Three types of objects in the Microsoft Fabric have similarities in what they can do for an analytics system. These three are; Lakehouse, Data Warehouse, and Power BI Datamart. All three objects provide storage for the data, which can be loaded into them using an ETL process and read using something like a Power BI report. In this article and video, I’ll explain the actual differences and how to choose the best option for your implementation and architecture.
Video
What is Microsoft Fabric
Microsoft Fabric is an end-to-end Data Analytics platform software-as-a-service from Microsoft. This platform uses multiple workloads to carry all services related to data analytics, such as data integration, storage, data warehousing, data engineering, Business intelligence, Data science, etc.
These are good starting points if you want to get yourself more familiar with Microsoft Fabric;
Power BI Datamart
Power BI Datamart brings Azure SQL Database into the Power BI architecture. It combines Dataflow (ETL), Azure SQL Database (Storage for the dimensional model), and Power BI Dataset (Analytical model) in a unified web editor experience. Power BI Datamart enables citizen data analysts to build a data warehouse in a multi-layered Power BI architecture.
You can learn more about Datamart in my article and video below;
Lakehouse
Lakehouse is a place for storing structured data (data tables) and unstructured data (files). Lakehouse is one of the Microsoft Fabric objects under the Data Engineering workloads. It stores the data behind the scene in OneLake. The data can be loaded into the Lakehouse using Data Pipelines, Dataflows Gen2, Notebooks, and Azure Spark Job Definitions. The Data from the Lakehouse can be accessed via the Lakehouse Explorer in the browser and the SQL Endpoint, which allows writing query commands to the Lakehouse. A Power BI dataset will be associated with the Lakehouse, which can be used for reporting.
To learn more about Lakehouse, read my article below;
Data Warehouse or Warehouse
Microsoft Fabric Data Warehouse, or Warehouse, is a place where structured data tables can be stored. A Warehouse is a highly performant and scalable database, and the complexity of the scalability is invisible from the user’s view. The Warehouse will be automatically called up or down to provide the best performance needed. The Warehouse stores the data in OneLake. Data can be loaded into the Warehouse using the Data Pipeline, Dataflows Gen2, or SQL Commands. The Warehouse provides a full SQL Endpoint connection to write SQL commands that can read and write into the data tables. Likewise, the Lakehouse and the Power BI Datamart, a Power BI Dataset, will be associated with the Warehouse for reporting.
To learn more about the Warehouse, read my article here;
Similarities
It is easy to spot the similarities between these three objects. They all act as a storage system (or you can call it a database or a data warehouse) in the analytics solution. There is always an ETL to load data into these and a Power BI dataset associated with them for further reporting. They are all cloud-based database systems. They all use the web editor for editing and modifying the data model. So that is one point of confusion; it looks like all three are doing the same thing.
I often get this question; now that Microsoft Fabric comes with the Lakehosue and the Warehouse, will Power BI Datamart be deprecated? Or what is the reason I should use one over the other? The rest of this article answers these types of questions.
Licensing Matters
One of the most essential matters of the difference is the type of license you are using. With Fabric F SKUs, you would have all three, but you can also have the Power BI Datamart used only with a PPU license (Premium Per User). Some organizations don’t use Premium or Fabric capacities and still using PPU; they can use Power BI Datamart at its full function with a database size of up to 100TB.
Suppose you are working in an organization with only five Power BI users and want full functionality for the Power BI, including a place to store the data. In that case, you can purchase PPU licenses for all five users and use the Power BI Datamart for your purpose.
Scalability
Another big difference between the three options is the scalability. Both Lakehouse and Warehouse are using OneLake as their storage. They both use Microsoft Fabric’s compute bursting behind the scene to speed up the performance when needed (let’s say at the time of loading data or fetching data using a query from Power BI). The Lakehouse and the Warehouse can easily handle massive amounts of data at ease.
Although the Power BI Datamarts can handle large amounts of data, there is a cap of 100TB. And the compute bursting won’t be an option here. On the other hand, the only way to load data into Datamart is Dataflow, which might not perform as fast as Copy data activity in Data Pipeline if we deal with massive data.
Structured and Unstructured Data
You can store Structured and Unstructured Data in the Lakehouse. However, both Warehouse and Datamart only support structured data. The Lakehouse makes a better choice when looking for a storage option to store files and tables.
SQL Endpoint
The SQL Endpoint connects to the database part of the Datamart, Lakehouse, and Warehouse. For the Datamart and the Lakehouse, the connection is read-only. You can only query the data but cannot modify it or write commands that change the structure of the database (such as altering or creating a table). However, the SQL Endpoint for the Warehouse is a fully operational endpoint, which allows you to read and write. You can write a query to fetch the data and write SQL commands to alter and create the tables or insert or delete data from the tables.
This is good evidence that the Warehouse is more for a database developer, whereas the other two options are not.
Are you a Citizen Data Analyst, Database Developer, or Data Engineer?
The persona of the developer working with these technologies is another point of difference. Because each object uses a different set of technology areas, it would need someone with a particular skill set.
Datamart is for Citizen Data Analyst
Datamart is an environment where you can load data using Power BI Dataflows. The knowledge that you need to work in that environment is Power Query and Power BI. These are skills that a Power BI Developer has in the belt.
Data Warehouse is for Database Developer
Because of the powerful SQL endpoint of the Warehouse, the best outcome from it is achieved when a Database Developer works with it. In addition to working with Data Pipelines and Dataflows, the database developer can write SQL query commands or commands to change the data and even the data structure.
Lakehouse is for Data Engineer
You can use Data Pipelines and Dataflows to load data into the Lakehouse. However, you can also use Notebooks and Spark Job Definitions. These two would use languages such as SparkR, Scala, and Python. These languages and working with the Notebooks are usually in the skillset area of a data engineer or a data scientist.
From the Power BI Point of View
If you are only concerned from the Power BI point of view, then these options have a few differences. If you are Importing data into Power BI, then usually only the speed of Import is the important factor (Refresh time), Which would usually work faster when the data is in OneLake (Warehouse or Lakehouse), especially if you are dealing with a large volume of data.
If you use a DirectQuery option, the new Direct Lake option added to the Power BI will bring a much faster response time than the Datamart (This option is available for Lakehouse and Warehouse because their data is stored in OneLake). So if you are working with a huge amount of data, then Data Warehouse and Lakehouse would bring a better performance to the Power BI. But then, the Lakehouse and the Warehouse won’t be much different here because they store their data in the OneLake and provide the SQL Endpoint for SQL queries that read data from there.
All-in-One Comparison
Here are some of the differences mentioned in the below table
Item | Data warehouse | Lakehouse | Power BI Datamart |
---|---|---|---|
Data volume | Unlimited | Unlimited | Up to 100 GB |
Type of data | Structured | Unstructured, semi-structured, structured | Structured |
Primary developer persona | Data warehouse developer, SQL engineer | Data engineer, data scientist | Citizen data analyst |
Primary developer skill set | SQL | Spark (Scala, PySpark, Spark SQL, R) | No code (Power Query), SQL |
ETL | Data Pipelines, Dataflows Gen2, SQL | Data Pipelines, Dataflows Gen2, Notebooks, Spark Job Definitions | Dataflows |
Languages to work with | T-SQL Spark | Spark (Scala, PySpark, Spark SQL, R) | Dataflows, T-SQL |
SQL Endpoint | Read and Write | Read-only | Read-only |
Primary development interface | SQL scripts | Spark notebooks, Spark job definitions | Power BI |
Licenses | Fabric Capacity | Fabric Capacity | Fabric Capacity Power BI Premium PPU |
Why not one? Analogy to Understand
I know that some of you at this point might ask questions like the below;
Why didn’t Microsoft combine all these three into one? Why should we need multiple things with pros and cons and not all pros in one? Why making it so complicated etc.
To answer that, I like to use an analogy. Next time you are in your neighborhood, visit a car dealership of a brand you like (Mercedes, Jaguar, BMW, Ford, Toyota, Nissan etc, I let you choose yours). And ask them for a car! They would either ask you for what you need a car, what is your budget, or family size, etc. Or they might bring you to their yard, and you get a whole lot of different types of cars from the same brand. Here are some of the cars that Mercedes makes;
The list above is all cars; you can drive each, and they take you from point A to point B. So why so many different cars, then? The answer is in their usage and pros and cons. Some are good for families and large groups (SUVs), some are good for simplicity and small family (compacts), some are good for performance (Coupes and Supercars), some are good for offroading, etc. Each has pros and cons. Car companies understand that each car has its own market, so they make not only one type but multiple types.
The Same is true everywhere. There are multiple programming languages. And here we are multiple objects in Fabric; Lakehouse, Warehouse, and the Datamart.
Summary
Although these three objects (Lakehouse, Warehouse, and Datamart) perform similar activities in an analytics project, they differ in many aspects. Their differences depend on the type of license you are using, the skillset and the person of the developer working with it, the scale and column of the data, and the type of data to be stored. Each object has its use case scenarios, and there are situations in which each can be helpful. I don’t see any of these objects getting replaced or deprecated at this stage. However, depending on how Microsoft focuses on its roadmap, things might change in the future.
Is the datamart cap 100TB or 100GB?
100 GB