There have been lots of announcements about Dataflow online, through videos and articles and news. There are many demos about Dataflow as well. However, still when I talk about it in conferences, one of the main questions from the audience is that; “What are use cases for it?”. In this blog post, I’m going to answer that question specifically. I will first explain what Dataflow is, and how it can be useful in different scenarios for business analysts or developers who are using Power BI. If you want to use dataflow in Power BI, but not sure where to use it, this is an article for you. If you want to learn more about Power BI, read Power BI book from Rookie to Rock Star.
What is Dataflow
A simple definition of Dataflow is running Power Query in the cloud. Well, this might not seem very new feature, correct? because right now, you can publish your *.pbix report into Power BI, and then schedule your report to refresh. That means you are scheduling a Power Query script in your report to run in the cloud. So what is new then with Dataflow? The new part is that Dataflow is not part of a single report. So better definition for Dataflow is;
Dataflow is a Power Query process that runs in the cloud independently from any Power BI reports.
Where the Output Stored?
If the Power Query scripts run independently from the report, then where it stores the data? That is a very valid question. Because when Power Query is part of a report, then the output of each query will load into the Power BI model (or let’s say Power BI Dataset). In the case of Dataflow, we said there is no report bind to it, so there is no dataset bind to it then. Dataflow has to store the output of the query somewhere.
Dataflow stores the data in the Azure Data Lake storage.
Azure Data Lake storage is Microsoft cloud storage that can store structured data (like tables) and unstructured data (like files).
But I Don’t Have Azure Data Lake Subscription!
If you don’t have an account in Azure or you don’t have a subscription that you can use for Azure Data Lake, No need to worry! You can still use Dataflow. The whole process of storing data into Azure Data Lake is internally managed through Dataflow. You won’t even need to login to the Azure portal or anywhere else. From your point of view, in the Power BI website, you create a dataflow, and that dataflow manages the whole storage configuration. You won’t need to have any other accounts or pay anything extra or more than what you are paying for Power BI subscription.
Dataflow manages the Data Lake configurations internally. You won’t need anything except your Power BI accounts and subscriptions.
If, however, some users have their own Azure subscriptions and prefer to manage the storage method in the Data Lake themselves. There are methods that they can work with Dataflows and configure their own Azure Data Lake to be the storage engine.
Power BI Can Do Get Data from Dataflow
The other end of the solution with dataflow is that; Power BI has a Get data option from Dataflow. You do not need to connect to Azure Data Lake and find where the data is stored (especially if you are using the built-in storage of dataflow and not the external dataflow with your own data lake storage). You can simply Get Data from a Dataflow in Power BI, and choose the table that you want.
What is the Benefit of Using Dataflow?
We are getting on the right path now. You know what Dataflow is, and now is the time to see what is the usage of it? Why and how running a Power Query script in the cloud independently from a Power BI dataset can be useful? Let’s dig into the answer together through some examples.
Using One Power Query Table in Multiple Power BI Reports
Have you ever had the need to use one Power Query table in multiple Power BI Reports? Of course, you did. If you worked with Power BI for some time, you know that tables generated through Power Query are only part of one Power BI file. If you want to use the same table in another file, with a combination of some other tables which is not in the first file, then you would need to replicate the Power Query transformations (or copy and paste the M script) into the new *.pbix file. You may say, no I don’t, but here is an example: Date Dimension!
Date dimension is a table that you use in a *.pbix let’s say for Sales Analysis, and also in another *.pbix for Inventory reporting, and for HR data analysis *.pbix too. So what do you do in these situations? Copying the script for Date Dimension in all of these three files? What if after a year, you decided to add a transformation or a column to the date dimension? then you need to replicate this change in all *.pbix files that you have, otherwise, your code becomes inconsistent. It would have been much better if you did the transformation once, stored the output somewhere, and then re-use it. That is exactly what Dataflow can do for you!
Re-usable tables or queries across multiple Power BI files, are one of the best candidates for Dataflow.
Different Data Source with Different Schedule of Refresh
What if you have a dataset that includes two tables with totally different schedule options. For example, the Sales transactions table coming from SQL Server database is changing every day, and you need to refresh this data every day. However, the mapping table used for some of the products and maintained by the product team is only changing every quarter. If you have both of these queries in one *.pbix file, then you have no other choice but to refresh at the maximum frequency needed which is once a day.
However, if there be a mechanism that can refresh the mapping table every quarter, apply all transformations needed, and store it in a table. Then on every day you just need to read it. Dataflow can do that for You; With one query running the data transformation script and loading it into a destination table. This can be scheduled for whatever plan you need.
Dataflow can run extract, transformation, and load (ETL) process on a different schedule for every query (or table).
Centralized Data Warehouse
With the evolving of Power BI and other self-service technologies, many companies started to implement a BI system without having a data warehouse. However, if the number of BI systems increases, the need for a centralized data warehouse appears quickly. A data warehouse is a specifically designed database to store data in the format needed for reporting. In traditional BI systems, one of the phases of building a BI system, and one of the most important phases let’s say, is to create a centralized data warehouse. The ETL process will extract data from data sources, and load it into the centralized data warehouse. All reporting solutions, then use the data warehouse as the single source of truth.
Dataflow can be an important part of building a centralized data warehouse for your Power BI solution. You can build the structure you want through Power Query scripts in a dataflow. Dataflow then runs those scripts and stores the data into output tables. Output tables of dataflow can act as a centralized data warehouse for your *.pbix files. Alternatively, you can have your own Azure Data Lake storage and configure it the way that you want, with the structure of tables that you want, and get dataflow to load data into those tables.
Dataflow can be the ETL engine, that fuels the centralized data warehouse in Azure data lake storage.
Versioning Data from a Data Source
One of the most requested features for Power BI users is that, how can I store the output of this report (or in our case a Power Query script) into somewhere, and I can refer to it later? The data is changing every day, but you may want to keep a version of the data at this point of the time and store it somewhere for later use. For example, getting a version of the data at the end of every fiscal year, or period or quarter. Dataflow, although, not built for that purpose, can do that for you. Using dataflow, you can create different dataflow processes to run the ETL anytime you want and store it in a different output table each time.
Dataflow can be used for versioning the data from the source into multiple destination tables.
Other Use Cases
Do you use Dataflow in Power BI for any other use cases? Feel free to share your experience below at comments.
How to Create Dataflow and Use it?
Now that you know what is the Dataflow and benefits of it, Is time to talk about it how to use it. Stay tuned for the next blog posts about Using Dataflow in action.
References to learn about Dataflow
Matthew Roche from Power BI CAT team in Microsoft prepared one of the best set of materials about Dataflow. I highly recommend reading his materials here.
Microsoft MSDN online has also a good set of articles about Dataflow here.