What are the Use Cases of Dataflow for You in Power BI?

FacebooktwitterredditpinterestlinkedintumblrmailFacebooktwitterredditpinterestlinkedintumblrmail
FacebooktwitterlinkedinrssyoutubeFacebooktwitterlinkedinrssyoutube

There have been lots of announcements about Dataflow online, through videos and articles and news. There are many demos about Dataflow as well. However, still when I talk about it in conferences, one of the main questions from the audience is that; “What are use cases for it?”. In this blog post, I’m going to answer that question specifically. I will first explain what Dataflow is, and how it can be useful in different scenarios for business analysts or developers who are using Power BI. If you want to use dataflow in Power BI, but not sure where to use it, this is an article for you. If you want to learn more about Power BI, read Power BI book from Rookie to Rock Star.

What is Dataflow

A simple definition of Dataflow is running Power Query in the cloud. Well, this might not seem very new feature, correct? because right now, you can publish your *.pbix report into Power BI, and then schedule your report to refresh. That means you are scheduling a Power Query script in your report to run in the cloud. So what is new then with Dataflow? The new part is that Dataflow is not part of a single report. So better definition for Dataflow is;

Dataflow is a Power Query process that runs in the cloud independently from any Power BI reports.

Where the Output Stored?

If the Power Query scripts run independently from the report, then where it stores the data? That is a very valid question. Because when Power Query is part of a report, then the output of each query will load into the Power BI model (or let’s say Power BI Dataset). In the case of Dataflow, we said there is no report bind to it, so there is no dataset bind to it then. Dataflow has to store the output of the query somewhere.

Dataflow stores the data in the Azure Data Lake storage.

Azure Data Lake storage is Microsoft cloud storage that can store structured data (like tables) and unstructured data (like files).

But I Don’t Have Azure Data Lake Subscription!

If you don’t have an account in Azure or you don’t have a subscription that you can use for Azure Data Lake, No need to worry! You can still use Dataflow. The whole process of storing data into Azure Data Lake is internally managed through Dataflow. You won’t even need to login to the Azure portal or anywhere else. From your point of view, in the Power BI website, you create a dataflow, and that dataflow manages the whole storage configuration. You won’t need to have any other accounts or pay anything extra or more than what you are paying for Power BI subscription.

Dataflow manages the Data Lake configurations internally. You won’t need anything except your Power BI accounts and subscriptions.

If, however, some users have their own Azure subscriptions and prefer to manage the storage method in the Data Lake themselves. There are methods that they can work with Dataflows and configure their own Azure Data Lake to be the storage engine.

Power BI Can Do Get Data from Dataflow

The other end of the solution with dataflow is that; Power BI has a Get data option from Dataflow. You do not need to connect to Azure Data Lake and find where the data is stored (especially if you are using the built-in storage of dataflow and not the external dataflow with your own data lake storage). You can simply Get Data from a Dataflow in Power BI, and choose the table that you want.

What is the Benefit of Using Dataflow?

We are getting on the right path now. You know what Dataflow is, and now is the time to see what is the usage of it? Why and how running a Power Query script in the cloud independently from a Power BI dataset can be useful? Let’s dig into the answer together through some examples.

Using One Power Query Table in Multiple Power BI Reports

Have you ever had the need to use one Power Query table in multiple Power BI Reports? Of course, you did. If you worked with Power BI for some time, you know that tables generated through Power Query are only part of one Power BI file. If you want to use the same table in another file, with a combination of some other tables which is not in the first file, then you would need to replicate the Power Query transformations (or copy and paste the M script) into the new *.pbix file. You may say, no I don’t, but here is an example: Date Dimension!

Date dimension is a table that you use in a *.pbix let’s say for Sales Analysis, and also in another *.pbix for Inventory reporting, and for HR data analysis *.pbix too. So what do you do in these situations? Copying the script for Date Dimension in all of these three files? What if after a year, you decided to add a transformation or a column to the date dimension? then you need to replicate this change in all *.pbix files that you have, otherwise, your code becomes inconsistent. It would have been much better if you did the transformation once, stored the output somewhere, and then re-use it. That is exactly what Dataflow can do for you!

Re-usable tables or queries across multiple Power BI files, are one of the best candidates for Dataflow.

Different Data Source with Different Schedule of Refresh

What if you have a dataset that includes two tables with totally different schedule options. For example, the Sales transactions table coming from SQL Server database is changing every day, and you need to refresh this data every day. However, the mapping table used for some of the products and maintained by the product team is only changing every quarter. If you have both of these queries in one *.pbix file, then you have no other choice but to refresh at the maximum frequency needed which is once a day.

However, if there be a mechanism that can refresh the mapping table every quarter, apply all transformations needed, and store it in a table. Then on every day you just need to read it. Dataflow can do that for You; With one query running the data transformation script and loading it into a destination table. This can be scheduled for whatever plan you need.

Dataflow can run extract, transformation, and load (ETL) process on a different schedule for every query (or table).

Centralized Data Warehouse

With the evolving of Power BI and other self-service technologies, many companies started to implement a BI system without having a data warehouse. However, if the number of BI systems increases, the need for a centralized data warehouse appears quickly. A data warehouse is a specifically designed database to store data in the format needed for reporting. In traditional BI systems, one of the phases of building a BI system, and one of the most important phases let’s say, is to create a centralized data warehouse. The ETL process will extract data from data sources, and load it into the centralized data warehouse. All reporting solutions, then use the data warehouse as the single source of truth.

Dataflow can be an important part of building a centralized data warehouse for your Power BI solution. You can build the structure you want through Power Query scripts in a dataflow. Dataflow then runs those scripts and stores the data into output tables. Output tables of dataflow can act as a centralized data warehouse for your *.pbix files. Alternatively, you can have your own Azure Data Lake storage and configure it the way that you want, with the structure of tables that you want, and get dataflow to load data into those tables.

Dataflow can be the ETL engine, that fuels the centralized data warehouse in Azure data lake storage.

Versioning Data from a Data Source

One of the most requested features for Power BI users is that, how can I store the output of this report (or in our case a Power Query script) into somewhere, and I can refer to it later? The data is changing every day, but you may want to keep a version of the data at this point of the time and store it somewhere for later use. For example, getting a version of the data at the end of every fiscal year, or period or quarter. Dataflow, although, not built for that purpose, can do that for you. Using dataflow, you can create different dataflow processes to run the ETL anytime you want and store it in a different output table each time.

Dataflow can be used for versioning the data from the source into multiple destination tables.

Other Use Cases

Do you use Dataflow in Power BI for any other use cases? Feel free to share your experience below at comments.

How to Create Dataflow and Use it?

Now that you know what is the Dataflow and benefits of it, Is time to talk about it how to use it. Stay tuned for the next blog posts about Using Dataflow in action.

References to learn about Dataflow

Matthew Roche from Power BI CAT team in Microsoft prepared one of the best set of materials about Dataflow. I highly recommend reading his materials here.
Microsoft MSDN online has also a good set of articles about Dataflow here.

FacebooktwitterredditpinterestlinkedintumblrmailFacebooktwitterredditpinterestlinkedintumblrmail
FacebooktwitterlinkedinrssyoutubeFacebooktwitterlinkedinrssyoutube
Reza Rad on FacebookReza Rad on LinkedinReza Rad on TwitterReza Rad on Youtube
Reza Rad
Trainer, Consultant, Mentor
Reza Rad is a Microsoft Regional Director, an Author, Trainer, Speaker and Consultant. He has a BSc in Computer engineering; he has more than 20 years’ experience in data analysis, BI, databases, programming, and development mostly on Microsoft technologies. He is a Microsoft Data Platform MVP for nine continuous years (from 2011 till now) for his dedication in Microsoft BI. Reza is an active blogger and co-founder of RADACAD. Reza is also co-founder and co-organizer of Difinity conference in New Zealand.
His articles on different aspects of technologies, especially on MS BI, can be found on his blog: https://radacad.com/blog.
He wrote some books on MS SQL BI and also is writing some others, He was also an active member on online technical forums such as MSDN and Experts-Exchange, and was a moderator of MSDN SQL Server forums, and is an MCP, MCSE, and MCITP of BI. He is the leader of the New Zealand Business Intelligence users group. He is also the author of very popular book Power BI from Rookie to Rock Star, which is free with more than 1700 pages of content and the Power BI Pro Architecture published by Apress.
He is an International Speaker in Microsoft Ignite, Microsoft Business Applications Summit, Data Insight Summit, PASS Summit, SQL Saturday and SQL user groups. And He is a Microsoft Certified Trainer.
Reza’s passion is to help you find the best data solution, he is Data enthusiast.

48 thoughts on “What are the Use Cases of Dataflow for You in Power BI?

  • Getting data from a data warehouse that isn’t quite in a usable form (for example, a date table with a non-contiguous value to replace null).
    Getting data from websites is a good use case to me. Since the data is static I wonder if this would protect internal data from getting exposed as well.

  • Good article Reza. I am trying to think of differences/ justifications of using Power BI Data Flows over Azure Data Factory. Maybe they will merge. Do you have any thoughts?

    • Hi Ian.
      Good question. There are pros and cons for each technology. For example, ADF is good with the distribution of computing power over multiple spark clusters, and Dataflow is good with the transformative power of the Power Query. However, for the future plan, I think we still have to wait for few months to see what is going to be the direction.

  • Hi Reza. I have a use case which I proposed to business some time back. As you know Power BI does not support write-backs yet. There was a requirement wherein the client wanted to do some adjustments to the data by himself. We stripped Power Query from the report and into a data flow. And now, the client(who is good at excel) can himself do the cleansing on his own. It saves us a lot of to and fro emails, time and flexibility. There are workarounds for data correction but none that I found which make make more sense than this.

    • Hi Sourav
      Thanks for sharing your experience. So you’ve isolated the query that is fetching data from the dataset which client changes much, and you run that transformation set separately through Dataflow, and then get data from it in the Power BI. Interesting use case 🙂
      Cheers
      Reza

  • Great post. Could you show the versioning you mentioned? I have not thought about this before but it sounds interesting? How exactly would this work? Copying the dataflow and then never update that version?

    • Yes, I’ve been thinking of running Dataflows manually at specific point-of-time to get snapshots of data. With later support of running Dataflow through REST API, this process can be implemented even much better.

      • I’ve just started building a small API using MS Flow and Azure Functions to achieve data versioning, saving data into OneDrive – this would be an absolute lifesaver if I could just use Dataflows and M to take care of it.
        How do you actually save Dataflows into different locations each time?

        • Hi Sam
          Dataflows can be also configured with Power BI REST API. and in the REST API, you can rename the entity each time to end up with a different destination each time and then run it. However, this is all theory so far, because I haven’t written a code to do it. maybe a good topic for another blog post 🙂

          Cheers
          Reza

          • I have not seen in the Power BI REST API documentation the endpoint which would allow the renaming of a dataflow. I am looking forward to read your solution !

          • Hi Bertrand
            the REST API part of the dataflow is very new, and hopefully, many features will come for it.
            Cheers
            Reza

  • Good article Reza, Data flow may be useful for smaller data sets with limited data source connectivity. Also Not sure how azure data lake optimizes data retrieval (block box for someone without azure data lake license) ? As some one already commented here, azure data factory may be good option for ETL than this.

    • Hi Chandra.
      You can actually SEE how Dataflow manages the data storage in the data lake. If you use your own Azure Data Lake storage, you can connect a Gen2 storage to a dataflow. I will write about it. However, as I mentioned before, for parallel compute processing or big data scenarios, still ADF seems a better option
      Cheers
      Reza

      • Great blog Reza as usual. I just have a question about your comment here. Would it be possible to access the internal Data Lake storage that Power BI created by default ?
        or Do I have to configure Power BI workspaces to store dataflows in my organization’s Azure Data Lake Storage Gen2 account i.e. (Use your own Azure Data Lake Storage -preview) / (Dataflows and Azure Data Lake integration – Preview) ?
        I assume the first internal default option doesn’t incur additional cost while the second option my organization’s Azure Data Lake Storage Gen2 account does incur additional cost, correct ?

        • Hi Ashraf
          The internal dataflow (which works with just a Power BI account) is limited to be used only through Power BI. So you can’t really see that data from other services. However, if you use Power Platform Dataflows, because that can write into CDS, the story is different and can be accessed from everywhere.
          The external dataflow (which works with your own Azure data lake subscription) can be accessed even outside of Power BI.
          Cheers
          Reza

  • Hi Reza, I am a big fan of your blogs and books. The question i have is slightly different from dataflows. I have 2 reports using exactly the same data and data model. The only difference is that one dataset has security levels and other doesn’t. Does the security bind to a dataset or a report. Is it possible I can use the same dataset but change the security. It really pains me to load 2 similar datasets in the workspace. I had to do this because business wanted a report specific to Area Managers but also allow them to see the overall figures. Rather than loading summarized tables in my Area Manager level report and making the data model complex, I came up with a solution to have 2 reports (Area Manger specific & overall Sales) and only giving access of AM level report to AM but having a dashboard with numbers from both the report. Is there a better way of getting around this. Hope this makes sense. Thanks !!

    • Hi.
      You just need one dataset. If you are publishing two separate datasets only because their RLS configuration is different, then it means you don’t have a proper Dynamic RLS implemented. a proper dynamic RLS implemented should need only one dataset, and you can have all the access role logic within your data model. check out my RLS blog series to learn different patterns for using it.
      Cheers
      Reza

  • Hi, useful post.
    I’d like to give my feedback regarding the “Using One Power Query Table in Multiple Power BI Reports”
    Data flow is not the only solution. I’ve created one specific .pbix for the model definition and one or more .pbix files for the reports, using Power BI Dataset as source.
    All the tables and measures are defined only once centrally in the model.
    Moreover, you take advantage of DAX for model definition.

    Some drowbacks:
    – You cannot have reports that use row level security and free to access reports;
    – Same refresh frequency;

    • Hi Marco,
      Yes, a shared dataset can also share tables. But you can not edit that model (until when the composite model on top of the Power BI dataset becomes available, which is not far away).
      Although you CAN, it doesn’t mean you SHOULD. Dataset is for the relationship between tables, and the DAX calculation, where the dataflow is for the data transformation layer. You need both layers. Here I explained in detail how these two work with each other.
      Cheers
      Reza

  • Hi Reza, thanks for the article. I am very interested in configuring Dataflows for versioning history, but cannot find a lot of information about this. I have a dataset that changes daily, but would like to compare counts per day. So I need to find a way to cache the daily data somehow…

    Bests

    Ben

    • Hi Ben,
      That is more like a possibility at this stage, not an implementation path.
      Dataflows can be a path towards that if we can have a way to dynamically change the destination or the dataflow creation. that can happen using Power BI REST APIs in the future if such a thing becomes available.
      Cheers
      Reza

  • ANother use is ..if we want to edit dataset, we need to download the obix file , edit and publish. Now with dataflow, we just need to edit dataflow, no need to download pbix file. This only applies if we want to edit dataset only not the report visuals.

    • Thanks Reddy,
      Yes, right. However, if the change is structural change, and you want to bring that into Power BI dataset at the end, then edit in the Power BI file is also needed.
      Cheers
      Reza

  • Hi Reza,
    Nice article. We are starting to look into this for a client right now. We have a bit of confusion, maybe you can help: If we start using Data Flow embedded with Power BI, we know that the Data Lake cannot be queried by any another tool.
    If for some reason later we would like to go with our own Data Lake so that we can query it from any other tool, is there a way to “migrate” the embedded data lake to a paid Azure Data Lake?
    Without loosing every reports made in PowerBI of course.
    Any thoughts?

    • Hi Simon
      I never came across this scenario to tell you exactly how the migration works.
      However, the worst case is to change the source in the Power BI dataset to the new dataflow (even if you need to create a new dataflow), and that is just a simple change in Power Query Editor.
      Cheers
      Reza

  • Hi Reza,
    Thanks for the wonderful Post. Could you also explain about security implementation in DataFlows? can we implement RLS on Dataflow’s ?

    • RLS on the dataflow is not YET supported, However, the Power BI team is working on many features to come in, and hopefully, RLS would be one of those.
      Cheers
      Reza

  • Hi Raza,
    Is there a way to delete the ‘AllInsights’/ ‘AllInsightsInProc’ AI source from a dataflow in PowerBi?
    I have a situation where the automated AI source had been added to the dataFlow, but now we don’t need it. The dataflow portal does not have a delete option.
    Any suggestions?

    • Hi Jeet
      Have you right-clicked on the query and checked the Advanced Editor? you should be able to modify anything you want from there.
      Cheers
      Reza

  • Hi Reza,

    I have a Dataflow which contains customer data, ideal for marketing purposes. A valuable usecase for me would be to use the data in an external application. For this I would like to export the data from the Dataflow to a csv or something. Is this possible?

    Best regards, Sander

    • If you use dataflow in the default way, the exported CSV file is not visible from other applications. However, if you connect it to your ADLS gen 2, then you can specify the details of the output location and etc. this way, you can use it anywhere else you want.
      Another alternative option is to use ADF with SSIS and a PQ data source, and the output wherever you like. But that would be a longer and more techie path to go.
      Cheers
      Reza

  • Hello, Reza.
    Thank you for the great article!
    Did you have a chance to try implementing versioning on practice already?

  • Hi Reza.
    The “Different Data Source with Different Schedule of Refresh” is a very interesting use case.
    But as I can see there is an option to schedule a refresh for the entire Dataflow.
    But I have also found a ” Incremental refresh settings ” setting when I open the dataset. But as I know this option works for the Premium accounts only, correct? If so, Power BI Pro users cannot implement “Different Data Source with Different Schedule of Refresh”, right?

    • Hi Rick
      Incremental refresh (at the moment) is Premium only feature.
      However, What I meant wasn’t incremental refresh. I meant that you can go and build a separate dataflow for each data source, and then schedule them based on what you need. If they are all in the same dataflow, then you cannot have a different schedule for that.

      Cheers
      Reza

      • If so, looks like we can do the same with datasets. We can create separate datasets and schedule them separately, correct?

        • Each dataflow, and each dataset can be scheduled separately.
          you can have multiple datasets with multiple refresh schedule
          and you can have multiple dataflows with multiple refresh schedule too.
          I hope this answered your question

          Cheers
          Reza

  • Hi Reza! Have you written any more about “Versioning Data from a Data Source” using Dataflow? I would like to learn more about how to do that. Thank you!

  • Hi Reza,
    I have been working on creating data flows lately. However, I am facing lot of issues to show refreshed data in reports. I am trying to enable schedule refresh for data flows and there are running for hours together. Could you please assist?

    • Hi.
      Have you scheduled the refresh of the dataflow at a different time than the refresh of the dataset? which one completes and which one doesn’t?
      Cheers
      Reza

  • I just watched your “Dataflow and Shared Dataset: Multi-Developer Architecture for Power BI” Webinar, February 28, 2020. I’m starting to setup my shared queries in the service. My concern is: the Analyst/Business Developer has the option to Get Data from my Dataflow, and transform that dataflow within Power BI, thereby negating all of my hard work in creating a disciplined entity that can be shared and commonly understood by multiple developers. Is there any way to restrict the “Transform” option? Just an option to “Load”? I want to do the entity development work, certify the Dataflow, and all the other developers can do is use Dataflow “As Is”, assuring a controlled common entity in the cloud.

    • Hi Matthew
      Unfortunately not. the Power BI report developer would be able to still transform data in the shape they want.
      and that is not against the concept of multi-developer architecture. 100% self-service, means having silos, and 100% control of development only by you or a specific group of people means the traditional way of implementing BI systems with no self-service. you always need a balance, finding the sweet spot in between is important.
      that said, if you still prefer not to allow others to do any changes in the data, you can create a shared dataset on top of it. that way, users only have a live connection to the shared dataset and cannot change or transform it.

      Cheers
      Reza

  • Lets says i have published a report by connecting a data flow.In future a new column is added in data flow entity so if i refresh report from power bi service does it support schema refresh or again i have to refresh data flow schema from power bi desktop,adding new column in report and then again publishing it?

    • It depends on the type of transformations you use. If you don’t use any column-specific transformation, then it might just pick that new column easily in the next dataset refresh in the Power BI dataset.
      However, it is very likely that you do some transformations that are column-dependent, and in that case, the Power BI Desktop file should be opened, refreshed to get the new metadata in Power Query Editor, and then publish again into the service.
      Cheers
      Reza

  • Thank you for useful article Reza.

    Can Power BI be used as ETL tool to gather, clean or transform data and then this pbix be used as Dataflow? (or rather add to dataflow data as raw as possible, apply all cleansing, transformations or relations and then create Datasets)
    The problem is quality of data and doubts on loading data not in shape as Dataflow (instead of loading everything as it is, especially coming outside from data warehouse)

    • Hi Kamil
      You can build your transformation logic in Power BI Desktop and then create a dataflow entity using that M code.
      There are different approaches to this problem. Some, transforms the data while bringing it in. Which may cause some issues considering the data quality. A better approach is to have like a staging environment, where you bring the data from the source as is, and then a data warehouse environment where you use that staging data as a source and do the transformation, etc. However, considering the amount of processing and possible space needed for these transformations and the data, you might need a Premium Power BI account.
      Cheers
      Reza

Leave a Reply

%d bloggers like this: