Power BI Shared Datasets: What is it? How does it work? and Why should you care?

Facebooktwittergoogle_plusredditpinterestlinkedintumblrmailFacebooktwittergoogle_plusredditpinterestlinkedintumblrmail

I have written a long time ago about sharing a Power BI dataset between reports, in an article written when the feature announced first. Nowadays, if you are in the world of Power BI, you heard a lot about certified datasets, and also shared datasets becoming available across multiple workspaces. In this article, you will learn about:

  • What is shared dataset in Power BI?
  • How the shared dataset can help in Power BI development?
  • Where is the place of Shared dataset in the Power BI architecture?
  • How shared dataset works behind the scene in the Power BI service?
  • What are Certified and Promoted Datasets?

To learn more about Power BI, read Power BI book from Rookie to Rock Star.

What is the Dataset in Power BI?

When you create a Power BI report (or let’s call it a *.PBIX file), the report has two components (if the data connection mode is import data); A report, and a dataset. When you are in the environment of Power BI Desktop, you can’t see the separation that easy, unless you go to the task manager and see the dataset running behind the scene under the Power BI Desktop task threads.

However, when you publish the PBIX file into the service (the Power BI website), you can easily see that there are two objects; A report, and a Dataset.

  • The report is the visualization layer of your Power BI implementation
  • The dataset includes the data, tables, relationships, calculations, and connection to the data source.

You can schedule the refresh for the dataset, and connect to on-premises sources (through a gateway), or cloud-based sources.

What is a Shared Dataset?

Now that you know about the dataset, let’s talk about the Shared Dataset. A shared dataset is a dataset shared between multiple reports. For a long time, you could create a new report from an existing dataset through Power BI website. This feature has been available from the early days of Power BI;

About April 2017, The ability to create a report from Power BI Desktop that can point at an existing dataset, and has a live connection to the existing dataset, became available, which I have written an article about it here.

2017-04-10_14h23_23

Those days, it was called; Get Data from Power BI service. Nowadays, this feature is renamed as getting Data from Power BI Dataset;

A shared dataset is a dataset that shared between multiple reports. Multiple reports connecting to one shared dataset. When that dataset gets refreshed, all of those reports will have the new data. A shared dataset is one step closer to the multi-developer tenant in Power BI environment.

2017-04-10_15h04_21

Sharing Datasets Across Multiple Workspaces

For a long time, sharing datasets was only possible inside a workspace. You could not use a dataset from workspace 1 as the source for a report in workspace 2. However, recently, the feature became available, and you can share the dataset even across multiple workspaces. This is an amazing update and changes the way that Power BI Development works in the future.

When you get data from a Power BI dataset through the Power BI Desktop, you have the option to select which dataset you want to get data from;

How does Shared Dataset work behind the scene?

When you share a dataset in the same workspace, everything is clear. You have one dataset to schedule refresh, and multiple reports connected to it. however, when you use a dataset shared from another workspace, you get something which might look a bit different.

Linked Dataset

When you get data from a Power BI dataset which is workspace 1, and then save your report in workspace 2, you get something like a copy of your dataset in the workspace 2. This might look strange! You might say, this is not a shared dataset, it is a copied dataset. The fact is that what you see is just a link. Power BI will bring a link to that dataset into the new workspace, this link helps you to understand when the dataset gets refreshed last time.

Here is what a linked dataset looks like and you can see the difference of that with normal datasets.

You cannot manually refresh or refresh based on the schedule on a linked dataset. The refresh action can be only configured in the main dataset. The linked dataset is just a link, showing you when was the last date and time for the refresh, and an easier way to generate more reports from that dataset.

A linked dataset is not a copy, It is a link to the original dataset.

Certified and Promoted Datasets

When Power BI developers use the function of Getting data from Power BI dataset, They see all datasets from all workspaces that they have access to. This might be a bit confusing. There might be tons of datasets shared in the environment. Developer ends up with questions such as: Which of these is the one I can use? Which of these are valid to use? which of these are reconciled and tested? which of these are reliable to use? and etc.

A new labeling system is added to the Power BI datasets, that helps in this scenario. You can mark some of the datasets as certified, and promoted. To get a dataset certified, there is an approval process that can assure the dataset passed some of the tests. You can clarify through this labeling system, that what are datasets good to be used as the source, and what is not. You can build the concept of Gold, Silver, and Bronze datasets. Having gold datasets as datasets that are fully tested, and reconciled, and then down to other levels, where the Bronze dataset is a dataset that hasn’t been through any testing yet.

To use this labeling system, the creator of the dataset can go to the setting of the dataset;

In the settings, you can set the Endorsement level as below;

As you can see the Certified option might not be available. The Power BI tenant administrator has the authority to enable that labeling, and give access to whom needed in the Tenant Settings;

The labeling system helps Power BI developers to then see what is the level of certification that a dataset has to be used as a shared dataset, and then can select based on that respectively;

Shared Dataset in the Power BI Architecture

I have written a lot of articles about the architecture of Power BI, in fact, gathered them in a book, However, in one article specifically, I explained how Dataflow and Shared datasets can play an important role in the multi-developer tenant of Power BI implementation.

In nutshell, using the Dataflow makes sure that you can bring the data well prepared in a central area, which you can call it a centralized data warehouse in the Azure Data Lake. and using the shared datasets, you can build data marts that can be used by multiple reports. Here is how the architecture works in the diagram view;

Instead of having silos of Power BI reports and files everywhere, You can build an architecture that works best with multiple developers, less redundancy in the data, in the code, and the logic, and easier maintenance approach. I highly recommend reading this article to learn more about this architecture, and learning how the shared dataset located in this architecture as a key element.

Summary

The shared dataset is not a new feature in the Power BI, However, the ability to share it between multiple workspaces announced recently and is a game changer in the architecture of Power BI implementation. Using shared dataset, you can have centralized data models (data marts) that can serve multiple reports. You can reduce the maintenance time, the redundancy of the code, and the data through this approach. Having the labeling system of the certified or promoted dataset is also a great way of putting some process and governance in place to make sure the shared datasets has been through some process of testing and reconciling.

Are you using shared datasets in your implementation? if yes, I like to hear about your experience, if not, I like to know why you don’t find it useful for your implementation. Either way, please share your thoughts down in the comments below.

Facebooktwittergoogle_plusredditpinterestlinkedintumblrmailFacebooktwittergoogle_plusredditpinterestlinkedintumblrmail
rssyoutuberssyoutube
Reza Rad on FacebookReza Rad on LinkedinReza Rad on TwitterReza Rad on Youtube
Reza Rad
Trainer, Consultant, Mentor
Reza Rad is a Microsoft Regional Director, an Author, Trainer, Speaker and Consultant. He has a BSc in Computer engineering; he has more than 20 years’ experience in data analysis, BI, databases, programming, and development mostly on Microsoft technologies. He is a Microsoft Data Platform MVP for eight continuous years (from 2011 till now) for his dedication in Microsoft BI. Reza is an active blogger and co-founder of RADACAD. Reza is also co-founder and co-organizer of Difinity conference in New Zealand.
His articles on different aspects of technologies, especially on MS BI, can be found on his blog: https://radacad.com/blog.
He wrote some books on MS SQL BI and also is writing some others, He was also an active member on online technical forums such as MSDN and Experts-Exchange, and was a moderator of MSDN SQL Server forums, and is an MCP, MCSE, and MCITP of BI. He is the leader of the New Zealand Business Intelligence users group. He is also the author of very popular book Power BI from Rookie to Rock Star, which is free with more than 1700 pages of content and the Power BI Pro Architecture published by Apress.
He is an International Speaker in Microsoft Ignite, Microsoft Business Applications Summit, Data Insight Summit, PASS Summit, SQL Saturday and SQL user groups. And He is a Microsoft Certified Trainer.
Reza’s passion is to help you find the best data solution, he is Data enthusiast.

4 Comments

    • Hi Amit.
      Let’s hope for it. At the moment, the only way to do that is through SSAS. but hopefully shared dataset becomes available in PBIRS too
      Cheers
      Reza

  • Great article. 2 questions.

    1. Dataflows allow incremental refresh – but I appears when utilizing them in a .pbix and deploying to the service that it loads all data from the dataflow. Thus, efficiency from the dataflow incremental refresh is lost by the dataset having to load all the data again.

    2. While I can build new reports on shared datasets, I cannot fortify the dataset with any other new information I want for those reports. Meaning, I can’t mashup data with anything new outside that existing model

    Am I doing something wrong in either case?

    • Hi,
      At the moment, the connection from Dataflow to Power BI only supports Import Data. Later in the future, you would be able to do DirectQuery hopefully as well.
      on the other hand side, when using Incremental Refresh, you can set that up also in the Power BI dataset, in addition to the Power BI dataflows. Incremental refresh in the dataflow makes sure the ETL runs on delta load, setting that up in the dataset, makes sure then model also runs under partitioning.
      Regarding mashing up data of the shared dataset. I would say it is a good thing that you cannot do more data mashup with a live connection to the Power BI dataset, because then your source code will spread all over places, and you cannot easily find where you have done what! If you need to use the Power BI dataset as one of the sources, and do data mashup on that and combine it with other sources, then don’t use Get Data from Power BI dataset, You can use Power BI dataset as an Import data source (it is now possible with XMLA endpoint), and then you can do whatever you want with PQ there.
      Cheers
      Reza

Leave a Reply

Your email address will not be published. Required fields are marked *