Despite my many articles about the importance of star-schema and design principals of fact and dimension tables. I still get this question a lot: “When is good to combine tables vs loading them in the Power BI and creating a relationship between them?” This article is going to answer this question.
In Power BI models, you can load tables into the model, and create relationships between tables. The sample below is an example of that type of relationship;
To understand how the relationships work in Power BI, I recommend you to look at my articles here:
- What is the Relationship in Power BI?
- What is the Cardinality of the Relationship?
- What is the Direction of the Relationship?
- Active or Inactive Relationships
You can also combine some of those tables together and have a simpler model with fewer relationships, like the example below;
To combine tables with each other, you can use the Merge transformation in Power Query.
Relationship or Combined?
So the question for this article: Should you combine the Product, ProductCategory, and ProductSubcategory tables into one Product dimension? Or should you leave them as is, and create relationships between tables in the Power BI? Let’s talk about the pros and cons of each method.
One of the distinguishing features of a reporting system is the performance of it. If you are building a reporting system, and the loading the report takes five minutes, then what is the benefit of this report over doing it in the operational system? because it would have taken that much in the operational system anyway!
When you build a reporting system, you should target creating reports that load fast. Having extra relationships, doesn’t help with the performance. The DAX queries behind the scene would be longer.
If I have a visualization like this:
The DAX query generated by this visual is different when tables are combined, or when they are separated with relationships.
This is the DAX query when separated product tables with relationships:
// DAX Query DEFINE VAR __DS0Core = SELECTCOLUMNS( KEEPFILTERS( FILTER( KEEPFILTERS( SUMMARIZECOLUMNS( 'DimProductCategory'[EnglishProductCategoryName], 'DimProductSubcategory'[EnglishProductSubcategoryName], 'DimProduct'[EnglishProductName], "CountRowsDimProduct", CALCULATE(COUNTROWS('DimProduct')) ) ), OR( OR( NOT(ISBLANK('DimProductCategory'[EnglishProductCategoryName])), NOT(ISBLANK('DimProductSubcategory'[EnglishProductSubcategoryName])) ), NOT(ISBLANK('DimProduct'[EnglishProductName])) ) ) ), "'DimProductCategory'[EnglishProductCategoryName]", 'DimProductCategory'[EnglishProductCategoryName], "'DimProductSubcategory'[EnglishProductSubcategoryName]", 'DimProductSubcategory'[EnglishProductSubcategoryName], "'DimProduct'[EnglishProductName]", 'DimProduct'[EnglishProductName] ) VAR __DS0PrimaryWindowed = TOPN( 501, __DS0Core, 'DimProductCategory'[EnglishProductCategoryName], 1, 'DimProductSubcategory'[EnglishProductSubcategoryName], 1, 'DimProduct'[EnglishProductName], 1 ) EVALUATE __DS0PrimaryWindowed ORDER BY 'DimProductCategory'[EnglishProductCategoryName], 'DimProductSubcategory'[EnglishProductSubcategoryName], 'DimProduct'[EnglishProductName]
If tables are combined, the code is like below:
// DAX Query DEFINE VAR __DS0Core = SUMMARIZE( 'DimProduct', 'DimProduct'[EnglishProductCategoryName], 'DimProduct'[EnglishProductSubcategoryName], 'DimProduct'[EnglishProductName] ) VAR __DS0PrimaryWindowed = TOPN( 501, __DS0Core, 'DimProduct'[EnglishProductCategoryName], 1, 'DimProduct'[EnglishProductSubcategoryName], 1, 'DimProduct'[EnglishProductName], 1 ) EVALUATE __DS0PrimaryWindowed ORDER BY 'DimProduct'[EnglishProductCategoryName], 'DimProduct'[EnglishProductSubcategoryName], 'DimProduct'[EnglishProductName]
The second code is much simpler. And it would perform faster when you have many large tables.
Simplicity vs a Mess
Another big deterministic of a good reporting model is the ability to make it easy-to-understand for data visualizers. If a data visualizer have 500 tables in a model all inter-related, he/she would have a hard time to understand the model.
Having a few tables and relationships is always helpful to make the model simple to understand.
Complexity of the relationship
Relationships come with a cost of complexity. You might need a both-directional relationship sometimes, or you might get into the problem of having an inactive relationship and trying to resolve that. Less relationship means fewer problems of this type when it is unnecessary.
Avoid this trap: A big table including everything
The above items might lead you to the thinking that if you have no relationship at all, then your model is the best. No, It is not. If you combine everything in one table, you will have other problems.
When everything is in one big table, then changes would become really hard. Every new table should be combined with the other tables. What if there is a table that you want to use in multiple Power BI files? how you want to re-produce the steps to generate that table when everything is part of one big table?
Here is the answer to the question of this article. You should not have tons of relationships, and you should not create one big table with everything in it. So what is the solution then? The answer is to design a totally different schema. A new layout of tables and relationships. Something that is different from the operating system’s database. Something that helps to get the best of both worlds.
This model, has many names; dimensional model, star schema, data warehouse etc. Through the process of creating this model, you combine some of the tables (dimensions), and keep some of them to be connected through the relationships (dimension to fact table relationships).
Build it from your existing tables
What if you don’t have a reference table that you can use as a dimension? what if there are only mapping tables in your data source? Many, have faced the same challenges. If you don’t have a reference table, you can create them from your mapping tables.
Here is an example, in which I explained how you can build reference (dimension) tables;
Just because you can, it doesn’t mean you should
Just because you can load all the tables into the model and then create relationship, it doesn’t mean it is the best thing to do. Just because you can merge all the tables using Power Query, it doesn’t mean you should do it.
The process of data modeling is not based on CAD DOs, it is based on SHOULD DOs.
Power BI modeling starts with a pen and paper
This is very important that you start your modeling by a pen and paper. Yes, no tools or services needed. Just a pen and paper. You need to draw the dimensional model, the layout of the tables. You need to list fields in your dimension and fact tables. All of these are much better to be done using a pen and paper. You can then take the next step of implementing it using Power BI.
A good model
A good model should consider many things. It should first start with the requirement. How are you going to build a database diagram of a reporting system, when you don’t know what values should be shown in the report?
A good model also requires an understanding of the existing data, and how that data can be transformed into the star-schema model.
Don’t build your model on this rule that; this is the format that I get data, so can’t really do much about it. This is wrong! You need to transform that format.
My report works just fine, why bother?
Why bothering if your report is running just fine? or if it doesn’t have any performance problems? Do you really need to think about the model if you got only a few tables and everything works as it should? Why spending time on something that the users won’t see?
Yes, you need to work on your model. If you didn’t do that from the beginning of your implementation, do it now. Your model gets bigger and bigger every time. You will have more tables, more relationships, bigger challenges.
Building a proper data model is like building a base for a 100 stories building. would you first build the 100 stories and then come and fortify the base?! of course not. You should do it as soon as possible, or your whole building might collapse at some point.
Your model, your world
Your Power BI data model is your world. It is up to you to decide in which world you want to live in. Do you prefer to live in a model that is too complicated that every time someone asks for a change it ends up with hours and often days of work because it is not designed in the way it should? Or do you prefer to build a model that is a high performance, simple, easy to change, and simple to integrate? That is your choice; your model, your world.