I have previously written about relationships and the cardinality of the relationship. Now is a good time to explain the two types of tables we deal with every day in a reporting data model; Fact and Dimension table. In this article, I will explain what a Dimension table is, and examples of it, and how it can be created. Although examples of this article are all using Power BI and Power Query, the basics of modeling are valid regardless of the tool that you use. If you heard about Dimension and have no idea what it is, or if you haven’t heard about it and you want to build a data model for reporting, this article is what you need to read. If you like to learn more about Power BI, read Power BI book from Rookie to Rock Star.
There is no prerequisite for this article. However, I highly recommend you to read below articles beforehand;
Two Types of Tables
In a reporting data model, there are two types of tables. Before I go further on this, I have to point out that there is a big difference between a data model build for a transactional or operational system, compared to the data model build for a reporting system. In this article, we are focusing on a data model for reporting systems, If you want to understand why it is different from transactional data models, read this article. In a reporting data model, we have two types of tables; Fact table, and Dimension table;
A Fact table is a table that keeps numeric data that might be aggregated in the reporting visualizations.
A Dimension table is a table that keeps descriptive information that can slice and dice the data of the fact table.
The definition above, although correct, can lead to creating heaps of different type of tables and calling those a Dimension or Fact table. You do need to understand where and how a Dimension table can be created, what are rules of thumb for creating such a table, and what is a good dimension, and what is bad. and that is what I’m focusing on this article.
A Type of Field: Descriptive
A Dimension table is a table that has descriptive information, which is or can be used for slicing and dicing. A Dimension table has fields that are descriptive. Descriptive fields can be of datatype text, date and time, or even a number. Let’s have a look at some examples;
The table below is a table full of information about customers, and they are all text data type information;
This is a Dimension table because you use the data of this table mainly in visualizations like this:
Most of the times, you use values from this table in an axis or legend of a chart of visualization, but not as values. (You might use these as values in the table visual, but that is an exception, table visual, will show you everything). In the screenshot above, the value is SalesAmount (coming from a different table), and it always sliced and diced by a field from the Customer table; Gender, EnglishEducation, Lastname, etc. That brings us to the first part of the definition of the dimension table;
Fields from a Dimension table are always used for slicing and dicing data of other tables (either through filters or slicers, or through axis or legend of the same visual), We call these fields, Descriptive fields.
You also found that fields of the Customer table above were all text. That can be different, let’s check it out here; we have two new fields in the Customer table now;
The date field (BirthDate), and the numeric field (YearlyIncome) will be also used for slicing and dicing as the below screenshot shows it;
No matter you have date and time fields or even numeric fields in the customer table, they are used in the same way as text fields, they are used for slicing and dicing.
The data type of the descriptive fields can be text, date and time, number or anything else. The data type is not the important property of these fields, the nature of use case for them is important, which is slicing and dicing.
Another Type of Field: Values
Descriptive fields are not the only type of field that we work with it in a data model. There is also another type of field, which is called as Fact. These fields are not used for slicing and dicing. These fields are used as VALUEs in the visualization. In all visualizations above, there was a field that was the Value part of the visual: SalesAmount.
SalesAmount is not used for slicing and dicing. It is used as a value. In fact, SalesAmount is sliced and diced by the descriptive fields of Gender, Education, and Lastname. SalesAmount is not a descriptive field, it is a field that used as a value, and can be aggregated sometimes (like what you see in SalesAmount by Gender or SalesAmount by EnglishEducation). SalesAmount is a FACT field.
Fields that are the VALUE part of a visualization (not for table, and a few other visuals), and sliced and diced by descriptive fields, often aggregated are FACT fields.
Golden Rule: Fact and Descriptive Fields Not in the Same Table!
Now that you know what is a Fact field, and what is a Descriptive field, it is time to explain the first golden rule: Keep these two types of fields away from each other, in separate tables. Fact fields should be in a table which does not have Descriptive fields in it and vice versa. Your first question I believe then is Why?! Well, let’s see.
Let’s assume that we have a table with both types of fields; descriptive fields and also fact fields, like below;
All the seven first fields are descriptive fields, and the last two (SalesAmount and Order Quantity), are fact fields. Using it in a model like above works fine, and you can slice and dice the data of fact fields, by descriptive fields. However, what if you bring another set of descriptive fields, which are not part of this table? for example, what if you want to slice and dice the data of the fact fields by some fields from the Product table, such as ProductName, Color, etc? Then you end up bringing all those into this table too, and end up with a super large table, which has customers, products, and sales values! What if later on, you want to bring dates of the transactions, sales promotion, store details? each of these brings a whole lot of fields with themselves.
Say No to a Single Table Including Everything!
One of the other golden rules of data modeling is to say no to a scenario like above to happen. Do not bring everything into one large table with hundreds of columns. There are several reasons for that, I just point a few;
- Maintenance of such a big table would be always a challenge. Every time you bring a new set of fields, you have to combine them into this table.
- The level of details stored in this table is dependent on the fields in it. If you bring new fields, you need to store more details! This is also a maintenance problem.
- Combining is not always possible! What if you also want to bring a budget into the same table? most of the time is not possible, because you may not have the budget data in the level of customer, you may have it only at the level of the month, and product. then you end up with multiple very large tables.
Every time, you add new fields, your table needs to be combined with the new table to have everything in one place!
The level of Details is dependent on the fields in the table! If you add more fields, you need to store more details.
Combining is Not Always Possible
If the level of details in one or two fact fields is different, then you cannot store them in one table easily. You have to split into two;
This can become another big table, adding dates, and more details about the product, etc. and then you need to create a “many-to-many” relationships between the two tables, which cause heaps of other issues! If you like to learn more about many-to-many relationships, read this article.
So it is not possible to have one table with everything in it, and as you realized it is better to decouple the fact fields from the dimension fields. because then expanding the model is much easier, and it won’t need a really high maintenance practice.
Now it is time to go back to the definition of the dimension table;
Dimension table is a table full of Descriptive Fields and zero Fact fields.
This table needs to be related to other tables including Fact fields to slice and dice their data. so this table needs to have a field which is the unique identifier for each row. This field in database design practices is called as Primary Key. Primary Key is a single field or combination of fields in the dimension table, that can uniquely identify each row. For example, the Customer table can be identified with the CustomerKey column as below;
This is the field, that will be used for the relationships of One-to-many between dimension table and fact tables.
The relationship between Dimension table and Fact tables most of the time is One-to-Many from dimension to the fact table. There are instances of having a one-to-one relationship as well, but not very often.
So, we need to change our definition a bit now;
Dimension table is full of Descriptive fields and zero Fact fields, and one or more Key fields. Key fields are row identifiers of the dimension table.
If you have more than one key field (it is also called a compound key), then you cannot create relationships between tables in Power BI using multiple fields, so you need to use a workaround like what I mentioned here, to create a single field for creating relationship.
When the Primary Key is Not a Primary Key!
The Primary Key is a column (let’s assume it is only one column for this example), that is row identifier. However, what if it isn’t!? Do you recall in your use cases a scenario that something has to be a primary key, but it can’t be? Let me tell you an example; In a reporting model for bank information, we have a table of credit cards. Each credit card has a unique number, so can be used as the primary key;
However, if you look more in details, a credit card will be renewed every few years, with a new expiry date, but the same number!
This is an example of a field that should be somehow primary key (or maybe it is already a primary key in the source operational system), but not in our model. we call these fields; Alternate Key, Business Key, or the Primary Key in the source system. And then we need to create a new field in the Dimension table which can be a real primary key. A primary key of the dimension table, because of this reason is called Surrogate Key, and most of the times can be generated as an auto-number index value;
There are other scenarios that you might end up creating an auto-number surrogate key for your dimension as well. This was just an example to clarify things for you. If scenarios like this happen, most of the times, you do need to also keep the Alternate Key beside the Surrogate Key in the dimension table for slicing and dicing.
Dimension Tables are Wide
Another golden rule in designing dimension tables is to make them wide. Add fields that are related to that, more fields in the dimension table means more power for slicing and dicing the data. If you have just three fields of Customer first name, last name, and gender, then it means you can only slice and dice data by these three. However, if you have a customer table with all columns below, then you will have much more power in slicing and dicing;
Dimension tables have many fields, and that is why they are Wide. And wider means more slicing and dicing power.
Dimension Tables are Shallow
Dimension tables, also wide, but often, they are shallow. They do not have many rows. They are often less than a million row. However, there are always exceptions. For examples, if FaceBook wants to create a customer dimension, it would be hundreds of millions of rows for sure. A dimension table is not storing transactional data, it is storing descriptive information. Transactional data is what it makes the number of rows to grow.
Examples of Dimension Tables
Based on what you learned so far, you can now name some dimension tables in your scenario. It is not a generic dimension table, a table might be a dimension in one model, and with some difference in the set of the fields, might be a fact table in another model. Here are some samples of dimensions, and their fields;
Customer: CustomerKey, First name, Last name, Full name, Age, Gender, Job title, etc.
Product: ProductKey, Product Number (Alternate key), Product Name, Color, Size, etc.
Store: StoreKey, Name, Address, etc.
I can write about the Dimension table even much more, as there are lots of details to talk about. However, let’s leave the rest of that for the other parts of this series of Modeling Basics. In this article, you learned that there are two main types of fields; Descriptive and Fact fields. These two fields should not be at the same table. So that is why we have tables with Descriptive fields and call them Dimension tables. Dimension tables also have key fields called surrogate key as a row identifier. The relationship between a dimension table and fact tables are one-to-many. In the next article, I will explain what is the fact table, and how it connects to a dimension table.
Do you have a table with both fact and descriptive fields in it? it is time to split it, if you have any questions in doing it, let me know in the comments below.