One of the most common transformations in Power Query is the Remove Duplicates. This transformation is used in many scenarios, one of the examples, is to create a dimension table with unique IDs in it, so can be used as the source of a one to many relationships to fact tables in the relationship diagram of Power BI. Remove duplicate will give you the unique output, however, sometimes this won’t function as you expect. Sometimes, you apply a remove duplicate, but there are still duplicates! how to fix that? read the rest of this article to learn about it. If you want to learn more about Power BI, read Power BI book from Rookie to Rock Star.
How does Remove Duplicate work?
Remove duplicate is a transformation in Power Query, which will remove duplicate entries in the selected column(s). As an example, if you consider a dataset such as the below image:
If we use the Remove Duplicate transformation on this column, as the below screenshot shows;
Then we will have the unique list of countries as the output as below;
As you can see in the above screenshot, Remove Duplicate is an easy way to achieve the unique list of values in Power Query.
Remove Duplicate Helps in Modeling
Now, let’s have a look at another example. Let’s say we have two tables; one table for Product, and one for Sales. Here are entries of the Product table;
and the Sales Table as below;
If we load these two tables as is into Power BI and try to create a relationship between them, we get this:
We get the error saying “This relationship has cardinality Man-Many. this should only be used if it is expected that neither column (Product ID and Product ID) contains unique values, and that the significantly different behavior of Many-many relationships is understood”.
If you haven’t used many to many relationships in Power BI previously, I’d recommend to not use it from now on too! this type of relationship is also called Weak relationship because it implies modeling issues. I suggest using one to many and many to one relationship instead. In our case, we need the relationship to be one-to-many from the Product table to the Sales table, because the Product table should have the UNIQUE list of products in it, not duplicate entries.
When the Remove Duplicate does not Work!
So the above example shows that we need to use the Remove Duplicate to get rid of duplicate entries in the product table. which is what you can do in Power Query Editor as below; right-click on the Product ID column and select Remove Duplicates;
After this action, everything should be right, but this time you would get the same error also!
It seems there are still duplicate entries in the product table. when you look at the table, you will see that Yes, there are duplicate entries, despite the fact that we said Remove Duplicate!
Solution: Be cautious of Case SeNsItIvItY!
Power Query is a case sensitive language. the two values of 324Abd and 324abd are considered as two different values as a result. This has nothing to do with Remove Duplicate, anywhere in Power Query, these two values are considered as two different values. If you are going to do remove duplicates despite their case of letters, then you have to apply a transformation to change them all to one case; either UPPERCASE or lowercase.
After getting this done, then you can do the Remove Duplicate, and this time, this would be the output:
So as a result, you can create relationship easily;
Other Scenarios? Characters that you don’t see!
case sensitivity is not the only issue when it comes to removing duplicates. some characters that you don’t see can be also an issue. For example; space characters at the end of a text! Event with changing it to UPPERCASE, you would still have the extra space, which makes the two texts different. Let’s say the above example works for now, but later on, when I refresh the model, I get an error like this:
As you can see, in the error message, it says product ID 324ABD has duplicate entries. and here is what we got!
The reason, in this case, is that the last entry has space at the end. you can see it when I take the length of this column;
so these two values are different. You have to use TRIM transformation to get rid of any extra spaces at the beginning or end of text values.
Now, this time, we would have a unique list again:
Caution: Remove Duplicate for Text Values
You may not have faced any of these issues so far in your model, mainly because you do remove duplicate on numeric fields. In a properly designed data warehouse, you will deal with numeric IDs mostly, and that is why you won’t face these issues often. However, if you are doing data transformations in Power Query, then you will face issues like this. In this post, I mentioned only two of scenarios that happens like that. However, there are other cautions when you are removing duplicate on text values;
- use TRIM transformation to make sure there are no extra spaces at the beginning or end of text values
- use CLEAN transformation to make sure there are no characters that won’t see in the text values
- use UPPERCASE or lowercase transformation to make sure you do remove duplicate despite the case sensitivity
Here is an example of steps that I apply before removing duplicates;
You can also create a custom function including all these steps, and use it in your modeling practices.
Remove Duplicate transformation has no issues, and it simply does the job you asked for; removing duplicates. However, sometimes you have extra characters (such as space at the end of the text) which you need to remove before using the Remove Duplicate. Power Query is a case sensitive language, you need to make sure you use uppercase or lowercase transformations if you want to remove duplicate despite the case sensitivity. Have you faced any of these kind of issues? please write your scenario down in the comments