After a long wait, in the October 2018 release of Power BI Desktop we saw the fuzzy matching feature added finally. Yay! Have you ever wanted to match two tables together but not on exact matches, but also on a threshold of similarity? if your answer to this question is yes, then this feature is built for you. Let’s explore in details how the fuzzy matching works in Power BI. To learn more about Power BI, read Power BI from Rookie to Rock Star.
Enable the Preview Feature
At the time of writing this blog post, Fuzzy matching is a preview feature, and you have to enable it in Power BI Desktop -> Files -> Options and Settings -> Options;
In the Options window, under Preview Features, select the checkbox beside “Enable fuzzy merge”
After this step, you’ll need to close the Power BI Desktop and open it again.
Sample Dataset
for this example; I will be using a sample dataset which has two very simple tables below;
A “source” table which is the data of employees and their departments. Notice that the Department field has data quality issues. we have department values such as “Sales”, and “Sale”. Or another example is “Managmnt” and “Management”.
A “Department” table which has a list of all departments;
As you can see the list of Department Names are clean in this table, and this is the table that should be used to clean the “source” table. Now let’s see how this is possible?
Fuzzy Merge
Fuzzy Merge is a way of joining two tables together, but not on exact matching criteria, but on the similarity threshold. If you want to learn what is the Merge operation itself and the difference of that with Append, read my blog post here. If you want to learn more details about what is Merge and the different types of join or merge, read my other blog post here. Merge or Join is simply the act of combining two tables with different structures, but with link/join columns, to access columns from one of the tables in the other one.
To use Merge operation on the “source” query, You can click on the Merge Queries as New option in the Home tab of Power Query Editor window.
Then you can select the second table and choose Department as the joining field
This process will give you the output below: (result below is after expanding the merge’s column output);
You can see that the Merge operation only finds the EXACT Matching scenarios. Department “Sale” doesn’t match with the Department table, because it is missing an “S” at the end to match with the “Sales”.
Now, let’s see how Fuzzy match works here. To use the Fuzzy Merge, just select the checkbox under the Merge tables dialog box;
When you enable the fuzzy matching, then you can configure it in the “fuzzy merge operations”. you can leave everything optional. or set values. let’s first see the sample output of this operation and then see what are the options. This is the sample output of Fuzzy Merge:
You can see the three highlighted records, which was not recognized as the exact match in the normal merge operation, is not matching the output of the fuzzy merge. Fuzzy merge will check the similarity between joining fields, and if their similarity is more than the threshold configuration, it will pass it as a successful match. You can see that “Managmnt” can match with “Management” with this threshold configuration, but the “Mangmt” doesn’t, it shows that the threshold of similarity is higher than the similarity rate of these two text values with each other.
You can play with Options of Fuzzy Merge and get different outputs. Here is an explanation of these options:
Option | Acceptable Value | Description |
Threshold | a value between 0.00 to 1.00 | if the similarity of the two text values is more than the threshold it will be considered as a successful match. Value 1.00 means exact match. |
Ignore Case | true/false | If you want the similarity algorithm to work regardless of the upper or lower case letters, then select this option. |
Ignore Space | true/false | If you want the similarity algorithm to work regardless of the number of spaces in the text, then select this option. |
Maximum Number of Matches | numeric positive value, between 0 to 2147483647 | The number of rows that can be matched to one value. |
Transformation Table | table | This is like a mapping table, let’s check it out a bit later in this post. It gives you the option to use your own mapping table. This table should have at least two columns of “To” and “From”. |
Power Query Functions
In addition to the option added in the graphical interface of Power Query, we also have two Power Query Functions that do the Fuzzy Merge, Functions are:
Table.FuzzyJoin
Table.FuzzyNestedJoin
Functions above both do have the same fuzzy configurations, their only difference is that one of them gives you the expanded output (FuzzyJoin), the other one gives you the same output as the one that you see in the graphical interface with the table column output after merge (FuzzyNestedJoin). If you use these two functions directly in M script, you will have a couple of more parameters to set, which are for concurrency and culture settings.
These are parameters of the two functions above;
Transformation Table
Sometimes in the merge operation, you need a mapping table. This table is called here as Transformation Table. Here is an example of a mapping table:
Note that this table should have at least the two column of “To”, and “From”. And don’t forget that Power Query is case sensitive!
Now you can select this table in your Merge operation in the Fuzzy configuration as below;
This process is like merging “source” table, which is the first table in our Merge, with the “Department” table based on the “Department” and then “Department Name” column, then merging it with the “mapping” table, based on the “To” column and “Department Name”. The output will bring the “To” column of the mapping table. Here is the sample output:
Summary
Matching based on similarity threshold, or Fuzzy matching is a fantastic feature added to Power Query and Power BI, however, it is still a preview feature, and it may have some more configuration coming up. please try it in your dataset, and let me know if you have any questions in the comment below.
Thanks for well explained blog post!
Oh that’s a great explanation ! I waited for this impatiently.
Thank you Reza
Very useful feature and good explanation
thanks for the post, and you tutorial. Do you know the math used to determine the similarity threshold? If so, I would appreciate an explanation of that as well.
Fuzzy Matching in Power Query, uses Jaccard Index method. explained here: https://en.wikipedia.org/wiki/Jaccard_index
Nice feature and easy to use.
Thanks for the detailed post.
+1
Hello, Can someone help me with which algorithm does Power BI to perform fuzzy match? I would like to replicate that algorithm on the data base side. Please help me with this.
Thanks
Sure. As I mentioned in previous comments, it is Jaccard Index
Cheers
Reza