Fuzzy Matching in Power BI and Power Query; Match based on Similarity Threshold

Facebooktwittergoogle_plusredditpinterestlinkedintumblrmailFacebooktwittergoogle_plusredditpinterestlinkedintumblrmail

After a long wait, in the October 2018 release of Power BI Desktop we saw the fuzzy matching feature added finally. Yay! Have you ever wanted to match two tables together but not on exact matches, but also on a threshold of similarity? if your answer to this question is yes, then this feature is built for you. Let’s explore in details how the fuzzy matching works in Power BI. To learn more about Power BI, read Power BI from Rookie to Rock Star.

Enable the Preview Feature

At the time of writing this blog post, Fuzzy matching is a preview feature, and you have to enable it in Power BI Desktop -> Files -> Options and Settings -> Options;

In the Options window, under Preview Features, select the checkbox beside “Enable fuzzy merge”

After this step, you’ll need to close the Power BI Desktop and open it again.

Sample Dataset

for this example; I will be using a sample dataset which has two very simple tables below;

A “source” table which is the data of employees and their departments. Notice that the Department field has data quality issues. we have department values such as “Sales”, and “Sale”. Or another example is “Managmnt” and “Management”.

A “Department” table which has a list of all departments;

As you can see the list of Department Names are clean in this table, and this is the table that should be used to clean the “source” table. Now let’s see how this is possible?

Fuzzy Merge

Fuzzy Merge is a way of joining two tables together, but not on exact matching criteria, but on the similarity threshold. If you want to learn what is the Merge operation itself and the difference of that with Append, read my blog post here. If you want to learn more details about what is Merge and the different types of join or merge, read my other blog post here. Merge or Join is simply the act of combining two tables with different structures, but with link/join columns, to access columns from one of the tables in the other one.

To use Merge operation on the “source” query, You can click on the Merge Queries as New option in the Home tab of Power Query Editor window.

Then you can select the second table and choose Department as the joining field

This process will give you the output below: (result below is after expanding the merge’s column output);

You can see that the Merge operation only finds the EXACT Matching scenarios. Department “Sale” doesn’t match with the Department table, because it is missing an “S” at the end to match with the “Sales”.

Now, let’s see how Fuzzy match works here. To use the Fuzzy Merge, just select the checkbox under the Merge tables dialog box;

When you enable the fuzzy matching, then you can configure it in the “fuzzy merge operations”. you can leave everything optional. or set values. let’s first see the sample output of this operation and then see what are the options. This is the sample output of Fuzzy Merge:

You can see the three highlighted records, which was not recognized as the exact match in the normal merge operation, is not matching the output of the fuzzy merge. Fuzzy merge will check the similarity between joining fields, and if their similarity is more than the threshold configuration, it will pass it as a successful match. You can see that “Managmnt” can match with “Management” with this threshold configuration, but the “Mangmt” doesn’t, it shows that the threshold of similarity is higher than the similarity rate of these two text values with each other.

You can play with Options of Fuzzy Merge and get different outputs. Here is an explanation of these options:

Option Acceptable Value Description
Threshold a value between 0.00 to 1.00 if the similarity of the two text values is more than the threshold it will be considered as a successful match. Value 1.00 means exact match.
Ignore Case true/false If you want the similarity algorithm to work regardless of the upper or lower case letters, then select this option.
Ignore Space true/false If you want the similarity algorithm to work regardless of the number of spaces in the text, then select this option.
Maximum Number of Matches numeric positive value, between 0 to 2147483647 The number of rows that can be matched to one value.
Transformation Table table This is like a mapping table, let’s check it out a bit later in this post. It gives you the option to use your own mapping table. This table should have at least two columns of “To” and “From”.

Power Query Functions

In addition to the option added in the graphical interface of Power Query, we also have two Power Query Functions that do the Fuzzy Merge, Functions are:

Table.FuzzyJoin

Table.FuzzyNestedJoin

Functions above both do have the same fuzzy configurations, their only difference is that one of them gives you the expanded output (FuzzyJoin), the other one gives you the same output as the one that you see in the graphical interface with the table column output after merge (FuzzyNestedJoin). If you use these two functions directly in M script, you will have a couple of more parameters to set, which are for concurrency and culture settings.

These are parameters of the two functions above;

Transformation Table

Sometimes in the merge operation, you need a mapping table. This table is called here as Transformation Table. Here is an example of a mapping table:

Note that this table should have at least the two column of “To”, and “From”. And don’t forget that Power Query is case sensitive!

Now you can select this table in your Merge operation in the Fuzzy configuration as below;

This process is like merging “source” table, which is the first table in our Merge, with the “Department” table based on the “Department” and then “Department Name” column, then merging it with the “mapping” table, based on the “To” column and “Department Name”. The output will bring the “To” column of the mapping table. Here is the sample output:

Summary

Matching based on similarity threshold, or Fuzzy matching is a fantastic feature added to Power Query and Power BI, however, it is still a preview feature, and it may have some more configuration coming up. please try it in your dataset, and let me know if you have any questions in the comment below.

Facebooktwittergoogle_plusredditpinterestlinkedintumblrmailFacebooktwittergoogle_plusredditpinterestlinkedintumblrmail
rssyoutuberssyoutube
Reza Rad
Reza Rad is an Author, Trainer, Speaker and DW/BI Consultant. He has a BSc in Computer engineering; he has more than 15 years’ experience in databases, programming and development mostly on Microsoft technologies. He is a Microsoft MVP in Data Platform for seven continues years (from 2011 till now) for his dedication in Microsoft BI. He is author of some SQL Server and BI books, and also Power BI online book; from Rookie to Rock Star.

6 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *