Why data preparation and transformation? Why Power Query?

Does Power Query requires training or reading a book, or even reading an article? I get this question a lot from those who started Power BI not long ago. Many users think that Power Query is just the get data from the source, and that’s it. Considering the many data source connectors in the Power BI, it seems Power Query isn’t doing anything at all.

The reality is, however, totally different. Power Query is much more than getting the data. And more importantly, the data always needs transforming, cleaning, filtering, and more preparation, even if it comes from a well-structured data source such as a relational database table.

Throughout my many years of experience, I have seen many flawed data analysis solutions or slow reports due to the lack of data preparation. Users go and read more about the calculation engine in Power BI; DAX. DAX is an essential skill, and there is no doubt about it; I have written a book on DAX. However, I would say you need to know about data transformation before learning about the calculation engine.

Let’s see what data preparation and transformation are and why Power Query is needed.

Getting data

When you are about to analyze data, the first step is connecting to the data source and bringing that data in for the analysis. Let’s go through an example to understand this better;

The analysis requirement is the Sales Vs. Budget dashboard. The budget information is stored in an Excel spreadsheet. The sales information comes from two different sources; The sales in-store, which uses a Windows application, and the data is written in a SQL Server database. The online sales, which come from a WordPress website with the MySQL database backend.

For the analysis above, we need to connect to the three data sources (Excel, MySQL, and SQL Server) and populate all the information in one place.

You need to know how to connect to these data sources and many other data sources to analyze them, and you also need a tool, which is powerful enough to cover a wide range of data sources. Fortunately, Power Query can do that, and it supports more than 80 different data sources.

Badly structured data

If you are about to analyze the data, you need the data in the right shape and format. You cannot have a proper analysis when the data isn’t in the right structure. This means part of your job as a data engineer is to re-shape the data.

As an example, the data below come from a movie database online. Your task is to find out how many movies are in each year and what their ranking is.

You cannot do the analysis as long as the movie’s ranking, title, and the release year are combined in one text field. You have to re-structure it and have a column for each.

It would help if you had a tool that allows you to perform transformations like above in the data table. Fortunately, we have many powerful transformations available in Power Query.

Not clean data

Data is not always come clean and ready to use. You may get a lot of data rows that are not useful in their current form.

You may get data with unnecessary rows

You need to remove the unnecessary rows, clean the data, apply some transformations and get the clean data out for the analysis.

Incomplete data

Let’s jump back to the example earlier about the Sales Vs. Budget dashboard. There is no single data source that is complete enough for this analysis. You need to combine the three data sources to achieve this information.

Combining datasets is an essential part of the data preparation that you should master. Power Query gives you the Append or Merge transformations that can help you in combining datasets.

Too many details

Let’s say you have the data that you need. However, the details are too much. It is more than what you need. The details are at the order line level. However, the analysis requirement is only down to the order level. You need to aggregate the data at the level you need and reduce the amount of data rows for faster analysis.

A data preparation tool such as Power Query gives you options such as Group By to perform such operations.

More calculations needed

Sometimes, you have the data. However, you can get more by further calculations. For example, you have employees and their birthdates, but you need to calculate their age for further analysis.

Or you may need some conditional calculation that checks if their age falls into specific bins or not.

These types of calculations in Power BI can be done in both DAX and Power Query. However, a calculation that can be calculated statically beforehand (such as the age and the age bins) is better to be done at the data preparation layer because it would speed up the performance of the analysis.

More data than what you need

It is good to have more data than you need. However, this might also lead to performance issues. If you have a data table with over 100 million rows for the sales data over 20 years, and you need to analyze the data of the last three years (which is about 10 million rows only), then there is no point in getting that 100 millions of rows of data.

An essential skill for data preparation is filtering the data and narrowing it down to what is required. Power Query gives you many filtering options for the data tables.

Data not in the right shape

If you have the budget data from the finance team, it usually comes in a spreadsheet, like this:

The data above is not in the right shape for analysis. If you want to see each year’s budget, you must add a calculation to sum all the 12 columns’ values. If you want the budget for each quarter, then you have to sum every three columns to get that, and this goes on. You may need hundreds of calculations to cover the poorly shaped data.

However, if you can find the right way of shaping it (like below), you can connect it to a Date table, and all the analysis options with dates will be available for you without any calculations. The data below is unpivoted;

In this case, the data transformation tool, Power Query, should be mature enough to give you the options to shape the data in the right way for modeling.

Prepare for modeling

There are best practices to follow when you build an analytical data model. One of the main ones is to create a star-schema data model. This means the tables will have to change from their format into a new structure and format. It would help if you had a data transformation tool that gives you enough capabilities and features to transform the tables.

Even for transactional data sources, you still need to perform some transformations to get it in the format that can be modeled for analysis.

Avoid repeating the work

An essential feature for any development tool is to reduce the repeatable work. The tool should reduce the code or transformation that needs to be repeated multiple times. For example, If you connect to a web page and populate that information and perform some data transformation on it. And then you want to apply the same thing on every twenty pages of that website; there should be a mechanism to re-use the code and transformations instead of copying it. This way, any changes in the future and maintaining this solution will be much easier.

Power Query enables you to create functions from a set of transformations and then perform those transformations with consuming that function.

Data transformation not just for BI

Finally, data preparation and transformation are not just for BI. You may need to do some data preparation work for your data in Excel before using it, or you may want to have a clean list of email addresses to use for sending an email campaign.

A data transformation tool should be available to you not only through a BI tool. This is why you see Power Query as a component in more than one tool and service. You can see it in Power BI as a data preparation tool for a BI application. But we also have it in Excel and some other places just for pure data preparation and transformation features.

Power Query: Get data, easy then? No. you see the tip of the iceberg

Many users think Power Query means getting data; Connecting to a data source, and bringing that data into Excel or Power BI, and that’s it! Now you know that is not Power Query or data preparation. The more you work in the analysis space, you realize the need for proper data, clean data, well-prepared data, and transformed-in-the-needed-shape data. As a Power Query developer, you have to do it all. The get data is just the tip of the iceberg.

Reference for the photo: https://www.flickr.com/photos/safari_vacation/32306903126

In my experience with data over 20 years, I can tell you that the data preparation and transformation can easily be over 70% of your work in building an analysis solution. Don’t underestimate this step. Data preparation and transformation is a big step, which requires a powerful tool, which is why Power Query is so popular. Power Query is an easy-to-use engine available on many platforms and powerful enough to perform simple to complex data transformations. Learning and mastering this engine and tool is an essential skill for a Power BI developer, Excel developer, and anyone who works with data analysis in Microsoft toolset.

So to answer the first question: “Do you need a read a book for Power Query?” I would say, “Definitely, perhaps more than just a book. I cannot recommend learning it enough”. You have to master a data preparation and transformation tool, and in the world of Microsoft toolset, Power Query is a powerful tool not to be missed.

Reza Rad on FacebookReza Rad on LinkedinReza Rad on TwitterReza Rad on Youtube
Reza Rad
Trainer, Consultant, Mentor
Reza Rad is a Microsoft Regional Director, an Author, Trainer, Speaker and Consultant. He has a BSc in Computer engineering; he has more than 20 years’ experience in data analysis, BI, databases, programming, and development mostly on Microsoft technologies. He is a Microsoft Data Platform MVP for 12 continuous years (from 2011 till now) for his dedication in Microsoft BI. Reza is an active blogger and co-founder of RADACAD. Reza is also co-founder and co-organizer of Difinity conference in New Zealand, Power BI Summit, and Data Insight Summit.
Reza is author of more than 14 books on Microsoft Business Intelligence, most of these books are published under Power BI category. Among these are books such as Power BI DAX Simplified, Pro Power BI Architecture, Power BI from Rookie to Rock Star, Power Query books series, Row-Level Security in Power BI and etc.
He is an International Speaker in Microsoft Ignite, Microsoft Business Applications Summit, Data Insight Summit, PASS Summit, SQL Saturday and SQL user groups. And He is a Microsoft Certified Trainer.
Reza’s passion is to help you find the best data solution, he is Data enthusiast.
His articles on different aspects of technologies, especially on MS BI, can be found on his blog: https://radacad.com/blog.

3 thoughts on “Why data preparation and transformation? Why Power Query?

  • So many articles that don’t cover a business vision of using Power BI. We have the impression that it is still aimed at self-service users who are not IT specialists.

    We must avoid doing all our transformations in Power BI, we must avoid reading non-robust sources (Excel spreadsheet), …

    Power BI is a component only in a BI ecosystem, we have our ELT tool (Azure Data Factory for example), the different stages of data in the data lake and the transformations that we are going to do between them. We have the tables and views in the database (Synapse for example) and some organizations will have Analysis Services.

    In short, juggling the different tools, the different implementation options, the different architectures of a BI solution is much broader than just talking about Power BI.

    I have the impression that on the internet we are only talking about the product, it’s a shame!

    • Hi.
      Power Query is not just for Power BI. You can run it through Azure Data Factory wrangling data flows too, and as Power Platform dataflow too. This post was mostly about why data preparation is important and what Power Query can do about it. different architectures can be applied with a combination of tools and services for sure in any environment.
      Cheers
      Reza

Leave a Reply