DirectQuery, Live Connection or Import Data? Tough Decision!

Facebooktwittergoogle_plusredditpinterestlinkedintumblrmailFacebooktwittergoogle_plusredditpinterestlinkedintumblrmail

2017-09-13_12h13_49

Power BI supports different methods for connecting data. That is why the decision to choose the right method is always a tough decision. You have heard of DirectQuery, Live Connection, and Import Data. What these three actually means? and what is the situation that you should choose one over the other one? What is pros and cons of each? What are best practices to choose each? Which one is faster? which one is more flexible? Is DirectQuery and Live Connection actually one thing? or two separate methods? And many other questions. I always get a lot of questions like this in my courses, conference talks, and blog posts. In this post, you are going to get the answer to all questions above. This post helps you to choose the right data connection methodology and architecture for your Power BI solution. If you like to learn more about Power BI; read Power BI book; from Rookie to Rock Star.

Why is this a tough decision?

If Power BI only had one way to connect to data sources, then everything was easy. You never needed to choose between methods and find the right method for yourself. However, Power BI supports three different methods for connecting to data: DirectQuery, Live Connection, and Import Data (or some call it Scheduled Refresh). Many of you might still think that DirectQuery and Live Connection are the same, however, they are different. You will learn later in this post about their differences. Each method has some benefits and also some disadvantages. Depends on the scenario that you are implementing Power BI for, you might choose one way over the others. Changing from one method to another method can be time-consuming task after a while in the implementation process. So best would be choosing the right method from the beginning.

Choosing the right method is an important step for your Power BI Solution Architecture, which you need to decide about it usually in early phases before starting implementation. In this post, I’m going to explain in details what every method is, and the answer to all questions below;

  • What is Import Data/Schedule Refresh?
  • What is DirectQuery?
  • What is Live Connection?
  • What is Difference between Live Connection and DirectQuery?
  • Pros and Cons of Each Method
  • Which method is performing best and fastest?
  • Which method is more flexible?
  • Which method is more scalable?
  • What are Architecture Scenarios to use Each Method?
  • What is the Gateway Role?

What is Import Data or Schedule Refresh?

This method has two names, some call it Import Data, some call it Scheduled Refresh. Both names explain the behavior of this method. With this method data from the source will be loaded into Power BI. Loading in Power BI means consuming memory and disk space. As long as you are developing Power BI on your machine with Power BI Desktop, then it would be memory and disk space of your machine. When you publish the report into the website, then it will be memory and disk space of Power BI cloud machine.

If you have 1 Million rows in a source table, and you load it into Power BI with no filtering, you end up with having the same amount of data rows in Power BI. If you have a database with 1000 tables, however, you only load 10 of those tables in Power BI, then you get memory consumption for only those 10 tables. Bottom line is that you spent memory and disc space as much as you load data into Power BI.

Compression Engine of xVelocity

The very first assumption that you might get after reading above explanation about Import Data is that; if you have a database with 100GB, then if you import it into Power BI, you will get 100GB file size in Power BI. This is not true. Power BI leverages compression engine of xVelocity and works on a Column-store in-memory technology. Column store in-memory technology compresses data and stores it in a compressed format. Sometimes you might have a 1GB Excel file, and when you import it into Power BI, your Power BI file ends up with only 10MB. This is mainly because of compression engine of Power BI. However, the compression rate is not always that. This depends on many things; the number of unique values in the column, sometimes data types and many other situations. I will write a post later that explains the compression engine in details.

Short read for this part is: Power BI will store compressed data. The size of data in Power BI would be much smaller than its size in the data source.

Important Pros and Cons for this method

Power BI Full Functional

With this method, you get Power BI full functional. You can use Power Query to combine data from multiple sources, or DAX to write advanced time intelligence expressions or the visualization. There will be no limitation in the functionality of Power BI with this method. You can use all components.

Size Limitation

With this method, you have a limitation on the size of the model. Your Power BI model (or let’s say the file) cannot be more than 1GB. You have usually up to 10GB size in your account, however, every file should be up to 1GB of the size. There is an exception for this; Power BI Premium allows you to have up to 50GB size of model loaded in Power BI website. without Premium maximum file size you can have is 1GB. However, remember that 1GB in the Power BI file is not equal to 1GB data in the source (As mentioned in the compression engine section).

This is the fastest Method

This connection method is the fastest option possible. Data loaded into the memory of the server, and reports queries will be evaluated from the data loaded into the memory. No lags or slowness with this method (as long as you designed your Power BI model with no performance issues).

What is DirectQuery?

DirectQuery is a direct connection to data source. Data will NOT be stored in Power BI model. Power BI will be a visualization layer, then query the data from data source every time. Power BI will only store metadata of tables (table names, column names, relationships…) but not the data. Power BI file size will be much smaller, and most probably you never hit the limitation of the size because there is no data stored in the model.

DirectQuery is only possible through few data sources. At the time of writing this post, these are supported data sources for DirectQuery:

  • Amazon Redshift
  • Azure HDInsight Spark (Beta)
  • Azure SQL Database
  • Azure SQL Data Warehouse
  • IBM Netezza (Beta)
  • Impala (version 2.x)
  • Oracle Database (version 12 and above)
  • SAP Business Warehouse (Beta)
  • SAP HANA
  • Snowflake
  • Spark (Beta) (version 0.9 and above)
  • SQL Server
  • Teradata Database

To get the most up-to-date list, look at this post.

Important Pros and Cons for this method

Scaleability: The main Advantage

This method will not have the limitation of the size. Mainly because no data stored in the Power BI file, so you never get any issue for the size of data. You can have data sources with Peta Bytes of data in SQL Server, Oracle or any other supported data sources and connect to it from Power BI.

Limited Functionality: Few Power Query Operations, Mainly Visualization

This method will not have full functionality of Power BI. With this method, you would have only two tabs in Power BI Desktop; Report, and Relationship. You can change the relationship in this mode.

2017-09-13_10h22_06

In Power Query, you are limited to a number of operations. majority of operations that cannot be folded, cannot be used. To learn more about query folding read this post. If you want to combine data sources then you cannot do that in this mode. you get a message like this:

2017-09-13_10h24_42

You are also limited in your DAX expressions; you can not write all types of expressions. Many functions are not supported; as an example: Time Intelligence functions are not supported.

2017-09-13_10h25_59

With this mode, however, you have full visualization support.

Slow Connection

A big disadvantage of this method is the connection is slower than other types of connection. Note that every visual sends a query to data source and data comes back. you usually have more than one visual in your report and with slicing and dicing every time you are sending queries to the data source. Performance Considering is a MUST DO in the data source for this model.

Just for a very small example of performance tuning; here is what performance I get when I have normal index on my table with 48 Million records;

2016-04-10_15h34_02
Regular Index

A regular select sum from my table with 48 million records takes 4 minutes and 4 seconds to run. and the same query responds in less than a second when I have clustered column store index;

and significantly improved performance when I have a Clustered Column Store index on the same table with the same amount of data rows;

Performance Boost with Clustered Column Store Index
Performance Boost with Clustered Column Store Index

I’m not going to teach you all performance tuning in this post, and I can’t do it because you have to read books, blog posts, watch videos to learn that. That is a whole different topic on its own. and the most important thing is that; Performance tuning is different for each data source. performance tuning for Oracle, SQL Server, and SSAS are totally different. Your friend for this part would be Google, and the vast amount of free content available on the internet for you to study.

What is Live Connection?

Live Connection is very similar to DirectQuery in the way that it works with the data source. It will not store data in Power BI, and it will query data source every time. However, it is different from DirectQuery. Live Connection is only supported for these data sets;

  • SQL Server Analysis Services (SSAS) Tabular
  • SQL Server Analysis Services (SSAS) Multi-Dimensional
  • Power BI Service

Because these data sources are modeling engines themselves, so Power BI only connect to these and fetch all model metadata (measure names, attribute names, relationships…). With this method, you need to handle all your modeling requirements in the data source, and Power BI just surface that data through Visualization.

Important Pros and Cons for this method

Big Size Model with OLAP or Tabular Engine

The big benefit of this model is that you can have a big sized data model (not limited to 1GB), and also you can leverage the modeling layer of SSAS. SSAS Tabular will give you DAX, and Multi-Dimensional will give you MDX. With any of these two languages, you can cover all your calculations and modeling needs. This method has better modeling feature than DirectQuery. Because in DirectQuery, there is no DAX or MDX as the analytical language to help. all calculations need to be done on the database side. Sometimes doing calculations in database side is much more complex than doing it in analytical expression language.

No Power Query. Just Visualization

The big disadvantage of this method is that you will not have even Power Query simple transformation. With this method; you will only have Report tab.

2017-09-13_10h39_56

Report Level Measures

With SSAS Tabular Live connection ONLY, you get report level measures. which gives you the ability to write DAX measures. However, you might want to keep them in the data source to keep your model consistent. To learn more about report level measures, watch this great video from Patrick. This feature is not yet available in connection to SSAS Multi-Dimensional at the time of writing this post.

2017-09-13_10h44_36

Report Level Measures are a great feature because users can create measures without the need to call BI developer to do that. However, these measures will not be added to the data set. These are just for the report. So for consistency of your model, you might want to keep measures creation to be part of your SSAS data source model.

What is Difference between Live Connection and DirectQuery?

Now you know about all different types of connections, before talking about their pros and cons in details. Let’s focus more on the difference between Live Connection and DirectQuery;

  • DirectQuery is a direct connection to data sources listed above, for example; SQL Server, Oracle, IBM…
  • LiveQuery is direct connection to analysis services model (SSAS Tabular, Multi-Dimensional, or a Power BI report published in service)

Relationship Configuration

With DirectQuery you can still configure relationship in some cases

2017-09-13_10h22_06

With Live Connection, you have no relationship tab. this should be handled in the data source.

2017-09-13_10h39_56

Report Level Measures

With some types of SSAS live connections (to tabular model or Power BI Service), you get report level measures.

2017-09-13_10h44_36

No Power Query in Live Connection

In DirectQuery you still can do simple Power Query transformations. However in Live Connection Power Query is not available at all. All you can do is to change the source data model to another model or another server.

Pros and Cons of Each Method

I have already explained main highlights of pros and cons in each section. Here let’s go through all of them together again;

Import Data or Schedule Refresh

Advantages

  • Fastest Possible Connection
  • Power BI Fully Functional
  • Combining Data from different sources
  • Full DAX expressions
  • Full Power Query transformations

Disadvantages

  • Power BI file size limitation (It is different for Premium)

DirectQuery

Advantages

  • Large Scale data sources supported. No size limitation.
  • Pre-Built models in some data sources can be used instantly

Disadvantages

  • Very Limited Power Query functionality
  • DAX very limited
  • Cannot combine data from multiple sources
  • Slower Connection type: Performance Tuning in the data source is MUST DO

Live Connection

Advantages

  • Large Scale data sources supported. No size limitation as far as SSAS Supports.
  • Many organizations already have SSAS models built. So they can use it as a Live Connection without the need to replicate that into Power BI.
  • Report Level Measures
  • MDX or DAX analytical engines in the data source of SSAS can be great asset for modeling compared to DirectQuery

Disadvantages

  • No Power Query
  • Cannot combine data from multiple sources
  • Slower Connection type: Performance Tuning in the data source is MUST DO

Which method is performing best and fastest?

Import Data is the fastest possible option. Data is loaded into the memory of the server and all queries will be resolved immediately.

Live Connection is the next option in this list, especially if SSAS Tabular or Power BI Service is used because these two are in-memory technologies and perform faster than multi-dimensional.

DirectQuery is the slowest type of connection. You have to consider performance tuning of your data source.

So the winner for this part is Import Data.

Which method is more flexible?

With Import Data you get the full functionality of Power BI. Full Power Query transformations, and DAX measures as well as visualizations.

Direct Query and Live Connection both are the next in this list because each of them gives you something. DirectQuery will give you few Power Query options. Live Connection will give you Report Level Measures.

So the winner for this part is again Import Data.

Which method is more scalable?

Import Data method has the size limitation of 1GB per model. So without using Power BI Premium, this method is not so much Scale-able.

With DirectQuery and Live Connection, you get better scalability. Data sources support a large amount of data.

So the winner for this part is Live Connection and DirectQuery

What are Architecture Scenarios to use Each Method?

Import Data for Agility and Performance

Import Data has the fully functional Power BI with a great performance. so If your data set is not a huge data set, then you can easily use this method, and produce reports in a very fast development time frame.

Live Connection for Enterprise Solution

Many enterprises already have pre-built models in SSAS tabular or multi-dimensional. These models can be easily used in a Power BI Live Connection.

Even if your company hasn’t yet started SSAS solution, and you are dealing with huge data set, this option is better than Direct Query. Because in SSAS you have analytical expression languages of MDX or DAX to cope with lots of calculations and modeling challenges. Here is a sample architecture that can be used with this method;

2017-09-13_11h36_40

 

another approach is to use Power BI service as the central hub of development. I have explained about it through another post.

2017-04-10_15h04_21

Direct Query for Non-Microsoft Sources

DirectQuery connection is not that much used in Microsoft solution architecture. The main reason is that If you have a Microsoft based solution architecture, you probably will use SSAS as mentioned above to leverage the analytical engine of it. The DirectQuery mode is mainly used in Non-Microsoft architecture. when the data model is handled in Oracle, IBM or SAP Hana system or anything like that.

What is the Gateway Role?

Regardless of the type of connection you use (Import Data, DirectQuery, or Live Connection), if the data source is located on-premises, you would need a gateway for it. otherwise not. To learn more about setting up for gateway read my other post.

Summary

In Summary, you’ve learned what are three different types of connection; Live Connection, Import Data, and Direct Query. You’ve learned their differences, pros and cons, and scenarios that each should be used. There are still a lot of details for each method. In this post, I tried to put an explanation of everything to give you a holistic view. I’m sure you will have some questions after reading this. Please don’t hesitate to ask your questions in the comment area below.

Facebooktwittergoogle_plusredditpinterestlinkedintumblrmailFacebooktwittergoogle_plusredditpinterestlinkedintumblrmail
rssyoutuberssyoutube
Reza Rad

Reza Rad is an Author, Trainer, Speaker and DW/BI Consultant. He has a BSc in Computer engineering; he has more than 15 years’ experience in databases, programming and development mostly on Microsoft technologies. He is a Microsoft MVP in Data Platform for seven continues years (from 2011 till now) for his dedication in Microsoft BI. He is author of some SQL Server and BI books, and also Power BI online book; from Rookie to Rock Star.


12 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *