In the last post, you learned about Import Data or Scheduled Refresh as a connection type. In this post, you’ll learn about the second type of connection named; DirectQuery. This type of connection is only supported by a limited number of data sources, and mainly targets systems with a huge amount of data. DirectQuery is different from another type of connection which I’ll talk about it in the next post named Live Connection.
What is DirectQuery?
DirectQuery is a type of connection in Power BI which does not load data into Power BI model. If you remember from the previous post, Power BI loads data into memory (when Import Data or Scheduled Refresh is used as a connection type). DirectQuery doesn’t consume memory because there will be no second copy of the data stored. DirectQuery means Power BI is directly connected to the data source. Anytime you see a visualization in a report; the data comes straight from a query sent to the data source.
Which Data Sources Support DirectQuery?
Unlike Import Data which is supported in all types of data sources, DirectQuery is only supported by a limited number of data sources. You cannot create a connection as a DirectQuery to an Excel File. Usually, data sources that are relational database models, or have modeling engine, supports DirectQuery mode. Here are all data sources supported through DirectQuery (as of today);
- Amazon Redshift
- Azure HDInsight Spark (Beta)
- Azure SQL Database
- Azure SQL Data Warehouse
- Google BigQuery (Beta)
- IBM Netezza (Beta)
- Impala (version 2.x)
- Oracle Database (version 12 and above)
- SAP Business Warehouse (Beta)
- SAP HANA
- Spark (Beta) (version 0.9 and above)
- SQL Server
- Teradata Database
- Vertica (Beta)
This list may change, with every new update in Power BI connectors, some new data sources may be added to the DirectQuery supported lists of Power BI. To view the up-to-date list always use this link:
from the list above, those connections that are Beta or Preview, are not yet supported in Power BI service.
How to Use DirectQuery
For running this example, you need to have a SQL Server instance installed. You can download SQL Server Developer edition from here:
Then set up AdventureWorksDW database on it. You can get the database from here:
This post is not about installing and configuring SQL Server or setting up a database on it. So, I’m not going to explain how to do that. We’ll only talk about Power BI part of the example;
Open a Power BI Desktop and start with the Get Data from SQL Server.
The very first query you will get when connecting to SQL Server data source includes an option for choosing the Data Connectivity mode. Select DirectQuery here. Also, you need to put your server name; If your SQL Server instance is the default instance on your machine, you can use (.) for the server.
Please note that data sources which support DirectQuery, Also, support Import Data (Any data sources support Import Data).
In the Navigator window, you can select some tables from AdventureWorksDW database, such as DimDate, and FactInternetSales.
After selecting tables, click on Load. The data load dialog in this connection mode will be much faster because there is no process of loading data into the memory. This time, only metadata will be loaded into Power BI. The data remains in SQL Server.
No Data Tab in DirectQuery Mode
One of the very first things you will notice in the DirectQuery Mode is that there is no Data Tab (the middle tab on the left-hand side navigations of Power BI.
The Data tab shows you the data in the Power BI model. However, with DirectQuery, there is no data stored in the model.
Also at the bottom right side of the Power BI Desktop, you will notice that there is a note about DirectQuery connection.
How DirectQuery Works
With DirectQuery enabled; every time you see a visualization, Power BI sends a query to the data source, and the result of that comes back. You can check this process in SQL Profiler. SQL Profiler is a tool that you can use to capture queries sent to a SQL Server database. Here is an example Power BI report visual on a DirectQuery model:
And running a SQL Profiler will show you that every time you refresh that report page or change something, there will be one query for every single visualization!
You can see that in the SQL Profiler; Five queries have been sent to the database. One query, for each visualization. Even if two visualizations are showing the same thing, they still send two separate queries to the database.
The understanding of how DirectQuery mode works, bring you straight to one point: How is the performance of this connection? The question about the performance is a very valid question to ask. You must ask this question from yourself Before deciding to use this connection type of course. DirectQuery is performing much slower than the Import Data option.
Import Data loads data into memory. It is always faster to query data from memory (Import Data), rather than querying it from disk (DirectQuery). However, to answer the question that how much faster it is, we need to know more details about the implementation. Depends on the size of data, specification of the server that the database is running on it, the network connection speed, and factors such as is there any database optimization applied on the data source, the answer might be entirely different.
The critical thing to understand is that this type of connection is the slowest type of connection, and whenever you decide to use it, you must immediately think about performance tuning for your database.
Using DirectQuery without performance tuning the source database, is a big mistake that you should avoid.
To understand the need for performance tuning, let’s go through an example. Assume that we have a large table in a SQL Server database, a table with 48 million rows in it. And we are querying the only Sum of Sales column from that table.
Here is the performance I get when I have a normal index on my table with 48 Million records;
A regular select sum from my table with 48 million records takes 4 minutes and 4 seconds to run. And the same query responds in less than a second when I have clustered column store index and significantly improved performance when I have a Clustered Column Store index on the same table with the same amount of data rows;
I’m not going to teach you all performance tuning in this post, and I can’t do it because you have to read books, blog posts, watch videos to learn that. That is a whole different topic on its own. And the most important thing is that; Performance tuning is different for each data source. Performance tuning for Oracle, SQL Server, and SSAS are entirely different. Your friend for this part would be Google, and the vast amount of free content available on the internet for you to study.
One of the newest features added in Power BI is to help to reduce the number of queries sent to the database in the DirectQuery mode. The default behavior of a slicer or filter in Power BI is that by selecting an item in slicer or filter, other visuals will be filtered immediately. In the DirectQuery mode, it means it will send multiple queries to the database with every selection in filter or slicer. Sending multiple queries will reduce the performance of your report.
You may want to select multiple items, but with only selecting the first item, five queries will be sent to the database. Then with selecting the second item, another five queries will be sent to the database. The speed, as a result, will be twice slower. To fix this issue, you can set a property in Options of your Power BI file.
Click on File menu in Power BI Desktop, and then select Options and Settings, from there select Options.
Then click on Query Reduction on the left-hand side (the last item)
There are two sections in the query reduction. The first item is what I do not recommend in most of the cases. The first item will disable cross highlighting/filtering by default. With selecting the first option, the main functionality of Power BI, which is cross highlighting/filtering will not work by default. Cross-highlighting/filtering means that, when you click on a visual, other visuals will be either filtered or highlighted. This feature is what makes the Power BI an interactive visualization tool. The option for removing it for most of the cases is not useful. However, if you have some visuals that don’t need to interact with each other, enabling this option will reduce the number of queries sent to the database.
The second option, however, is beneficial, especially if you have multi-select slicers. When you choose this option, then your filters or slicers will have an Apply button on them. Changes will not apply until you click on the apply button, and then all queries will be sent to the database. This option is highly recommended when you have a multi-selection slicer or filter.
Single Data Source
One of the fundamental limitations of DirectQuery mode is single data source support. With this mode, because data is not loaded into a secondary storage, you can only connect to one single data source. If you are connected to a SQL Server database, then you cannot bring another query from another SQL Server database or any other data sources. If you try to do that, you’ll get a message that says this feature is only supported in Import Data and you have to switch to that.
The concept of using DirectQuery works with the connect of having an enterprise data warehouse. You need to bring all data that you need to analyze in your Power BI report, into an enterprise database, and then use that as the source in Power BI with a DirectQuery connection. Using DirectQuery, as a result, means that you have to use other ETL tools to take care of many data transformation.
Limited Power Query
With DirectQuery, you can apply some data transformations in the Query Editor window. However, not all transformations are supported. To find out which transformations are supported and which are not, you have to check the data source first. Some of the data sources are not supporting any transformation at all, such as SAP Business Warehouse. Some of the transformations, such as SQL Server database, support more transformations.
If you use a transformation which is not supported, you’ll get an error message that says “This step results in a query that is not supported in DirectQuery mode.”
Transformations in Power Query are limited in DirectQuery mode. Depends on the data source, some transformations or sometimes no transformations at all are supported.
As I mentioned earlier; when you use a DirectQuery connection, then you usually should think of another data transformation tool, such as SQL Server Integration Services, etc. to bring data transformed into a data warehouse before connecting Power BI to it.
DAX and Modelling are also limited in DirectQuery mode. You cannot create calculated tables because these are in-memory generated tables, and DirectQuery doesn’t store anything in memory. You cannot use the built-in Date hierarchy in Power BI. The date hierarchy in Power BI is based on an in-memory date table created when you use Import Data. With DirectQuery such table cannot be built, and as a result, the built-in hierarchy doesn’t work in DirectQuery.
You can write DAX calculations as a New Column or New Measure. However, you cannot use all DAX functions. A limited number of functions are available to use. For example; you cannot use PATH functions to create hierarchies on an organizational chart or chart of accounts type of solution. You cannot use many of Time Intelligence functions such as TotalYTD that calculates an insight out of a DateTime field.
Every DAX expression will be converted to T-SQL script and will run on the database. Here is, for example, a DAX expression to calculate sales for productkey=528;
Measure = SUMX( FILTER( FactInternetSales, FactInternetSales[ProductKey]=528 ), FactInternetSales[SalesAmount] )
And here is the t-sql script catched in SQL Profiler for that expression:
SELECT SUM([t1].[SalesAmount]) AS [a0] FROM ( (select [$Table].[ProductKey] as [ProductKey], [$Table].[OrderDateKey] as [OrderDateKey], [$Table].[DueDateKey] as [DueDateKey], [$Table].[ShipDateKey] as [ShipDateKey], [$Table].[CustomerKey] as [CustomerKey], [$Table].[PromotionKey] as [PromotionKey], [$Table].[CurrencyKey] as [CurrencyKey], [$Table].[SalesTerritoryKey] as [SalesTerritoryKey], [$Table].[SalesOrderNumber] as [SalesOrderNumber], [$Table].[SalesOrderLineNumber] as [SalesOrderLineNumber], [$Table].[RevisionNumber] as [RevisionNumber], [$Table].[OrderQuantity] as [OrderQuantity], [$Table].[UnitPrice] as [UnitPrice], [$Table].[ExtendedAmount] as [ExtendedAmount], [$Table].[UnitPriceDiscountPct] as [UnitPriceDiscountPct], [$Table].[DiscountAmount] as [DiscountAmount], [$Table].[ProductStandardCost] as [ProductStandardCost], [$Table].[TotalProductCost] as [TotalProductCost], [$Table].[SalesAmount] as [SalesAmount], [$Table].[TaxAmt] as [TaxAmt], [$Table].[Freight] as [Freight], [$Table].[CarrierTrackingNumber] as [CarrierTrackingNumber], [$Table].[CustomerPONumber] as [CustomerPONumber], [$Table].[OrderDate] as [OrderDate], [$Table].[DueDate] as [DueDate], [$Table].[ShipDate] as [ShipDate] from [dbo].[FactInternetSales] as [$Table]) ) AS [t1] WHERE ( [t1].[ProductKey] = 528 )
The conversion of DAX to T-SQL is not always that simple. With some DAX expressions, the conversion is not good at all. That is why some of the DAX functions are not available to use. There is an option in the Option window to choose to use unlimited DAX functions. However, I do not recommend this option.
This option will allow you to use unrestricted measures and DAX expressions. However, I do not recommend it as it will send heavy T-SQL scripts for some of the expressions.
No Refresh Needed
One the advantages of DirectQuery is that there is no need for data refresh to be scheduled. Because every time a user looks at the report, there is a query sent to the database, and the recent data will be visualized, then there is no need for data refresh schedule.
For Dashboards when the automatic refresh happens, the DirectQuery connection will be refreshed every 15 minutes or more. For reports, data will be updated anytime a new query is sent to the database with refreshing the report or filtering or slicing and dicing.
Large Scale Dataset
The main benefit of DirectQuery is to use it over the massive amount of data. This feature is in fact, the main reason that you may need to consider DirectQuery. Because the data is not loaded into the memory, so there is no limitation on the size of the data source. The only limitation is the limitation that you have in the data source itself. You can have petabytes of data with the DirectQuery connection. However, you need to consider performance tuning in the data source as mentioned before in this post.
No size limitation is the main and in many cases the only reason why some scenarios use DirectQuery. Because based on what you’ve learned in this post, DirectQuery is slower, less flexible, with fewer features in Power Query, and DAX. In overall DirectQuery is the only option if other two types of connections cannot be used.
Power BI supports three types of connection. In this post, you’ve learned about DirectQuery which is one of the connection types. DirectQuery is not storing a second copy of the data into memory. It will keep the data in the data source. With DirectQuery, anything you see in the report will be sent to the database through T-SQL scripts. If you have ten visualizations in your report, it means posting ten queries to the database and getting the result of that back.
DirectQuery supports a huge amount of data. Because the data is not stored in the memory, so the only limitation on the data size, is the limitation of the data source itself. You can easily have petabytes of data in a database and connect to it from Power BI. The limitation that we have in Import Data does not apply here, and you can have large scale dataset for connecting to it. Another advantage of DirectQuery is no need for a scheduled refresh. Because with this mode the data will be updated anytime the report is refreshed with queries sent to the database, there is no need for scheduling refresh.
DirectQuery has many limitations and downsides. With DirectQuery, the speed of Power BI reports is much slower. DirectQuery is limited in using modeling features of Power BI such as DAX, calculated tables, built-in date table, and Power Query functionalities. DirectQuery is as a result less flexible with fewer features. Using DirectQuery means you get slower reports, with less functionality on the Power BI side.
DirectQuery because of the limitations mentioned above is only recommended if the other two types of connection (Import Data, or Live Connection) cannot be applied. In the next post, I’ll explain about Live Connection. If you decided to use DirectQuery, it is then better to use another integration tool and do all the data transformation before loading data into the source database. Also, it is better to take care of all the calculations in the data source itself.
DirectQuery is only recommended if the other two types of connection (Import Data, or Live Connection) cannot be applied
Here are list of pros and cons of the DirectQuery;
- Large Scale Dataset; Size limitation only for the data source itself
- No need for data refresh
- Single Data Source only
- Power Query transformations are limited
- Modeling is limited
- DAX is limited
- Lower speed of the report