Power BI free allows you to have 1GB data per user, and pro 10GB. However in many organizations you have much bigger size of data than 10GB, sometimes you deal with Tera bytes, even Peta bytes of data. Is Power BI limiting you for the visualization? Short answer; No! In this post I’ll explain you options that you have when you deal with large data sets. As any other solutions there are pros and cons of working with large data sets with Power BI, Let’s explore them together.
Compressed Structure of Power BI
First of all I have to mention that Power BI compress each data set effectively before loading into memory, this is one of the big advantages of xVelocity In-memory engine that Power BI, Power Pivot, and SSAS Tabular built on top of that. Compression happens automatically, that means you don’t need to set a configuration, or allow Power BI to do something. Any data set will be compressed in a reasonable level when you import the data into Power BI.
As an example I’ve imported a CSV file with ~800 MB data size. When it loaded into Power BI I had only 8MB power BI file to work with. That’s an impressive compression.
However compression rate isn’t always like that, it depends on type of data set, data type of columns and some other facts. The bottom line in this section is that, if you have a data set that is more than 10GB and won’t grow significantly, then consider loading that into Power BI to check how compressed it would be, maybe you get it under 10GB limitation and then you are good to go If that is not the case then continue reading.
With Power BI we can connect live to some data sources; On premises or in cloud. Well, here’s the trick; if your data set is large, then use Live connection! simple but useful trick. You can have a SQL Server, Oracle or whatever data source you want with any size of data you need, and create a live connection to it with Power BI. Live connection won’t import data into the model in Power BI. Live connection brings the metadata and data structure into Power BI, and then you can visualize data based on that. with every visualization a query will be sent to the data source and brings the response.
Live Connection to SQL Server On Premises
Let’s look at an example of such Live connection to SQL Server on premises database. I have a database with a large table contains 48 million records of data. the table itself is ~800 MB
I know it is not more than 10GB, but the method would be the same even if I have 10TB sized data table. this table is big enough to show how live connection works for this example.
When you go through Get Data and select this table (or any other tables in the data set) and click Load, you’ll see the option to choose between Live connection or Import. DirectQuery means Live Connection.
After creating the live Connection you can create relationship in the model, or build visualization for that.
There are however some limitations with Live connection which I’ll explore later in this post.
** For the Live connection to on-premises data sources you need to have Power BI Enterprise or Personal Gateway installed and configured.
You can connect to wide range of on-premises data sources, such as other databases, and also SSAS. as an example you can do:
Live Connection to Azure
There are also many Azure data sources that you can use for the live connection Azure SQL Database which is the similar database engine of SQL Server on-premises (with some differences of course) but on cloud. Azure SQL Data Warehouse is the cloud database structure which supports both unstructured and structured data. Azure SQL DW is capable of expanding the compute engine of the database regardless of the storage engine of that. And Sparks on Azure HDInsight is the other azure data source that can be used for live connection to big data structure on azure.
- Live Connection to Azure SQL Database
- Live Connection to Azure SQL Data Warehouse
- Live Connection to Spark on Azure HDInsight
Limitations of Live Connection
Live connection is good to connect with large data sets, and won’t load the data into the model, so Power BI solution would be very small. However there are some limitations for this type of connection, such as; No DAX support, no data model formatting in the Data tab of Power BI, No multiple data sources support, and No Power Q&A support. Let’s go through them one by one;
No DAX Support
With Live connection there won’t be any Data tab in Power BI to create calculated measures, columns or tables. You have to create all calculations in the data source level. It means that if you have SQL Server as the source, create all calculated columns or measures in SQL Server source tables. Your modelling should be done through the data source. In Power BI desktop you would be able to only set up relationship and that’s all in terms of modelling in Power BI.
Formatting is not also available through the Modeling tab of Power BI, So if you want to set decimal points, or set a data type of column it is not possible through Power BI. You need to handle all of these through the data source (or Power Query, read further down).
Full Visualization Support
Fortunately visualization part of Power BI is fully supported in Live connection mode. The underlying reason is that visualization is a separate engine in Power BI, and this is one of awesome reasons that a product built on top of separate components usually works better than a product all in one with no underlying component. You can build any visualization and you won’t be limited at all.
Multiple Data Sources is Not Supported
With Power BI one of the great advantages is that you can combine data sets from multiple sources, multiple data bases or files can be participated in building a model. Unfortunately when you work with Live connection you won’t be able to have data from more than one data source. Even if you want to get data from another database on the same server you will face messages like below;
So adding another data source requires data to be imported, which is not useful for the scenario of working with large data sets.
No Power Q&A
As you know one of the features of Power BI website is natural language questions and answering engine called Power Q&A. This feature is not available (at the time of writing this post) in Power BI Live Connection. when you have a live connection your dashboard won’t have Power Q&A question box on the top.
Don’t Forget the Power Query
Fortunately Power Query still is available with Live Connection. This gives you ability to join tables, flatten them if you require, apply data transformation and prepare the data as you want. Power Query can also set the data types in a way that be more familiar for the Power BI model to understand. If you want to learn more about Power Query read Power Query sections of Power BI online book.
In screenshot above you can see that I’ve joined DimProduct, DimProductSubCategory, and DimProductCategory to bring all fields in single table; DimProduct. And this all happened with Live Connection.
Optimization at Data Source Level
Live connection to data source means report will send queries to data source. Data Sources are different in terms of response time. SSAS tabular might produce faster result, and normal SQL Server database slower. Don’t forget that all performance and indexing tips should be considered carefully when you work with Live connection. If you are working for example with SQL Server consider proper indexing, column-store indexes and many other optimization and performance tuning tips.
Just for a very small example of performance tuning; here is what performance I get when I have normal index on my table with 48 Million records;
A regular select sum from my table with 48 million records takes 4 minutes and 4 seconds to run. and the same query responds in less than a second when I have clustered column store index;
and significantly improved performance when I have a Clustered Column Store index on the same table with same amount of data rows;
I’m not going to teach you all performance tuning in this post, and I can’t do it, because you have to read books, blog posts, watch videos to learn that. That is a whole different topic on its own. and the most important thing is that; Performance tuning is different for each data source. performance tuning for Oracle, SQL Server, and SSAS are totally different. Your friend for this part would be Google, and vast amount of free content available on the internet for you to study.