When you work with tables and queries in Power Query and Power BI, you get the option to copy them through these actions: Duplicate, or Reference. It has been always a question in my sessions and courses that what is the actual difference between these two actions. The explanation is simple but very important to understand. Because when you know the difference, you will use it properly. In this short blog post, I’ll explain what is the Reference, and the difference of that with Duplicate. To learn more about Power BI; read Power BI from Rookie to Rock Star book.
If you are looking to copy an entire query with all of its steps, then Duplicate is your friend. Let’s see this in action. As an example, Let’s assume that we got the data from a web page that shows us the best seller’s movies information. If you have done the movies example of my book previously, the website is BoxOfficeMojo. Here is the link to the page:
In Power Query, we got data From Web, and selected this source;
Let’s say for this table, we do some transformations. For example; removing extra “^” character from the last column (Year column);
and then some other transformations, so we end up with a number of steps for this query.
After doing all these transformations, you realize that this data is only for the first hundred best seller movies because that web page doesn’t have the remaining movies. To get the remaining, you need to navigate to page 2, which has a different URL, but the same data structure.
Well, what you need to do? You have to do all those steps on page 2 as well. Let’s keep this example static and basic, (Because in complex scenarios when you have many pages, you may use functions and parameters to loop through all pages and combine them all together. If you are interested to learn about that, read my post here). Let’s say you want to do all those steps that you have done for page one, now for page two. To do that; you can leverage Duplicate.
Create a duplicate of Box Office Mojo (I called it; Box Office Mojo Page 1)
When you create the Duplicate query, it will be an exact copy of the first query, with all steps of it. These two queries are exactly like each other. No difference!
Duplicate copies a query with all the applied steps of it as a new query; an exact copy.
After creating the copy, then you can go to the source step to change the URL:
Using Duplicate, you managed to copy a query with all steps in it, and then make changes in your new query. Your original query is intact.
Duplicate is the option to choose, when you want to copy a query, but do a different configuration in steps.
Reference is another way of copying a query, However, the big difference is that; When you reference a query, the new query will have only one step: sourcing from the original query. A referenced query, will not have the applied steps of the original query. Let’s see this option in action. Continuing the example above; let’s say we want to create a new query that is the result of combining the page 1 and page 2 result. However, we do NOT want to change any of the existing queries, because we want to use those as the source for other operations.
With a right click on Box Office Mojo Page 1, I can create a Reference.
Reference will create a new query which is a copy of the Box Office Mojo Page 1, but only contains one single step:
The only step in the new query is sourcing the data from the original query. What does it mean? It means if you make changes in the original query, then this new query will be impacted.
Reference will create a new query which has only one step: Getting data from the original query.
Now we can use this query, to append to the Box Office Mojo Page 2;
The result would be a query that contains both pages;
To learn more about append and the difference of that with Merge, read my blog post here. In this example; we used Reference option to create a copy of the original query, and then continue some extra steps. There are many other usages for the Reference.
Reference is a good choice, when you want to branch a query into different pathes. One path that follows a number of steps, and another that follows a different steps, and both are sharing some steps in the original query.
After doing the append in this example, it is a good idea to uncheck the enable load on Page 1 and Page 2 queries to save some memory in Power BI.
Finding out that which query is dependent (or referenced from) which query can be a bit challenging when you have too many queries. That is why we have the Query Dependency menu option in the View tab of Power Query;
For our example above, this is the query dependency diagram;
Duplicate vs Reference
Now that you know there are two options when you copy a query, let’s have a closer look at their difference.
Isolation from the Original or Dependency to the Original
Duplicate creates a new copy with all the existing steps. The new copy will be isolated from the original query. You can make changes in the original or the new query, and they will NOT affect each other. Reference, on the other hand, is a new copy with only one single step: getting data from the original query. If you make a change in the original query, the new query will be impacted. For example; If you remove a column from the original query, the new query will not have it if it used the Reference method for copying.
Limitation of the Reference
You can not use referenced queries in all situations. As an example; If you have a Query 1, and then you created a reference from that as Query 2. You cannot use the result from Query 2 in Query 1! It will create a circular reference. You are combining a query with a reference to the query itself, It is impossible!
Some actions that invoke Reference or Duplicate
There are some actions in the Power Query that trigger Reference or Duplicate, let’s check those options:
Append Queries as New / or Merge Queries as New is a Reference action
These two actions are creating a reference from the original query and then they do Append or Merge with other queries.
Add as New Query is a Duplicate action
Believe it or not, when you right click on a column or cell and select Add as New Query, you are creating a duplicate of the original query.
This can be misleading sometimes, because you may expect the new query to source from the original, and with the change of original, this query also to change. However, the truth is that this is a duplicate action, and after this action, your original query and the new copy will be isolated from each other.
Copy and Paste is Neither Duplicate Nor Reference!
This is another misconception that Copy and Paste are similar to Duplicate. It is not, and it is not Reference either. When you do this action on a simple query (I mean a query that is not sourced from any other queries), then you get a result similar to Duplicate.
But when you do the Copy and Paste on a query that is sourced from other queries; the result is a copy of all original queries. Here is the result of Copy and Paste on Box Office Mojo All Pages (which is sourced from Page 1 and Page 2);
Duplicate and Reference are two different actions, and they are also different from Copy and Paste of a query. Duplicate will give you an exact copy of the query with all steps, Reference will create a reference to the original query instead as a new query. Duplicate is a good option to choose when you want the two copies to be isolated from each other, Reference is a good option when you create different branches from one original query. There are some actions in Power Query that trigger Duplicate or Reference as listed in this blog post. Hope this was a good post for you to understand the difference between these two actions clearly, and use them wisely from now on.
22 thoughts on “Reference vs Duplicate in Power BI; Power Query Back to Basics”
Reza, that was a great article. Do you have any articles about the Power BI 70-778 certification?
Specifically on how to prepare for the exam.
I will write a post about that soon 🙂
I have a question:
If I have two queries with reference, does Power BI read once and use it for both queries or read the origin twice?
No. Behind the scene, it will be two queries. That is why you can see in one of my other posts, I mentioned if the purpose is just referencing, then creating a calculated table in DAX can be a better option. Referencing in Power Query is the choice when you want to do branching and continue with a set of transformations on top of it.
Thanks for this – excellent article.
Have you found a way to “materialize” data from the first query so that when calling a second query that references it it doesnt trigger a refresh of the data in the 1st query. I think this is the same sort of thought that Salvador is having. Thanks. Andrew
The referencing does not necessarily materialize the original query. Check out the computed entity in dataflow for a feature like that.
Great article. Is one of them better than the other in terms of refreshing data and the size of a PBIX file that is importing the data?
No difference in the size of the PBIX file or performance. behind the scene would be two queries to the data source. It is just the difference in branching rather than copying.
I have a similar question in regards to using Power Pivot in Excel. You said to do this in Power BI : “After doing the append in this example, it is a good idea to uncheck the enable load on Page 1 and Page 2 queries”. I assume I do this same thing in Power Pivot as only creating a connection when selecting “Load to”, and deselecting “Add to Data Model.” When I click on the properties of the query, I get the option to “Enable Background Refresh.” How is that related to this article, or more specifically, when do I select this option (Extract, Transform, and/or Load queries)? Thanks for sharing.
Your understand is correct. In PowerPivot, it is controlled with creating a connection only.
I believe the “Enable Background Refresh” option that you see is similar to “Include in Report Refresh” that we have in Power BI Desktop. It means that this query will be part of the refresh process (which should be in this case), but won’t be loaded individually to the model.
quick question on reference/dataflow/incremental refresh. We have 2 queries on dataflow with Premium capacity.
First query : SQL connection, no modification, incremental refresh applied
Second query : Reference to the first query, query has many steps.
The pbix only import the second query. Is it a good way to work with reference/dataflow and incremental refresh?
It really depends on what you want to do
Is the referencing from the first query because you MAY use that in other queries in the future?
Is this because you MAY want to switch data sources in the future and you don’t want your whole query to not work as a result?
depends on what is the goal here, there are different answers
Reza, The Reference vs Duplicate article, greatly explained these two important functions. Great job
Very helpful article. Thanks verymuch Sir. I now understand the difference between Duplicating a query and Referencing a query. What I still don’t understand is the difference between Copy and Paste, and Duplicate.
copy paste will also create a copy of all other queries that this query is referenced from. but duplicate, just create a copy of this query.
Hi Reza, Thanks very useful article. I have a Python script in Table 1 that downloads invoices, then it process the data in PowerBI. Each time I make a change to the table, it re-runs the python script. I thought if I create Table 2 by referencing (not duplicating) the first query. Now each time I make a change to table 2 it will not need to refresh table1. But yet it keeps refreshing both tables thus re-creating both tables.
referencing a query does not necessarily materialize the original query.
If you want to do something like that the computed entity in the dataflow might be a better option for you
i have table 1 from SQL query and i need the same table 1 data in the another table 2. the another table 2 should not refresh and the data has to load from table 1 in power query editor.
You can have a duplicate of your table 1, and just change the first few lines of it so that it Gets data from the other data source. something like what I explained here but without parameters
Which of the two methods consumes more resources when updating? Duplicate or Reference?
It depends on many things.
If you run in Power Query dataflows, then a referenced query is using compute engine, so much better for consuming less resources. because it reads from the source once.
if you run in Power Query in Power BI Desktop, most likely both are doing the same querying from the source, especially if query folding isn’t happening.