Incremental Load is always a big challenge in Data Warehouse and ETL implementation. In enterprise world you face millions, billions and even more of records in fact tables. It won’t be a practical practice to load those records every night, as it would have many downsides such as;
- ETL process will slow down significantly, and can’t be scheduled to run on small periods.
- Performance of the source, and destination server will be affected badly, downtime of these systems would be longer.
- More resources will be required to maintain the process. such as better processors, more RAMs… and adding these won’t help so much at the end, because the amount of data is increasing as times passes.
- and many other issues.
So what would be the solution? Solution is Incremental Load approach. In this approach data will be loaded partially, preferably only part of the data that has been changed. A change set will be much smaller than the total amount of data. As an example in a 200 million records fact table which stored data for 10 years, only 10% percent of that data might be related to the current year and changes frequently, so you won’t usually required to re-load the rest 180 million records.
There are different methods to identify change set and implement the ETL process in a way that only transfer the change set. In this post I’ll explain one way of Incremental Load which is an efficient way especially when you work with a source database that supports Change Data Capture technology. Change Data Capture is a technology that capture inserts, updates, and deletes into changed sets. This technology is available in some RDBMS such as SQL Server and Oracle. SQL Server supports Change Data Capture only in Enterprise edition of SQL Server 2008 or higher versions. Change Data Capture works with LSN (Log Sequence Number) and log file to capture changes. In this post I don’t want to talk about CDC (Change Data Capture) as it would require it’s own topic, I encourage you to read more about CDC in books online here.
As always I would explain the method through an example, because I’ve found this as the best way to learn a topic for readers, you’ll see the usage of it through an example.
Change Data Capture Configuration on the source database
This section explains how to enable CDC on the source database and table(s). If you have a database with CDC configuration enabled then you can skip this part.
Enable CDC (Change Data Capture) on a Database
Change Data Capture should be enabled in two levels: Database, and Table. First step to enable CDC on a database is simply possible with running sys.sp_cdc_enable_db stored procedure on your database.
Important Note: For Change Data Capture to work correctly you need to have SQL Server Agent started and running.
Check CDC configuration for databases
with running query below you can check which databases has CDC enabled:
select name from sys.databases
Enable CDC for Table
system stored procedure sys.sp_cdc_enable_table is for enabling CDC on a table. you see the example usage of this in code below;
@source_schema = N’dbo’,
@source_name = N’SalesTransactions_ForIncrementalLoad’,
@role_name = NULL,
@supports_net_changes = 1
As you see in above screenshot, two jobs started with this configuration. That’s why I’ve mentioned earlier that SQL Server Agent needs to be started and working for CDC to work correctly.
Check CDC configuration on Table
After configuring CDC on the table you can check it with query below on sys.tables:
from sys.tables tab
Change Data Capture Control Flow Task in SSIS
SSIS 2012 came with a new control flow task named CDC Control Task, which works with CDC enabled databases and tables. CDC Control Task controls the life cycle of change set for a CDC marked database and table. This task enables SSIS package to use CDC change set, apply the transfer of data as required, and finally mark the change set as accomplished, or if an error happened retain the change set for further investigation. This task holds state of CDC into a package variable, this variable then can be used later in CDC Data Flow components.
Starting the CDC for a table
The very first step is to configure the CDC Control Task is to Mark CDC for Start. For doing this drag and drop a CDC Control Task into your package. Configure it as below:
- Set a connection manager (ADO.NET Connection Manager) to the source database. In my example the source database is SourceDB_CDC.
- Set CDC Control Operation as: Mark CDC Start
- Set a variable of type string for CDC State.
- Set the connection for the database contains state data. In my example the database is the same: SourceDB_CDC
- Set the table for storing CDC state. You can create a new table here if you don’t have any table for it already.
- Verify or Set the State Name values.
Screenshot below shows the configuration for this example;
After configuring this task, run the package! you will see (if everything be fine) a green indicator showing that task ran successfully. You might wonder what happened with running this task?! No data transferred, you haven’t set any actions for the change set. Don’t worry this is just the beginning. this task needs to be run one and only once, and its job is to set the CDC state in cdc_state table.
Now query cdc_state table and you will see a state stored in this table. the state’s value has a timestamp portion in it that shows the data and time of storing the state. So what is this state? this is the state of table stored so SSIS could recognize the very first state of CDC, and get the range of changes afterwards.
Now disable the task, as we don’t want to run that again.
Work with the Range of changes with CDC Control Task
After initializing the CDC state with CDC Control task, we need couple of more CDC control tasks; One for Getting the Range, and another for Marking it. In other words first we have to get the range of data that has been changed, then after a set of successful actions through ETL we should mark the range as processed. So we will create two more CDC control tasks with slightly different configuration as below:
- Create a new CDC Control Task, configure it exactly as you did for the “CDC Control Task – Start”, but with setting the CDC control operation as “Get Processing Range”.
- Create another CDC Control task same as previous one, set the CDC control operation as “Mark Processed Range”.
- Now place a Data Flow Task between these two tasks. leave the data flow empty for now, as we should fill it in next stage. Data Flow is to read the change set and do appropriate action based on delete, insert, or update happened on the data rows.
Read the Changed Set with CDC Source and CDC Splitter
In the data flow task we have to read the changed set with the help of CDC change set table, CDC data flow components (CDC Source, and CDC Splitter), and CDC state (stored by CDC Control Tasks). CDC Source is a component that read the change set and provide it as the main output so it can be used for CDC splitter. CDC Splitter split the change set into three data set: Inserted, Deleted, and Updated outputs. For this example I’ve used a stage table as the destination of this step to be able to write result sets into it. my state table is exactly same as the source table plus single column for Status. I’ll fill the status column in Derived Column depends on the change happened on the data row.
- Create a CDC Source component in the data flow.
- Set ADO.NET connection manager of it to the source database which has CDC enabled.
- Set CDC enabled table name
- verify the CDC processing mode to be Net
- Set CDC state variable the same variable that we’ve used in CDC Control Tasks.
- Create a CDC Splitter component after the CDC Source. CDC Splitter doesn’t require any special configuration, the only requirement is that a CDC source should be connected to it.
- Create a Derived Column transformation and connect InsertOutput of the CDC Splitter to it. Create a Status Column in Derived Column and set its value as 0.
- Create a derived column transformation for DeleteOutput with status of 1, and another for UpdateOutput with status of 2.
- Use a Union All transformation to integrate all three data sets together.
- Load them into the stage table with an OLE DB Destination.
Now if you run the package you won’t see any records in data paths, as there were no changes. but if you make a change in the source table, then records will appear in their appropriate passes. Below screenshot shows how the package run with showing changes.
and If I query my stage table, I can see the change set captured simply. Status column shows that if that record inserted (0), deleted (1), or updated (2);
As I’ve used a stage table to extract the change set, I’ve used an execute sql task to truncate that before any other task;
This stage table simply can be used to apply updates or deletes or inserts in the fact table.
Apply Changes into the Fact Table
After gathering all change set into a stage table, then you can simply use some conditional split and lookup transformations to apply changes into the fact table in the data warehouse. It is always a best practice to go through an staging table to the fact table, with this method if there be any issues you can always trace back and investigate. I used another package for this load section named as cdc_load (all works so far was in another package called cdc_staging).
- Create a single Data Flow task in this package.
- Create an OLE DB Source connected to the stage table.
- Use a Conditional Split to identify if the record is insert or not. (this is simply possible with checking Status column).
- If record is Insert then we simply want to add it to destination. But if the record is update or delete we delete the record first (sometimes you might prefer to soft delete the record). an OLE DB Command transformation can be used for this which takes the record identifier from the stage table and removes the record.
- If the record is Deleted, then our job is done. However if the record is updated, then we have to insert that record after removing it. so we check for the status again and Union updated and inserted records together.
- After some Lookups to identify surrogate keys of dimensions, we can load data into the destination fact table.
This was a simple example of using Change Data Capture for Incremental Load. This method works with very good performance as it only get the changed set and process that part only. I highly recommend using this method especially if your source system is CDC enabled. I’ll write about other methods of implementing Incremental Load in future posts.