Previously I’ve mentioned how to implement Incremental Load within SSIS using CDC (Change Data Capture) feature. Unfortunately CDC is not always supported by the source database, so you have to implement an incremental load solution without CDC. In this post I’ll explain how to implement incremental load if there is a modified datetime column in the data source tables or data sets. This method can be applied to any data sources (relational or non-relational) as long as there a modified datetime column in the source data set.
Why Incremental Load?
- You don’t want to re-load 10 TBs of data from source tables every night, and even if you do want that the ETL process will take long time and the ETL server will crash half way through the process! You will need a very high performing ETL server with massive resources (RAM, CPU…) to run that ETL process.
- BI or Reporting users sometimes need their data to be updated more frequently than daily process. Consider if you want to run ETL process every 30 minutes or every hour; You have to carefully think about what you transfer in your ETL process to minimize the ETL process time.
- Source systems are in use by other users or systems; So you cannot keep them busy for 10 hours to load all the data set from mid-night to 10AM morning!
and many other reasons…
Why So much resistance to Incremental Load?
I have seen and heard everywhere sentences like: “We don’t do Incremental Load, Full Load is a better option!” or “Incremental Load isn’t appropriate, because of …”. In my opinion there are few reasons why people might resist to do incremental load, here are reasons:
- Implementing Full Load is easy! Really, to be honest, it is easier to write “Select * from mysourcetable” and load it into the staging table rather than checking for the change set!
- Low time and budget for implementing Incremental Load in ETL process. I’ve seen this approach in many consultancy companies which care about limitation on the chargeable hours for customer more than the quality of work and customer satisfaction!
- The data set is so small and fast. This might be the only acceptable reason to do a full load rather than incremental, However this should be carefully considered with factors such as growth rate of the data set. For example would this small data set stay small 10 years later? if not, do the incremental load!
With the methods I explain in Incremental Load posts I intend to make the process of writing and developing Incremental Load easier and faster for you.
Using Datetime columns to implement incremental load
As I mentioned in the top, Previously I’ve explained how to do incremental load with CDC (Change Data Capture). CDC method is performing great, however it is not a possible option in all scenarios. In fact the source database should support CDC. If you are using some data sources that doesn’t support CDC then your first stop should be thinking about datetime columns or timestamp column in the source data sets. In many cases the source system have columns such as Created datetime or Modified datetime that stores the datetime of creating and modifying the record in the table. These datetime columns can be used to implement Incremental Load. In this post we will go through process of implementing an Incremental Load solution with SSIS using modified datetime column.
The idea behind this method is to store the latest ETL run time in a config or log table, and then in the next ETL run just load records from the source table that have modified (with their modified date greater than or equal to) after the latest ETL run datetime. This will create the change set for the data table. The change set might contains inserted, updated, or deleted records. to identify which change happened on the record you need to compare the change set with existing records and separate inserted, updated, and deleted records. This change set with the action on each record can be inserted into staging tables, and then be used to apply on the fact table based on appropriate action.
Instead of using an extra table for log or config to store the latest ETL run time, I just use the fact table, and get the maximum modified datetime from the fact table. This will give me the datetime of the latest record loaded into the fact table.
select max(modifieddate) as lastdate from factinternetsales_t
I write the result set into a package variable, and use that variable to generate select command from the source table;
,convert(nvarchar(15),[CustomerID]) as CustomerID
where modifieddate>'"+ @[User::LastModifiedDate]+"' order by [SalesOrderNumber]
Note: the very first time that you run ETL there won’t be any values populated for the LastModifiedDate variable, so you have to check if there is no value then assign datetime 1900-01-01 or a default early date.
Now the command above can be used in the OLE DB Source to get only changed set.
Then I check the change action happened on the change set with a Lookup component, In the Lookup component I only load IDs from the fact table (to minimize the load)
Lookup then gives the match and no-match output. Match output will be updated or deleted records. and no match output will be new records. Depends on the action, I set status of 0 for inserted, 1 for deleted, and 2 for updated. I also use a conditional split to identify if the record has been deleted or not.
Note: In my scenario I assumed that there is an IsDeleted field in the source table, a flag that is 1 if record is deleted. Your data source table might not have IsDeleted, and the actual record might be hard deleted from the table. If that is the case then you can use Merge Join transformation with Full outer join setting instead of Lookup to identify deleted records.
Here is the data flow to find inserted, updated, and deleted records;
Here is the control flow structure of the Staging package (which brings data from source table into staging):
To Load data from stage table into the fact table I select everything from stage table (because stage table contains only change set)
And then if the action is update or delete I’ll remove the record from fact table (if it is update it will be added back again). If the action is insert the record will be inserted again into the fact table.
Before inserting the record into fact table I’ll do dimension lookups to fetch surrogate keys. Here is the data flow to load data into the fact table;
Now your fact table is up-to-date with incremental load approach rather than full load.
Note: there might be some variations in data flow tasks or components to use, but the overall concept of incremental load using modified datetime is always the same. I have seen some implementations that only load data from the source table where the modified datetime is greater than latest ETL run minus 1 day (or 2, or 5…). and they remove everything from the fact table for that period, but that method doesn’t work correctly if the source system doesn’t support soft delete. The method explained above catches deleted changes as well as updates and inserts.