SSIS Incremental Load with Datetime Columns

Posted by on Jan 12, 2016 in ETL, SSIS | 14 Comments
Facebooktwittergoogle_plusredditpinterestlinkedintumblrmailFacebooktwittergoogle_plusredditpinterestlinkedintumblrmail

0

Previously I’ve mentioned how to implement Incremental Load within SSIS using CDC (Change Data Capture) feature. Unfortunately CDC is not always supported by the source database, so you have to implement an incremental load solution without CDC. In this post I’ll explain how to implement incremental load if there is a modified datetime column in the data source tables or data sets. This method can be applied to any data sources (relational or non-relational) as long as there a modified datetime column in the source data set.

Why Incremental Load?

  • You don’t want to re-load 10 TBs of data from source tables every night, and even if you do want that the ETL process will take long time and the ETL server will crash half way through the process! You will need a very high performing ETL server with massive resources (RAM, CPU…) to run that ETL process.
  • BI or Reporting users sometimes need their data to be updated more frequently than daily process. Consider if you want to run ETL process every 30 minutes or every hour; You have to carefully think about what you transfer in your ETL process to minimize the ETL process time.
  • Source systems are in use by other users or systems; So you cannot keep them busy for 10 hours to load all the data set from mid-night to 10AM morning!

and many other reasons…

 

Why So much resistance to Incremental Load?

I have seen and heard everywhere sentences like: “We don’t do Incremental Load, Full Load is a better option!” or “Incremental Load isn’t appropriate, because of …”. In my opinion there are few reasons why people might resist to do incremental load, here are reasons:

  • Implementing Full Load is easy! Really, to be honest, it is easier to write “Select * from mysourcetable” and load it into the staging table rather than checking for the change set!
  • Low time and budget for implementing Incremental Load in ETL process. I’ve seen this approach in many consultancy companies which care about limitation on the chargeable hours for customer more than the quality of work and customer satisfaction!
  • The data set is so small and fast. This might be the only acceptable reason to do a full load rather than incremental, However this should be carefully considered with factors such as growth rate of the data set. For example would this small data set stay small 10 years later? if not, do the incremental load!

With the methods I explain in Incremental Load posts I intend to make the process of writing and developing Incremental Load easier and faster for you.

 

Using Datetime columns to implement incremental load

As I mentioned in the top, Previously I’ve explained how to do incremental load with CDC (Change Data Capture). CDC method is performing great, however it is not a possible option in all scenarios. In fact the source database should support CDC. If you are using some data sources that doesn’t support CDC then your first stop should be thinking about datetime columns or timestamp column in the source data sets. In many cases the source system have columns such as Created datetime or Modified datetime that stores the datetime of creating and modifying the record in the table. These datetime columns can be used to implement Incremental Load. In this post we will go through process of implementing an Incremental Load solution with SSIS using modified datetime column.

The idea behind this method is to store the latest ETL run time in a config or log table, and then in the next ETL run just load records from the source table that have modified (with their modified date greater than or equal to) after the latest ETL run datetime. This will create the change set for the data table. The change set might contains inserted, updated, or deleted records. to identify which change happened on the record you need to compare the change set with existing records and separate inserted, updated, and deleted records. This change set with the action on each record can be inserted into staging tables, and then be used to apply on the fact table based on appropriate action.

Implementation Details

Instead of using an extra table for log or config to store the latest ETL run time, I just use the fact table, and get the maximum modified datetime from the fact table. This will give me the datetime of the latest record loaded into the fact table.

1

I write the result set into a package variable, and use that variable to generate select command from the source table;

2

Note: the very first time that you run ETL there won’t be any values populated for the LastModifiedDate variable, so you have to check if there is no value then assign datetime 1900-01-01 or a default early date.

Now the command above can be used in the OLE DB Source to get only changed set.

3

Then I check the change action happened on the change set with a Lookup component, In the Lookup component I only load IDs from the fact table (to minimize the load)

4

Lookup then gives the match and no-match output. Match output will be updated or deleted records. and no match output will be new records. Depends on the action, I set status of 0 for inserted, 1 for deleted, and 2 for updated. I also use a conditional split to identify if the record has been deleted or not.

Note: In my scenario I assumed that there is an IsDeleted field in the source table, a flag that is 1 if record is deleted. Your data source table might not have IsDeleted, and the actual record might be hard deleted from the table. If that is the case then you can use Merge Join transformation with Full outer join setting instead of Lookup to identify deleted records.

Here is the data flow to find inserted, updated, and deleted records;

5

Here is the control flow structure of the Staging package (which brings data from source table into staging):

6

 

To Load data from stage table into the fact table I select everything from stage table (because stage table contains only change set)

7

And then if the action is update or delete I’ll remove the record from fact table (if it is update it will be added back again). If the action is insert the record will be inserted again into the fact table.

Before inserting the record into fact table I’ll do dimension lookups to fetch surrogate keys. Here is the data flow to load data into the fact table;

8

Now your fact table is up-to-date with incremental load approach rather than full load.

Note: there might be some variations in data flow tasks or components to use, but the overall concept of incremental load using modified datetime is always the same. I have seen some implementations that only load data from the source table where the modified datetime is greater than latest ETL run minus 1 day (or 2, or 5…). and they remove everything from the fact table for that period, but that method doesn’t work correctly if the source system doesn’t support soft delete. The method explained above catches deleted changes as well as updates and inserts.

 

Facebooktwittergoogle_plusredditpinterestlinkedintumblrmailFacebooktwittergoogle_plusredditpinterestlinkedintumblrmail
rssyoutuberssyoutube
Reza Rad
Reza Rad is an Author, Trainer, Speaker and DW/BI Consultant. He has a BSc in Computer engineering; he has more than 15 years’ experience in databases, programming and development mostly on Microsoft technologies. He is a Microsoft MVP in Data Platform for seven continues years (from 2011 till now) for his dedication in Microsoft BI. He is author of some SQL Server and BI books, and also Power BI online book; from Rookie to Rock Star.

14 Comments

  • Hi there,
    Nice Article, I try o follow your instructions but it looks like I’m missing parts.
    Is there any way you can email me this package?
    Thanks,
    Oded Dror

    • Hi Oded,

      Unfortunately I don’t have the sample package I’ve built for this example blog post. If you let me know where you face an issue, I might be able to help

      Cheers,
      Reza

  • Hi Reza,
    Please explain what is done inside the component which you named as remove record.
    After deleting the updated and deleted records, how are you inserting the updated records again.

    Regards,
    Kranthi Kumar Reddy

    • Remove record will just delete the record.
      Because I am using one of the transformations in Data Flow, so the input and output would be the same plus any extra columns for that transformation. so I would have same output as the input row, and I can create the new row based on that then in other steps down to OLE DB Destination for inserting the record.

      Cheers,
      Reza

  • HI Reza

    I really appreciate your article on this incremental load based on the datatime but Unfortunately, Datetime comparison did not happening correctly. can you please send me if u have this package still.
    where modifieddate>'”+ @[User::LastModifiedDate]+”‘

  • Nice Approach !!!!!
    But I have some queries ….if we want to perform incremental load for particular past date , how we can do this … please suggest ASAP as I need to implement it in my project which is in very critical phase .

    • Hi Salma,

      if you want to run incremental load from specific date, you have to create a config table somewhere (for example SQL Server DB), and use SSIS to read that first, and start loading data from that date on-wards.
      Cheers
      Reza

  • How would I do if I have source as ODBC.

    I know one thing we can do is create an SQL Command at control properties level which I did but that does not help me.

    • Same method applies on ODBC connections. Only parameter place holders and names are different. This method is working regardless of data source. you need to have a data source table with modified/created datetime columns only.
      Cheers
      Reza

  • Hi Reza,

    Regarding the same query
    “Your data source table might not have IsDeleted, and the actual record might be hard deleted from the table. If that is the case then you can use Merge Join transformation with Full outer join setting instead of Lookup to identify deleted records.”

    If we implement this way, then there will incremental loading of the fact but without the timestamp which in fact means full records from source (100 million rows) joined with full records from fact (100 million rows).

    Will this not kill the performance of the entire process. Can’t this approached be based on a timestamp?

    Awaiting your reply.

    Thanks
    Uday

    • Hi Uday,
      That is in addition to use Modified DateTime columns. So you still need to get portion of data that their modified date time is after the latest ETL. In that case you are never loading full records from tables.

      Cheers
      Reza

Leave a Reply

Your email address will not be published. Required fields are marked *