Thanks to all attendees to my webinar today, with more than 400 registration I’m thrilled to say this very first RADACAD webinar was a successful step towards regular free webinars. If you are interested in future webinars please stay tuned, we will announce them all in our events page. Today’s webinar was about ETL Incremental Load approach with SSIS. I had only time to go through one of the approaches which was using Change Data Capture, and unfortunately hadn’t chance to explain other approaches, so it is probably that we have another session later on to continue the subject. I’ve got many interesting questions that I’ll share here with answers. Don’t hesitate to ask if you have any other questions.
Webinar recording is available through this link:
Q. Can I get recording of this webinar (this asked by many of you)?
A. Yes, look at links above
Q. What is the requirement for source CDC table format?
A. Yes there are some types of columns that CDC won’t capture their changes, such as: Computed Columns, XML, Timestamp, Blog Data Types. Read the details here:
Q. I have a questions regarding SCD. Can we capture delete and inserts through SCD or just updates?
A. Yes, you can use CDC to capture changes for implementing SCD. However there are some other considerations for SCD which needs to be taken care of, such as Inferred Dimension Member. I’ll do another session in future about implementing SCD and Inferred Dimension members design solutions.
Q. Is there any possible way this can be achieved in BIDS 2008 or less?
A. Yes, CDC is a database engine feature that works with some stored procedures. You can develop your SSIS 2008 package with Execute SQL Tasks and some other components to query the change set. You have to do some custom development, and the job isn’t as easy as working with built-in components of CDC in 2012 or higher. However it is possible. Good question actually, I might write a blog post about it later on. Thanks.
Q. I’ve seen a lot of settings regarding collation going on in the existing packages that running ETL. Since I can’t just change the logic in a running setting, how will CDC work with these collation differences between the environments?
A. When you work with CDC in ETL, you enable that on the source system. Usually source system is in a production server and it’s collation won’t change when you develop your ETL solution.
Q. I have to work with CSV extracted from Oracle financial. This is my source and I will be loading them to SQL Server 2012. Would CDC method still apply if I use extracts to load to SQL tables ?
A. When you work with Oracle as a source you can use CDC for Oracle to get the change set. SSIS supports it simply. However if you can only get CSV files then the answer is no, CDC won’t be helpful for you. However you can ask for CSV files that only contains change set. or work with metadata date columns such as modified_datetime and use Incremental Load with DateTime columns method.
Q. I have huge data size 90 million records, can this method be tried? how will the performance be?
A. Of course! That’s exactly one of the reasons you have to use Incremental Load. I’ve implemented this solution for many fact tables with more than 500 million records, and it worked perfectly. You need to also consider some performance tuning consideration for your overall data warehouse performance such as indexes, best practices in dimensional modelling, columnstore indexes and etc.
Q. I was wondering what would be other recovery options for resuming incremental load apart from doing full load and continue with incremental load, when CDC gets reset (like when ETL from replicated copy of db)?
A. I’m not 100% sure that I understand this question correctly. Why CDC gets reset? if so, then you can assign the relevant LSN in CDC Control Flow task to do incremental load from that point. If I didn’t understand your question clearly, please explain more in comments section below.
Q. Has it need special permission on data source?
A. Yes, For enabling CDC on the database you need to be a member of sysadmin fixed server role. Read more here:
Q. can this approach also be used for incremental load from SQL Server to (other applications)(like salesforce, oracle etc) ?
A. Sure. When we talk about CDC we are talking about source system. your destination data warehouse can be anything, SQL Server, Oracle, MySQL…..
If you have any other questions please let me know in comments section below.
Title: Incremental Load with SSIS
Incremental Load is always a big challenge in Data Warehouse and ETL implementation. In enterprise world you face millions, billions and even more of records in fact tables. It won’t be a practical practice to load those records every night, as it would have many downsides such as;
ETL process will slow down significantly, and can’t be scheduled to run on small periods.
Performance of the source, and destination server will be affected badly, downtime of these systems would be longer.
More resources will be required to maintain the process. such as better processors, more RAMs… and adding these won’t help so much at the end, because the amount of data is increasing as times passes.
and many other issues.
So what would be the solution? Solution is Incremental Load approach. In this approach data will be loaded partially, preferably only part of the data that has been changed. A change set will be much smaller than the total amount of data. As an example in a 200 million records fact table which stored data for 10 years, only 10% percent of that data might be related to the current year and changes frequently, so you won’t usually required to re-load the rest 180 million records.
In this session I will show you how to implement Incremental load with SSIS with demos.