Best Practice of SCD in SSIS

Right now I am making a research about what is BIG DATA, how Microsoft (Azure and Hadoop) can help with it and some brief hands-on (beyond the theory). But before I just wanted to help to any SSIS developer who has to work with SCD (Slowly Changing Dimensions) and is getting a bad performance response and with some little problems.

There is two ways to implement a SCD on SSIS:

 

Using the by default dataflow Item “Slowly Changing Dimension”

This is not the best practice, cause has a serious performance issues. The only benefit of this Item is that you will invest less than a minute creating your SSIS type 2. However, you really need/like it there is 2 little advices that will help you.

In the SCD Type 2, you need a starting date and end date in order to keep a track/history of your records. But the wizard of this item, will ask you to choose which SSIS variable you want to use for your Starting Date and also, it won’t even ask you about which value you want for your end date. By default, the end date will be NULL. What to do then?

* Starting Date: the best practice is to use a customized getdate() instead a system or SSIS variable. Cause you really want to know the exactly moment when is writing in the DB (and could be a gap between the SSIS execution task and your SQL insert). Besides that, imaging that your SSIS is launched in other country with different GMT… then asking for this GETDATE() sql command, will give you a better starting date.

* End Date: Well, the default NULL is not wrong, but SSAS 2008 and NULLs are not a very good friends, therefore I encorage you to use a non reachable date, like “9999/12/31” as the end date.

 

Using personalized Merge and SCD solution

This way will give you from 10x to 100x times faster SCD task (depending on the volume). There is some few steps and some little SQL Coding and you will get the best practice with a SCD Type2 on SSIS.

 

1st Step

Imaging the scenario that you want to move data from your Staging.Employee to your DW.Dim_Employee, of course you need in your DW a business key (something to help making look ups with your StagingArea) and a StartDate and EndDate (to track the history). Then the first step will be create a copy of your DW.Dim_Employee called “DW.Dim_employeeTemp”.

2nd Step

In your CONTROL FLOW you will create first a SQL Command that will truncate your temp table, then a Data flow to move and prepare all your SCD data and finally we will call to a SP in order to insert the new data and update the old versions of this data.

3rd Step

Into your Data Flow you have to merge all the data from the Staging Area to DW through your business key. In order to use the merge Item Data Flow, remember to sort the Sources in their properties. As I said you have to use a LEFT OUTER JOIN (everything in the Staging Area and only the rows of DW with a Business Key found in the Staging Area. Then we have to create a Split with two possible ways:

– IF DW.Key is null = that means this Stage key is not in the DW and therefore, is a new row to insert. Then we have to insert them in the DW table

– IF Stage.Field1 != DW.Field1 ||¬†Stage.Field2 != DW.Field2 ||Stage.Field3 != DW.Field3 ||… = this is the case when we have the same business key in the DW and the Stage, but then we have to check all the possible fields that could change between versions. If there is any field with different value (remember to use a OR condition) then we will insert them into the temp table.

4th and Final Step

We have to create a Stored Procedure in order to update all the DW business keys found in the temp table (with enddate = ‘9999/12/31’ of course) and then insert the temp rows into the DW.

Some pseudo code could be:

UPDATE SET DW.EndDate = getdate() WHERE DW.key in (Select Stage.key from Stage)

INSERT INTO DW SELECT key, field1, field2, field3, getdate() as ‘StartingDate’, ‘9999/12/31’ as ‘EndDate’

That’s all, Cheers!

 


Did you like this? Share it:

Leave a Reply

*