The Data Vault is a detail oriented, historical tracking and uniquely linked set of normalized tables that support one or more functional areas of business. It is a hybrid approach encompassing the best of breed between 3rd normal form (3NF) and star schema. The design is flexible, scalable, consistent and adaptable to the needs of the enterprise.

In this schema is mainly composed by 3 types of tables:

 

Hubs:

They are core business keys

* Hubs Model Entities

* Contains business key
-PK in absence of surrogate key

* Metadata:
-Record source
-Load date/time

* Optional surrogate key
-Used as PK if present

* No foreign keys!

 

Links:

They form all relationships.

* Links model relationships
-Intersection table (M:n relationship)

* Foreign keys to related hubs or links
-Form natural key (business key) of the link

* Metadata:
-Record source
-Load date/time

* Optional surrogate key

 

Satellites:

They provide all descriptive information.

* Satellites model a group of attributes

* Foreign key to a Hub or Link

* Metadata:
-Record source
-Load date/time

An example:

If you want to create diagrams, it is easy to implement following colors and shapes:

Codes Colour Shapes
Hubs Blue Square/Cube
Links Red Oblong/Oval
Satellites Yellow/Orange or Yellow/Green or Green Rectangle/Flat
Reference Entities None or Grey/Purple Circle
Hierarchical Link Red Pyramid

 

How to build Satellites

From an agility perspective, this approach takes us several steps in the right direction.  Things that change frequently are grouped into a common structure.  This limits the impact of changes during the lifecycle of the data warehouse.  The 3 to 5 Satellites above are commonly distinguished from each other based on differing rates of change, different types of data and also different record sources.  Let’s take a brief look at each one of these considerations:

 

  • Rates of change.

There will inevitably be some attributes that change over time more frequently than other attributes.  When we apply this criteria to modeling Satellites we need to consider the 80/20 or even the 90/10 rule.  That is to say we generally group rapidly changing attributes together and then leave the rest.  We may even have three categories of rates of change – Rapid, Sometimes, Never – for example.  But if we analyze too much, try to use too much precision in our decision, then we may find ourselves gravitating towards the left of this continuum, towards the One Attribute per Satellite.

 

  • Types of Data.

There are natural delineations or groupings concerning the types of context data that we have available to describe Hubs.  So groups of attributes may fall into types such as profile attributes, descriptive attributes, tracking attributes, status attributes, base attributes, geographic attributes, physical attributes, classification attributes, etc.  There are no hard and fast rules on these groupings.  Each should become apparent when you as the designer are working with the business to define these Hub keys.

 

  • Record Sources.

Even though we are designing the EDW data vault based on the central view (specifically not based on source systems or departmental views) we do have groupings of attributes that come from certain areas within the organization.  Often this is in line with the design we were contemplating based on the Types of Data factor.  This is because certain groupings of Types of Data tend to also be related to specific business processes and so also certain source systems.  As a first step then we should look to see how in line these two factors are for our given EDW scenario.


Did you like this? Share it:

read more

I created a quick video explaining some points about what is Big Data and how Microsoft is working on it for all their partners and companies who trust their solutions. In few words we could say that working in Big Data is a BIG thing and work in these kind of projects will challenge you as BI or Data Architect/Modeler. Hope you enjoy the video and please use the subtitles and forgive me for my Spanish accent, but if you are from a English country I bet you that I have better Spanish or Catalan accent 😉

In my next post I will dive deeply on Hadoop, Hive and Windows Azure solution (Microsoft Hadoop). If you want to get hand on labs from now, you just have to ask for a trial (no visa required) in www.hadooponazure.com. They will give you access to a Hadoop 1.5TB database with Windows Azure for a period of time. Enjoy it!


Did you like this? Share it:

read more

Right now I am making a research about what is BIG DATA, how Microsoft (Azure and Hadoop) can help with it and some brief hands-on (beyond the theory). But before I just wanted to help to any SSIS developer who has to work with SCD (Slowly Changing Dimensions) and is getting a bad performance response and with some little problems.

There is two ways to implement a SCD on SSIS:

 

Using the by default dataflow Item “Slowly Changing Dimension”

This is not the best practice, cause has a serious performance issues. The only benefit of this Item is that you will invest less than a minute creating your SSIS type 2. However, you really need/like it there is 2 little advices that will help you.

In the SCD Type 2, you need a starting date and end date in order to keep a track/history of your records. But the wizard of this item, will ask you to choose which SSIS variable you want to use for your Starting Date and also, it won’t even ask you about which value you want for your end date. By default, the end date will be NULL. What to do then?

* Starting Date: the best practice is to use a customized getdate() instead a system or SSIS variable. Cause you really want to know the exactly moment when is writing in the DB (and could be a gap between the SSIS execution task and your SQL insert). Besides that, imaging that your SSIS is launched in other country with different GMT… then asking for this GETDATE() sql command, will give you a better starting date.

* End Date: Well, the default NULL is not wrong, but SSAS 2008 and NULLs are not a very good friends, therefore I encorage you to use a non reachable date, like “9999/12/31” as the end date.

 

Using personalized Merge and SCD solution

This way will give you from 10x to 100x times faster SCD task (depending on the volume). There is some few steps and some little SQL Coding and you will get the best practice with a SCD Type2 on SSIS.

 

1st Step

Imaging the scenario that you want to move data from your Staging.Employee to your DW.Dim_Employee, of course you need in your DW a business key (something to help making look ups with your StagingArea) and a StartDate and EndDate (to track the history). Then the first step will be create a copy of your DW.Dim_Employee called “DW.Dim_employeeTemp”.

2nd Step

In your CONTROL FLOW you will create first a SQL Command that will truncate your temp table, then a Data flow to move and prepare all your SCD data and finally we will call to a SP in order to insert the new data and update the old versions of this data.

3rd Step

Into your Data Flow you have to merge all the data from the Staging Area to DW through your business key. In order to use the merge Item Data Flow, remember to sort the Sources in their properties. As I said you have to use a LEFT OUTER JOIN (everything in the Staging Area and only the rows of DW with a Business Key found in the Staging Area. Then we have to create a Split with two possible ways:

– IF DW.Key is null = that means this Stage key is not in the DW and therefore, is a new row to insert. Then we have to insert them in the DW table

– IF Stage.Field1 != DW.Field1 || Stage.Field2 != DW.Field2 ||Stage.Field3 != DW.Field3 ||… = this is the case when we have the same business key in the DW and the Stage, but then we have to check all the possible fields that could change between versions. If there is any field with different value (remember to use a OR condition) then we will insert them into the temp table.

4th and Final Step

We have to create a Stored Procedure in order to update all the DW business keys found in the temp table (with enddate = ‘9999/12/31’ of course) and then insert the temp rows into the DW.

Some pseudo code could be:

UPDATE SET DW.EndDate = getdate() WHERE DW.key in (Select Stage.key from Stage)

INSERT INTO DW SELECT key, field1, field2, field3, getdate() as ‘StartingDate’, ‘9999/12/31’ as ‘EndDate’

That’s all, Cheers!

 


Did you like this? Share it:

read more

Lotus Temple – New Delhi (India)

 


Did you like this? Share it:

read more

Define a best practice testing a SSIS solution is not an easy task, but I would like to share what I consider good practices from my experience and humble knowledge about the matter.

First of all you have to consider to test your SSIS packages not only when you finish your development but in every little step/task you are working on. Most of us are used to develop a whole package without any test before, cause the unique nature of this visual “language” programming, that is so easy, fast and thankful to work. That’s just an illusion, cause you cannot shouldn’t work on a whole package (or even solution) without testing it little by little.

Then the first good and reasonable practice is to test every task or dataflow you create and thanks to this, you will be able to detect the problems earlier (not funny surprises after some days developing a wrong idea) and reducing the cost of fix, isolating this issues in little problems to solve before any more development.

A good (and painful) practice is to apply “Continuous Integration“, that means, whenever you are working in a big solution (with even more developers) you have to test your code everytime you check in in your TFS, having special care not broking any other part of your entire solution, something called “Regression Testing“.

 

Strategy to implement a UnitTest

 

 

 

UnitTest & Test Suites

There is many good tools for SSIS beyond the C# Unit Test provide by MS Visual Studio, some of them could be:

 

  • * ssisUnit: What I consider the best and easiest one for SSIS with a lot of people working on it and sharing and blogging their experiences with this tool
    http://ssisunit.codeplex.com/

 

  • * xUnit: Probably a good test suit for SSIS but it is built for .net Developments (kind of what VS can offer you)
    http://xunit.codeplex.com/

 

  • * NUnit: Brother of xUnit, you can use it for any testing with your .net developments and some SSIS as well.
    http://www.nunit.org/

These test suites helps you to create an automated whole package/solution test in different levels. That is the more professional and right approach for a big solutions.

Some help to use them is this video :
http://files.sqlbits.com/event10/unit_testing_SSIS_packages/unit_testing_SSIS_packages_c_1000.mp4

 

Manual Testing

Here I would like to expose what for me are more than an alternative to the UnitTest tools, because from my point of view some of this little tips should be applied in the 99% of the cases, doesn’t matter if you are going to implement a whole testUnit solution right after o not, it will help you to check step by step if everything goes as we expected.

1) Using Logging in your SSIS with a SQL table. It is something that I won’t ever miss, cause it gives you for free a whole audit framework writing in a SQL table if some of your packages/tasks has any problem, or how long it takes (to improve the performance as well).

2) CheckPoints, it will help you a lot in a long development, saving some of your time not relaunching the whole solution till your failed point

3) If you are aware of your checkpoints a good practice would be to take care about the TransactionOption in case you don’t want to apply any dataflow that writes something in the begging and fails in the last step (for example).

4) Test the same package on another box. This would help to test properties like DelayValidation property and PackageProtectionLevel property. Whenever SSIS Packages are executed, first the SSIS runtime checks the package for validity and provided it passes the validation, then it executes the package.

5) Use the ForceExecutionResult property with each task to check how the flow behaves in case of failure, success, and completion.

Checking list

 

Design and best practices To check Passed

Check if best practices

have been followed

Best practices related to
optimization/tuning/design
have been followed.
 

OK

 

Naming convention at
project/packages/tasks/component
level is applied.
 

X

 

Package structure is aligned with
Package Design best practices.
 

OK

 

Testing To check Passed
Check if testing

has been completed.

Normal
flow scenario
Verify all tables/rows have
been imported.
 

 

Verify all data has been imported without
any truncation (for each column).
 

 

Error flow
scenario
Verify error flow when no input
data source available.
 

 

 

Verify error flow when no
output data source available.
 

 

 

Verify error flow related with input
data quality (NULL values, strings instead
of numeric values, etc.)
 

 

 

Logging Check if chosen logging mechanism is in
place and working properly for the selected events.
 

 

 

Deploying Check if package(s) has been deployed
and executed as expected in a proper testing environment.
 

 

 

Configuration Check if chosen configuration
mechanism is in place and working as expected.
 

 

 

 

Keeping in mind to have a complete and successful SSIS Test

1 Verify all the tables have been imported.
2 Verify all the rows in each table have been imported.
3 Verify all the columns specified in source query for each table have been imported
4 Verify all the data has been received without any truncation for each column.
5 Verify the schema at source and destination
6 Verify the time taken /speed for data transfer
7 Fields truncated due to difference in length of the field at destination.


Did you like this? Share it:

read more