(DV) Datavault approach for your DWH

The Data Vault is a detail oriented, historical tracking and uniquely linked set of normalized tables that support one or more functional areas of business. It is a hybrid approach encompassing the best of breed between 3rd normal form (3NF) and star schema. The design is flexible, scalable, consistent and adaptable to the needs of the enterprise.

In this schema is mainly composed by 3 types of tables:

 

Hubs:

They are core business keys

* Hubs Model Entities

* Contains business key
-PK in absence of surrogate key

* Metadata:
-Record source
-Load date/time

* Optional surrogate key
-Used as PK if present

* No foreign keys!

 

Links:

They form all relationships.

* Links model relationships
-Intersection table (M:n relationship)

* Foreign keys to related hubs or links
-Form natural key (business key) of the link

* Metadata:
-Record source
-Load date/time

* Optional surrogate key

 

Satellites:

They provide all descriptive information.

* Satellites model a group of attributes

* Foreign key to a Hub or Link

* Metadata:
-Record source
-Load date/time

An example:

If you want to create diagrams, it is easy to implement following colors and shapes:

Codes Colour Shapes
Hubs Blue Square/Cube
Links Red Oblong/Oval
Satellites Yellow/Orange or Yellow/Green or Green Rectangle/Flat
Reference Entities None or Grey/Purple Circle
Hierarchical Link Red Pyramid

 

How to build Satellites

From an agility perspective, this approach takes us several steps in the right direction.  Things that change frequently are grouped into a common structure.  This limits the impact of changes during the lifecycle of the data warehouse.  The 3 to 5 Satellites above are commonly distinguished from each other based on differing rates of change, different types of data and also different record sources.  Let’s take a brief look at each one of these considerations:

 

  • Rates of change.

There will inevitably be some attributes that change over time more frequently than other attributes.  When we apply this criteria to modeling Satellites we need to consider the 80/20 or even the 90/10 rule.  That is to say we generally group rapidly changing attributes together and then leave the rest.  We may even have three categories of rates of change – Rapid, Sometimes, Never – for example.  But if we analyze too much, try to use too much precision in our decision, then we may find ourselves gravitating towards the left of this continuum, towards the One Attribute per Satellite.

 

  • Types of Data.

There are natural delineations or groupings concerning the types of context data that we have available to describe Hubs.  So groups of attributes may fall into types such as profile attributes, descriptive attributes, tracking attributes, status attributes, base attributes, geographic attributes, physical attributes, classification attributes, etc.  There are no hard and fast rules on these groupings.  Each should become apparent when you as the designer are working with the business to define these Hub keys.

 

  • Record Sources.

Even though we are designing the EDW data vault based on the central view (specifically not based on source systems or departmental views) we do have groupings of attributes that come from certain areas within the organization.  Often this is in line with the design we were contemplating based on the Types of Data factor.  This is because certain groupings of Types of Data tend to also be related to specific business processes and so also certain source systems.  As a first step then we should look to see how in line these two factors are for our given EDW scenario.

There is many good points (as well as weak) but the keys for me are:

* HISTORICAL AND AUDIT DATA: By default all the H, S and L have a ValidFrom, ValidTo, LoadDateTime and RecordSource fields (ALWAYS) which allow you to track history in every single entity in your DWH, plus a perfect audit data to know when this record come in and from where

* UPDATE ONLY WHAT YOU NEED TO UPDATE: You can have a tabular cube /DWH with one fact table or dimension (i.e. using star schema instead of snowflake) with several fields (hundreds of them even) and you can split them into satellites with more and less change rate. That means you don’t need to update all the 100 records in the dimension, only one satellite which it will contain all the possible fields to be updated regularly.

* BIML + DATAVAULT: BIML is a good tool in order to automatize the creation of SSIS Packages quite fast. But you won’t create a BIML only in order to create one package… usually all the dimensions and facts are different and there is no code you can “reuse”, but in the case of DATAVAULT it is fixed enough to know where are the links, where the satellites, which are the validFrom, ValidTo etc fields, PKs, etc. Therefore you can create a SQL Stored Procedure (as I did in some projects) where only giving some data about the new packages/tables like name of the fields and how they are distributed (how many satellites/links) then it can recreate the whole SSIS Solution in 5 seconds, as well as create scripts that will transform quite easily from your DV to a Kimball dimensional modelling with a mathematical 0 error.

* …

 

In the next weeks I will explain with detail my experience creating a framework of BIML with Datavault and how that could recreate all the packages (an average of 60 to 100 SSIS packages (H+S+L) per DV project) in seconds. Stay tuned!

Regards,

Albert Garcia

 


Did you like this? Share it:

Leave a Reply

*