Introduction

This is probably a key factor, technically speaking, of the future health of your BI solution. There is not perfect nor unique “Data Modelling” that will fit in ALL solutions, but there are some points that will power up your future DW and extend to the max. all the possibilities.

Data modeling it is so extensive concept that can’t be covered within one post neither one book, but we can provide some easy to understand “dos and don’ts” when creating your Staging Area and DW for reporting.

 

STAGING AREA

Prepare yourself for ANYTHING and get AS MUCH AS YOU CAN

This sentence is key if we are getting our data from the source to the staging area. Basically, we have to capture all the data from the source system, and keep record of the original modeling.

So it important to get all the Business Keys, Foreign Keys, nullable or non nullable fields attributes, etc. from the source system tables. And most important, record all the history at the same time. That means, every time we run our ETL, we need to bring all the data, and capture some audit columns in order to keep record of the history. So if at any point, there is a PK (or BK) that is changing in the source, we need to create another record in our Staging table with the CDC info, i.e.: INSERT/UPDATE/DELETE

That will create a database with the data as it is in the source system, adding some audit fields and history, so at any point in time we can recreate our DW using its history and we won’t miss anything out.

Dos:

  • * Raw data in Stage, respecting the DDL (datatypes, fields, etc.)
  • * Get any table that might be a potential for your interest. The DB space/resources are limited, so I won’t say get EVERY single tables no matter what, but if possible, that is not a bad idea.
  • * Keep history, so you can recreate your DW and Datamarts easily in the future.

Don’ts:

  • * Don’t change data, like computing or merging two fields (i.e.: name + surname = full name)
  • * Don’t crop or truncate any fields. If you thing a datatype of VARCHAR(2000) is too big, keep that length in your staging or even round it up to 4000 or MAX.
  • * Don’t start modeling or “fixing” the data model in your Staging Area, even if it looks wrong, you should respect this DDL at this stage, that is something you’ll start working on the next phase, while moving data to DW or even better, to your presentation layer.

 

DATA WAREHOUSE

For our Data Warehouse I can’t recommend a unique model, as every solution/business is different, and the purpose of this DW could be Archiving, reporting, build a base for OLAP/Tabular cubes and so on.

The most extended and well known models are:

 

3NF

Pro

  • save most storage of all modelling techniques
  • simple to understand

Con

  • bad performance with multiple joins
  • difficult to load (ETL)
  • changes are very difficult to handle

 

* Star Schema

Pro

  • Simple design
  • fast queries
  • fits very well to olap models
  • many DBMS are optimized for queries on star schemas

Con

  • usually centralized on one fact table
  • higher storage usage due to denormalization

 

My suggestion is, to use “Data Vault” http://en.wikipedia.org/wiki/Data_Vault_Modeling in the DW core and from that point you can build star schemas in data marts. To explain the benefits of DV would take more than 1 post, so if you are really interested, just Google it or have a look at the Wikipedia link I provided.

Cheers!

 

 


Did you like this? Share it:

read more

The purpose of this post is to clarify some basics on the SSAS Tabular Cubes and how we can apply Incremental updates in order to achieve RealTime and its best practice.

 

First of all it is important to know about the basic processing options on Tabular cubes:

 

Object Available Processing Options
Database Process Clear, Process Default, Process Defrag, Process Full, Process Recalc
Table Process Clear, Process Data, Process Default, Process Defrag, Process Full
Partition Process Add, Process Clear, Process Data, Process Default, Process Full

 

And the meaning of these processing options:

 

Processing Option Description
Process Add Adds new rows to a partition. Any affected calculated columns, relationships, user hierarchies, or internal engine structures (except table dictionaries) are recalculated.
Process Clear Drops all the data in a database, table, or partition.
Process Data Loads data into a partition or table.
Process Default Loads data into unprocessed partitions or tables. Any affected calculated columns, relationships, user hierarchies, or internal engine structures (except table dictionaries) are recalculated.
Process Defrag Optimizes the table dictionary (an internal engine structure) for a given table or for all tables in the database*. This operation removes all dictionary entries that no longer exist in the data set and rebuilds the partition data based on the new dictionaries.
Process Full Loads data into all selected partitions or tables. Any affected calculated columns, relationships, user hierarchies, or internal engine structures (except table dictionaries) are recalculated.
Process Recalc For all tables in the database, recalculates calculated columns,  rebuilds relationships. rebuilds user hierarchies, and rebuilds other internal engine structures. Table dictionaries are not affected.

 

Once we have a clear idea of all the options we can apply in our BI Lifecycle we must differentiate between the concept that update or refresh your database doesn’t mean real update DWH:

 

  • Recalculate (Process Recalc) is not going to add new info, it is going just to recalculate all the measures even if you don’t insert new data in your cube.

 

  • Partitioning is a very common and old practice used mainly in the big solutions (huge amount of rows) where you need to process some old part of your BI solution only once and that’s not going to change. i.e.: sales 2005, 2004, 2003… In order to define your partitions you are flexible enough to specify a query that will bring only part of the table to these partitions, like “WHERE ID_Dim_Date >= 20010101 and ID_Dim_Date <= 20021231” where your current partition will have something like “WHERE ID_Dim_Date > 20130101” . That could be an option to only process part of your data, but it is not the best practice because you have to define your partitions which they’ll be at the Year or Month level, that means, if you have to re-process your 2013-11 partition every 5 minutes to aim realtime data.

 

  • Incremental processing (Process Add) is the one to achieve your beloved real-time. It will do the deltas by itself, adding or updating whatever just changed. Sadly, you can’t specify a query to apply the deltas only for an specific period.

 

Best Practice:

The ideal scenario is to use the process Add but not using the whole table (reducing the amount of data to process). Hence we can use a Partition, let’s say for 2013-11 and on top of that, ProcessAdd, which will keep and process very quickly the deltas. Then you can re process the cube as many times as you need and getting closer to what we call “real time”.


Did you like this? Share it:

read more

BIML (Business Intelligence Markup Language)

BIML is a Markup Language created by Variance which will help you to automatize the creation of many assets in your BI solutions, like SSIS packages or Cube (SSAS) definitions. For better definition visit Wikipedia http://en.wikipedia.org/wiki/Business_Intelligence_Markup_Language

As you may know, all the metadata of your MS BI suite is based on XML. If you take any cube or dtsx package and open it with Notepad you’ll notice that everything inside is XML. BIML is a tool to autogenerate this XML code through .Net or even SQL. You can find samples and all the possible tags to use on http://bimlscript.com/.

Based on that, I created on SQL Stored Procedure a what I called “PackageGenerator”. This SP will ask for an IDPackage and this ID will lead to a series of simple configuration tables that will define your DTSX Source, Target, fields, transformations, etc. In order to Autogenerate the package into your VS Solution with just one right click -> Generate.

Datavalut + BIML

As I explained before Datavault is a flexible  DWH following a rigid standards that will allow you to “guess” how a new entity will be build (more info check my previous post http://www.businessintelligence-solutions.com/dv-datavault-approach-dwh/). For example, we know that the business key will be on a Hub, and we know that all the hubs start with H_[EntityName]. Then if you have any link you know that it will be allocated in a Link and all the links (one or more) will be structured like this L_[EntityName]_L1,L_[EntityName]_L2… and same for Satellites S_[EntityName]_S1,S_[EntityName]_S2,…

Then you also know from the very beginning that all your tables will have a standardized structure and fields as… ValidFrom, ValidTo (always in all the tables and allowing you to do a SCD2 by default), RecordSource (to know from where is coming the data) and LoadDateTime (to know when it was inserted).

Then each object has a link to the hub always H_[EntityName]_SK (int) and your hub will always containt (apart of this unique SK) the business key called H_[EntityName]_BK.

With all these rules in place, you can create a framework that will allow you (hopefully with a UI created on top of that) to create all your BIML packages using this SP that will be feed by all these tables:

Then now the possibilities are open to almost everything. For example, in less than a week a prepare a SP that will call (thanks to these tables or even to the system tables that you can query) your DW. You just need to define which ones are going to be Dimensions, which ones Fact tables and your SP will recreate the whole DW (drop and create). It will link your fact tables (through constraints) to your dimensions, it will create all the tables with the same rules (ID_Dim_, or ID_Fact_…). And of course, it will create automatically all the dtsx to bring the data from your DV to your dimensional modelling in seconds and also the views that will wrap all these H, S and L into one Dim or Fact (depending what you said in your tables), besides I have a table that will define SCD1 or SCD2 and your system will create the dtsx doing an update or doing a SCD (http://www.businessintelligence-solutions.com/practice-scd-ssis/).

The weak point of DV is that usually a normal third normal form database that has around 30 tables in DV it will become more than 120+, but using BIML on top of your DV it will create the 120+ packages (dtsx) in seconds and your DW will change as soon as you really need it without any problem.

Good luck!
Albert


Did you like this? Share it:

read more

The Data Vault is a detail oriented, historical tracking and uniquely linked set of normalized tables that support one or more functional areas of business. It is a hybrid approach encompassing the best of breed between 3rd normal form (3NF) and star schema. The design is flexible, scalable, consistent and adaptable to the needs of the enterprise.

In this schema is mainly composed by 3 types of tables:

 

Hubs:

They are core business keys

* Hubs Model Entities

* Contains business key
-PK in absence of surrogate key

* Metadata:
-Record source
-Load date/time

* Optional surrogate key
-Used as PK if present

* No foreign keys!

 

Links:

They form all relationships.

* Links model relationships
-Intersection table (M:n relationship)

* Foreign keys to related hubs or links
-Form natural key (business key) of the link

* Metadata:
-Record source
-Load date/time

* Optional surrogate key

 

Satellites:

They provide all descriptive information.

* Satellites model a group of attributes

* Foreign key to a Hub or Link

* Metadata:
-Record source
-Load date/time

An example:

If you want to create diagrams, it is easy to implement following colors and shapes:

Codes Colour Shapes
Hubs Blue Square/Cube
Links Red Oblong/Oval
Satellites Yellow/Orange or Yellow/Green or Green Rectangle/Flat
Reference Entities None or Grey/Purple Circle
Hierarchical Link Red Pyramid

 

How to build Satellites

From an agility perspective, this approach takes us several steps in the right direction.  Things that change frequently are grouped into a common structure.  This limits the impact of changes during the lifecycle of the data warehouse.  The 3 to 5 Satellites above are commonly distinguished from each other based on differing rates of change, different types of data and also different record sources.  Let’s take a brief look at each one of these considerations:

 

  • Rates of change.

There will inevitably be some attributes that change over time more frequently than other attributes.  When we apply this criteria to modeling Satellites we need to consider the 80/20 or even the 90/10 rule.  That is to say we generally group rapidly changing attributes together and then leave the rest.  We may even have three categories of rates of change – Rapid, Sometimes, Never – for example.  But if we analyze too much, try to use too much precision in our decision, then we may find ourselves gravitating towards the left of this continuum, towards the One Attribute per Satellite.

 

  • Types of Data.

There are natural delineations or groupings concerning the types of context data that we have available to describe Hubs.  So groups of attributes may fall into types such as profile attributes, descriptive attributes, tracking attributes, status attributes, base attributes, geographic attributes, physical attributes, classification attributes, etc.  There are no hard and fast rules on these groupings.  Each should become apparent when you as the designer are working with the business to define these Hub keys.

 

  • Record Sources.

Even though we are designing the EDW data vault based on the central view (specifically not based on source systems or departmental views) we do have groupings of attributes that come from certain areas within the organization.  Often this is in line with the design we were contemplating based on the Types of Data factor.  This is because certain groupings of Types of Data tend to also be related to specific business processes and so also certain source systems.  As a first step then we should look to see how in line these two factors are for our given EDW scenario.


Did you like this? Share it:

read more

I created a quick video explaining some points about what is Big Data and how Microsoft is working on it for all their partners and companies who trust their solutions. In few words we could say that working in Big Data is a BIG thing and work in these kind of projects will challenge you as BI or Data Architect/Modeler. Hope you enjoy the video and please use the subtitles and forgive me for my Spanish accent, but if you are from a English country I bet you that I have better Spanish or Catalan accent 😉

In my next post I will dive deeply on Hadoop, Hive and Windows Azure solution (Microsoft Hadoop). If you want to get hand on labs from now, you just have to ask for a trial (no visa required) in www.hadooponazure.com. They will give you access to a Hadoop 1.5TB database with Windows Azure for a period of time. Enjoy it!


Did you like this? Share it:

read more

Right now I am making a research about what is BIG DATA, how Microsoft (Azure and Hadoop) can help with it and some brief hands-on (beyond the theory). But before I just wanted to help to any SSIS developer who has to work with SCD (Slowly Changing Dimensions) and is getting a bad performance response and with some little problems.

There is two ways to implement a SCD on SSIS:

 

Using the by default dataflow Item “Slowly Changing Dimension”

This is not the best practice, cause has a serious performance issues. The only benefit of this Item is that you will invest less than a minute creating your SSIS type 2. However, you really need/like it there is 2 little advices that will help you.

In the SCD Type 2, you need a starting date and end date in order to keep a track/history of your records. But the wizard of this item, will ask you to choose which SSIS variable you want to use for your Starting Date and also, it won’t even ask you about which value you want for your end date. By default, the end date will be NULL. What to do then?

* Starting Date: the best practice is to use a customized getdate() instead a system or SSIS variable. Cause you really want to know the exactly moment when is writing in the DB (and could be a gap between the SSIS execution task and your SQL insert). Besides that, imaging that your SSIS is launched in other country with different GMT… then asking for this GETDATE() sql command, will give you a better starting date.

* End Date: Well, the default NULL is not wrong, but SSAS 2008 and NULLs are not a very good friends, therefore I encorage you to use a non reachable date, like “9999/12/31” as the end date.

 

Using personalized Merge and SCD solution

This way will give you from 10x to 100x times faster SCD task (depending on the volume). There is some few steps and some little SQL Coding and you will get the best practice with a SCD Type2 on SSIS.

 

1st Step

Imaging the scenario that you want to move data from your Staging.Employee to your DW.Dim_Employee, of course you need in your DW a business key (something to help making look ups with your StagingArea) and a StartDate and EndDate (to track the history). Then the first step will be create a copy of your DW.Dim_Employee called “DW.Dim_employeeTemp”.

2nd Step

In your CONTROL FLOW you will create first a SQL Command that will truncate your temp table, then a Data flow to move and prepare all your SCD data and finally we will call to a SP in order to insert the new data and update the old versions of this data.

3rd Step

Into your Data Flow you have to merge all the data from the Staging Area to DW through your business key. In order to use the merge Item Data Flow, remember to sort the Sources in their properties. As I said you have to use a LEFT OUTER JOIN (everything in the Staging Area and only the rows of DW with a Business Key found in the Staging Area. Then we have to create a Split with two possible ways:

– IF DW.Key is null = that means this Stage key is not in the DW and therefore, is a new row to insert. Then we have to insert them in the DW table

– IF Stage.Field1 != DW.Field1 || Stage.Field2 != DW.Field2 ||Stage.Field3 != DW.Field3 ||… = this is the case when we have the same business key in the DW and the Stage, but then we have to check all the possible fields that could change between versions. If there is any field with different value (remember to use a OR condition) then we will insert them into the temp table.

4th and Final Step

We have to create a Stored Procedure in order to update all the DW business keys found in the temp table (with enddate = ‘9999/12/31’ of course) and then insert the temp rows into the DW.

Some pseudo code could be:

UPDATE SET DW.EndDate = getdate() WHERE DW.key in (Select Stage.key from Stage)

INSERT INTO DW SELECT key, field1, field2, field3, getdate() as ‘StartingDate’, ‘9999/12/31’ as ‘EndDate’

That’s all, Cheers!

 


Did you like this? Share it:

read more