Data Maturity Model is a way to classify how bad or well is your company doing in all the data related fields.  Basically, the topic is so big, which includes the security, the data governance, etc… Today I want to expose only the Levels around the Enterprise Data Warehouse and see how your analytics is doing:

  

DMM1 (circa 1990-2000)

  • * No standards
  • * Reactive approach
  • * No master data plan
  • * No strategy
  • * Redoing isolated excels for each report

 

DMM2 (circa 2000-2005)

  • * Standards established
  • * People try to copy all operational Data together in a centralize DB and call it Data Warehouse (no modelling
  • * Waterfall approach
  • * No data Governance

 

DMM3 (circa 2005-2010)

  • * Data Specialist are hired and start doing some nominal Data Governance
  • * Some kind of MDM (third party tool) is used inefficiently
  • * Agile lite or selective
  • * First form of DW or marts creating Star Schemas and some Slowly Changing Dimensions. Rigid and not very scalable, having a big cost on maintenance and very slow process

 

DMM4 (circa 2010-now)

  • * Hadoop in Development
  • * Some analytic stores
  • * MDM managing all companies meta data, using a third party tool well accommodated to the specific business or creating a custom one
  • * Data layers put in place
  • * DW with data quality and reducing the maintenance cost. Dimensional Modelling or other techniques created by experts
  • * Chief Data Officer
  • * All Agile
  • * ETL Automation

 

DMM5 (now – future)

  • * Hadoop in production
  • * Scalable and easy to maintain Enterprise Data Warehouse that will provide marts to different Departments or subjects
  • * Columnar stores
  • * NoSQL
  • * Data Virtualization. Ability to recreate your virtual data marts avoiding the whole lifecycle of each mart (schema_on_read)
  • * MDM managing all companies governance data by subject. Providing insights and user friendly
  • * Chief Information Architect
  • * Data as asset in financial statements

 

So, how do I get into DMM4 or DMM5? for that I would recommend to read these two articles:


Did you like this? Share it:

read more

Introduction

Let me explain my experience leading the whole Data Architecture (split by layers, see on my previous post here) and also leading and creating from zero the whole Enterprise Data Warehouse solution for my current company.

 

About the company

Avoiding names, I would like to introduce few facts about this company. It is one of the largest ecommerce companies in the world with more than 13,000 employees. This company has more than 70 active companies or services attached to its core company, providing a huge variety of business such as: Insurance, Telecommunication, bank, credit card, travel, sports (bought soccer club, beisbol club, etc),…

 

About the data

So for that, we have more than 500 databases distributed around the world with thousands and thousands of tables, as mentioned before, talking about very different business domains. Size of it, still unknown, but so far we have in our hands around 2PB (and growing).

 

How to implement Agile DW with DV2.0

 

First: Data ingestion (Bring all your data into your Data Lake and keep history records)

So in order to get all this data into our History Stage (copy of the original + history), we did developed a Python script and a MetaData Management to automate all the ingestion, but that’s another topic that I will cover in another post, here I am going to focus on the EDW solution to get as much as possible from all this data.

  

 

Second: prepare automation for your Data Vault 2.0

This is very important step as it will provide a huge benefit for a cheap price. Talking about my own experience, I created a whole framework in Python code that auto populates Hub, Satellite, Link and SameAsLink objects in about 2 weeks of Development.

How does it work? Easy! You just need to create a view that follows the Data Vault naming convention. Let’s say that we create a Satellite view to feed our DV, and it will automatically populate, not only a Satellite (table and load), but also Hubs reusing the same view (possibilities to automate are limitless). With a bit of code in our framework, our view will be able to:

  • * Create hub table (if not exists).
  • * Load hub information (only new data).
  • * Create Satellite table (if not exists)
  • * Drop and create views such as vw_[name]_current and vw_[name]_history.
  • * Load Satellite information (new, updated and deleted data).
  • * Get metadata and governance to know which Entities, DV Objects and business we are loading.

 

This is a sample of my naming convention for the framework:

[DV Object]_[Entity]_[Business]_[common/[others]] ==> sat_customer_bank_common

 

Third: split your Satellites using the flexibility of DV2.0

So, the beauty of Data Vault is its scalability and flexibility, which is perfect for complex scenarios such as developing an EDW with an initial unknown scope for a very big company. Since companies are so different from each other and we don’t know how many entities we will need at the end, we started creating some basic entities, such as: customer, order, orderdetail, product, item and so forth.

So we are also putting all the common attributes in a common Sat for each entity, because, even though Phone, a Bank or an eCommerce companies have almost nothing in common, they will still have customers and these customers will share common attributes such as Name, Address, DOB, Gender, etc. This is just finding semantical understanding to our data and using it.

For those attributes that are specific for each business, then we can just create a special satellite for customer banks (salary, rate, etc), other satellites for travel business (frequent flyer number, preferred destinations, etc).

That means, we don’t need to design the whole EDW in one goal, neither we need to include all the business of each entity in order to start getting value.

With the common attributes, we can reuse the same customer segmentation or KPI reports across each business or all together (or by groups) only with one report and a filter for the company/ies that we want to analyse.

What about Item, Order, Order detail, etc? Easy find the semantical understanding of the data. For example, in eCommerce item could be an ipad2, in bank could be loan package 30 years, in travel could be travel package, or even a hotel room (i.e.: “Holiday Inn – Atlanta – Double standard“) and so forth. So later you’ll be able to reuse the same report to check which item was the most successful during one period (agnostic of which kind of product it is).

NOTE: Of course, other good practice is to split your satellites by change rate, so the columns that change the most could be in one satellite and ETL it very often and the ones with low change/priority could be loaded daily or weekly etc.

 

Fourth: Supernova layer or “Business Delivery Layer” (c)

Thanks to our MetaDataManagement, we can automate creation of views removing the DV complexity. So, Business users can start looking at the EDW with a 3NF language that they are more comfortable or create (an even automate) our virtual Star Schemas, based on only views, so much easier to create, recreate and fine tune.
* BTW, I prefer to call it “Business Delivery Layer” .


 

Some references:

 

Pure (and real) Agile Data Warehouse

So the first and second steps are king of one off, once is developed and running in an automation way.

The third step is the one that can be done, one entity/business at the time and that means, because we are not aiming to deliver all of them at the same time. So we can apply agile sprints at target for a number of entities and business for this sprint (let’s say 2 weeks).

Once these new entities/business are in production, you can easily go to the fourth step and create the supernova of it (aim for automation when possible) and data consumers can start enjoying this part of the EDW, or at least, checking if they need something else, if they are missing things etc. So if that happens in the supernova layer, as they are all views, it will be very fast to fix it, if it is a major change in the DV layer, then next sprint we can fix it just creating new satellites (always adding/appending, never rebuilding or throwing away anything).

 


Did you like this? Share it:

read more

So in order to expose the different layers and artefacts contained in this architecture see in the next picture and refer the numbers in the table below.

Data Architecture, Data Lake, Data Vault, Data Virtualization, Data governance and Data lineage

NumDescription
1Data source or Data Owners space. This could be Relational Database, API, Flat files, etc.
2Python customized script that will do all the ELT, capture audit information, feed the MDM (with Metadata automatically and configuration through operators) etc
3Layer where we will store all the data sets on its raw shape (just adding audit or ELT extra columns). Ideally it will be placed on Hive (Data Lake)
4When python (E)xtracts the source information into a file, then it loads it into Hadoop/Hive. This could be Delta or Full. This table is totally temporal, so it is not persistent.
5Now python will do the (L)oad. With our extra columns, we can hash the upcoming data and compare it through the Business Key to the one already stored in the History Stage (persistent) table. Then we will insert ONLY new and updated data. Later we will run a Delete Detection Pattern and insert the deleted Rows.
6This table is persistent and it will contain the initial full load and the inserted/deleted/updated new records on the following loads.
7The enterprise layer (or integration layer) is where our EDW will be located. Typically in Hive, but if needed could be stored anywhere else.
8Python will also load from our Raw Layer, the already modelled (through views ideally) data into a persistent EDW tables. In this case, we will apply only Hard Business Rules, which are the ones that not change the meaning of the data, just the shape. Example: Change data type, standardize Timestamps into a unique format, etc.
This data will land into what we called "Raw Data Vault". Modelled with DV 2.0 fashion, the data won't be agreegated or changed here.
9If required (optional) we will create out Business Data Vault, which could be views or another ELT to transform the data into more meaningful data for the business, where we will apply business rules, do aggregations, merge data, etc.
10This layer works on top of the DV layer. Because DV modelling is quite complex (even for some Data Engineers), this layer is meant to create (automatically if possible) a set of views with the information stored in RDV and BDV in a 3NF fashion (or even Star Schema fashion). So the Data consumers doesn't need to deal with the modelling complexity of DV2.0
11This Database with the output of EDW (DV) could be stored in Hive or Teradata or any other high performance database for later consumption. Ideally it will be just views or materialized views.
12This is our Data Virtualization space. From here we will choose a technology able to do cross platform queries and expose all our Raw Layer and EDW information through only this channel. Because the will connect to a simple JDBC or ODBC, they don't need to deal with the complexity of looking for data in Hive or in Teradata, etc. Everything will be simplified.
13We will apply Security and expose only the access level for that specific user (stored in MDM) creating views to the granted tables and columns. This database (in presto in our example) will need to regenerate all the views to follow the security specs stored in MDM or to capture new tables (also specified in MDM)
14Backbone of everything. Here we will stored the Metadata (refresh in every load), the patterns to use when loading, the extra configuration (blacklist columns/tables, identify delta fields) and also deal with the security to be exposed in the Presentation (or data Access layer)
15Audit database will track the data lineage of every single row and also stored information about the ELT and even, through presto, how many people are accessing which tables and how often. With this we can do a hot/warm/cold strategy.

Did you like this? Share it:

read more

Basically, if you keep your history just exposing the datetime when the record was inserted or updated, it is possible to get a view of your table in the shape of a Slowly Change Dimension Type 2. The way to achieve it is to mention your LoadDateTime as StartDate and then create a partition over the business key to identify the next LoadDateTime, and if there is no more, expose year 9999 or 2100 (no end). The following code will explain that in a real scenario. Note: ignore the audit fields that I used to create in the History Staging Area, which I will explain in my next post.

CREATE VIEW [hstg].[vw_Table_History]
AS
SELECT
ETL_EXECUTION_DT AS START_DATE,
ISNULL(lead (ETL_EXECUTION_DT) OVER (PARTITION BY  BusinessKey ORDER BY ETL_EXECUTION_DT), ‘9999-12-31 00:00:00.0000000’) AS END_DATE,
CASE WHEN lead (ETL_EXECUTION_DT) OVER (PARTITION BY BusinessKey ORDER BY ETL_EXECUTION_DT) IS NULL THEN ‘Y’  ELSE ‘N’ END AS [CURRENT_RECORD],
[BusinessKey],
[Field1],
[Field2],

ETL_BATCH_DT AS ETL_ROW_START_DT,
S.ETL_BATCH_ROW_ID,
CONVERT (DATETIME, ‘99991231’) AS ETL_ROW_END_DT,
ETL_ROW_HASH,
CONVERT (NVARCHAR (512), ‘AdvWorks’) AS ETL_ROW_SOURCE,
ETL_CDC_OPERATION,
ETL_EXECUTION_DT
FROM [hstg].[Table] S


Did you like this? Share it:

read more

Introduction

This is probably a key factor, technically speaking, of the future health of your BI solution. There is not perfect nor unique “Data Modelling” that will fit in ALL solutions, but there are some points that will power up your future DW and extend to the max. all the possibilities.

Data modeling it is so extensive concept that can’t be covered within one post neither one book, but we can provide some easy to understand “dos and don’ts” when creating your Staging Area and DW for reporting.

 

STAGING AREA

Prepare yourself for ANYTHING and get AS MUCH AS YOU CAN

This sentence is key if we are getting our data from the source to the staging area. Basically, we have to capture all the data from the source system, and keep record of the original modeling.

So it important to get all the Business Keys, Foreign Keys, nullable or non nullable fields attributes, etc. from the source system tables. And most important, record all the history at the same time. That means, every time we run our ETL, we need to bring all the data, and capture some audit columns in order to keep record of the history. So if at any point, there is a PK (or BK) that is changing in the source, we need to create another record in our Staging table with the CDC info, i.e.: INSERT/UPDATE/DELETE

That will create a database with the data as it is in the source system, adding some audit fields and history, so at any point in time we can recreate our DW using its history and we won’t miss anything out.

Dos:

  • * Raw data in Stage, respecting the DDL (datatypes, fields, etc.)
  • * Get any table that might be a potential for your interest. The DB space/resources are limited, so I won’t say get EVERY single tables no matter what, but if possible, that is not a bad idea.
  • * Keep history, so you can recreate your DW and Datamarts easily in the future.

Don’ts:

  • * Don’t change data, like computing or merging two fields (i.e.: name + surname = full name)
  • * Don’t crop or truncate any fields. If you thing a datatype of VARCHAR(2000) is too big, keep that length in your staging or even round it up to 4000 or MAX.
  • * Don’t start modeling or “fixing” the data model in your Staging Area, even if it looks wrong, you should respect this DDL at this stage, that is something you’ll start working on the next phase, while moving data to DW or even better, to your presentation layer.

 

DATA WAREHOUSE

For our Data Warehouse I can’t recommend a unique model, as every solution/business is different, and the purpose of this DW could be Archiving, reporting, build a base for OLAP/Tabular cubes and so on.

The most extended and well known models are:

 

3NF

Pro

  • save most storage of all modelling techniques
  • simple to understand

Con

  • bad performance with multiple joins
  • difficult to load (ETL)
  • changes are very difficult to handle

 

* Star Schema

Pro

  • Simple design
  • fast queries
  • fits very well to olap models
  • many DBMS are optimized for queries on star schemas

Con

  • usually centralized on one fact table
  • higher storage usage due to denormalization

 

My suggestion is, to use “Data Vault” http://en.wikipedia.org/wiki/Data_Vault_Modeling in the DW core and from that point you can build star schemas in data marts. To explain the benefits of DV would take more than 1 post, so if you are really interested, just Google it or have a look at the Wikipedia link I provided.

Cheers!

 

 


Did you like this? Share it:

read more