There are three keys for a Data Warehouse to be worth working on:

  •       * Efficiency: You need a Data Warehouse that doesn’t need to be from day zero with the perfect data model for the next 10 years, as this “big design up front” will cost months of analysis and thinking to be 99% of the time wrong at some point. Also, we need to be able to adapt to changes and have value and benefits within weeks (instead of months or years). This is how I define efficiency in a DW.
  •       * Durable: Having your own framework, which can be created in a flexible manner in terms of technologies. The only thing that you need to make sure, is that your methodology and algorithms when loading your data are always the same regardless of the technology of your pipeline and then when moving to new technology or combining different technologies, they will all speak the same language.
  •       * Scalability: Here we are talking beyond just adding more servers to the same SQL engine, but being able to combine Big Data Technologies (S3, Hadoop,… or any schema-on-read), with high-performance relational databases (such Teradata, RedShift, MSSQL) as well as realtime SQL engines (such kSQL) or NoSQL into a properly well connected Data Mesh of teams working on their own preferent/required environment but not creating silos and duplicating work. Instead, being able to combine their data with other teams’ data.

How do we achieve this? Having a proper Data Architecture, Data Modelling technique, and ETL Framework that can cover the 3 points already mentioned.


Data Architecture

Your data architecture needs to be something flexible especially, when we are talking about a big or mid-enterprise, with hundreds of Data Engineers working with little connection to some distant teams. Ideally, this shouldn’t happen, but let’s face it, in big organizations, some teams are built with different skills than others, some departments have little to no alignment with some others (in terms of technologies, way of ingesting, or processing the data, etc).

So, for that, instead of swimming against the current and trying to imagine this is not happening, we should encourage different teams or departments to have their own Data Warehouse, but instead of isolated silos, they have to be consistent on the main methodologies used when ingesting or processing our data and have a solid Data Governance in place to easily reach or expose whose team is working with which data. Some people call this new trend of working the “Data Mesh“, which creates small data sets or DW by so-called data products. Each team has the responsibility to create and use their data not depending on others, which is great. Important to keep in mind is that we need ways to share this data between teams and of course, Data Modeling must be consistent across the whole org, so they can speak the same language. That’s our next point.


Data modeling

Data Modelling needs to be extremely flexible, but with strong rules, naming conventions, and patterns. That’s basically what Data Vault 2.0 is great at. When doing Dimensional Modelling or 3NF, there are few “schools” or groups of people doing in many ways, but with DV2.0, there are some rules that you shouldn’t break, that are clear and easy to follow, as well as defining naming conventions is very easy here.

This modeling technique avoids the “big design up front” approach and goes more with starting small and having resulted from the first week or two, and then, keep appending and expanding your DW without any risk of breaking anything that already was created. Extremely modular (kind of micro-services but in DW) and super easy to Automate (which again, gives you great efficiency and low cost).

An extremely old article on Data Vault can be found in this blog, but I encourage you to search for more resources.


ETL Framework

Nowadays, you have a few options when working on your pipelines (ETL):

  •       * Cloud: If you are in the cloud and lazy enough to buy every single service that some providers offer you. Then moving to the cloud to lower your costs is going the wrong way. Each of these “easy to build” pipelines, already integrated into your cloud provider (i.e.: Glue from AWS) is going to cost you a lot, plus hidden costs, plus learning how to use these new tools that are not customized for you. Spoiler alert, you will still use another third party to the tool or your own scripts to move the data and perhaps pay a lot of money just to run a shell command from your “Data Pipeline cloud service”.
  •       * Third-party: Using a third-party tool, which some of them are headaches and a terrible way to pay a lot for little benefit, but some others, are NOT cheap, but delivering their promises. So, a good study and test phase is very important here, or at least, don’t let their Sales representative decide for you just with words and a short demo. Still, IMHO, we should focus on something with lower cost and more Taylor made for your company and needs
  •       * Manual ETL: or how I call it, Artesanial ETL development. Or these old school methods, that keep highly qualified Data Engineers creating and crafting by their bare hands, some polish and extremely expensive (and inefficient) ETLs. These solutions are legacy from the past that companies prefer to keep doing that one (regardless of the cost) because the risk of changing is scarier than the benefits (or they think like that)
  •       * Taylor made ETL Framework: For me, the very best option. This Framework will work as we need it, made by the data engineers that are going to use it. It will excel on our demands as the design is perfectly matching our needs. What people are scared of this option always is the price, but unless you develop an over-engineered framework, it should compete or win always, the other solutions (Third-party, etc). Just remember, simplicity! the most important word when developing your Data Framework.

So, choosing the Taylor made ETL Framework option, I have to say that I do have plenty of experience doing so and I can share some points very quickly, but going a bit on detail I could spend easily a one hour to one week worth of meetings presenting the solution and design. Long story short:

  •       * As mentioned before. Keep it simple. Don’t over-engineer it, otherwise, it will be so costly to maintain and difficult to use
  •       * Create it by modules like microservices. So, you can create APIs that will talk to each other in different layers, and defining good rules and standards on their interfaces will make it more stable and robust.
  •       * Prepare at least a few layers. For example:
    •          – Physical layer. Just connecting with any database engine via shell commands and text. i.e.: hive -e “<content_from_data_layer>” or mysql -e “<…>”
    •          – Data Layer. Here is where we can going to translate each SQL syntax for each technology you add. Create table statement is not the same in Teradata, nor PostgreSQL, MSSQL, Hive, …
    •          – Business Layer. Here based on the metadata acquired by the physical layer and requested by the Data layer, we can start building our loading methods. i.e.: Load history to history / Load SCD1 / Delete detection / full or bulk load / …

Now I should mention that this kind of framework will be extremely compatible with many (if not all) of the SQL Engines available and hence when doing Data Mesh, all the teams, regardless of their tech preference, will have the same ways of loading their data following the same rules and standards, so… this will make it extremely compatible and easy to integrate between them.

Perhaps, I will give more details about the Framework for my next post. Depends on the feedback I get here 🙂


Did you like this? Share it:

read more

Hi there,

I’ve got plenty of people and companies asking me more details about how I created these successful Data Warehouses into Big Data Space.

So here I try to explain in the following video, the whole architecture that I’ve been using for the last 4 years (in 3 different companies).

If you have further questions, please, don’t hesitate to contact me through email or just let’s get in touch with LinkedIn

Did you like this? Share it:

read more

Video about my Data Vault framework for Big Data and other relational databases


Did you like this? Share it:

read more


If you ask any experienced DV architect which is the most complicated part of implementing a DV solution, there will be a big chance that convincing the stockholders to use the “new” Data Vault concept.

Basically, the DW world got a very bad ratio of successful projects, so nobody is willing to “try” new things. Along these lines, there are benefits on the DV methodology, but one of the cons you can find is that there are not a big community neither plenty of experts available on the market (so it is normal that people have some fears about this methodology).

So I would like to expose, how I did manage to convince one of my last clients about implementing this kind of solutions.


1) Problem statement


  • Long initial delivery phase per project
  • Fully manual process
  • Difficult to maintain / Fix
  • Reports are not reusable across different businesses
  • Not possible to adjust quickly to business requirements
  • Unable to do automated data lineage
  • For new projects we need to reverse engineer from ground zero to find the EBOs/meaningful data


2) Current situation


  • Reverse engineer existing data marts could take a very long time due to a non clear data lineage and lack of documentation
  • Any new report requires an analysis of the databases and tables to find the required EBOs
  • Manual creation of ETL and maintenance of the same is a mandatory step
  • Long initial delivery phase could lead to refactor and delay the project considerably
  • Customers doesn’t have an easy access to create their own jobs and extract any data they need


3) To be (hypothesis)


  • Short initial delivery phase
  • Automation can be applied (ELT and DDL)
  • Easy to maintain and Troubleshooting
  • Easy to reuse (80% reusable VS 20% non-reusable)
  • Faster deliveries (quicker iteration between delivery team and business) will help to shape the end result as business is expecting
  • Generic reports can be reused for all business Fully automated data lineage
  • Every time the EDW project covers one more EBO, all the departments can use it in a very simplified and easy way, so they don’t need to reverse engineer


4) Assertions


  • Applying automation will reduce human error and the speed will be increased
  • Having a unique repository for all our data, well structured, will help other teams to easily create their marts
  • Reusing same generic report across multiple services will save resources
  • Creating or modifying virtual marts will be faster than physical marts


5) Criteria of Success


  • Ability to deliver value in biweekly basis
  • Enterprise business objects can be reused
  • Ability to run same report for different businesses
  • Expect a new data consumer to avoid reverse engineer the EBO for each business, finding a useful and easy EBO already built in EDW project
  • Reduce Average time to production for one business to 2/3


6) Definition of done


  • Framework will automate the ingestion of data only providing one view to feed our EDW
  • Minimum number of tables and links will provide enough data to start doing reporting
  • Generic reports will show the same generic insights across every business already modelled in the EDW


7) Comparison





Initial delivery phase Long initial phase Short initial phase
Creation of ETL/Tables Manual process + lifecycle Automated
Maintenance Difficult – modify physical tables and check impact Very easy – modify virtual tables (views) no impact on current production
Reusability 20% reusable / 80% non usable 80% reusable / 20% non reusable
Data lineage Not available Fully integrated
Initial analysis Requires to check all databases/tables to find EBO every time We need to find the EBO only once, then shape for easy consumption
Agile Not possible to deliver quickly Delivery value every 2 weeks



EBO = Enterprise Business Object
DDL = Data Definition Language
ELT = Extract Load Transform
ETL = Extract Transform Load
EDW = Enterprise Data Warehouse


Did you like this? Share it:

read more

Data Maturity Model is a way to classify how bad or well is your company doing in all the data related fields.  Basically, the topic is so big, which includes the security, the data governance, etc… Today I want to expose only the Levels around the Enterprise Data Warehouse and see how your analytics is doing:


DMM1 (circa 1990-2000)

  • * No standards
  • * Reactive approach
  • * No master data plan
  • * No strategy
  • * Redoing isolated excels for each report


DMM2 (circa 2000-2005)

  • * Standards established
  • * People try to copy all operational Data together in a centralize DB and call it Data Warehouse (no modelling
  • * Waterfall approach
  • * No data Governance


DMM3 (circa 2005-2010)

  • * Data Specialist are hired and start doing some nominal Data Governance
  • * Some kind of MDM (third party tool) is used inefficiently
  • * Agile lite or selective
  • * First form of DW or marts creating Star Schemas and some Slowly Changing Dimensions. Rigid and not very scalable, having a big cost on maintenance and very slow process


DMM4 (circa 2010-now)

  • * Hadoop in Development
  • * Some analytic stores
  • * MDM managing all companies meta data, using a third party tool well accommodated to the specific business or creating a custom one
  • * Data layers put in place
  • * DW with data quality and reducing the maintenance cost. Dimensional Modelling or other techniques created by experts
  • * Chief Data Officer
  • * All Agile
  • * ETL Automation


DMM5 (now – future)

  • * Hadoop in production
  • * Scalable and easy to maintain Enterprise Data Warehouse that will provide marts to different Departments or subjects
  • * Columnar stores
  • * NoSQL
  • * Data Virtualization. Ability to recreate your virtual data marts avoiding the whole lifecycle of each mart (schema_on_read)
  • * MDM managing all companies governance data by subject. Providing insights and user friendly
  • * Chief Information Architect
  • * Data as asset in financial statements


So, how do I get into DMM4 or DMM5? for that I would recommend to read these two articles:

Did you like this? Share it:

read more


Let me explain my experience leading the whole Data Architecture (split by layers, see on my previous post here) and also leading and creating from zero the whole Enterprise Data Warehouse solution for my current company.


About the company

Avoiding names, I would like to introduce few facts about this company. It is one of the largest ecommerce companies in the world with more than 13,000 employees. This company has more than 70 active companies or services attached to its core company, providing a huge variety of business such as: Insurance, Telecommunication, bank, credit card, travel, sports (bought soccer club, beisbol club, etc),…


About the data

So for that, we have more than 500 databases distributed around the world with thousands and thousands of tables, as mentioned before, talking about very different business domains. Size of it, still unknown, but so far we have in our hands around 2PB (and growing).


How to implement Agile DW with DV2.0


First: Data ingestion (Bring all your data into your Data Lake and keep history records)

So in order to get all this data into our History Stage (copy of the original + history), we did developed a Python script and a MetaData Management to automate all the ingestion, but that’s another topic that I will cover in another post, here I am going to focus on the EDW solution to get as much as possible from all this data.



Second: prepare automation for your Data Vault 2.0

This is very important step as it will provide a huge benefit for a cheap price. Talking about my own experience, I created a whole framework in Python code that auto populates Hub, Satellite, Link and SameAsLink objects in about 2 weeks of Development.

How does it work? Easy! You just need to create a view that follows the Data Vault naming convention. Let’s say that we create a Satellite view to feed our DV, and it will automatically populate, not only a Satellite (table and load), but also Hubs reusing the same view (possibilities to automate are limitless). With a bit of code in our framework, our view will be able to:

  • * Create hub table (if not exists).
  • * Load hub information (only new data).
  • * Create Satellite table (if not exists)
  • * Drop and create views such as vw_[name]_current and vw_[name]_history.
  • * Load Satellite information (new, updated and deleted data).
  • * Get metadata and governance to know which Entities, DV Objects and business we are loading.


This is a sample of my naming convention for the framework:

[DV Object]_[Entity]_[Business]_[common/[others]] ==> sat_customer_bank_common


Third: split your Satellites using the flexibility of DV2.0

So, the beauty of Data Vault is its scalability and flexibility, which is perfect for complex scenarios such as developing an EDW with an initial unknown scope for a very big company. Since companies are so different from each other and we don’t know how many entities we will need at the end, we started creating some basic entities, such as: customer, order, orderdetail, product, item and so forth.

So we are also putting all the common attributes in a common Sat for each entity, because, even though Phone, a Bank or an eCommerce companies have almost nothing in common, they will still have customers and these customers will share common attributes such as Name, Address, DOB, Gender, etc. This is just finding semantical understanding to our data and using it.

For those attributes that are specific for each business, then we can just create a special satellite for customer banks (salary, rate, etc), other satellites for travel business (frequent flyer number, preferred destinations, etc).

That means, we don’t need to design the whole EDW in one goal, neither we need to include all the business of each entity in order to start getting value.

With the common attributes, we can reuse the same customer segmentation or KPI reports across each business or all together (or by groups) only with one report and a filter for the company/ies that we want to analyse.

What about Item, Order, Order detail, etc? Easy find the semantical understanding of the data. For example, in eCommerce item could be an ipad2, in bank could be loan package 30 years, in travel could be travel package, or even a hotel room (i.e.: “Holiday Inn – Atlanta – Double standard“) and so forth. So later you’ll be able to reuse the same report to check which item was the most successful during one period (agnostic of which kind of product it is).

NOTE: Of course, other good practice is to split your satellites by change rate, so the columns that change the most could be in one satellite and ETL it very often and the ones with low change/priority could be loaded daily or weekly etc.


Fourth: Supernova layer or “Business Delivery Layer” (c)

Thanks to our MetaDataManagement, we can automate creation of views removing the DV complexity. So, Business users can start looking at the EDW with a 3NF language that they are more comfortable or create (an even automate) our virtual Star Schemas, based on only views, so much easier to create, recreate and fine tune.
* BTW, I prefer to call it “Business Delivery Layer” .


Some references:


Pure (and real) Agile Data Warehouse

So the first and second steps are king of one off, once is developed and running in an automation way.

The third step is the one that can be done, one entity/business at the time and that means, because we are not aiming to deliver all of them at the same time. So we can apply agile sprints at target for a number of entities and business for this sprint (let’s say 2 weeks).

Once these new entities/business are in production, you can easily go to the fourth step and create the supernova of it (aim for automation when possible) and data consumers can start enjoying this part of the EDW, or at least, checking if they need something else, if they are missing things etc. So if that happens in the supernova layer, as they are all views, it will be very fast to fix it, if it is a major change in the DV layer, then next sprint we can fix it just creating new satellites (always adding/appending, never rebuilding or throwing away anything).


Did you like this? Share it:

read more