Next generation Data Warehouse (as of 2022)


There are three keys for a Data Warehouse to be worth working on:

  •       * Efficiency: You need a Data Warehouse that doesn’t need to be from day zero with the perfect data model for the next 10 years, as this “big design up front” will cost months of analysis and thinking to be 99% of the time wrong at some point. Also, we need to be able to adapt to changes and have value and benefits within weeks (instead of months or years). This is how I define efficiency in a DW.
  •       * Durable: Having your own framework, which can be created in a flexible manner in terms of technologies. The only thing that you need to make sure, is that your methodology and algorithms when loading your data are always the same regardless of the technology of your pipeline and then when moving to new technology or combining different technologies, they will all speak the same language.
  •       * Scalability: Here we are talking beyond just adding more servers to the same SQL engine, but being able to combine Big Data Technologies (S3, Hadoop,… or any schema-on-read), with high-performance relational databases (such Teradata, RedShift, MSSQL) as well as realtime SQL engines (such kSQL) or NoSQL into a properly well connected Data Mesh of teams working on their own preferent/required environment but not creating silos and duplicating work. Instead, being able to combine their data with other teams’ data.

How do we achieve this? Having a proper Data Architecture, Data Modelling technique, and ETL Framework that can cover the 3 points already mentioned.


Data Architecture

Your data architecture needs to be something flexible especially, when we are talking about a big or mid-enterprise, with hundreds of Data Engineers working with little connection to some distant teams. Ideally, this shouldn’t happen, but let’s face it, in big organizations, some teams are built with different skills than others, some departments have little to no alignment with some others (in terms of technologies, way of ingesting, or processing the data, etc).

So, for that, instead of swimming against the current and trying to imagine this is not happening, we should encourage different teams or departments to have their own Data Warehouse, but instead of isolated silos, they have to be consistent on the main methodologies used when ingesting or processing our data and have a solid Data Governance in place to easily reach or expose whose team is working with which data. Some people call this new trend of working the “Data Mesh“, which creates small data sets or DW by so-called data products. Each team has the responsibility to create and use their data not depending on others, which is great. Important to keep in mind is that we need ways to share this data between teams and of course, Data Modeling must be consistent across the whole org, so they can speak the same language. That’s our next point.


Data modeling

Data Modelling needs to be extremely flexible, but with strong rules, naming conventions, and patterns. That’s basically what Data Vault 2.0 is great at. When doing Dimensional Modelling or 3NF, there are few “schools” or groups of people doing in many ways, but with DV2.0, there are some rules that you shouldn’t break, that are clear and easy to follow, as well as defining naming conventions is very easy here.

This modeling technique avoids the “big design up front” approach and goes more with starting small and having resulted from the first week or two, and then, keep appending and expanding your DW without any risk of breaking anything that already was created. Extremely modular (kind of micro-services but in DW) and super easy to Automate (which again, gives you great efficiency and low cost).

An extremely old article on Data Vault can be found in this blog, but I encourage you to search for more resources.


ETL Framework

Nowadays, you have a few options when working on your pipelines (ETL):

  •       * Cloud: If you are in the cloud and lazy enough to buy every single service that some providers offer you. Then moving to the cloud to lower your costs is going the wrong way. Each of these “easy to build” pipelines, already integrated into your cloud provider (i.e.: Glue from AWS) is going to cost you a lot, plus hidden costs, plus learning how to use these new tools that are not customized for you. Spoiler alert, you will still use another third party to the tool or your own scripts to move the data and perhaps pay a lot of money just to run a shell command from your “Data Pipeline cloud service”.
  •       * Third-party: Using a third-party tool, which some of them are headaches and a terrible way to pay a lot for little benefit, but some others, are NOT cheap, but delivering their promises. So, a good study and test phase is very important here, or at least, don’t let their Sales representative decide for you just with words and a short demo. Still, IMHO, we should focus on something with lower cost and more Taylor made for your company and needs
  •       * Manual ETL: or how I call it, Artesanial ETL development. Or these old school methods, that keep highly qualified Data Engineers creating and crafting by their bare hands, some polish and extremely expensive (and inefficient) ETLs. These solutions are legacy from the past that companies prefer to keep doing that one (regardless of the cost) because the risk of changing is scarier than the benefits (or they think like that)
  •       * Taylor made ETL Framework: For me, the very best option. This Framework will work as we need it, made by the data engineers that are going to use it. It will excel on our demands as the design is perfectly matching our needs. What people are scared of this option always is the price, but unless you develop an over-engineered framework, it should compete or win always, the other solutions (Third-party, etc). Just remember, simplicity! the most important word when developing your Data Framework.

So, choosing the Taylor made ETL Framework option, I have to say that I do have plenty of experience doing so and I can share some points very quickly, but going a bit on detail I could spend easily a one hour to one week worth of meetings presenting the solution and design. Long story short:

  •       * As mentioned before. Keep it simple. Don’t over-engineer it, otherwise, it will be so costly to maintain and difficult to use
  •       * Create it by modules like microservices. So, you can create APIs that will talk to each other in different layers, and defining good rules and standards on their interfaces will make it more stable and robust.
  •       * Prepare at least a few layers. For example:
    •          – Physical layer. Just connecting with any database engine via shell commands and text. i.e.: hive -e “<content_from_data_layer>” or mysql -e “<…>”
    •          – Data Layer. Here is where we can going to translate each SQL syntax for each technology you add. Create table statement is not the same in Teradata, nor PostgreSQL, MSSQL, Hive, …
    •          – Business Layer. Here based on the metadata acquired by the physical layer and requested by the Data layer, we can start building our loading methods. i.e.: Load history to history / Load SCD1 / Delete detection / full or bulk load / …

Now I should mention that this kind of framework will be extremely compatible with many (if not all) of the SQL Engines available and hence when doing Data Mesh, all the teams, regardless of their tech preference, will have the same ways of loading their data following the same rules and standards, so… this will make it extremely compatible and easy to integrate between them.

Perhaps, I will give more details about the Framework for my next post. Depends on the feedback I get here 🙂


Did you like this? Share it:

Leave a Reply