Case study of a proper Data Architecture

So in order to expose the different layers and artefacts contained in this architecture see in the next picture and refer the numbers in the table below.

Data Architecture, Data Lake, Data Vault, Data Virtualization, Data governance and Data lineage

1Data source or Data Owners space. This could be Relational Database, API, Flat files, etc.
2Python customized script that will do all the ELT, capture audit information, feed the MDM (with Metadata automatically and configuration through operators) etc
3Layer where we will store all the data sets on its raw shape (just adding audit or ELT extra columns). Ideally it will be placed on Hive (Data Lake)
4When python (E)xtracts the source information into a file, then it loads it into Hadoop/Hive. This could be Delta or Full. This table is totally temporal, so it is not persistent.
5Now python will do the (L)oad. With our extra columns, we can hash the upcoming data and compare it through the Business Key to the one already stored in the History Stage (persistent) table. Then we will insert ONLY new and updated data. Later we will run a Delete Detection Pattern and insert the deleted Rows.
6This table is persistent and it will contain the initial full load and the inserted/deleted/updated new records on the following loads.
7The enterprise layer (or integration layer) is where our EDW will be located. Typically in Hive, but if needed could be stored anywhere else.
8Python will also load from our Raw Layer, the already modelled (through views ideally) data into a persistent EDW tables. In this case, we will apply only Hard Business Rules, which are the ones that not change the meaning of the data, just the shape. Example: Change data type, standardize Timestamps into a unique format, etc.
This data will land into what we called "Raw Data Vault". Modelled with DV 2.0 fashion, the data won't be agreegated or changed here.
9If required (optional) we will create out Business Data Vault, which could be views or another ELT to transform the data into more meaningful data for the business, where we will apply business rules, do aggregations, merge data, etc.
10This layer works on top of the DV layer. Because DV modelling is quite complex (even for some Data Engineers), this layer is meant to create (automatically if possible) a set of views with the information stored in RDV and BDV in a 3NF fashion (or even Star Schema fashion). So the Data consumers doesn't need to deal with the modelling complexity of DV2.0
11This Database with the output of EDW (DV) could be stored in Hive or Teradata or any other high performance database for later consumption. Ideally it will be just views or materialized views.
12This is our Data Virtualization space. From here we will choose a technology able to do cross platform queries and expose all our Raw Layer and EDW information through only this channel. Because the will connect to a simple JDBC or ODBC, they don't need to deal with the complexity of looking for data in Hive or in Teradata, etc. Everything will be simplified.
13We will apply Security and expose only the access level for that specific user (stored in MDM) creating views to the granted tables and columns. This database (in presto in our example) will need to regenerate all the views to follow the security specs stored in MDM or to capture new tables (also specified in MDM)
14Backbone of everything. Here we will stored the Metadata (refresh in every load), the patterns to use when loading, the extra configuration (blacklist columns/tables, identify delta fields) and also deal with the security to be exposed in the Presentation (or data Access layer)
15Audit database will track the data lineage of every single row and also stored information about the ELT and even, through presto, how many people are accessing which tables and how often. With this we can do a hot/warm/cold strategy.

Did you like this? Share it:

Leave a Reply