Data warehouse frameworks

 

DOs and DON'Ts

 

and maybe a few words about DBT framework

What is a DWH Framework?

  • Encapsulates transformation
  • Provides logging
  • Composability
  • Teamwork, collaboration
  • Version and change management (different targets, environments, git integration, etc)
  • Documentation
  • Testing pipelines (CI/CD)
  • Deployment 
  • Job execution, dependency management (DAGs)
  • Execution handling

 

Yes, we need all of these.

ETL Tools

Point and click adventure

$$$

 

  • Alteryx
  • Talend
  • Informatica
  • Matillion

ETL Frameworks

Code (SQL and/or Python)

$

 

  • dbt 
  • airflow
  • metaflow
  • beam
  • home grown shit

Good framework generates SQLs and keeps transformation inside the database

And now some rants.

Why you *must* not build frameworks in UDFs/Stored Procedures

Team Work with UDF/SP

Expectation

  • Multiple teams can work on different features in the same database
  • Isolation, people can work on the same loads 

 

Reality

  • Two people cannot work on the same function 
  • No easy way to merge code from different branches

 

Issues

  • Manually identify what things have changed by a new feature?
  • How to tie table changes with code changes as one package

Object Persistence with SPs

Expectation

  • Code is isolated from the data. You can take a production source data dump and easily re-build the data warehouse (One-click deployment in a new db)
  • Seed management (like static lookups)

Reality

  • When you copy prod data to stage you are fucked
  • build a new environment takes forever
  • no clear definition in object dependency
  • shitton of manual metadata 

Issues

  • a lot. 

Version and change control with SPs

Expectation

  • Tight git integration (read only master branch, feature branching)
  • Easy way to change source and target DBs, schemas, tables
  • Jobs ALWAYS coming from the versioned entities

Reality

  • You CANNOT enforce to run things directly from git
  • no automated way to deploy the DB objects with code (migration support)

Issues

  • you will be always inconsistent with your version manager tool as UDFs cannot be executed from git for instance
  • Implementing your own change management will take an infinite amount of time. Rollback will suck, always.

Logging and Metadata stored in Database

Expectation

  • Logs are easy to read, easy to search
  • No impact on database performance or transactions
  • In case of a database issue, jobs still can log what went wrong

Reality

  • Logging will fuck up performance and transaction contexts
  • If the database dies, your logging dies

Issues

  • When refreshing data from prod to dev, you override your past logs and metadata
  • Versioning metadata and seed along with the code is impossible if it is stored in database 

IN SHORT

 

DO NOT, I REPEAT, NEVER USE STORED PROCEDURES OR UDFs TO IMPLEMENT ETL JOBS OR FRAMEWORKS

 

THANK YOU, THE HUMANITY

OKAY, THEN WHAT?

Use DBT!

WTF is dbt?

DBT is data pipeline framework, using SQL, generated from python.

What's inside?

  • All transformations (models) are SQL with jinja templating
  • Utils (SCD, snapshot)
  • GIT, CI/CD and everything you're using when you dev sw
  • Dependency management
  • Deployment (dev, test, prod)
  • Job execution

Few Concepts

Models = Transformation = SQL

Models = Transformation

1 Model = 1 SQL statement

Materialization can be a view, table, temp table, or custom 

Handles schema changes 

BTW, refreshing models

Truncate Insert is dead.

Long Live The

CREATE OR REPLACE TABLE

Other DBT things

  • Source - define source systems, tables
  • Snapshots - SCD Type2/6 tables
  • Incremental Models
  • Macros - inline, reusable functions
  • Exposures - external dependencies 
  • Seeds
  • Tests 

Seed.

DEMO

QUESTIONS

Data warehouse frameworks

By Tamas Foldi

Data warehouse frameworks

  • 486