dbt

data build tool

1. What is it?

2. Why do we need it?

Context

  • Redshift
  • Multiple datasets

1. Raw

2. Derived

  • Copied from somewhere (S3, RDS, ...)
  • Put directly by jobs/apps
  • Generated from raw and other derived

Derived Datasets

Current State

  • Queries for derived datasets are spread across dev machines, confluence pages, dropbox papers, etc.
  • History for queries sometimes is not maintained.
  • There is no place to check the whole dependencies graph.
  • Many of derived datasets are in the patest schema, which makes this schema a complete mess.
  • There is no way to recalculate a part of the graph (or the whole graph).
  • Almost everything is manual and ad-hoc.

How dbt may help here?

What is dbt?

  • Macros folder
  • schema.yml
  • *.sql files
  • dbt as a python script (CLI app)
  • dbt_project.yml
  • venv (source .venv/bin/activate)
  • dbt cli
    • dbt --help
    • dbt compile
  • dbt — management layer on top of raw sql
    • Dependencies tracking
    • dbt docs generate + serve
    • Materialization

Demo

  • view

  • materialized view

  • table

  • ephemeral

  • incremental

Materialization

  • All other materializations (except epheremal) imply that the dataset will be recreated from scratch every time.

Incremental Materialization

  • It's not always efficient, especially for large datasets.

  • Incremental models fix it by copying only a portion of data on each run.

Incremental Materialization

When a model requires more than a few minutes to be recreated from scratch, and it is partitioned by date, use incremental materialization.

 

Otherwise, use table materialization.

How to run queries?

dbt run


dbt run --select "model1"


dbt run --select "model1 model2"


dbt run --select "model1+"


dbt run --select "+model1+"


dbt run --full-refresh


dbt run --help

Production

  • Each master commit triggers Circle CI job

  • CI job builds a docker image and pushes it to ECR

  • Airflow DAG runs every night

  • It creates a k8s pod from the docker image

  • And runs dbt run command (dbt build actually)

How to work with the project?

1. Creating something new

2. Changing something that already exists

  • README.md
  • justfile
  • Dockerfile
  • airflow_run.sh
  • 2 main scenarios:

Creating something new

1. Start with local scripts and experiment in patest schema.

2. When everything works, create a branch in

parrot-dbt-redshift project and add changes there.

3. Create a pull request to merge changes to master.

4. Remove old tables from patest.

Changing something

  • Work in a local branch
  • To test run dbt from your local machine, but try to minimise the impact (number of affected models)
  • When finished create a pull request to to master.

What else dbt could do?

  • Tests
  • Source freshness
  • Maintaining a state between runs to update only models with changes
  • Much more, see docs: https://docs.getdbt.com 

Current State

  • Queries for derived datasets are spread across dev machines, confluence pages, dropbox papers, etc.
  • History for queries sometimes is not maintained.
  • There is no place to check the whole dependencies graph.
  • Many of derived datasets are in the patest schema, which makes this schema a complete mess.
  • There is no way to recalculate a part of the graph (or the whole graph).
  • Almost everything is manual and ad-hoc.

dbt benefits

Without dbt With dbt
Queries for derived datasets are spread across dev machines, confluence pages, dropbox papers, etc. Everything is in one place.
History for queries sometimes is not maintained. Proper git history.
There is no place to check the whole dependencies graph. Dependency graph is visible and managable.
There is no way to recalculate a part of the graph (or the whole graph). Possibility to recalculate a precise part of a graph.
Many of derived datasets are in the patest schema, which makes this schema a complete mess. Instead of using the single schema, tables could be placed in a right schema.
Almost everything is manual and ad-hoc. Proper daily production pipeline.

Thank you!

Made with Slides.com