dbt

data build tool

1. What is it?

2. Why do we need it?

Context

Redshift

Multiple datasets

1. Raw

2. Derived

Copied from somewhere (S3, RDS, ...)

Put directly by jobs/apps

Generated from raw and other derived

Derived Datasets

Current State

Queries for derived datasets are spread across dev machines, confluence pages, dropbox papers, etc.

History for queries sometimes is not maintained.

There is no place to check the whole dependencies graph.

Many of derived datasets are in the patest schema, which makes this schema a complete mess.

There is no way to recalculate a part of the graph (or the whole graph).

Almost everything is manual and ad-hoc.

How dbt may help here?

What is dbt?

Macros folder
schema.yml
*.sql files
dbt as a python script (CLI app)
dbt_project.yml
venv (source .venv/bin/activate)
dbt cli
- dbt --help
- dbt compile
dbt — management layer on top of raw sql
- Dependencies tracking
- dbt docs generate + serve
- Materialization

Demo

view
materialized view
table
ephemeral
incremental

Materialization

All other materializations (except epheremal) imply that the dataset will be recreated from scratch every time.

Incremental Materialization

It's not always efficient, especially for large datasets.

Incremental models fix it by copying only a portion of data on each run.

Incremental Materialization

When a model requires more than a few minutes to be recreated from scratch, and it is partitioned by date, use incremental materialization.

Otherwise, use table materialization.

How to run queries?

dbt run


dbt run --select "model1"


dbt run --select "model1 model2"


dbt run --select "model1+"


dbt run --select "+model1+"


dbt run --full-refresh


dbt run --help

Production

Each master commit triggers Circle CI job

CI job builds a docker image and pushes it to ECR

Airflow DAG runs every night

It creates a k8s pod from the docker image

And runs dbt run command (dbt build actually)

How to work with the project?

1. Creating something new

2. Changing something that already exists

README.md

justfile

Dockerfile

airflow_run.sh

2 main scenarios:

Creating something new

1. Start with local scripts and experiment in patest schema.

2. When everything works, create a branch in

parrot-dbt-redshift project and add changes there.

3. Create a pull request to merge changes to master.

4. Remove old tables from patest.

Changing something

Work in a local branch

To test run dbt from your local machine, but try to minimise the impact (number of affected models)

When finished create a pull request to to master.

What else dbt could do?

Tests

Source freshness

Maintaining a state between runs to update only models with changes

Much more, see docs: https://docs.getdbt.com

Current State

Queries for derived datasets are spread across dev machines, confluence pages, dropbox papers, etc.

History for queries sometimes is not maintained.

There is no place to check the whole dependencies graph.

Many of derived datasets are in the patest schema, which makes this schema a complete mess.

There is no way to recalculate a part of the graph (or the whole graph).

Almost everything is manual and ad-hoc.

dbt benefits

Without dbt	With dbt
Queries for derived datasets are spread across dev machines, confluence pages, dropbox papers, etc.	Everything is in one place.
History for queries sometimes is not maintained.	Proper git history.
There is no place to check the whole dependencies graph.	Dependency graph is visible and managable.
There is no way to recalculate a part of the graph (or the whole graph).	Possibility to recalculate a precise part of a graph.
Many of derived datasets are in the `patest` schema, which makes this schema a complete mess.	Instead of using the single schema, tables could be placed in a right schema.
Almost everything is manual and ad-hoc.	Proper daily production pipeline.

dbt

data build tool

1. What is it?

2. Why do we need it?

Context

Derived Datasets

Current State

How dbt may help here?

What is dbt?

Demo

view

materialized view

table

ephemeral

incremental

Materialization

Incremental Materialization

Incremental Materialization

How to run queries?

Production

How to work with the project?

Creating something new

Changing something

What else dbt could do?

Current State

dbt benefits

Thank you!