dbt
data build tool
1. What is it?
2. Why do we need it?
Context
- Redshift
- Multiple datasets
1. Raw
2. Derived
- Copied from somewhere (S3, RDS, ...)
- Put directly by jobs/apps
- Generated from raw and other derived
Derived Datasets
Current State
- Queries for derived datasets are spread across dev machines, confluence pages, dropbox papers, etc.
- History for queries sometimes is not maintained.
- There is no place to check the whole dependencies graph.
- Many of derived datasets are in the
patest
schema, which makes this schema a complete mess.
- There is no way to recalculate a part of the graph (or the whole graph).
- Almost everything is manual and ad-hoc.
How dbt may help here?
What is dbt?
- Macros folder
- schema.yml
- *.sql files
- dbt as a python script (CLI app)
- dbt_project.yml
- venv (source .venv/bin/activate)
- dbt cli
- dbt --help
- dbt compile
- dbt — management layer on top of raw sql
- Dependencies tracking
- dbt docs generate + serve
- Materialization
Demo
-
view
-
materialized view
-
table
-
ephemeral
-
incremental
Materialization
-
All other materializations (except epheremal) imply that the dataset will be recreated from scratch every time.
Incremental Materialization
-
It's not always efficient, especially for large datasets.
-
Incremental models fix it by copying only a portion of data on each run.
Incremental Materialization
When a model requires more than a few minutes to be recreated from scratch, and it is partitioned by date, use incremental materialization.
Otherwise, use table materialization.
How to run queries?
dbt run
dbt run --select "model1"
dbt run --select "model1 model2"
dbt run --select "model1+"
dbt run --select "+model1+"
dbt run --full-refresh
dbt run --help
Production
-
Each master commit triggers Circle CI job
-
CI job builds a docker image and pushes it to ECR
-
Airflow DAG runs every night
-
It creates a k8s pod from the docker image
-
And runs
dbt run
command (dbt build
actually)
How to work with the project?
1. Creating something new
2. Changing something that already exists
- README.md
- justfile
- Dockerfile
- airflow_run.sh
- 2 main scenarios:
Creating something new
1. Start with local scripts and experiment in patest schema.
2. When everything works, create a branch in
parrot-dbt-redshift
project and add changes there.
3. Create a pull request to merge changes to master.
4. Remove old tables from patest.
Changing something
- Work in a local branch
- To test run dbt from your local machine, but try to minimise the impact (number of affected models)
- When finished create a pull request to to master.
What else dbt could do?
- Tests
- Source freshness
- Maintaining a state between runs to update only models with changes
- Much more, see docs: https://docs.getdbt.com
Current State
- Queries for derived datasets are spread across dev machines, confluence pages, dropbox papers, etc.
- History for queries sometimes is not maintained.
- There is no place to check the whole dependencies graph.
- Many of derived datasets are in the
patest
schema, which makes this schema a complete mess.
- There is no way to recalculate a part of the graph (or the whole graph).
- Almost everything is manual and ad-hoc.
dbt benefits
Without dbt | With dbt |
---|---|
Queries for derived datasets are spread across dev machines, confluence pages, dropbox papers, etc. | Everything is in one place. |
History for queries sometimes is not maintained. | Proper git history. |
There is no place to check the whole dependencies graph. | Dependency graph is visible and managable. |
There is no way to recalculate a part of the graph (or the whole graph). | Possibility to recalculate a precise part of a graph. |
Many of derived datasets are in the patest schema, which makes this schema a complete mess. |
Instead of using the single schema, tables could be placed in a right schema. |
Almost everything is manual and ad-hoc. | Proper daily production pipeline. |
Thank you!
dbt
By Yury Badalyants
dbt
Internal Presentation for Parrot Analytics
- 23