dbt
data build tool
1. What is it?
2. Why do we need it?
Context
- Redshift
- Multiple datasets
1. Raw
2. Derived
- Copied from somewhere (S3, RDS, ...)
- Put directly by jobs/apps
- Generated from raw and other derived
Derived Datasets
Current State
- Queries for derived datasets are spread across dev machines, confluence pages, dropbox papers, etc.
- History for queries sometimes is not maintained.
- There is no place to check the whole dependencies graph.
- Many of derived datasets are in the patestschema, which makes this schema a complete mess.
- There is no way to recalculate a part of the graph (or the whole graph).
- Almost everything is manual and ad-hoc.
How dbt may help here?
What is dbt?
- Macros folder
- schema.yml
- *.sql files
- dbt as a python script (CLI app)
- dbt_project.yml
- venv (source .venv/bin/activate)
- dbt cli
	- dbt --help
- dbt compile
 
- dbt — management layer on top of raw sql
	- Dependencies tracking
- dbt docs generate + serve
- Materialization
 
Demo
- 
	view
- 
	materialized view
- 
	table
- 
	ephemeral
- 
	incremental
Materialization
- 
	All other materializations (except epheremal) imply that the dataset will be recreated from scratch every time. 
Incremental Materialization
- 
	It's not always efficient, especially for large datasets. 
- 
	Incremental models fix it by copying only a portion of data on each run. 
Incremental Materialization
When a model requires more than a few minutes to be recreated from scratch, and it is partitioned by date, use incremental materialization.
Otherwise, use table materialization.
How to run queries?
dbt run
dbt run --select "model1"
dbt run --select "model1 model2"
dbt run --select "model1+"
dbt run --select "+model1+"
dbt run --full-refresh
dbt run --help
Production
- 
	Each master commit triggers Circle CI job 
- 
	CI job builds a docker image and pushes it to ECR 
- 
	Airflow DAG runs every night 
- 
	It creates a k8s pod from the docker image 
- 
	And runs dbt runcommand (dbt buildactually)
How to work with the project?
1. Creating something new
2. Changing something that already exists
- README.md
- justfile
- Dockerfile
- airflow_run.sh
- 2 main scenarios:
Creating something new
1. Start with local scripts and experiment in patest schema.
2. When everything works, create a branch in
parrot-dbt-redshift project and add changes there.
3. Create a pull request to merge changes to master.
4. Remove old tables from patest.
Changing something
- Work in a local branch
- To test run dbt from your local machine, but try to minimise the impact (number of affected models)
- When finished create a pull request to master.
What else dbt could do?
- Tests
- Source freshness
- Maintaining a state between runs to update only models with changes
- Dev/Staging environment
dbt benefits
| Without dbt | With dbt | 
|---|---|
| Queries for derived datasets are spread across dev machines, confluence pages, dropbox papers, etc. | Everything is in one place. | 
| History for queries sometimes is not maintained. | Proper git history. | 
| There is no place to check the whole dependencies graph. | Dependency graph is visible and managable. | 
| There is no way to recalculate a part of the graph (or the whole graph). | Possibility to recalculate a precise part of a graph. | 
| Many of derived datasets are in the patestschema, which makes this schema a complete mess. | Instead of using the single schema, tables could be placed in a right schema. | 
| Almost everything is manual and ad-hoc. | Proper daily production pipeline. | 
Thank you!
dbt
By Yury Badalyants
dbt
Internal Presentation for Parrot Analytics
- 168
 
   
   
  