1. Raw
2. Derived
patest
schema, which makes this schema a complete mess.All other materializations (except epheremal) imply that the dataset will be recreated from scratch every time.
It's not always efficient, especially for large datasets.
Incremental models fix it by copying only a portion of data on each run.
When a model requires more than a few minutes to be recreated from scratch, and it is partitioned by date, use incremental materialization.
Otherwise, use table materialization.
dbt run
dbt run --select "model1"
dbt run --select "model1 model2"
dbt run --select "model1+"
dbt run --select "+model1+"
dbt run --full-refresh
dbt run --help
Each master commit triggers Circle CI job
CI job builds a docker image and pushes it to ECR
Airflow DAG runs every night
It creates a k8s pod from the docker image
And runs dbt run
command (dbt build
actually)
1. Creating something new
2. Changing something that already exists
1. Start with local scripts and experiment in patest schema.
2. When everything works, create a branch in
parrot-dbt-redshift
project and add changes there.
3. Create a pull request to merge changes to master.
4. Remove old tables from patest.
patest
schema, which makes this schema a complete mess.Without dbt | With dbt |
---|---|
Queries for derived datasets are spread across dev machines, confluence pages, dropbox papers, etc. | Everything is in one place. |
History for queries sometimes is not maintained. | Proper git history. |
There is no place to check the whole dependencies graph. | Dependency graph is visible and managable. |
There is no way to recalculate a part of the graph (or the whole graph). | Possibility to recalculate a precise part of a graph. |
Many of derived datasets are in the patest schema, which makes this schema a complete mess. |
Instead of using the single schema, tables could be placed in a right schema. |
Almost everything is manual and ad-hoc. | Proper daily production pipeline. |