Espresso AI
Faster, Cheaper, Easier: Choose 3-
Automatically Optimize dbt at Scale
Overview
Overview
- Why we're working on optimizing dbt
- Automatic Query Incrementalization
- dbt Job Scheduling
Who am I?
# Overview
- One of the founders of Espresso AI. We use machine learning to automatically optimize Snowflake
- I used to work on systems performance at Google Cloud, ML at Google Search, and I’ve done research on RL for systems performance optimization with DeepMind.
# Overview
Why Snowflake?
- Snowflake costs are one of today's top IT line items - and growing fast
- We're saving customers up to 70% on their Snowflake bills.
# How it works
How does it work?
- We have a warehouse optimization agent that manages cluster autoscaling. It accesses Snowflake through a Snowflake account.
- Our dbt-specific features use a SQL proxy. The proxy lets us route and optimize queries on the fly.
# How it Works

Why are we focusing on dbt?
# Why dbt?
- dbt is the most popular tool across our customers
- We're working on dbt-specific optimizations to run dbt jobs faster and cheaper.
- We've found one super simple way to speed up dbt on Snowflake. Stay tuned!
# 1. Query Incrementalization
Query Incrementalization
Query Incrementalization
# Query Incrementalization
- You can run ELT jobs massively faster and cheaper by incrementalizing your queries.
- Incrementalizing means taking a query that processes an entire dataset and changing it to reprocess only new data for each run.
select * from {{ ref('events') }}
{% if is_incremental() %}
where event_time >= (select max(event_timestamp)::date from {{ this }})
{% endif %}# Query Incrementalization
Query Incrementalization Example
# Query Incrementalization
Automatic Query Incrementalization
This gets a lot more complicated.
- Joins
- Multiple Joins
- Aggregations
- Filters
- Subqueries
# Query Incrementalization
Automatic Query Incrementalization

# Query Incrementalization
dbt Microbatch incremental models
- Microbatch incremental models are a new dbt feature
- We can leverage microbatch annotations to help generate more complex incremental queries automatically
# Query Incrementalization
Q&A - Incrementalization
# Scheduling
Scheduling
# Scheduling
Scheduling
- dbt jobs produce a DAG of queries for your job
- There are many ways to execute to most DAGs
- We're using schedling theory to find faster execution plans (and maybe cheaper ones.)
# Scheduling
Scheduling - DAG

# Scheduling
Scheduling - DAG

# Scheduling
Scheduling

Default Snowflake Runtime:
6 minutes on 1 warehouse
# Scheduling
Scheduling - DAG

# Scheduling
Scheduling - DAG

Optimized Runtime:
3 minutes on 4 clusters
# Scheduling
Scheduling - DAG
- De la Vega & Leuker
“Bin packing can be solved within 1 + ε in linear time
-
Békéski & Galambos
“A 5/4 Linear Time Bin Packing Algorithm”
- Bansal, Caprara, & Sviridenko
“Improved approximation algorithms for multidimensional bin packing problem”
# Scheduling
Scheduling - DAG
We're seeing speed-ups of 30% on some customer workloads.
# Scheduling
Where's the ML?
-
We use LLMs to predict query preprocessing time, runtime, scaling, and concurrency
-
Runtime predictions are obviously helpful for scheduling. They also tell us which queries to incrementalize.
-
# Scheduling
Where's the ML?
-
We use LLMs to predict query preprocessing time, runtime, scaling, and concurrency.
-
Scaling (i.e. runtime on different-size clusters) and concurrency (i.e. runtime with multiple queries per cluster) let us reduce runtime and increase utilization.
-
# Scheduling
Where's the ML?
-
We’re building towards an ML-powered runtime scheduler, not just an offline job scheduler.
-
This will work kind of like the borg scheduler or the kubernetes scheduler, but with ML-inferred resource annotations.
# Snowflake dbt Optimization
Snowflake dbt Optimization Advice
If you only take one thing away from this talk:
Increase your dbt threadcount!
-
dbt’s defaults to 4 threads, which means your jobs will block if 4 queries are outstanding.
- Most jobs benefit from more parallelism.
# Snowflake dbt Optimization
Snowflake dbt Optimization Advice
-
Snowflake will queue work for you.
-
No downside for standard warehouses.
-
Can increase costs for multicluster warehouses
-
-
This can increase speed as much as increasing your warehouse a size, and for free.
Get in touch
ben@espresso.ai
Faster, Cheaper, Easer: Automatically Optimize dbt at Scale
By Espresso AI
Faster, Cheaper, Easer: Automatically Optimize dbt at Scale
- 22