An Introduction to Double / Debiased Machine Learning

A practical guide for robust causal inference with modern data

The story of your life ...

... begins here

🏁 Title & Credits

An Introduction to Double/Debiased Machine Learning
Based on Ahrens et al. (2025) and supplemental resources
Slide author: Your Name • Updated: {{DATE}}

This is the title

🗺️ Expanded Road Map

Motivation – why predictive ML alone cannot deliver causal answers
Two pillars of DML – Neyman orthogonality & cross‑fitting
The algorithm – partitioning, nuisance fitting, solving, standard errors
Diagnostics & model choice – loss metrics, CVC, residual checks, stacking
Three empirical case studies – retirement savings, health shocks, online monopsony
Hands‑on tips – folds, repetitions, transparent reporting
Recent extensions – automatic scores, set identification, HD targets
Key take‑aways & open questions

❓ Motivation: The Inference Gap

Objective: quantify causal effects with valid uncertainty
Obstacle: nuisance objects—propensity scores, outcome models, fixed effects—are high‑dimensional
Naïve ML plug‑ins induce regularization bias and *over‑fitting bias
Classical asymptotics fail ⇒ CIs undercover, p‑values unreliable

DML repairs this gap without sacrificing ML flexibility.

⚙️ Generic Semi‑Parametric Setup

Let $W=(Y,D,X)$ where

(Y): scalar outcome
(D): treatment or regressors
(X): high‑dimensional controls

Define target $\theta_0\in\mathbb R^p$ via
$$\mathbb E\big[m\big(W;\theta_0,\eta_0\big)\big]=0\quad (1)$$ with nuisance bundle $\eta_0=(r_0,\ell_0,\dots)$.
Modern data ⇒ $\eta_0$ infinite‑dimensional → tough to estimate.

🎯 Neyman Orthogonality (Pillar 1)

A score $\psi$ is orthogonal if $$\mathbb E[\psi(W;\theta_0,\eta_0)]=0$$ and $$\frac{\partial}{\partial\lambda},\mathbb E\big[\psi\big(W;\theta_0,\eta_0+\lambda(\tilde\eta-\eta_0)\big)\big]\Big|_{\lambda=0}=0\quad (2)$$

First‑order bias from estimating $\eta$ cancels
Only need $n^{-1/4}$ convergence of $\hat\eta$

Think of a Taylor series: the linear term vanishes, leaving only a tiny quadratic remainder.

Implications

🔄 Cross‑Fitting (Pillar 2)

“Don’t grade your own homework.”

Randomly split data into (K) folds $I_1,\dots,I_K$
Fit $\hat\eta_{-k}$ on the other $K-1$ folds
Evaluate score on held‑out fold
Aggregate all folds and solve for $\hat\theta$

Breaks correlation between nuisance errors and scores → kills over‑fitting bias while using every observation efficiently.
Default: (K=5) – 10 (larger if cheap).

📐 Asymptotics & Variance

With pillars 1 + 2: $$\sqrt n,(\hat\theta-\theta_0);\rightarrow_d;\mathcal N(0,V)$$ Sandwich variance is plug‑in: $$\hat V=\hat J^{-1}\Big(\tfrac1n\sum_{i}\psi_i^2\Big)\hat J^{-1},\qquad \hat J=\tfrac1n\sum_i\partial_\theta\psi_i$$ No fragile bootstrap needed.

🧮 Example 1 – Average Treatment Effect

Parameter: $$\theta_0=\mathbb E[Y(1)-Y(0)]$$

Doubly‑robust score:

$$\psi_{ATE}=\Big[\tfrac{D}{r(X)}-\tfrac{1-D}{1-r(X)}\Big]\big(Y-\ell(D,X)\big)+\ell(1,X)-\ell(0,X)-\theta$$

Consistent if either $r$ or $\ell$ correct
Simulations: 95 % coverage vs 71 % for naïve IPW

🧮 Example 2 – Partially Linear Regression

Model: (Y=D\theta_0+g(X)+\varepsilon) $$\psi_{PLR}=\big(D-r(X)\big)\big(Y-\ell(X)\big)-\big(D-r(X)\big)^2\theta$$ Effective with text/image controls when (g) is ML‑fitted (RF, XGB, BERT …)

🧮 Example 3 – Group‑Time ATT (Staggered DiD)

For units first treated in (g) and evaluated in (t): $$ATT_{g,t}=\mathbb E[\Delta Y_{it}(1)-\Delta Y_{it}(0)\mid G=g]$$ Orthogonal score from Chang & Santonastaso (2021) gives valid DiD with heterogenous effects.

💼 Application 1 — 401(k) Eligibility

1991 SIPP, (n≈10{,}000); outcome: net assets
Random‑forest nuisances
DML: −$6.9 k (SE 1.3 k)
IPW & RA diverge → DML preferred

🏥 Application 2 — Hospital Admission

HRS, 656 households, 5 waves
15‑fold DML per (g,t), median aggregation
Spending jumps ≈ $2.4 k in first post‑admission wave, no pre‑trend

🛒 Application 3 — MTurk Monopsony

Outcome: log fill‑time; regressor: log payment
30 k text features + BERT embeddings
Nine learners tested; diagnostics via $R^2$ & CVC
Stacked DML: elasticity ≈ −0.024 (SE 0.005) → moderate monopsony

🧰 Practical Guide & Rules of Thumb

Aspect	Good practice	Common pitfall
Learner set	Mix simple & complex models	One flashy black box
Hyper‑tuning	Document grid; nested CV	Undisclosed manual tuning
Folds	5–10 (≧ n folds if tiny $n$)	Just 2 folds with minuscule holdout
Repetitions	Median of ≥5 splits	Reporting a lucky split
Diagnostics	Out‑of‑fold loss, CVC, residuals	Only in‑sample $R^2$
Reporting	Share code, seeds, learner weights	Opaque methods section

🌱 Recent Extensions

Automatic score generators (Chernozhukov et al., 2022)
Set‑identified and sensitivity bounds
Orthogonal learners for function targets
Higher‑order theory: finite‑sample tweaks
Causal AI pipelines with LLM embeddings

🚀 Key Take‑Aways

DML = Orthogonality ➕ Cross‑fitting → valid inference with ML
Mitigates bias without restrictive functional forms
Requires thoughtful learner tuning and transparent diagnostics
A robust baseline for modern empirical work

🙏 Thank You!

Questions or feedback? → email / GitHub
Code & resources: repository link

An Introduction to Double / Debiased Machine Learning

By Carlos Mendez

An Introduction to Double / Debiased Machine Learning

Carlos Mendez

carlos-mendez.org

An Introduction to Double / Debiased Machine Learning

The story of your life ...

... begins here

🏁 Title & Credits

This is the title

🗺️ Expanded Road Map

❓ Motivation: The Inference Gap

⚙️ Generic Semi‑Parametric Setup

🎯 Neyman Orthogonality (Pillar 1)

🔄 Cross‑Fitting (Pillar 2)

📐 Asymptotics & Variance

🧮 Example 1 – Average Treatment Effect

🧮 Example 2 – Partially Linear Regression

🧮 Example 3 – Group‑Time ATT (Staggered DiD)

💼 Application 1 — 401(k) Eligibility

🏥 Application 2 — Hospital Admission

🛒 Application 3 — MTurk Monopsony

🧰 Practical Guide & Rules of Thumb

🌱 Recent Extensions

🚀 Key Take‑Aways

🙏 Thank You!

An Introduction to Double / Debiased Machine Learning

More from Carlos Mendez

An Introduction to Double / Debiased Machine Learning

An Introduction to Double / Debiased Machine Learning