An Introduction to Double / Debiased Machine Learning
A practical guide for robust causal inference with modern data

The story of your life ...
... begins here
🏁 Title & Credits
An Introduction to Double/Debiased Machine Learning
Based on Ahrens et al. (2025) and supplemental resources
Slide author: Your Name • Updated: {{DATE}}

This is the title
🗺️ Expanded Road Map
- Motivation – why predictive ML alone cannot deliver causal answers
- Two pillars of DML – Neyman orthogonality & cross‑fitting
- The algorithm – partitioning, nuisance fitting, solving, standard errors
- Diagnostics & model choice – loss metrics, CVC, residual checks, stacking
- Three empirical case studies – retirement savings, health shocks, online monopsony
- Hands‑on tips – folds, repetitions, transparent reporting
- Recent extensions – automatic scores, set identification, HD targets
- Key take‑aways & open questions
❓ Motivation: The Inference Gap
- Objective: quantify causal effects with valid uncertainty
- Obstacle: nuisance objects—propensity scores, outcome models, fixed effects—are high‑dimensional
- Naïve ML plug‑ins induce regularization bias and *over‑fitting bias
- Classical asymptotics fail ⇒ CIs undercover, p‑values unreliable
DML repairs this gap without sacrificing ML flexibility.
⚙️ Generic Semi‑Parametric Setup
Let \(W=(Y,D,X)\) where
- (Y): scalar outcome
- (D): treatment or regressors
- (X): high‑dimensional controls
Define target \(\theta_0\in\mathbb R^p\) via
$$\mathbb E\big[m\big(W;\theta_0,\eta_0\big)\big]=0\quad (1)$$ with nuisance bundle \(\eta_0=(r_0,\ell_0,\dots)\).
Modern data ⇒ \(\eta_0\) infinite‑dimensional → tough to estimate.
🎯 Neyman Orthogonality (Pillar 1)
A score \(\psi\) is orthogonal if $$\mathbb E[\psi(W;\theta_0,\eta_0)]=0$$ and $$\frac{\partial}{\partial\lambda},\mathbb E\big[\psi\big(W;\theta_0,\eta_0+\lambda(\tilde\eta-\eta_0)\big)\big]\Big|_{\lambda=0}=0\quad (2)$$
- First‑order bias from estimating \(\eta\) cancels
- Only need \(n^{-1/4}\) convergence of \(\hat\eta\)
Think of a Taylor series: the linear term vanishes, leaving only a tiny quadratic remainder.
Implications
🔄 Cross‑Fitting (Pillar 2)
“Don’t grade your own homework.”
- Randomly split data into (K) folds \(I_1,\dots,I_K\)
- Fit \(\hat\eta_{-k}\) on the other \(K-1\) folds
- Evaluate score on held‑out fold
- Aggregate all folds and solve for \(\hat\theta\)
Breaks correlation between nuisance errors and scores → kills over‑fitting bias while using every observation efficiently.
Default: (K=5) – 10 (larger if cheap).
📐 Asymptotics & Variance
With pillars 1 + 2: $$\sqrt n,(\hat\theta-\theta_0);\rightarrow_d;\mathcal N(0,V)$$ Sandwich variance is plug‑in: $$\hat V=\hat J^{-1}\Big(\tfrac1n\sum_{i}\psi_i^2\Big)\hat J^{-1},\qquad \hat J=\tfrac1n\sum_i\partial_\theta\psi_i$$ No fragile bootstrap needed.
🧮 Example 1 – Average Treatment Effect
Parameter: $$\theta_0=\mathbb E[Y(1)-Y(0)]$$
Doubly‑robust score:
$$\psi_{ATE}=\Big[\tfrac{D}{r(X)}-\tfrac{1-D}{1-r(X)}\Big]\big(Y-\ell(D,X)\big)+\ell(1,X)-\ell(0,X)-\theta$$
- Consistent if either \(r\) or \(\ell\) correct
- Simulations: 95 % coverage vs 71 % for naïve IPW
🧮 Example 2 – Partially Linear Regression
Model: (Y=D\theta_0+g(X)+\varepsilon) $$\psi_{PLR}=\big(D-r(X)\big)\big(Y-\ell(X)\big)-\big(D-r(X)\big)^2\theta$$ Effective with text/image controls when (g) is ML‑fitted (RF, XGB, BERT …)
🧮 Example 3 – Group‑Time ATT (Staggered DiD)
For units first treated in (g) and evaluated in (t): $$ATT_{g,t}=\mathbb E[\Delta Y_{it}(1)-\Delta Y_{it}(0)\mid G=g]$$ Orthogonal score from Chang & Santonastaso (2021) gives valid DiD with heterogenous effects.
💼 Application 1 — 401(k) Eligibility
- 1991 SIPP, (n≈10{,}000); outcome: net assets
- Random‑forest nuisances
- DML: −$6.9 k (SE 1.3 k)
- IPW & RA diverge → DML preferred
🏥 Application 2 — Hospital Admission
- HRS, 656 households, 5 waves
- 15‑fold DML per (g,t), median aggregation
- Spending jumps ≈ $2.4 k in first post‑admission wave, no pre‑trend
🛒 Application 3 — MTurk Monopsony
- Outcome: log fill‑time; regressor: log payment
- 30 k text features + BERT embeddings
- Nine learners tested; diagnostics via $R^2$ & CVC
- Stacked DML: elasticity ≈ −0.024 (SE 0.005) → moderate monopsony
🧰 Practical Guide & Rules of Thumb
| Aspect | Good practice | Common pitfall |
|---|---|---|
| Learner set | Mix simple & complex models | One flashy black box |
| Hyper‑tuning | Document grid; nested CV | Undisclosed manual tuning |
| Folds | 5–10 (≧ n folds if tiny $n$) | Just 2 folds with minuscule holdout |
| Repetitions | Median of ≥5 splits | Reporting a lucky split |
| Diagnostics | Out‑of‑fold loss, CVC, residuals | Only in‑sample $R^2$ |
| Reporting | Share code, seeds, learner weights | Opaque methods section |
🌱 Recent Extensions
- Automatic score generators (Chernozhukov et al., 2022)
- Set‑identified and sensitivity bounds
- Orthogonal learners for function targets
- Higher‑order theory: finite‑sample tweaks
- Causal AI pipelines with LLM embeddings
🚀 Key Take‑Aways
- DML = Orthogonality ➕ Cross‑fitting → valid inference with ML
- Mitigates bias without restrictive functional forms
- Requires thoughtful learner tuning and transparent diagnostics
- A robust baseline for modern empirical work
🙏 Thank You!
Questions or feedback? → email / GitHub
Code & resources: repository link
An Introduction to Double / Debiased Machine Learning
By Carlos Mendez
An Introduction to Double / Debiased Machine Learning
- 79