A practical guide for robust causal inference with modern data
An Introduction to Double/Debiased Machine Learning
Based on Ahrens et al. (2025) and supplemental resources
Slide author: Your Name • Updated: {{DATE}}
DML repairs this gap without sacrificing ML flexibility.
Let \(W=(Y,D,X)\) where
Define target \(\theta_0\in\mathbb R^p\) via
$$\mathbb E\big[m\big(W;\theta_0,\eta_0\big)\big]=0\quad (1)$$ with nuisance bundle \(\eta_0=(r_0,\ell_0,\dots)\).
Modern data ⇒ \(\eta_0\) infinite‑dimensional → tough to estimate.
A score \(\psi\) is orthogonal if $$\mathbb E[\psi(W;\theta_0,\eta_0)]=0$$ and $$\frac{\partial}{\partial\lambda},\mathbb E\big[\psi\big(W;\theta_0,\eta_0+\lambda(\tilde\eta-\eta_0)\big)\big]\Big|_{\lambda=0}=0\quad (2)$$
Think of a Taylor series: the linear term vanishes, leaving only a tiny quadratic remainder.
Implications
“Don’t grade your own homework.”
Breaks correlation between nuisance errors and scores → kills over‑fitting bias while using every observation efficiently.
Default: (K=5) – 10 (larger if cheap).
With pillars 1 + 2: $$\sqrt n,(\hat\theta-\theta_0);\rightarrow_d;\mathcal N(0,V)$$ Sandwich variance is plug‑in: $$\hat V=\hat J^{-1}\Big(\tfrac1n\sum_{i}\psi_i^2\Big)\hat J^{-1},\qquad \hat J=\tfrac1n\sum_i\partial_\theta\psi_i$$ No fragile bootstrap needed.
Parameter: $$\theta_0=\mathbb E[Y(1)-Y(0)]$$
Doubly‑robust score:
$$\psi_{ATE}=\Big[\tfrac{D}{r(X)}-\tfrac{1-D}{1-r(X)}\Big]\big(Y-\ell(D,X)\big)+\ell(1,X)-\ell(0,X)-\theta$$
Model: (Y=D\theta_0+g(X)+\varepsilon) $$\psi_{PLR}=\big(D-r(X)\big)\big(Y-\ell(X)\big)-\big(D-r(X)\big)^2\theta$$ Effective with text/image controls when (g) is ML‑fitted (RF, XGB, BERT …)
For units first treated in (g) and evaluated in (t): $$ATT_{g,t}=\mathbb E[\Delta Y_{it}(1)-\Delta Y_{it}(0)\mid G=g]$$ Orthogonal score from Chang & Santonastaso (2021) gives valid DiD with heterogenous effects.
| Aspect | Good practice | Common pitfall |
|---|---|---|
| Learner set | Mix simple & complex models | One flashy black box |
| Hyper‑tuning | Document grid; nested CV | Undisclosed manual tuning |
| Folds | 5–10 (≧ n folds if tiny $n$) | Just 2 folds with minuscule holdout |
| Repetitions | Median of ≥5 splits | Reporting a lucky split |
| Diagnostics | Out‑of‑fold loss, CVC, residuals | Only in‑sample $R^2$ |
| Reporting | Share code, seeds, learner weights | Opaque methods section |
Questions or feedback? → email / GitHub
Code & resources: repository link