An Introduction to Double / Debiased Machine Learning

A practical guide for robust causal inference with modern data

  The story of your life ...

... begins here 

🏁 Title & Credits

An Introduction to Double/Debiased Machine Learning
Based on Ahrens et al. (2025) and supplemental resources
Slide author: Your Name   •   Updated: {{DATE}}

This is the title

y
x

🗺️ Expanded Road Map

  1. Motivation – why predictive ML alone cannot deliver causal answers
  2. Two pillars of DML – Neyman orthogonality & cross‑fitting
  3. The algorithm – partitioning, nuisance fitting, solving, standard errors
  4. Diagnostics & model choice – loss metrics, CVC, residual checks, stacking
  5. Three empirical case studies – retirement savings, health shocks, online monopsony
  6. Hands‑on tips – folds, repetitions, transparent reporting
  7. Recent extensions – automatic scores, set identification, HD targets
  8. Key take‑aways & open questions

❓ Motivation: The Inference Gap

  • Objective: quantify causal effects with valid uncertainty
  • Obstacle: nuisance objects—propensity scores, outcome models, fixed effects—are high‑dimensional
  • Naïve ML plug‑ins induce regularization bias and *over‑fitting bias
  • Classical asymptotics fail ⇒ CIs undercover, p‑values unreliable

DML repairs this gap without sacrificing ML flexibility.

⚙️ Generic Semi‑Parametric Setup

Let \(W=(Y,D,X)\) where

  • (Y): scalar outcome
  • (D): treatment or regressors
  • (X): high‑dimensional controls

Define target \(\theta_0\in\mathbb R^p\) via
$$\mathbb E\big[m\big(W;\theta_0,\eta_0\big)\big]=0\quad (1)$$ with nuisance bundle \(\eta_0=(r_0,\ell_0,\dots)\).
Modern data ⇒ \(\eta_0\) infinite‑dimensional → tough to estimate.

🎯 Neyman Orthogonality (Pillar 1)

A score \(\psi\) is orthogonal if $$\mathbb E[\psi(W;\theta_0,\eta_0)]=0$$ and $$\frac{\partial}{\partial\lambda},\mathbb E\big[\psi\big(W;\theta_0,\eta_0+\lambda(\tilde\eta-\eta_0)\big)\big]\Big|_{\lambda=0}=0\quad (2)$$ 

  • First‑order bias from estimating \(\eta\) cancels
  • Only need \(n^{-1/4}\) convergence of \(\hat\eta\)

Think of a Taylor series: the linear term vanishes, leaving only a tiny quadratic remainder.

Implications

🔄 Cross‑Fitting (Pillar 2)

“Don’t grade your own homework.”

  1. Randomly split data into (K) folds \(I_1,\dots,I_K\)
  2. Fit \(\hat\eta_{-k}\) on the other \(K-1\) folds
  3. Evaluate score on held‑out fold
  4. Aggregate all folds and solve for \(\hat\theta\)

Breaks correlation between nuisance errors and scores → kills over‑fitting bias while using every observation efficiently.
Default: (K=5) – 10 (larger if cheap).

📐 Asymptotics & Variance

With pillars 1 + 2: $$\sqrt n,(\hat\theta-\theta_0);\rightarrow_d;\mathcal N(0,V)$$ Sandwich variance is plug‑in: $$\hat V=\hat J^{-1}\Big(\tfrac1n\sum_{i}\psi_i^2\Big)\hat J^{-1},\qquad \hat J=\tfrac1n\sum_i\partial_\theta\psi_i$$ No fragile bootstrap needed.

🧮 Example 1 – Average Treatment Effect

Parameter: $$\theta_0=\mathbb E[Y(1)-Y(0)]$$

Doubly‑robust score:

$$\psi_{ATE}=\Big[\tfrac{D}{r(X)}-\tfrac{1-D}{1-r(X)}\Big]\big(Y-\ell(D,X)\big)+\ell(1,X)-\ell(0,X)-\theta$$

  • Consistent if either \(r\) or \(\ell\) correct
  • Simulations: 95 % coverage vs 71 % for naïve IPW

🧮 Example 2 – Partially Linear Regression

Model: (Y=D\theta_0+g(X)+\varepsilon) $$\psi_{PLR}=\big(D-r(X)\big)\big(Y-\ell(X)\big)-\big(D-r(X)\big)^2\theta$$ Effective with text/image controls when (g) is ML‑fitted (RF, XGB, BERT …)

🧮 Example 3 – Group‑Time ATT (Staggered DiD)

For units first treated in (g) and evaluated in (t): $$ATT_{g,t}=\mathbb E[\Delta Y_{it}(1)-\Delta Y_{it}(0)\mid G=g]$$ Orthogonal score from Chang & Santonastaso (2021) gives valid DiD with heterogenous effects.

💼 Application 1 — 401(k) Eligibility

  • 1991 SIPP, (n≈10{,}000); outcome: net assets
  • Random‑forest nuisances
  • DML: −$6.9 k (SE 1.3 k)
  • IPW & RA diverge → DML preferred

🏥 Application 2 — Hospital Admission

  • HRS, 656 households, 5 waves
  • 15‑fold DML per (g,t), median aggregation
  • Spending jumps ≈ $2.4 k in first post‑admission wave, no pre‑trend

🛒 Application 3 — MTurk Monopsony

  • Outcome: log fill‑time; regressor: log payment
  • 30 k text features + BERT embeddings
  • Nine learners tested; diagnostics via $R^2$ & CVC
  • Stacked DML: elasticity ≈ −0.024 (SE 0.005) → moderate monopsony

🧰 Practical Guide & Rules of Thumb

Aspect Good practice Common pitfall
Learner set Mix simple & complex models One flashy black box
Hyper‑tuning Document grid; nested CV Undisclosed manual tuning
Folds 5–10 (≧ n folds if tiny $n$) Just 2 folds with minuscule holdout
Repetitions Median of ≥5 splits Reporting a lucky split
Diagnostics Out‑of‑fold loss, CVC, residuals Only in‑sample $R^2$
Reporting Share code, seeds, learner weights Opaque methods section

🌱 Recent Extensions

  • Automatic score generators (Chernozhukov et al., 2022)
  • Set‑identified and sensitivity bounds
  • Orthogonal learners for function targets
  • Higher‑order theory: finite‑sample tweaks
  • Causal AI pipelines with LLM embeddings

🚀 Key Take‑Aways

  • DML = Orthogonality ➕ Cross‑fitting → valid inference with ML
  • Mitigates bias without restrictive functional forms
  • Requires thoughtful learner tuning and transparent diagnostics
  • A robust baseline for modern empirical work

🙏 Thank You!

Questions or feedback? → email / GitHub
Code & resources: repository link

An Introduction to Double / Debiased Machine Learning

By Carlos Mendez

An Introduction to Double / Debiased Machine Learning

  • 79