Apple's Face ID. Trained on over a billion images.
Works in the dark.
Works with glasses.
Works with hats.
Ship it!
COVID-19 → everyone wears masks
Your phone keeps asking for your passcode.
Again. And again. And again.
Millions of frustrated users worldwide.
Training
Deployment
Face ID didn't get worse. The input to the model changed.
Apple had to ship iOS 15.4 with "Face ID with a Mask" to fix this.
This is Domain Shift
Also called: distribution shift, dataset shift
Let's look at a trained neural network on MNIST digits
Training samples
Accuracy: 97%
Test samples
Accuracy: 81%
Looks like overfitting... or is it?
Activation maps at each layer — what each filter responds to for a digit "4":
Layer 1 (32 filters): strokes, edges — digit still recognizable
Layer 2 (64 filters): stroke fragments, spatial patterns
Layer 3 (128 filters): abstract, sparse codes
The network builds hierarchical features — not memorizing individual digits.
Training: upright, centered
Test: rotated and shifted
The distribution changed, not the model complexity
This is domain shift, not overfitting
| Overfitting | Domain Shift | |
|---|---|---|
| Symptom | Training ✓, testing ✗ | |
| Analogy | Studies only past exams, tested on same course | Masters C01, tested on C011 exam |
| Root cause | Model memorizes noise | \(P_{\text{source}}(x,y) \neq P_{\text{target}}(x,y)\) |
| Fixes | Regularization, dropout, simpler model | Different — that's today & next lecture |
Source domain = training | Target domain = deployment
Face ID: \(x\) = face scan, \(y\) = unlock / reject
Randomly sample one unlock attempt...
| \(x\) | \(y\) | \(P_{\text{source}}\) (pre-2020) | \(P_{\text{target}}\) (mid-2020) |
|---|---|---|---|
| full face | unlock | 47% | 8% |
| masked face | unlock | 0.1% | 42% |
An \((x, y)\) pair that was 0.1% of training is now 42% of deployment.1
1 Contrived numbers for illustration.
Robot platform and distribution shift settings: same task, different object and station configurations
Tedrake et al. (2025)
Top: real world | Middle & bottom: simulation counterparts
Maddukuri et al. (2025)
Large, regular fields → high accuracy
Tiny, irregular fields → much worse
Wang, Waldner & Lobell (2022)
The inputs change, but the rules don't
Training: one location, one season
Deployment: new locations, new times
A deer in a different place, at a different time, is still a deer.
Same ears. Same legs. Same antlers.
Didn't change
The relationship between "deer features" and "deer"
Changed
The background — location, lighting, vegetation
The rules are the same — only the inputs look different.
\(P(x)\) changes — inputs look different
\(P(y|x)\) stays the same — rules unchanged
The deer detector's knowledge is still valid — a deer is still a deer.
It just needs adjusting for different-looking inputs.
The most common type of domain shift.
\(P(x, y) = P(y|x) \cdot P(x)\)
| \(P(\text{deer} | \text{infrared image})\) | \(P(\text{infrared image})\) | \(P(\text{infrared, deer})\) | |
|---|---|---|---|
| Training | 20% | 5% | 1% |
| Deployment | 20% | 90% | 18% |
Same conditional × different input frequency = different joint → \(P_{\text{source}} \neq P_{\text{target}}\)
| Example | \(x\) | \(P(x)\) shifted | \(P(y|x)\) same |
|---|---|---|---|
| Face ID + masks | face scan | masked faces everywhere | your identity didn't change |
| MNIST rotation | digit image | rotated pixels | a 3 is still a 3 |
| Sim-to-real robotics | camera image | sim vs real textures | same grasping task |
| Wildlife monitoring | trail camera | day vs infrared night | a deer is still a deer |
Same rules, different-looking inputs → covariate shift.
Weight each training example by: \(w(x) = \frac{P_{\text{target}}(x)}{P_{\text{source}}(x)}\)
Common in target → boost | Rare in target → reduce
Reweighting isn't the only approach:
🔧
Finetuning
Continue training on target data
Deer detector + a few night images
🎲
Data Augmentation
Simulate target conditions during training
Synthetically darken daytime photos
⚡
Test-Time Adaptation
Adapt on-the-fly at deployment
No target labels needed
We'll cover these in detail next lecture.
P(y) changed
👩💻 — studying or fun? Same image, different odds.
Same cold detection model, different patient populations:
Children's Hospital
Kids get 6–8 colds per year
MGH (adult general hospital)
Adults get 2–3 colds per year
\(P(x|y)\) same — a cold looks the same in kids and adults
\(P(y)\) changed — prevalence is very different
Also called: prior probability shift, target shift
\(P(x, y) = P(x|y) \cdot P(y)\)
| \(P(\text{runny nose} | \text{cold})\) | \(P(\text{cold})\) | \(P(\text{runny nose, cold})\) | |
|---|---|---|---|
| Children's | 80% | 40% | 32% |
| MGH | 80% | 10% | 8% |
Same conditional × different prior = different joint → \(P_{\text{source}} \neq P_{\text{target}}\)
3Blue1Brown, The medical test paradox
Sensitivity = 90%, specificity = 91%. You test positive.
| \(P(\text{disease})\) | \(P(\text{disease} | +)\) | |
|---|---|---|
| 10% prevalence | 10% | ~50% |
| 1% prevalence | 1% | ~9% |
10% case: \(\frac{0.9 \cdot 0.10}{0.9 \cdot 0.10 + 0.09 \cdot 0.90} = \frac{0.09}{0.171} \approx 0.53 \approx 50\%\)
1% case: \(\frac{0.9 \cdot 0.01}{0.9 \cdot 0.01 + 0.09 \cdot 0.99} \approx 0.09\)
Same test, same \(P(x|y)\) — the prior \(P(y)\) changes everything.
That's label shift.
\[ P(D|+) = \frac{P(+|D)P(D)}{P(+|D)P(D) + P(+|\neg D)P(\neg D)} \]
Adjust predictions: \(P_{\text{target}}(y|x) \propto P_{\text{source}}(y|x) \cdot \frac{P_{\text{target}}(y)}{P_{\text{source}}(y)}\)
Common in target → boost | Rare in target → reduce
MNIST example demo
Covariate Shift
\(P(x)\) changes
\(P(y|x)\) stays same
World looks different, works the same
e.g. deer detector, sim-to-real, Face ID
Fix: reweight, finetune, augment
Label Shift
\(P(y)\) changes
\(P(x|y)\) stays same
World works the same, mix is different
e.g. cold diagnosis, MNIST class bias
Fix: adjust priors, rebalance
Identifying which pattern → determines the fix.
Spam filter: trained on Gmail, deployed on corporate email
→ Covariate — different writing style, same spam vs not-spam
Sentiment analysis: trained on restaurant reviews, deployed on electronics reviews
→ Covariate — different vocabulary, same positive/negative meaning
Disease screening: trained in flu season, deployed in summer
→ Label shift — same symptoms, way fewer sick patients
Satellite imagery: trained in France, deployed in India
→ Both! — different terrain (covariate) and different crop mix (label)
Credit scoring model trained at a major bank in New York, deployed at a regional bank in rural Midwest.
What might be different?
Both types! — Different applicant features AND different default rates
When we encounter a new deployment scenario, ask:
Probe: train \(d(x)\) to predict source vs target.
Near chance (~50%) → little detectable input shift; high accuracy → covariate shift signal.
Probe: compare \(\hat{P}(y)\) across source and target.
Similar frequencies → no label shift; large gaps → label shift signal.
Flat, wide, sunny → hills, fog, narrow streets
| Question | Answer | Diagnosis |
|---|---|---|
| Inputs look different? | Yes — fog, hills, narrow streets | Covariate shift |
| Class frequencies different? | Yes — more cyclists, pedestrians | Label shift |
| Rules changed? | No — a stop sign is still a stop sign | No concept shift |
Domain-classifier check: source-vs-target frame classifier performs well → measurable input shift.
Both types — real-world shifts are often mixed.
After years of adaptation, Waymo now operates commercially in SF.
Domain shift is not overfitting — the world changed, not your model's complexity.
Covariate shift (\(P(x)\) changes, \(P(y|x)\) same) is the most common pattern — fix with reweighting, finetuning, or augmentation.
Label shift (\(P(y)\) changes, \(P(x|y)\) same) requires adjusting class priors, not model architecture.
Diagnosis before treatment — identify what changed, then pick the matching fix.
Models are frozen snapshots of one distribution; the world keeps moving, so anticipate and adapt.
| Shift Type | What Changes? | What Stays Same? | Fix Strategy | Today's Examples |
|---|---|---|---|---|
| Covariate | \(P(x)\) | \(P(y|x)\) | Reweight, finetune, augment | Face ID, deer detector, sim-to-real |
| Label | \(P(y)\) | \(P(x|y)\) | Adjust priors, rebalance | Cold diagnosis, MNIST class bias |
| Concept (preview) | \(P(y|x)\) | Nothing guaranteed | Retrain, continual learning | Next lecture! |
Now that we can diagnose domain shift... how do we fix it?