Shen Shen

Understanding Domain Shift

When the World Changes

Face ID: A Success Story

Apple's Face ID. Trained on over a billion images.

 

Works in the dark.

Works with glasses.

Works with hats.

 

Ship it!

✓ Unlocked

Then 2020 Happened

✗ Locked

COVID-19 → everyone wears masks

Your phone keeps asking for your passcode.

 

Again. And again. And again.

Millions of frustrated users worldwide.

Face ID Meets Masks

Training

Deployment

Face ID didn't get worse. The input to the model changed.

Apple had to ship iOS 15.4 with "Face ID with a Mask" to fix this.

This is Domain Shift

Also called: distribution shift, dataset shift

Wait... Is This Just Overfitting?

Let's look at a trained neural network on MNIST digits

Training samples

Accuracy: 97%

Test samples

Accuracy: 81%

Looks like overfitting... or is it?

What the MNIST Network Learned

Activation maps at each layer — what each filter responds to for a digit "4":

Layer 1 (32 filters): strokes, edges — digit still recognizable

Layer 2 (64 filters): stroke fragments, spatial patterns

Layer 3 (128 filters): abstract, sparse codes

The network builds hierarchical features — not memorizing individual digits.

Training ≠ Deployment

Training: upright, centered

Test: rotated and shifted

The distribution changed, not the model complexity

This is domain shift, not overfitting

Domain Shift

Overfitting Domain Shift
Symptom Training ✓, testing ✗
Analogy Studies only past exams, tested on same course Masters C01, tested on C011 exam
Root cause Model memorizes noise \(P_{\text{source}}(x,y) \neq P_{\text{target}}(x,y)\)
Fixes Regularization, dropout, simpler model Different — that's today & next lecture

Source domain = training  |  Target domain = deployment

What Does \(P_{\text{source}} \neq P_{\text{target}}\) Look Like?

Face ID:  \(x\) = face scan,  \(y\) = unlock / reject

Randomly sample one unlock attempt...

\(x\) \(y\) \(P_{\text{source}}\) (pre-2020) \(P_{\text{target}}\) (mid-2020)
full face unlock 47% 8%
masked face unlock 0.1% 42%

An \((x, y)\) pair that was 0.1% of training is now 42% of deployment.1

1 Contrived numbers for illustration.

Domain Shift is Everywhere

Robot platform and distribution shift settings: same task, different object and station configurations

Tedrake et al. (2025)

The Sim-to-Real Gap

Top: real world  |  Middle & bottom: simulation counterparts

Maddukuri et al. (2025)

Satellite Imagery: Crop Classification

Large, regular fields → high accuracy

Tiny, irregular fields → much worse

Wang, Waldner & Lobell (2022)

Pattern 1: Covariate Shift

The inputs change, but the rules don't

Covariate Shift: Wildlife Monitoring

Training: one location, one season

Deployment: new locations, new times

Same Deer, Different Pixels

A deer in a different place, at a different time, is still a deer.

Same ears. Same legs. Same antlers.

Didn't change

The relationship between "deer features" and "deer"

Changed

The background — location, lighting, vegetation

The rules are the same — only the inputs look different.

Covariate Shift: Definition

\(P(x)\) changes — inputs look different

\(P(y|x)\) stays the same — rules unchanged

The deer detector's knowledge is still valid — a deer is still a deer.
It just needs adjusting for different-looking inputs.

The most common type of domain shift.

Why the Joint Changes: \(P(x)\) Shifts

\(P(x, y) = P(y|x) \cdot P(x)\)

\(P(\text{deer} | \text{infrared image})\) \(P(\text{infrared image})\) \(P(\text{infrared, deer})\)
Training 20% 5% 1%
Deployment 20% 90% 18%

Same conditional × different input frequency = different joint → \(P_{\text{source}} \neq P_{\text{target}}\)

Covariate Shift: You've Seen It

Example \(x\) \(P(x)\) shifted \(P(y|x)\) same
Face ID + masks face scan masked faces everywhere your identity didn't change
MNIST rotation digit image rotated pixels a 3 is still a 3
Sim-to-real robotics camera image sim vs real textures same grasping task
Wildlife monitoring trail camera day vs infrared night a deer is still a deer

Same rules, different-looking inputs → covariate shift.

Fixing Covariate Shift

Weight each training example by:  \(w(x) = \frac{P_{\text{target}}(x)}{P_{\text{source}}(x)}\)

Common in target → boost  |  Rare in target → reduce

Other Fixes (Preview)

Reweighting isn't the only approach:

🔧

Finetuning

Continue training on target data

Deer detector + a few night images

🎲

Data Augmentation

Simulate target conditions during training

Synthetically darken daytime photos

Test-Time Adaptation

Adapt on-the-fly at deployment

No target labels needed

We'll cover these in detail next lecture.

Pattern 2: Label Shift

Semester

📚
40%
😴
30%
🏃
15%
🥳
15%

P(y) changed

Summer

📚
10%
😴
25%
🏃
30%
🥳
35%

👩‍💻 — studying or fun? Same image, different odds.

Label Shift: Cold Diagnosis

Same cold detection model, different patient populations:

Children's Hospital

Kids get 6–8 colds per year

MGH (adult general hospital)

Adults get 2–3 colds per year

\(P(x|y)\) same — a cold looks the same in kids and adults

\(P(y)\) changed — prevalence is very different

Also called: prior probability shift, target shift

Why the Joint Changes: \(P(y)\) Shifts

\(P(x, y) = P(x|y) \cdot P(y)\)

\(P(\text{runny nose} | \text{cold})\) \(P(\text{cold})\) \(P(\text{runny nose, cold})\)
Children's 80% 40% 32%
MGH 80% 10% 8%

Same conditional × different prior = different joint → \(P_{\text{source}} \neq P_{\text{target}}\)

Label Shift as Bayes' Rule

3Blue1Brown, The medical test paradox

Sensitivity = 90%, specificity = 91%. You test positive.

\(P(\text{disease})\) \(P(\text{disease} | +)\)
10% prevalence 10% ~50%
1% prevalence 1% ~9%

10% case: \(\frac{0.9 \cdot 0.10}{0.9 \cdot 0.10 + 0.09 \cdot 0.90} = \frac{0.09}{0.171} \approx 0.53 \approx 50\%\)

1% case: \(\frac{0.9 \cdot 0.01}{0.9 \cdot 0.01 + 0.09 \cdot 0.99} \approx 0.09\)

Same test, same \(P(x|y)\) — the prior \(P(y)\) changes everything.

That's label shift.

\[ P(D|+) = \frac{P(+|D)P(D)}{P(+|D)P(D) + P(+|\neg D)P(\neg D)} \]

Fixing Label Shift

Adjust predictions:  \(P_{\text{target}}(y|x) \propto P_{\text{source}}(y|x) \cdot \frac{P_{\text{target}}(y)}{P_{\text{source}}(y)}\)

Common in target → boost  |  Rare in target → reduce

MNIST example  demo

Two Patterns, One Framework

Covariate Shift

\(P(x)\) changes

\(P(y|x)\) stays same

World looks different, works the same

e.g. deer detector, sim-to-real, Face ID

Fix: reweight, finetune, augment

Label Shift

\(P(y)\) changes

\(P(x|y)\) stays same

World works the same, mix is different

e.g. cold diagnosis, MNIST class bias

Fix: adjust priors, rebalance

Identifying which pattern → determines the fix.

Which Type of Shift?

Spam filter: trained on Gmail, deployed on corporate email

Covariate — different writing style, same spam vs not-spam

Sentiment analysis: trained on restaurant reviews, deployed on electronics reviews

Covariate — different vocabulary, same positive/negative meaning

Disease screening: trained in flu season, deployed in summer

Label shift — same symptoms, way fewer sick patients

Satellite imagery: trained in France, deployed in India

Both! — different terrain (covariate) and different crop mix (label)

Your Turn: Loan Approval Model

Credit scoring model trained at a major bank in New York, deployed at a regional bank in rural Midwest.

What might be different?

  • Applicant profiles (income levels, employment types)
  • Default rates (different economic conditions)

Both types! — Different applicant features AND different default rates

The Diagnostic Framework

When we encounter a new deployment scenario, ask:

  1. Do the inputs look different? → Check for covariate shift
    • Can a domain classifier separate source vs target? → measurable input shift

Probe: train \(d(x)\) to predict source vs target.
Near chance (~50%) → little detectable input shift; high accuracy → covariate shift signal.

  1. Are the class frequencies different? → Check for label shift
    • Compare class distributions source vs target → class imbalance signals label shift

Probe: compare \(\hat{P}(y)\) across source and target.
Similar frequencies → no label shift; large gaps → label shift signal.

  1. Did the underlying rules change? → Might need to retrain

Case Study: Waymo Phoenix → SF

Flat, wide, sunny → hills, fog, narrow streets

Diagnosing Waymo's Shift

Question Answer Diagnosis
Inputs look different? Yes — fog, hills, narrow streets Covariate shift
Class frequencies different? Yes — more cyclists, pedestrians Label shift
Rules changed? No — a stop sign is still a stop sign No concept shift

Domain-classifier check: source-vs-target frame classifier performs well → measurable input shift.

Both types — real-world shifts are often mixed.

How Waymo Adapted

  1. Collected SF data — thousands of miles with safety drivers
  2. Reweighted — upweight rare SF scenarios (fog, hills) ← covariate shift fix
  3. Rebalanced — adjust for higher cyclist/pedestrian frequency ← label shift fix
  4. Finetuned — Phoenix model as starting point + SF data
  5. Augmented — synthetic fog, hills, unusual traffic

After years of adaptation, Waymo now operates commercially in SF.

Summary

  • Domain shift is not overfitting — the world changed, not your model's complexity.

  • Covariate shift (\(P(x)\) changes, \(P(y|x)\) same) is the most common pattern — fix with reweighting, finetuning, or augmentation.

  • Label shift (\(P(y)\) changes, \(P(x|y)\) same) requires adjusting class priors, not model architecture.

  • Diagnosis before treatment — identify what changed, then pick the matching fix.

  • Models are frozen snapshots of one distribution; the world keeps moving, so anticipate and adapt.

Reference: Shift Types

Shift Type What Changes? What Stays Same? Fix Strategy Today's Examples
Covariate \(P(x)\) \(P(y|x)\) Reweight, finetune, augment Face ID, deer detector, sim-to-real
Label \(P(y)\) \(P(x|y)\) Adjust priors, rebalance Cold diagnosis, MNIST class bias
Concept (preview) \(P(y|x)\) Nothing guaranteed Retrain, continual learning Next lecture!

Next Time: Making Models Robust

Now that we can diagnose domain shift... how do we fix it?

Data Augmentation Ensembles Monitoring Uncertainty Test-Time Adaptation

AI Educators Pilot - Lecture - Domain Shift

By Shen Shen

AI Educators Pilot - Lecture - Domain Shift

  • 17