WIP DRAFT 2
August 28, 2025
[video edited from 3b1b]
Consider the MNIST dataset
[video edited from 3b1b]
[video edited from 3b1b]
trained such a 2-hidden-layer network for classification
Training Accuracy 97%
Testing Accuracy 81%
The canonical ____ ?
overfitting
...
trained such a 2-hidden-layer network for classification
Training Accuracy 97%
Testing Accuracy 60%
...
The canonical domain shift
Is this overfitting?
Does this performance gap "feel" the same as before?
We usually assume that the data we use to train a model is representative of what it will see in the real world — i.e., training and test data share the same structure and characteristics.
same domain
different domains
In other words, we often assume training and test data come from the same distribution, or the same domain. Domain shift happens when this assumption fails.
Overfitting | Domain Shift | |
---|---|---|
Training vs. Testing Error | Training low, testing high | |
Analogy | studies only past C01 exams, tested on C01 exam | studies all C01 materials (hw, lectures, etc.), tested on C011 exam |
Cause | Model memorizes noise or overly specific patterns | Distribution mismatch between training and testing data |
Symptoms | High variance across data splits | Poor generalization on new environments or populations |
Data Distribution | Same for train and test (P_train(x, y) ≈ P_test(x, y)) | Different for train and test (P_train(x, y) ≠ P_test(x, y)) |
Typical Fixes | Regularization, dropout, simpler model, more data | Domain adaptation, importance weighting, data augmentation |
ML models may generalize poorly when domain shift occurs.
E.g., temporal changes:
E.g., demographic/context changes
trained on 🇬🇧 accent
applied on 🇬🇧 🇮🇪 🇺🇸 🇦🇺 ... accents
trained at
deployed at
we also call this the
source
and this the target
E.g., actions we take
Performance can be substantially affected by “domain shift”…
[Ganin et al. 2015]
[Koh et al. 2021]
Performance can be substantially affected by "domain shift"…
[Beery et al. 2020, Koh et al. 2021]
Performance can be substantially affected by "domain shift"…
very little one can do when the shift is "too significant".
to deal with shift in a meaningful way, there must be some properties -- maybe hidden, or conditional -- that remain unchanged, despite the overall shift
How to address domain shift?
e.g. if the target task is to speak a foreign language, it seems almost useless to train on skateboarding
E.g. Train a deer detector using images (captured in sunny daytime)
Train a deer detector using images (captured in sunny daytime)
A deer detector, trained in daytime and tested in nighttime:
A deer detector, trained in daytime and tested in nighttime:
We would like to estimate \(\theta\) that minimizes
\(\sum_{x, y} P_T(x, y) \operatorname{Loss}(x, y, \theta)\)
\(=\sum_{x, y} P_S(x, y) \frac{P_T(x, y)}{P_S(x, y)} \operatorname{Loss}(x, y, \theta)\)
\(=\sum_{x, y} P_S(x, y) \frac{P_T(x) P_T(y \mid x)}{P_S(x) P_S(y \mid x)} \operatorname{Loss}(x, y, \theta)\)
\(=\sum_{x, y} P_S(x, y) \frac{P_T(x)}{P_S(x)} \operatorname{Loss}(x, y, \theta)\)
average over the target distribution
weighted average over the source distribution
If the domain classifier is "perfect" then
\(Q(S \mid x)=\frac{P_S(x)}{P_S(x)+P_T(x)} \quad \frac{Q(S \mid x)}{Q(T \mid x)}=\frac{P_S(x)}{P_T(x)}\)
So we can get the ratio weights required for estimation directly from the domain classifier
Now let's look at another type of shift that's "fixable" in a relatively easy and principled way
TODO: so many bugs... 😱
need fix
Similar to the MNIST demo, consider a similar scenario.
A species detector, trained using images from location 514.
What's the features, what's the labels?
Features \(x\): images, labels \(y\): category index
Species detector, trained at location 514, deployed at other locations.
What are do these bar charts represent?
Distribution over \(y:\) e.g. species 25 rare at location 514, not so rare at other locations.
More formally, assumptions:
We want:
\(P_T(y \mid x) \propto P_T(x \mid y) P_T(y)\)
\(=P_S(x \mid y) P_T(y)\)
\(=P_S(x \mid y) P_S(y) \frac{P_T(y)}{P_S(y)}\)
\(\propto \frac{P_S(x \mid y) P_S(y)}{P_S(x)} \frac{P_T(y)}{P_S(y)}\)
\(=P_S(y \mid x) \frac{P_T(y)}{P_S(y)}\)
class conditional distributions assumed to be the same
x is fixed, so we can include any x term
we have expressed the target classifier proportionally to the source classifier and the label ratios; normalizing this across y for any given x gives our target classifier
Let’s consider two scenarios where we can adjust what we do
(these are not mutually consistent, and do not cover all shifts)
Domain Shift and Adaptation Summary
Shift Type | What Changes? | Typical Assumption | Need to Reweight? | Need to Relearn? |
---|---|---|---|---|
Covariate Shift | P(x) changes | P(y|x) stays the same | Yes (importance sampling on x) | No |
Label Shift | P(y) changes | P(x|y) stays the same | Yes (importance sampling on y) | No |
Concept Shift | P(y|x) changes | Neither P(x) nor P(y) necessarily stay the same | No (reweighting not enough) | Yes |
Goal is to learn from labeled/annotated training (source) examples and transfer this knowledge for use in a different (target) domain
For supervised learning: we have some annotated examples also from the target domain
For semi-supervised: we have lots of unlabeled target examples and just a few annotations
For unsupervised: we only have unlabeled examples from the target domain
For reinforcement learning: we have access to (source) environment, and aim to transfer policies to a (target) environment, possibly with limited or no interaction data from the target
ToDo: add citations, plots here for each case
Data augmentation
Data augmentation
Data augmentation
Domain randomization
Let’s revisit domain shift — this time through the lens of a real-world application.
Autonomous driving gives us rich, practical examples of all types of shift.
Training: Urban, sunny, daytime data in California
Testing: Nighttime, fog, rain, snow — possibly in Europe
[Insert visual of sunny vs. foggy streets]
Same pedestrians, different pixel distributions
Training: Suburban setting — fewer pedestrians, many stop signs
Testing: City center near a school — more pedestrians and cyclists
[Insert bar plot of class distribution across domains]
Target domain has more bikes, fewer stop signs
To adapt label-shifted data:
w(y) = P_target(Y=y) / P_source(Y=y)
Q: how to adapt for co-variate shift?
Setting | Description |
---|---|
Unsupervised | No target labels |
Semi-supervised | Few labeled target samples |
Supervised | Fully labeled target data |
Domain shift is not rare — it’s constant in the real world.
Robustness