Shen Shen
April 28, 2025
2:30pm, Room 32-144
Slides adapted from Tommi Jaakkola
[Ganin et al. 2015]
[Koh et al. 2021]
[Beery et al. 2020, Koh et al. 2021]
Goal is to learn from labeled/annotated training (source) examples and transfer this knowledge for use in a different (target) domain
For supervised learning: we have some annotated examples also from the target domain
For semi-supervised: we have lots of unlabeled target examples and just a few annotations
For unsupervised: we only have unlabeled examples from the target domain
For reinforcement learning: we have access to (source) environment, and aim to transfer policies to a (target) environment, possibly with limited or no interaction data from the target
Let’s consider two scenarios where we can adjust what we do
(these are not mutually consistent, and do not cover all shifts)
Assumptions:
We want:
\(P_T(y \mid x) \propto P_T(x \mid y) P_T(y)\)
\(=P_S(x \mid y) P_T(y)\)
\(=P_S(x \mid y) P_S(y) \frac{P_T(y)}{P_S(y)}\)
\(\propto \frac{P_S(x \mid y) P_S(y)}{P_S(x)} \frac{P_T(y)}{P_S(y)}\)
\(=P_S(y \mid x) \frac{P_T(y)}{P_S(y)}\)
class conditional distributions assumed to be the same
x is fixed, so we can include any x term
we have expressed the target classifier proportionally to the source classifier and the label ratios; normalizing this across y for any given x gives our target classifier
We would like to estimate \(\theta\) that minimizes
\(\sum_{x, y} P_T(x, y) \operatorname{Loss}(x, y, \theta)\)
\(=\sum_{x, y} P_S(x, y) \frac{P_T(x, y)}{P_S(x, y)} \operatorname{Loss}(x, y, \theta)\)
\(=\sum_{x, y} P_S(x, y) \frac{P_T(x) P_T(y \mid x)}{P_S(x) P_S(y \mid x)} \operatorname{Loss}(x, y, \theta)\)
\(=\sum_{x, y} P_S(x, y) \frac{P_T(x)}{P_S(x)} \operatorname{Loss}(x, y, \theta)\)
average over the target distribution
weighted average over the source distribution
If the domain classifier is "perfect" then
\(Q(S \mid x)=\frac{P_S(x)}{P_S(x)+P_T(x)} \quad \frac{Q(S \mid x)}{Q(T \mid x)}=\frac{P_S(x)}{P_T(x)}\)
So we can get the ratio weights required for estimation directly from the domain classifier
importance sampling
\(\mathbb{E}_{x \sim q}\left[\frac{p(x)}{q(x)} f(x)\right]=\mathbb{E}_{x \sim p}[f(x)]\)
\(U(\theta)=\mathbb{E}_{\tau \sim \theta} \mathrm{old}\left[\frac{P(\tau \mid \theta)}{P\left(\tau \mid \theta_{\mathrm{old}}\right)} R(\tau)\right]\)
\(=\mathbb{E}_{\tau \sim \theta_{\text {old }}}\left[\frac{\pi(\tau \mid \theta)}{\pi\left(\tau \mid \theta_{\text {old }}\right)} R(\tau)\right]\)
\(\nabla_\theta U(\theta)=\mathbb{E}_{\tau \sim \theta} \text { old }\left[\frac{\nabla_\theta P(\tau \mid \theta)}{P\left(\tau \mid \theta_{\text {old }}\right)} R(\tau)\right]\)
[Tang and Abbeel, On a Connection between Importance Sampling and the Likelihood Ratio Policy Gradient, 2011]
Off-policy evaluation as dealing with co-variate shift
What's the conditional that remains the same? the unknown transitions.
Recall
Domain Adaptation Summary
Shift Type | What Changes? | Typical Assumption | Need to Reweight? | Need to Relearn? |
---|---|---|---|---|
Covariate Shift | P(x) changes | P(y|x) stays the same | Yes (importance sampling on x) | No |
Label Shift | P(y) changes | P(x|y) stays the same | Yes (importance sampling on y) | No |
Concept Shift | P(y|x) changes | Neither P(x) nor P(y) necessarily stay the same | No (reweighting not enough) | Yes (new learning needed) |
Data augmentation
Data augmentation
Data augmentation
Domain randomization
Robustness
We'd love to hear your thoughts.