Supervised Learning
ML in Feedback Sys #2
Prof Sarah Dean
Announcements
 Sign up to scribe
 Rank preferences for paper presentations
 Email me about first paper presentations 9/12 and 9/14
 HSNL18 Fairness Without Demographics in Repeated Loss Minimization
 PZMH20 Performative Prediction
 Required: meet with Atul at least 2 days before you are scheduled to present
 Working in pairs/groups, selfassessment
training data
\(\{(x_i, y_i)\}\)
model
\(f:\mathcal X\to\mathcal Y\)
policy
observation
action
ML in Feedback Systems
training data
\(\{(x_i, y_i)\}\)
model
\(f:\mathcal X\to\mathcal Y\)
observation
prediction
Supervised learning
\(\mathcal D\)
sampled i.i.d. from \(\mathcal D\)
\(x\sim\mathcal D_{x}\)
Goal: for new sample \(x,y\sim \mathcal D\), prediction \(\hat y = f(x)\) is close to true \(y\)
Predictions via Risk Mimization
Goal: for new sample \(x,y\sim \mathcal D\), prediction \(\hat y = f(x)\) is close to true \(y\)
\(\ell(y,\hat y)\) measures "loss" of predicting \(\hat y\) when it's actually \(y\)
Encode our goal in risk minimization framework:
$$\min_{f\in\mathcal F}\mathcal R(f) = \mathbb E_{x,y\sim\mathcal D}[\ell(y, f(x))]$$
$$\hat \theta = \arg\min \sum_{i=1}^N(\theta^\top x_i\cdot y_i)_+$$
predict \(\hat f(x) = \mathbb 1\{\hat\theta^\top x \geq t\}\)
Ex: targeted job ads
No fairness through unawareness!
\(x_i=\) demographic info and browsing history
\(y_i=\) clicked (1) or not (1)
The index of \(\hat\theta\) corresponding to "female" is negative!
The index of \(\hat\theta\) corresponding to "visited website for women's clothing store" is negative!
Statistical Classification Criteria
Accuracy
\(\mathbb P( \hat Y = Y)\) = ________
Positive rate
\(\mathbb P( \hat Y = 1)\) = ________
False positive rate
\(\mathbb P( \hat Y = 1\mid Y = 0)\) = ________
False negative rate
\(\mathbb P( \hat Y = 0\mid Y = 1)\) = ________
Positive predictive value
\(\mathbb P( Y = 1\mid\hat Y = 1)\) = ________
Negative predictive value
\(\mathbb P( Y = 0\mid\hat Y = 0)\) = ________
\(X\)
\(Y=1\)
\(Y=0\)
\(f(X)\)
\(3/4\)
\(9/20\)
\(1/5\)
\(3/10\)
\(7/9\)
\(8/11\)
Nondiscrimination Criteria
 Informally: predictor should treat individuals the "same" across groups
 Formally: equalize positive rate, error rate, or predictive value across groups
Independence: prediction does not depend on \(a\)
\(\hat y \perp a\)
e.g. ad displayed at equal rates across gender
Separation: given outcome, prediction does not depend on \(a\)
\(\hat y \perp a~\mid~y\)
e.g. ad displayed to interested users at equal rates across gender
Sufficiency: given prediction, outcome does not depend on \(a\)
\( y \perp a~\mid~\hat y\)
e.g. users viewing ad are interested at equal rates across gender
In addition to features \(x\) and labels \(y\), individuals have protected attribute \(a\) (e.g. gender, race)
Ref: Ch 2 of Hardt & Recht, "Patterns, Predictions, and Actions" mlstory.org; Ch 3 of Barocas, Hardt, Narayanan "Fairness and Machine Learning" fairmlbook.org.
Ex: Pretrial Detention
 COMPAS: a criminal risk assessment tool used in pretrial release decisions
 \(x\) survey about defendant
 \(\hat y\) designation as high or lowrisk
 Audit of data from Broward county, FL
 \(a\) race of defendant
 \(y\) recidivism within two years
“Black defendants who did not recidivate over a twoyear period were nearly twice as likely to be misclassified. [...] White defendants who reoffended within the next two years were mistakenly labeled low risk almost twice as often.”
Ex: Pretrial Detention
“In comparison with whites, a slightly lower percentage of blacks were ‘Labeled Higher Risk, But Didn’t ReOffend.’ [...] A slightly higher percentage of blacks were ‘Labeled Lower Risk, Yet Did ReOffend.”’
\(\mathbb P(\hat y = 1\mid y=0, a=\text{Black})> \mathbb P(\hat y = 1\mid y=0, a=\text{White}) \)
\(\mathbb P(\hat y = 0\mid y=1, a=\text{Black})< \mathbb P(\hat y = 0\mid y=1, a=\text{White}) \)
\(\mathbb P(y = 0\mid \hat y=1, a=\text{Black})\approx \mathbb P( y = 0\mid \hat y=1, a=\text{White}) \)
\(\mathbb P(y = 1\mid \hat y=0, a=\text{Black})\approx \mathbb P( y = 1\mid \hat y=0, a=\text{White}) \)
COMPAS risk predictions do not satisfy separation
Ex: Pretrial Detention
COMPAS risk predictions do satisfy sufficiency
Achieving Nondiscrimination Criteria
 Preprocessing: remove correlations between \(a\) and features \(x\) in dataset.

Preprocessing: remove correlations between \(a\) and features \(x\) in dataset.
 Requires knowledge of \(a\) during data cleaning
Achieving Nondiscrimination Criteria
 Preprocessing: remove correlations between \(a\) and features \(x\) in dataset.

Inprocessing: modify learning algorithm to respect criteria.
 Requires knowledge of \(a\) at training time
Achieving Nondiscrimination Criteria
 Preprocessing: remove correlations between \(a\) and features \(x\) in dataset.
 Inprocessing: modify learning algorithm to respect criteria.
 Postprocessing: adjust thresholds in groupdependent manner.
Achieving Nondiscrimination Criteria
 Preprocessing: remove correlations between \(a\) and features \(x\) in dataset.
 Inprocessing: modify learning algorithm to respect criteria.

Postprocessing: adjust thresholds in groupdependent manner.
 Requires knowledge of \(a\) at decision time
Achieving Nondiscrimination Criteria
Limitations of Nondiscrimination Criteria
 Tradeoffs: It is impossible to simultaneously satisfy separation and sufficiency if populations have different base rates
Kleinberg & Raghavan, Inherent TradeOffs in the Fair Determination of Risk Scores
Limitations of Nondiscrimination Criteria
 Tradeoffs: It is impossible to simultaneously satisfy separation and sufficiency if populations have different base rates
 Observational: Statistical criteria can measure only correlation; intuitive notions of discrimination involve causation, which requires careful modelling
Simpson's Paradox, Cory Simon
Limitations of Nondiscrimination Criteria
 Tradeoffs: It is impossible to simultaneously satisfy separation and sufficiency if populations have different base rates
 Observational: Statistical criteria can measure only correlation; intuitive notions of discrimination involve causation, which requires careful modelling
 Unclear legal grounding: While algorithmic decisions may have disparate impact, achieving criteria involves disparate treatment
Barocas & Selbst, Big Data's Disparate Impact
Limitations of Nondiscrimination Criteria
 Tradeoffs: It is impossible to simultaneously satisfy separation and sufficiency if populations have different base rates
 Observational: Statistical criteria can measure only correlation; intuitive notions of discrimination involve causation, which requires careful modelling
 Unclear legal grounding: While algorithmic decisions may have disparate impact, achieving criteria involves disparate treatment
 Limited view: focusing on risk prediction might miss the bigger picture of how these tools are used by larger systems to make decisisons
Discrimination Beyond Classification
image cropping
facial recognition
information retrieval
generative models
Empirical Risk Minimization
 define loss
 do ERM
performance depends on risk \(\mathcal R(f)\)
$$\hat f = \min_{f\in\mathcal F} \frac{1}{n} \sum_{i=1}^n \ell(y_i, f(x_i))$$
\(\{\)
\(\mathcal R_N(f)\)
(with fairness constraints)
training data
\(\{(x_i, y_i)\}\)
model
\(f:\mathcal X\to\mathcal Y\)
\(\mathcal D\)
Sample vs. population
Fundamental Theorem of Supervised Learning:
 The risk is bounded by the empirical risk plus the generalization error. $$ \mathcal R(f) \leq \mathcal R_N(f) + \mathcal R(f)  \mathcal R_N(f)$$
Empirical risk minimization
$$\hat f = \min_{f\in\mathcal F} \frac{1}{n} \sum_{i=1}^n \ell(y_i, f(x_i))$$
\(\{\)
\(\mathcal R_N(f)\)
1. Representation
2. Optimization
3. Generalization
Case study: linear regression
At first glance, linear representation seems limiting
Leastsquares linear regression models \(y\approx \theta^\top x\)
$$\min_{\theta\in\mathbb R^d} \frac{1}{n}\sum_{i=1}^n \left(\theta^\top x_i  y_i\right)^2$$
but we can encode rich representations by expanding the features (increasing \(d\))
\(y = (x1)^2\)
\(y = \begin{bmatrix}1\\2\\1\end{bmatrix}^\top \begin{bmatrix}1\\x\\x^2\end{bmatrix} \)
\(\varphi(x)\)
\(\{\)
For more, see Ch 4 of Hardt & Recht, "Patterns, Predictions, and Actions" mlstory.org.
Case study: linear regression
one global min
infinitely many global min
local and global min
Optimization is straightforward due to differentiable and convex risk
$$\min_{\theta\in\mathbb R^d} \frac{1}{n}\sum_{i=1}^n \left(\theta^\top x_i  y_i\right)^2$$
strongly convex
convex
nonconvex
Case study: linear regression
Derivation of optimal solution
$$\hat\theta\in\arg\min_{\theta\in\mathbb R^d} \frac{1}{n}\sum_{i=1}^n \left(\theta^\top x_i  y_i\right)^2$$
first order optimality condition: \( \displaystyle \sum_{i=1}^n x_i x_i^\top\hat \theta = \sum_{i=1}^n y_ix_i \)
minnorm solution: \( \displaystyle \hat\theta = \left( \sum_{i=1}^n x_i x_i^\top\right)^\dagger\sum_{i=1}^n y_ix_i \)
Case study: linear regression
Proof: exercise. Hint: consider the span of the \(x_i\).
Iterative optimization with gradient descent $$\theta_{t+1} = \theta_t  \alpha\sum_{i=1}^n (\theta_t^\top x_i  y_i)x_i$$
Claim: suppose \(\theta_0=0\) and GD converges to a minimizer. Then it converges to the minimum norm solution.
Case study: linear regression
Generalization: under the fixed design generative model, \(\{x_i\}_{i=1}^n\) are fixed and
\(y_i = \theta_\star^\top x_i + v_i\) with \(v_i\) i.i.d. with mean \(0\) and variance \(\sigma^2\)
\(\mathcal R(\theta) = \frac{1}{n}\sum_{i=1}^n \mathbb E_{y_i}\left[(x_i^\top \theta  y_i)^2\right]\)
Claim: when features span \(\mathbb R^d\), the excess risk \(\mathcal R(\hat\theta) \mathcal R(\theta_\star) =\frac{\sigma^2 d}{n}\)
 First, \(\mathcal R(\theta) = \frac{1}{n}\X(\theta\theta_\star)\_2^2 + \sigma^2\)
 Then, \(\X(\hat\theta\theta_\star)\_2^2 = v^\top X(X^\top X)^{1}X^\top v\)
 Then take expectation.
Case study: linear regression
Exercises:
 For the same generative model and a new fixed \(x_{n+1}\), what is the expected loss \(\mathbb E_y[(\hat \theta^\top x_{n+1}  y_{n+1})^2]\)? Can you interpret the quantities?
 In the random design setting, we take each \(x_i\) to be drawn i.i.d. from \(\mathcal N(0,\Sigma)\) and assume that \(v_i\) is also Gaussian. The risk is then $$\mathcal R(\theta) = \mathbb E_{x, y}\left[(x^\top \theta  y)^2\right].$$ What is the excess risk of \(\hat\theta\) in terms of \(X^\top X\) and \(\Sigma\)? What is the excess risk in terms of \(\sigma^2, n, d\) (ref 1, 2)?
Recap
 Nondiscrimination criteria
 independence, separation, and sufficiency
 Leastsquares regression
 representation, optimization, generalization
Next time: online learning with linear leastsquares case study
02  Supervised Learning  ML in Feedback Sys
By Sarah Dean