CS 4/5789: Introduction to Reinforcement Learning
Lecture 26: Societal Implications
Prof. Sarah Dean
MW 2:454pm
255 Olin Hall
Reminders
 Homework
 5789 Paper Reviews due weekly on Mondays
 PA 4 due tonight
 Midterm corrections due Monday
 Accepted up until final (no late penalty)
 Final exam is Saturday 5/13 at 2pm
 Length: 2 hours
 Location: 155 Olin
 Review lecture Monday
 Course evaluations open! Participation credit
Agenda
1. RealWorld RL
2. Specification & Risks
3. Does RL Work?
4. All ML is RL
PollEv
Real World RL
AlphaGo vs. Lee Sedol, 2016
Real World RL
Real World RL

AlphaGo Zero (2017)
 Replaces imitation learning with random exploration, uses MCTS during selfplay

AlphaZero (2018)
 Generalizes beyond Go to Chess and Shogi

MuZero (2020)
 Generalizes to Atari by not requiring dynamics \(f\)
 Applied to video compression in 2022
Real World RL
Real World RL
Real World RL
Real World RL
(offline)
Real World RL
Kuaishou
Real World RL
offline contextual bandits
Agenda
1. RealWorld RL
2. Specification & Risks
3. Does RL Work?
4. All ML is RL
RL Specification
Markov decision process \(\mathcal M = \{\mathcal S, ~\mathcal A, ~P, ~r, ~\gamma\}\)
\(s_t\)
\(r_t\)
\(a_t\)
\(\pi\)
\(\gamma\)
\(P\)
 action space and discount known
 states and reward signals observed
 transition probabilities unknown
actions & states determine environment
discount & reward determine objective
Specifying Horizon/Discount
Small discount factor leads to shortsighted agent
 \(0\) cost for \(a_0\)
 \(2\epsilon\) cost for \(a_1\)
 \(\epsilon\) reward in \(s_0\)
 \(1\) reward in \(s_1\)
\(V^{a_0}(s_0) = \frac{\epsilon}{1\gamma}\) and \(V^{a_1}(s_0) = \frac{1}{1\gamma}  \frac{2\epsilon}{\gamma}\)
Specifying Reward
The promise of RL:
translate specified objective into desired behavior
The reality:
Risk: Reward Hacking
Risk: Reward Hacking
Risk: Reward Hacking
While everyone seemed focused on how many views a video got, we thought the amount of time someone spent watching a video was a better way to understand whether a viewer really enjoyed it."
Youtube in 2014 vs. 2018
Risk: Reward Hacking
Facebook's "Meaningful Social Interaction" metric
Misinformation, toxicity, and violent content are inordinately prevalent among reshares"
Reward Design
Inverse Reward Design (NeuRIPS, 2017)
Idea: treat specified reward as imperfect proxy
Then attempt to learn true reward from other feedback
Directly related to learning human preferences and RLHF
Specifying States & Actions
The interface through which the agent sees and impacts the world
Also delimits reasoning about the world
\(s_t\)
\(a_t\)
Risk: Too Much Information
Evolving an oscillator on hardware (Bird & Layzell, 2002)
Result: a "network of transistors sensing and utilising the radio waves emanating from nearby PCs"
Risk: Too Little Information
The first Tesla autopilot fatality in 2016
Safety systems failed to detect white truck against bright sky
"vehicles [...] will no longer be equipped with radar. Instead, these will [...] rely on camera vision and neural net processing." (Tesla, 2021)
Risk: Inappropriate Actuation
Learning to influence other drivers
Excessive caution around other drivers
Excessive aggression
Example adapted from Anca Dragan
Risk: Inappropriate Actuation
Emotionally charged content effectively grabs attention
Agenda
1. RealWorld RL
2. Specification & Risks
3. Does RL Work?
4. All ML is RL
Does RL Work?
1. Modelbased design and optimization works better
Three strikes against RL:
ex  Model Predictive Control at Boston Dynamics
Does RL Work?
1. Modelbased design and optimization works better
Three strikes against RL:
datadriven optimization suffers from local minima, large sample complexity (Deep RL doesn't work yet, 2018)
Does RL Work?
2. Simulation essentially necessary, but huge sim2real gap
Three strikes against RL:
RL exploits bugs in simulator code (Nathan Lambert, 2021)
Does RL Work?
3. Questionable evaluation practices
Three strikes against RL:
Stateoftheart algorithms outperformed by simple baselines: Simple random search provides a competitive approach
to reinforcement learning, 2017
Generality?
This perspective ignores the instancespecific tuning that often goes into making RL algorithms work

AlphaGo Zero (2017)
 Replaces imitation learning with random exploration, uses MCTS during selfplay

AlphaZero (2018)
 Generalizes beyond Go to Chess and Shogi

MuZero (2020)
 Generalizes to Atari by not requiring dynamics \(f\)
 Applied to video compression in 2022
 Large pretrained models (e.g. GPTX, 2018present)
 Arguable entirely imitationbased
Generality?
Agenda
1. RealWorld RL
2. Specification & Risks
3. Does RL Work?
4. All ML is RL
All ML is RL once deployed
ex  creditscore designed within supervised learning framework, but used to make lending decisions
\(\{x_i, y_i\}\)
\(x\)
\(\widehat y\)
\((x, y)\)
Sometimes ML is actually RL
\(x=\) features about user, video
\(y=\) watchtime of entire remaining session
\(s=\) features about user
\(a=\) features about video
\(r(s,a)=\) watchtime of current video
\(Q^\pi(s, a) = \mathbb E[\sum r_t s,a]=\) watchtime of remaining session
When a measure becomes a target, it ceases to be a good measure"
Goodhardt's law
ML and social dynamics
When a measure becomes a target, it ceases to be a good measure"
Goodhardt's law
ML and social dynamics
Buzzfeed noticed the success of content that exploited racial divisions, fad/junky science, extremely disturbing news and gross images.
ML and social dynamics
Some political parties in Europe told Facebook the algorithm had made them shift their policy positions so they resonated more on the platform, according to the documents."
Technologies are developed and used within a particular social, economic, and political context. They arise out of a social structure, they are grafted on to it, and they may reinforce it or destroy it, often in ways that are neither foreseen nor foreseeable.”
Ursula Franklin, 1989
ExoFeedback
control feedback
data feedback
external feedback
"...social, economic, and political context..."
"...neither foreseen nor forseeable..."
CS 4/5789: Lecture 26
By Sarah Dean