Prof. Sarah Dean
MW 2:45-4pm
255 Olin Hall
1. Real-World RL
2. Specification & Risks
3. Does RL Work?
4. All ML is RL
PollEv
AlphaGo vs. Lee Sedol, 2016
(offline)
Kuaishou
offline contextual bandits
1. Real-World RL
2. Specification & Risks
3. Does RL Work?
4. All ML is RL
Markov decision process \(\mathcal M = \{\mathcal S, ~\mathcal A, ~P, ~r, ~\gamma\}\)
\(s_t\)
\(r_t\)
\(a_t\)
\(\pi\)
\(\gamma\)
\(P\)
actions & states determine environment
discount & reward determine objective
Small discount factor leads to short-sighted agent
\(V^{a_0}(s_0) = \frac{\epsilon}{1-\gamma}\) and \(V^{a_1}(s_0) = \frac{1}{1-\gamma} - \frac{2\epsilon}{\gamma}\)
The promise of RL:
translate specified objective into desired behavior
The reality:
While everyone seemed focused on how many views a video got, we thought the amount of time someone spent watching a video was a better way to understand whether a viewer really enjoyed it."
Youtube in 2014 vs. 2018
Facebook's "Meaningful Social Interaction" metric
Misinformation, toxicity, and violent content are inordinately prevalent among reshares"
Inverse Reward Design (NeuRIPS, 2017)
Idea: treat specified reward as imperfect proxy
Then attempt to learn true reward from other feedback
Directly related to learning human preferences and RLHF
The interface through which the agent sees and impacts the world
Also delimits reasoning about the world
\(s_t\)
\(a_t\)
Evolving an oscillator on hardware (Bird & Layzell, 2002)
Result: a "network of transistors sensing and utilising the radio waves emanating from nearby PCs"
The first Tesla autopilot fatality in 2016
Safety systems failed to detect white truck against bright sky
"vehicles [...] will no longer be equipped with radar. Instead, these will [...] rely on camera vision and neural net processing." (Tesla, 2021)
Learning to influence other drivers
Excessive caution around other drivers
Excessive aggression
Example adapted from Anca Dragan
Emotionally charged content effectively grabs attention
1. Real-World RL
2. Specification & Risks
3. Does RL Work?
4. All ML is RL
1. Model-based design and optimization works better
Three strikes against RL:
ex - Model Predictive Control at Boston Dynamics
1. Model-based design and optimization works better
Three strikes against RL:
data-driven optimization suffers from local minima, large sample complexity (Deep RL doesn't work yet, 2018)
2. Simulation essentially necessary, but huge sim2real gap
Three strikes against RL:
RL exploits bugs in simulator code (Nathan Lambert, 2021)
3. Questionable evaluation practices
Three strikes against RL:
State-of-the-art algorithms outperformed by simple baselines: Simple random search provides a competitive approach
to reinforcement learning, 2017
This perspective ignores the instance-specific tuning that often goes into making RL algorithms work
1. Real-World RL
2. Specification & Risks
3. Does RL Work?
4. All ML is RL
ex - credit-score designed within supervised learning framework, but used to make lending decisions
\(\{x_i, y_i\}\)
\(x\)
\(\widehat y\)
\((x, y)\)
\(x=\) features about user, video
\(y=\) watch-time of entire remaining session
\(s=\) features about user
\(a=\) features about video
\(r(s,a)=\) watch-time of current video
\(Q^\pi(s, a) = \mathbb E[\sum r_t| s,a]=\) watch-time of remaining session
When a measure becomes a target, it ceases to be a good measure"
Goodhardt's law
When a measure becomes a target, it ceases to be a good measure"
Goodhardt's law
Buzzfeed noticed the success of content that exploited racial divisions, fad/junky science, extremely disturbing news and gross images.
Some political parties in Europe told Facebook the algorithm had made them shift their policy positions so they resonated more on the platform, according to the documents."
Technologies are developed and used within a particular social, economic, and political context. They arise out of a social structure, they are grafted on to it, and they may reinforce it or destroy it, often in ways that are neither foreseen nor foreseeable.”
Ursula Franklin, 1989
control feedback
data feedback
external feedback
"...social, economic, and political context..."
"...neither foreseen nor forseeable..."