CS 4/5789: Lecture 26

CS 4/5789: Introduction to Reinforcement Learning

Lecture 26: Societal Implications

Prof. Sarah Dean

MW 2:45-4pm
255 Olin Hall

Reminders

Homework
- 5789 Paper Reviews due weekly on Mondays
- PA 4 due tonight
- Midterm corrections due Monday
  - Accepted up until final (no late penalty)
Final exam is Saturday 5/13 at 2pm
- Length: 2 hours
- Location: 155 Olin
- Review lecture Monday
Course evaluations open! Participation credit

Agenda

1. Real-World RL

2. Specification & Risks

3. Does RL Work?

4. All ML is RL

PollEv

Real World RL

AlphaGo vs. Lee Sedol, 2016

Real World RL

How Does AI Improve Human Decision-Making? Evidence from the AI-Powered Go Program, 2021.

Adversarial Policies Beat Superhuman Go AIs, 2022.

Real World RL

AlphaGo Zero (2017)
- Replaces imitation learning with random exploration, uses MCTS during self-play
AlphaZero (2018)
- Generalizes beyond Go to Chess and Shogi
MuZero (2020)
- Generalizes to Atari by not requiring dynamics \(f\)
- Applied to video compression in 2022

Real World RL

MuZero with Self-competition for Rate Control in VP9 Video Compression, 2022

Real World RL

Magnetic control of tokamak plasmas through deep reinforcement learning, 2022

Real World RL

Autonomous navigation of stratospheric balloons using reinforcement learning, 2020

Real World RL

RL for Amazon Ads and Conversational Music Recommendation, 2022

(offline)

Real World RL

Reinforcing User Retention in a Billion Scale Short Video Recommender System, 2023

Kuaishou

Real World RL

Illustrating Reinforcement Learning from Human Feedback (RLHF), 2022

offline contextual bandits

Agenda

1. Real-World RL

2. Specification & Risks

3. Does RL Work?

4. All ML is RL

RL Specification

Markov decision process \(\mathcal M = \{\mathcal S, ~\mathcal A, ~P, ~r, ~\gamma\}\)

\(s_t\)

\(r_t\)

\(a_t\)

\(\pi\)

\(\gamma\)

\(P\)

action space and discount known
states and reward signals observed
transition probabilities unknown

actions & states determine environment

discount & reward determine objective

Specifying Horizon/Discount

Small discount factor leads to short-sighted agent

\(0\) cost for \(a_0\)
\(2\epsilon\) cost for \(a_1\)
\(\epsilon\) reward in \(s_0\)
\(1\) reward in \(s_1\)

\(V^{a_0}(s_0) = \frac{\epsilon}{1-\gamma}\) and \(V^{a_1}(s_0) = \frac{1}{1-\gamma} - \frac{2\epsilon}{\gamma}\)

Specifying Reward

The promise of RL:

translate specified objective into desired behavior

The reality:

Risk: Reward Hacking

Faulty Reward Functions in the Wild

Risk: Reward Hacking

While everyone seemed focused on how many views a video got, we thought the amount of time someone spent watching a video was a better way to understand whether a viewer really enjoyed it."

You know what’s cool? A billion hours, 2017.

Youtube in 2014 vs. 2018

Risk: Reward Hacking

Facebook's "Meaningful Social Interaction" metric

Misinformation, toxicity, and violent content are inordinately prevalent among reshares"

Reward Design

Inverse Reward Design (NeuRIPS, 2017)

Idea: treat specified reward as imperfect proxy

Then attempt to learn true reward from other feedback

Directly related to learning human preferences and RLHF

Specifying States & Actions

The interface through which the agent sees and impacts the world

Also delimits reasoning about the world

\(s_t\)

\(a_t\)

Risk: Too Much Information

Evolving an oscillator on hardware (Bird & Layzell, 2002)

Result: a "network of transistors sensing and utilising the radio waves emanating from nearby PCs"

Risk: Too Little Information

The first Tesla autopilot fatality in 2016

Safety systems failed to detect white truck against bright sky

"vehicles [...] will no longer be equipped with radar. Instead, these will [...] rely on camera vision and neural net processing." (Tesla, 2021)

Risk: Inappropriate Actuation

Learning to influence other drivers

Excessive caution around other drivers

Excessive aggression

Example adapted from Anca Dragan

Risk: Inappropriate Actuation

Emotionally charged content effectively grabs attention

Agenda

1. Real-World RL

2. Specification & Risks

3. Does RL Work?

4. All ML is RL

Does RL Work?

1. Model-based design and optimization works better

Three strikes against RL:

ex - Model Predictive Control at Boston Dynamics

Does RL Work?

1. Model-based design and optimization works better

Three strikes against RL:

data-driven optimization suffers from local minima, large sample complexity (Deep RL doesn't work yet, 2018)

Does RL Work?

2. Simulation essentially necessary, but huge sim2real gap

Three strikes against RL:

RL exploits bugs in simulator code (Nathan Lambert, 2021)

Does RL Work?

3. Questionable evaluation practices

Three strikes against RL:

Deep Reinforcement Learning at the Edge of the Statistical Precipice

State-of-the-art algorithms outperformed by simple baselines: Simple random search provides a competitive approach
to reinforcement learning, 2017

Generality?

This perspective ignores the instance-specific tuning that often goes into making RL algorithms work

AlphaGo Zero (2017)
- Replaces imitation learning with random exploration, uses MCTS during self-play
AlphaZero (2018)
- Generalizes beyond Go to Chess and Shogi
MuZero (2020)
- Generalizes to Atari by not requiring dynamics \(f\)
- Applied to video compression in 2022
Large pretrained models (e.g. GPT-X, 2018-present)
- Arguable entirely imitation-based

Generality?

Agenda

1. Real-World RL

2. Specification & Risks

3. Does RL Work?

4. All ML is RL

All ML is RL once deployed

ex - credit-score designed within supervised learning framework, but used to make lending decisions

\(\{x_i, y_i\}\)

\(x\)

\(\widehat y\)

\((x, y)\)

Sometimes ML is actually RL

\(x=\) features about user, video

\(y=\) watch-time of entire remaining session

\(s=\) features about user

\(a=\) features about video

\(r(s,a)=\) watch-time of current video

\(Q^\pi(s, a) = \mathbb E[\sum r_t| s,a]=\) watch-time of remaining session

Manipulation of Social Program Eligibility, 2011

When a measure becomes a target, it ceases to be a good measure"

Goodhardt's law

ML and social dynamics

Creators are making longer videos to cater to the YouTube algorithm, 2018

When a measure becomes a target, it ceases to be a good measure"

Goodhardt's law

ML and social dynamics

Buzzfeed noticed the success of content that exploited racial divisions, fad/junky science, extremely disturbing news and gross images.

ML and social dynamics

Some political parties in Europe told Facebook the algorithm had made them shift their policy positions so they resonated more on the platform, according to the documents."

Technologies are developed and used within a particular social, economic, and political context. They arise out of a social structure, they are grafted on to it, and they may reinforce it or destroy it, often in ways that are neither foreseen nor foreseeable.”

Ursula Franklin, 1989

Exo-Feedback

control feedback

data feedback

external feedback

"...social, economic, and political context..."

"...neither foreseen nor forseeable..."