CS 4/5789: Introduction to Reinforcement Learning

Lecture 1: Introduction

Prof. Sarah Dean

MW 2:45-4pm
255 Olin Hall

Agenda

 

1. What is Reinforcement Learning (RL)?

2. Logistics and Syllabus

3. Types of Machine Learning (ML)

4. Markov Decision Processes (MDP)

AlphaGo

Robotic Manipulation

Algorithmic Media Feeds

...

observation

action

reward

a policy maps observation to action

design policy to achieve high reward

RL is for Sequential Decision-Making

reaction

adaptation

Sequential Decision-Making

observation

action

reward

AlphaGo

Robotic Manipulation

Media Feeds

?

?

?

?

?

?

\(\theta_t-\theta_*\)

Agenda

 

1. What is Reinforcement Learning (RL)?

2. Logistics and Syllabus

3. Types of Machine Learning (ML)

4. Markov Decision Processes (MDP)

Logistics

  • Instructor: Prof. Sarah Dean
  • Head TAs: Rohan Banerjee and Runzhe Wu
  • TAs & Consultant: Yann Hicke, Jenna Fields, Ruizhe Wang, Brandon Man, Yijia Dai, Patrick Yuan, Yingbing Huang

 

  • Contact: Ed Discussion
  • Instructor Office Hours: Wednesday 4-4:50pm in 255 Olin, additional time TBD
  • TA Office Hours: TBD, see Ed Discussion

Waitlist and Enrollment

There should be plenty of space!

 

Course staff do not manage waitlist and enrollment.

CS enrollment policies:
https://www.cs.cornell.edu/courseinfo/enrollment

Exams

  • Prelim on March 15 in class
  • Final exam during finals period, time TBD

Homework

  • Homework assignments
    • ~8 problem sets (math)
    • ~4 programming assignments (coding)
  • Gradescope
    • neatly written, ideally typeset with LaTeX
  • 5789: Paper review assignments (after Unit 1)
  • Collaboration Policy: discussion is fine, but write your own solutions and code, and do not look at others or let others look at yours
  • Late Policy: penalties can be avoided by requesting extensions on Ed (private post)

Participation

Participation is 5% of final grade, /20 points

  • Lecture participation = 1pt each
    • Poll Everywhere: PollEv.com/sarahdean011
  • Helpful posts on Ed Discussions = 2pt each
    • TA endorsement

Schedule

  • Unit 1: Fundamentals of Planning and Control (Jan & Feb)
    • Imitation learning, Markov Decision Processes, Dynamic Programming, Value and Policy Iteration, Continuous Control, Linear Quadratic Regulation
  • Unit 2: Learning in MDPs (Mar)
    • Estimation, Model-based RL, Approximate Dynamic Programming, Policy Optimization
  • Unit 3: Exploration (Apr & May)
    • Multi-armed Bandits, Contextual Bandits
    • State of the art examples

Prerequisites

Machine learning (e.g., CS 4780)

Background in probability, linear algebra, and programming.

Materials

Lecture Slides and Notes

Extra Resources (not required)
RL Theory Book: https://rltheorybook.github.io/
Classic RL Book:  Sutton & Barto (http://www.incompleteideas.net/book/RLbook2020.pdf)

Agenda

 

1. What is Reinforcement Learning (RL)?

2. Logistics and Syllabus

3. Types of Machine Learning (ML)

4. Markov Decision Processes (MDP)

Types of Machine Learning

  1. Unsupervised Learning
  2. Supervised Learning
  3. Reinforcement Learning

Unsupervised Learning

Examples: clustering, principle component analysis (PCA)

  • Goal:
    • summarization
  • Dataset:
    • information about many instances
    • \(\{x_1, x_2, \dots x_N\}\)
  • Evaluation:
    • qualitative

"descriptive"

Supervised Learning

Examples: classification, regression

  • Goal:
    • prediction
  • Dataset:
    • each instance has features and label
    • \(\{(x_1, y_1), \dots (x_N,y_N)\}\)
  • Evaluation:
    • accuracy, \(y\) vs. \(\hat y\)

"predictive"

Reinforcement Learning

  • Goal:
    • action or decision
  • Dataset:
    • history of observations, actions, and rewards
    • sequential \(\{(o_t, a_t, r_t)\}_{t=1}^T\)
  • Evaluation:
    • cumulative reward

"presciptive"

Types of Machine Learning

  1. Unsupervised Learning
    • summarize unstructured data \(\{x_i\}_{i=1}^N\)
  2. Supervised Learning
    • predict labels from features \(\{(x_i, y_i)\}_{i=1}^N\)
  3. Reinforcement Learning
    • choose actions that lead to high reward
    • sequential data

Difficulties of Sequential Problem

Unlike other types of ML, in RL data may not be drawn "i.i.d." from some distribution

  1. May start with no data
  2. Actions have consequences
  3. Solving task may require long sequence of correct

\(a_t\)

\(o_t, r_t\)

\(a_{t+1}\)

\(o_{t+1}, r_{t+1}\)

\(...\)

ML Specification

  • Specifying supervised learning problem
    • feature space \(\mathcal X\) and label space \(\mathcal Y\)
    • distribution over feature and labels \((x,y)\sim \mathcal D\)
      • often empirical, i.e. defined by a dataset
    • pick a loss function to determine accuracy

Agenda

 

1. What is Reinforcement Learning (RL)?

2. Logistics and Syllabus

3. Types of Machine Learning (ML)

4. Markov Decision Processes (MDP)

General setting

action

observation

\(a_t\)

reward

\(o_t\)

\(r_t\)

\(o_{t+1}\)

  1. Agent observes environment
  2. Agent takes action
  3. Environment sends reward and changes

Markov Decision Process (MDP)

action \(a_t\)

state \(s_t\)

\(\sim \pi(s_t)\)

reward

\(r_t\sim r(s_t, a_t)\)

\(s_{t+1}\sim P(s_t, a_t)\)

  1. Agent observes state of environment
  2. Agent takes action
    • depending on state according to policy
  3. Environment state updates (stochastically) according to transition function

Assumption on structure of observations and how they change

state
\(s_t\)

Markov Decision Process (MDP)

  • The state transition is independent of the past when conditioned on the current state and action
    • \(\mathbb P\{s_{t+1}=s\mid s_t, s_{t-1},\dots, s_0, a_t, \dots a_0 \} = \mathbb P\{s_{t+1}=s\mid s_t, a_t \}\)
  • Similarly for the reward signal
  • Therefore, we write state transition and reward distribution as
    • \(s_{t+1}\sim P(s_t, a_t),\quad r_t\sim r(s_t, a_t)\)

  • Actions can be chosen based only on current state

    • \(a_t \sim \pi(s_t)\)

Key Markovian Assumption:

Markov Decision Process (MDP)

action \(a_t\)

state \(s_t\)

\(\sim \pi(s_t)\)

reward

\(r_t\sim r(s_t, a_t)\)

\(s_{t+1}\sim P(s_t, a_t)\)

  1. Agent observes state of environment
  2. Agent takes action depending on state according to policy
  3. Environment returns reward and updates state according to reward/transition function

Example

  • state: \(s\)
    • finger configuration and object pose
  • action: \(a\)
    • joint motor commands
  • transition: \(s'\sim P(s,a)\)
    • physical equations of motion (gravity, contact forces, friction)
  • policy: \(\pi(s)\)
    • maps configurations to motor commands
  • reward: \(r(s,a)\)
    • negative distance to goal (etc)

robot manipulation

Infinite Horizon Discounted MDP

\(\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, \gamma\}\)

  • \(\mathcal{S}\) space of possible states \(s\in\mathcal S\)
  • \(\mathcal{A}\) space of possible actions \(a\in \mathcal{A}\)
  • \(r\) stochastic map from state, action to scalar reward
  • \(P\) stochastic map from current state and action to next state
  • \(\gamma\) discount factor between \(0\) and \(1\)

Goal: achieve high cumulative reward:

$$\sum_{t=0}^\infty \gamma^t r_t$$

Infinite Horizon Discounted MDP

\(\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, \gamma\}\)

  • \(\mathcal{S}\) space of possible states \(s\in\mathcal S\)
  • \(\mathcal{A}\) space of possible actions \(a\in \mathcal{A}\)
  • \(r\) stochastic map from state, action to scalar reward
  • \(P\) stochastic map from current state and action to next state
  • \(\gamma\) discount factor between \(0\) and \(1\)

maximize   \(\displaystyle \mathbb E\left[\sum_{i=1}^\infty \gamma^t r(s_t, a_t)\right]\)

s.t.   \(s_{t+1}\sim P(s_t, a_t), ~~a_t\sim \pi(s_t)\)

\(\pi\)

Recap

 

1. What is Reinforcement Learning (RL)?

2. Logistics and Syllabus

3. Types of Machine Learning (ML)

4. Markov Decision Processes (MDP)