CS 4/5789: Introduction to Reinforcement Learning

Lecture 1: Introduction

Prof. Sarah Dean

MW 2:55-4:15pm
255 Olin Hall



1. What is Reinforcement Learning (RL)?

2. Logistics and Syllabus

3. Types of Machine Learning (ML)

4. Markov Decision Processes (MDP)


Robotic Manipulation

Algorithmic Media Feeds

Interactive Applications





a policy maps observation to action

design policy to achieve high reward

RL is for Sequential Decision-Making



Sequential Decision-Making






Robotic Manipulation

Media Feeds








👍 👎







  • Instructor: Prof. Sarah Dean
  • Head TAs: Promise Ekpo and Chia-Hsiang Kao
  • TAs: Daniel Mistrik, Hangyu Zhou, Daniel Cao, Rohan Singh, Arnav Agrawal, Alexis Hao, Yijia Dai, Owen Oertell


  • Contact: Ed Discussion
  • Instructor Office Hours: TBA
  • TA Office Hours: TBA, see Ed Discussion

Waitlist and Enrollment

There should be plenty of space!


Course staff do not manage waitlist and enrollment.

CS enrollment policies:


  • Two in class prelims
    • Monday March 4th
    • Wednesday April 10th
  • Final exam during finals period, time TBD


  • Homework assignments
    • ~8 problem sets (math)
    • ~4 programming assignments (coding)
  • Gradescope
    • neatly written, ideally typeset with LaTeX
  • 5789: Paper review assignments (after Unit 1)
  • Collaboration Policy: discussion is fine, but write your own solutions and code, and do not look at others or let others look at yours
  • Late Policy: six slip days, use no more than three at once (see Syllabus)


Participation is 10% of final grade, /20 points

  • Lecture participation = 1pt each
    • Poll Everywhere: PollEv.com/sarahdean011
  • Helpful posts on Ed Discussions = 2pt each
    • TA endorsement


  • Unit 1: Fundamentals of Planning and Control (Jan & Feb)
    • Markov Decision Processes, Dynamic Programming, Value and Policy Iteration, Continuous Control, Linear Quadratic Regulation
  • Unit 2: Learning in MDPs (Mar)
    • Estimation, Fitted Dynamic Programming, Policy Optimization
  • Unit 3: Exploration (Apr & May)
    • Multi-armed Bandits, Contextual Bandits, Imitation Learning
    • State of the art examples


Machine learning (e.g., CS 4780)

Background in probability, linear algebra, and programming.


Lecture Slides and Notes on Canvas

Extra Resources (not required)
RL Theory Book: https://rltheorybook.github.io/
Classic RL Book:  Sutton & Barto (http://www.incompleteideas.net/book/RLbook2020.pdf)



Types of Machine Learning

  1. Unsupervised Learning
  2. Supervised Learning
  3. Reinforcement Learning

Unsupervised Learning

Examples: clustering, principle component analysis (PCA)

  • Goal:
    • summarization
  • Dataset:
    • information about many instances
    • \(\{x_1, x_2, \dots x_N\}\)
  • Evaluation:
    • qualitative


Supervised Learning

Examples: classification, regression

  • Goal:
    • prediction
  • Dataset:
    • each instance has features and label
    • \(\{(x_1, y_1), \dots (x_N,y_N)\}\)
  • Evaluation:
    • accuracy, \(y\) vs. \(\hat y\)


Reinforcement Learning

  • Goal:
    • action or decision
  • Dataset:
    • history of observations, actions, and rewards
    • sequential \(\{(o_t, a_t, r_t)\}_{t=1}^T\)
  • Evaluation:
    • cumulative reward


Types of Machine Learning

  1. Unsupervised Learning
    • summarize unstructured data \(\{x_i\}_{i=1}^N\)
  2. Supervised Learning
    • predict labels from features \(\{(x_i, y_i)\}_{i=1}^N\)
  3. Reinforcement Learning
    • choose actions that lead to high reward
    • sequential data

Difficulties of Sequential Problem

Unlike other types of ML, in RL data may not be drawn "i.i.d." from some distribution

  1. May start with no data
  2. Actions have consequences
  3. Solving task may require long sequence of correct actions


\(o_t, r_t\)


\(o_{t+1}, r_{t+1}\)


ML Specification

  • Specifying supervised learning problem
    • feature space \(\mathcal X\) and label space \(\mathcal Y\)
    • distribution over feature and labels \((x,y)\sim \mathcal D\)
      • often empirical, i.e. defined by a dataset
    • pick a loss function to determine accuracy



General setting








  1. Agent observes environment
  2. Agent takes action
  3. Environment sends reward and changes

Markov Decision Process (MDP)

action \(a_t\)

state \(s_t\)

\(\sim \pi(s_t)\)


\(r_t\sim r(s_t, a_t)\)

\(s_{t+1}\sim P(s_t, a_t)\)

  1. Agent observes state of environment
  2. Agent takes action
    • depending on state according to policy
  3. Environment state updates (stochastically) according to transition function

Assumption on structure of observations and how they change


Markov Decision Process (MDP)

  • The state transition is independent of the past when conditioned on the current state and action
    • \(\mathbb P\{s_{t+1}=s\mid s_t, s_{t-1},\dots, s_0, a_t, \dots a_0 \} = \mathbb P\{s_{t+1}=s\mid s_t, a_t \}\)
  • Similarly for the reward signal
  • Therefore, we write state transition and reward distribution as
    • \(s_{t+1}\sim P(s_t, a_t),\quad r_t\sim r(s_t, a_t)\)

  • Actions can be chosen based only on current state

    • \(a_t \sim \pi(s_t)\)

Key Markovian Assumption:

Markov Decision Process (MDP)

action \(a_t\)

state \(s_t\)

\(\sim \pi(s_t)\)


\(r_t\sim r(s_t, a_t)\)

\(s_{t+1}\sim P(s_t, a_t)\)

  1. Agent observes state of environment
  2. Agent takes action depending on state according to policy
  3. Environment returns reward and updates state according to reward/transition function


  • state: \(s\)
    • finger configuration and object pose
  • action: \(a\)
    • joint motor commands
  • transition: \(s'\sim P(s,a)\)
    • physical equations of motion (gravity, contact forces, friction)
  • policy: \(\pi(s)\)
    • maps configurations to motor commands
  • reward: \(r(s,a)\)
    • negative distance to goal (etc)

robot manipulation

Finite Horizon MDP

\(\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, H\}\)

  • \(\mathcal{S}\) space of possible states \(s\in\mathcal S\)
  • \(\mathcal{A}\) space of possible actions \(a\in \mathcal{A}\)
  • \(r\) map from state, action to scalar reward
  • \(P\) stochastic map from current state and action to next state
  • \(H\) horizon length (positive integer)

Goal: achieve high cumulative reward:

$$\sum_{t=0}^{H-1}  r_t$$

