Shen Shen
April 18, 2025
11am, Room 10-250
Toddler demo, Russ Tedrake thesis, 2004
(Uses vanilla policy gradient (actor-critic))
Reinforcement Learning with Human Feedback
Policy Evaluation
State Value Functions \(\mathrm{V}^{\pi}\)
Bellman recursions and Bellman equations
Policy Optimization
Optimal policies \(\pi^*\)
Optimal action value functions: \(\mathrm{Q}^*\)
Value iteration
Policy Evaluation
State Value Functions \(\mathrm{V}^{\pi}\)
Bellman recursions and Bellman equations
Policy Optimization
Optimal policies \(\pi^*\)
Optimal action value functions: \(\mathrm{Q}^*\)
Value iteration
Research area initiated in the 50s by Bellman, known under various names:
Stochastic optimal control (Control theory)
Stochastic shortest path (Operations research)
Sequential decision making under uncertainty (Economics)
Reinforcement learning (Artificial intelligence, Machine learning)
A rich variety of accessible and elegant theory, math, algorithms, and applications. But also, considerable variation in notations.
We will use the most RL-flavored notations.
Normally, we get to the “intended” state;
E.g., in state (7), action “↑” gets to state (4)
If an action would take Mario out of the grid world, stay put;
E.g., in state (9), “→” gets back to state (9)
In state (6), action “↑” leads to two possibilities:
20% chance to (2)
80% chance to (3).
Running example: Mario in a grid-world
reward of (3, \(\downarrow\))
reward of \((3,\uparrow\))
reward of \((6, \downarrow\))
reward of \((6,\rightarrow\))
Mario in a grid-world, cont'd
Markov Decision Processes - Definition and terminologies
In 6.390,
Markov Decision Processes - Definition and terminologies
\(\mathrm{T}\left(7, \uparrow, 4\right) = 1\)
\(\mathrm{T}\left(9, \rightarrow, 9\right) = 1\)
\(\mathrm{T}\left(6, \uparrow, 3\right) = 0.8\)
\(\mathrm{T}\left(6, \uparrow, 2\right) = 0.2\)
In 6.390,
Markov Decision Processes - Definition and terminologies
reward of \((3,\uparrow\))
reward of \((6,\rightarrow\))
\(\mathrm{R}\left(3, \uparrow \right) = 1\)
\(\mathrm{R}\left(6, \rightarrow \right) = -10\)
In 6.390,
The goal of an MDP is to find a "good" policy.
Markov Decision Processes - Definition and terminologies
In 6.390,
Policy \(\pi(s)\)
Transition \(\mathrm{T}\left(s, a, s^{\prime}\right)\)
Reward \(\mathrm{R}(s, a)\)
time
a trajectory (aka, an experience, or a rollout), of horizon \(h\)
\(\quad \tau=\left(s_0, a_0, r_0, s_1, a_1, r_1, \ldots s_{h-1}, a_{h-1}, r_{h-1}\right)\)
initial state
all depends on \(\pi\)
Policy \(\pi(s)\)
Transition \(\mathrm{T}\left(s, a, s^{\prime}\right)\)
Reward \(\mathrm{R}(s, a)\)
time
Starting in a given \(s_0\), how "good" is it to follow a policy \(\pi\) for \(h\) time steps?
One idea:
Policy \(\pi(s)\)
Transition \(\mathrm{T}\left(s, a, s^{\prime}\right)\)
Reward \(\mathrm{R}(s, a)\)
time
Starting in a given \(s_0\), how "good" is it to follow a policy \(\pi\) for \(h\) time steps?
But in
Mario game:
One idea:
if start at \(s_0=6\) and policy \(\pi(s) =\uparrow, \forall s\), i.e., always up
in 390, this expectation is only w.r.t. the transition probabilities \(\mathrm{T}\left(s, a, s^{\prime}\right)\)
\( h\) terms
Policy \(\pi(s)\)
Transition \(\mathrm{T}\left(s, a, s^{\prime}\right)\)
Reward \(\mathrm{R}(s, a)\)
time
Starting in a given \(s_0\), how "good" is it to follow a policy \(\pi\) for \(h\) time steps?
Policy Evaluation
State Value Functions \(\mathrm{V}^{\pi}\)
Bellman recursions and Bellman equations
Policy Optimization
Optimal policies \(\pi^*\)
Optimal action value functions: \(\mathrm{Q}^*\)
Value iteration
Definition: For a given policy \(\pi(s),\) the state value functions
\(\mathrm{V}_h^\pi(s):=\mathbb{E}\left[\sum_{t=0}^{h-1} \gamma^t \mathrm{R}\left(s_t, \pi\left(s_t\right)\right) \mid s_0=s, \pi\right], \forall s, h\)
expanded form
\( h\) terms
evaluate the "\(\pi(s) = \uparrow\), for all \(s,\) i.e. the always \(\uparrow\)" policy
horizon \(h\) = 0: no step left
horizon \(h\) = 1: receive the rewards
states and
one special transition:
rewards
horizon \(h = 2:\)
\( 2\) terms inside
states and
one special transition:
rewards
action \(\uparrow\)
action \(\uparrow\)
action \(\uparrow\)
\( 2\) terms inside
horizon \(h = 2:\)
states and
one special transition:
rewards
action \(\uparrow\)
action \(\uparrow\)
action \(\uparrow\)
action \(\uparrow\)
action \(\uparrow\)
horizon \(h = 3:\)
states and
one special transition:
rewards
the immediate reward for taking the policy-prescribed action \(\pi(s)\) in state \(s\).
horizon-\(h\) value in state \(s\): the expected sum of discounted rewards, starting in state \(s\) and following policy \(\pi\) for \(h\) steps.
\((h-1)\) horizon future values at a next state \(s^{\prime}\)
sum up future values weighted by the probability of getting to that next state \(s^{\prime}\)
discounted by \(\gamma\)
Bellman Recursion
approaches infinity
\(|\mathcal{S}|\) many linear equations, one equation for each state
typically \(\gamma <1\) in MDP definition, motivated to make \(\mathrm{V}^{\pi}_{\infty}(s):=\mathbb{E}\left[\sum_{t=0}^{\infty} \gamma^t \mathrm{R}\left(s_t, \pi\left(s_t\right)\right) \mid s_0=s, \pi\right]\) finite.
Bellman Equations
If the horizon \(h\) goes to infinity
Bellman Recursion
finite-horizon Bellman recursions
infinite-horizon Bellman equations
Recall: For a given policy \(\pi(s),\) the (state) value functions
\(\mathrm{V}_h^\pi(s):=\mathbb{E}\left[\sum_{t=0}^{h-1} \gamma^t \mathrm{R}\left(s_t, \pi\left(s_t\right)\right) \mid s_0=s, \pi\right], \forall s, h\)
MDP
Policy evaluation
Quick summary
1. By summing \(h\) terms:
2. By leveraging structure:
Policy Evaluation
State Value Functions \(\mathrm{V}^{\pi}\)
Bellman recursions and Bellman equations
Policy Optimization
Optimal policies \(\pi^*\)
Optimal action value functions: \(\mathrm{Q}^*\)
Value iteration
Optimal policy \(\pi^*\)
Definition: for a given MDP and a fixed horizon \(h\) (possibly infinite), \(\pi^*\) is an optimal policy if \(\mathrm{V}_h^{\pi^*}({s}) = \mathrm{V}_h^{*}({s})\geqslant \mathrm{V}_h^\pi({s})\) for all \(s \in \mathcal{S}\) and for all possible policy \(\pi\).
How to search for an optimal policy \(\pi^*\)?
Definition: for a given MDP and a fixed horizon \(h\) (possibly infinite), \(\pi^*\) is an optimal policy if \(\mathrm{V}_h^{\pi^*}({s}) = \mathrm{V}_h^{*}({s})\geqslant \mathrm{V}_h^\pi({s})\) for all \(s \in \mathcal{S}\) and for all possible policy \(\pi\).
Policy Evaluation
State Value Functions \(\mathrm{V}^{\pi}\)
Bellman recursions and Bellman equations
Policy Optimization
Optimal policies \(\pi^*\)
Optimal action value functions: \(\mathrm{Q}^*\)
Value iteration
Optimal state-action value functions \(\mathrm{Q}^*_h(s, a)\)
\(\mathrm{Q}^*_h(s, a)\): the expected sum of discounted rewards for
recursively finding \(\mathrm{Q}^*_h(s, a)\)
\(\mathrm{Q}^*_h(s, a)\): the expected sum of discounted rewards for
Recall:
\(\gamma = 0.9\)
States and one special transition:
\(\mathrm{R}(s,a)\)
Let's consider \(\mathrm{Q}^*_2(3, \rightarrow)\)
\( = 1 + .9 \max _{a^{\prime}} \mathrm{Q}_{1}^*\left(3, a^{\prime}\right)\)
\( = 1.9\)
Recall:
\(\gamma = 0.9\)
States and one special transition:
\(\mathrm{Q}^*_2(3, \rightarrow) = \mathrm{R}(3,\rightarrow) + \gamma \max _{a^{\prime}} \mathrm{Q}^*_{1}\left(3, a^{\prime}\right)\)
\(\mathrm{Q}^*_h(s, a)\): the value for
Let's consider \(\mathrm{Q}^*_2(3, \uparrow)\)
\( = 1 + .9 \max _{a^{\prime}} \mathrm{Q}_{1}^*\left(3, a^{\prime}\right)\)
\( = 1.9\)
Recall:
\(\gamma = 0.9\)
States and one special transition:
\(\mathrm{Q}^*_2(3, \uparrow) = \mathrm{R}(3,\uparrow) + \gamma \max _{a^{\prime}} \mathrm{Q}^*_{1}\left(3, a^{\prime}\right)\)
\(\mathrm{Q}^*_h(s, a)\): the value for
Let's consider \(\mathrm{Q}_2^*(3, \leftarrow)\)
\( = 1 + .9 \max _{a^{\prime}} \mathrm{Q}_{1}^*\left(2, a^{\prime}\right)\)
\( = 1\)
Recall:
\(\gamma = 0.9\)
States and one special transition:
\(\mathrm{Q}_2^*(3, \leftarrow) = \mathrm{R}(3,\leftarrow) + \gamma \max _{a^{\prime}} \mathrm{Q}_{1}^*\left(2, a^{\prime}\right)\)
\(\mathrm{Q}^*_h(s, a)\): the value for
Let's consider \(\mathrm{Q}^*_2(3, \downarrow)\)
\( = 1 + .9 \max _{a^{\prime}} \mathrm{Q}_{1}^*\left(6, a^{\prime}\right)\)
\( = -8\)
Recall:
\(\gamma = 0.9\)
States and one special transition:
\(\mathrm{Q}_2^*(3, \downarrow) = \mathrm{R}(3,\downarrow) + \gamma \max _{a^{\prime}} \mathrm{Q}_{1}^*\left(2, a^{\prime}\right)\)
\(\mathrm{Q}^*_h(s, a)\): the value for
Recall:
\(\gamma = 0.9\)
States and one special transition:
\(= -10 + .9 [.2 \times 0+ .8 \times 1] = -9.28\)
Let's consider \(\mathrm{Q}_2^*(6, \uparrow) \)
\(=\mathrm{R}(6,\uparrow) + \gamma[.2 \max _{a^{\prime}} \mathrm{Q}_{1}^*\left(2, a^{\prime}\right)+ .8\max _{a^{\prime}} \mathrm{Q}_{1}^*\left(3, a^{\prime}\right)] \)
\(\mathrm{Q}^*_h(s, a)\): the value for
\(\mathrm{Q}_2^*(6, \uparrow) =\mathrm{R}(6,\uparrow) + \gamma[.2 \max _{a^{\prime}} \mathrm{Q}_{1}^*\left(2, a^{\prime}\right)+ .8\max _{a^{\prime}} \mathrm{Q}_{1}^*\left(3, a^{\prime}\right)] \)
in general
Recall:
\(\gamma = 0.9\)
States and one special transition:
\(\mathrm{Q}^*_h(s, a)\): the value for
Recall:
\(\gamma = 0.9\)
States and one special transition:
what's the optimal action in state 3, with horizon 2, given by \(\pi_2^*(3)=?\)
in general
either up or right
\(\mathrm{Q}^*_h(s, a)\): the value for
Policy Evaluation
State Value Functions \(\mathrm{V}^{\pi}\)
Bellman recursions and Bellman equations
Policy Optimization
Optimal policies \(\pi^*\)
Optimal action value functions: \(\mathrm{Q}^*\)
Value iteration
Given the recursion
we can have an infinite horizon equation
Value Iteration
if run this block \(h\) times and break, then the returns are exactly \(\mathrm{Q}^*_h\)
\(\mathrm{Q}^*_{\infty}(s, a)\)
\(\mathrm{V}\) values vs. \(\mathrm{Q}\) values
\(\mathrm{V}_{h}^*(s)=\max_{a}\left[\mathrm{Q}^*_{h}(s, a)\right]\)
\(\mathrm{\pi}_{h}^*(s)=\arg\max_{a}\left[\mathrm{Q}^*_{h}(s, a)\right]\)
We'd love to hear your thoughts.