Shen Shen
November 15, 2024
Markov Decision Processes
Definition, terminologies, and policy
Policy Evaluation
\(V\)-values: State Value Functions
Bellman recursions and Bellman equations
Policy Optimization
Optimal policies \(\pi^*\)
\(Q\)-values: State-action Optimal Value Functions
Value iteration
Toddler demo, Russ Tedrake thesis, 2004
(Uses vanilla policy gradient (actor-critic))
Reinforcement Learning with Human Feedback
Markov Decision Processes
Definition, terminologies, and policy
Policy Evaluation
\(V\)-values: State Value Functions
Bellman recursions and Bellman equations
Policy Optimization
Optimal policies \(\pi^*\)
\(Q\)-values: State-action Optimal Value Functions
Value iteration
Research area initiated in the 50s by Bellman, known under various names (in various communities):
Stochastic optimal control (Control theory)
Stochastic shortest path (Operations Research)
Sequential decision making under uncertainty (Economics)
Reinforcement learning (Artificial Intelligence, Machine Learning)
A rich variety of accessible and elegant theory, math, algorithms, and applications. But also, considerable variation in notations.
We will use the most RL-flavored notations.
Normally, we get to the “intended” state;
E.g., in state (7), action “↑” gets to state (4)
If an action would take Mario out of the grid world, stay put;
E.g., in state (9), “→” gets back to state (9)
In state (6), action “↑” leads to two possibilities:
20% chance to (2)
80% chance to (3).
Running example: Mario in a grid-world
reward of (3, \(\downarrow\))
reward of \((3,\uparrow\))
reward of \((6, \downarrow\))
reward of \((6,\rightarrow\))
Mario in a grid-world, cont'd
Markov Decision Processes - Definition and terminologies
\(\mathrm{T}\left(7, \uparrow, 4\right) = 1\)
\(\mathrm{T}\left(9, \rightarrow, 9\right) = 1\)
\(\mathrm{T}\left(6, \uparrow, 3\right) = 0.8\)
\(\mathrm{T}\left(6, \uparrow, 2\right) = 0.2\)
The goal of an MDP is to find a "good" policy.
Sidenote: In 6.390,
Markov Decision Processes - Definition and terminologies
State \(s\)
Action \(a\)
Reward \(r\)
Policy \(\pi(s)\)
Transition \(\mathrm{T}\left(s, a, s^{\prime}\right)\)
Reward \(\mathrm{R}(s, a)\)
time
a trajectory (aka, an experience, or a rollout), of horizon \(h\)
\(\quad \tau=\left(s_0, a_0, r_0, s_1, a_1, r_1, \ldots s_{h-1}, a_{h-1}, r_{h-1}\right)\)
initial state
all depends on \(\pi\)
Markov Decision Processes
Definition, terminologies, and policy
Policy Evaluation
\(V\)-values: State Value Functions
Bellman recursions and Bellman equations
Policy Optimization
Optimal policies \(\pi^*\)
\(Q\)-values: State-action Optimal Value Functions
Value iteration
Starting in a given \(s_0\), how "good" is it to follow a policy for \(h\) time steps?
State \(s\)
Action \(a\)
Reward \(r\)
Policy \(\pi(s)\)
Transition \(\mathrm{T}\left(s, a, s^{\prime}\right)\)
Reward \(\mathrm{R}(s, a)\)
But, consider the Mario game, with
One idea:
reward of \((6,\uparrow\))
State \(s\)
Action \(a\)
Reward \(r\)
Policy \(\pi(s)\)
Transition \(\mathrm{T}\left(s, a, s^{\prime}\right)\)
Reward \(\mathrm{R}(s, a)\)
in 390, this expectation is only w.r.t. the transition probabilities \(\mathrm{T}\left(s, a, s^{\prime}\right)\)
\( h\) terms inside
Starting in a given \(s_0\), how "good" is it to follow a policy for \(h\) time steps?
For a given policy \(\pi(s),\) the (state) value functions
\(V^h_\pi(s):=\mathbb{E}\left[\sum_{t=0}^{h-1} \gamma^t \mathrm{R}\left(s_t, \pi\left(s_t\right)\right) \mid s_0=s, \pi\right], \forall s, h\)
evaluating the "always \(\uparrow\)" policy
\(V^h_\pi(s):=\mathbb{E}\left[\sum_{t=0}^{h-1} \gamma^t \mathrm{R}\left(s_t, \pi\left(s_t\right)\right) \mid s_0=s, \pi\right], \forall s, h\)
expanded form
\( h\) terms inside
evaluating the "always \(\uparrow\)" policy
\( 2\) terms inside
evaluating the "always \(\uparrow\)" policy
\( 2\) terms inside
action \(\uparrow\)
6
2
3
action \(\uparrow\)
action \(\uparrow\)
Recall:
\(\pi(s) = ``\uparrow",\ \forall s\)
\(\mathrm{R}(3, \uparrow) = 1\)
\(\mathrm{R}(6, \uparrow) = -10\)
\(\gamma = 0.9\)
2
3
action \(\uparrow\)
action \(\uparrow\)
6
action \(\uparrow\)
2
action \(\uparrow\)
3
action \(\uparrow\)
evaluating the "always \(\uparrow\)" policy
Bellman Recursion
weighted by the probability of getting to that next state \(s^{\prime}\)
\((h-1)\) horizon values at a next state \(s^{\prime}\)
the immediate reward for taking the policy-prescribed action \(\pi(s)\) in state \(s\).
discounted by \(\gamma\)
horizon-\(h\) value in state \(s\): the expected sum of discounted rewards, starting in state \(s\) and following policy \(\pi\) for \(h\) steps.
approaches infinity
\(|\mathcal{S}|\) many linear equations, one equation for each state
Bellman Recursion
typically \(\gamma <1\) in MDP definition
becomes Bellman Equations
\(V^h_\pi(s):=\mathbb{E}\left[\sum_{t=0}^{h-1} \gamma^t \mathrm{R}\left(s_t, \pi\left(s_t\right)\right) \mid s_0=s, \pi\right], \forall s, h\)
If the horizon \(h\) goes to infinity
finite-horizon Bellman recursions
infinite-horizon Bellman equations
Recall: For a given policy \(\pi(s),\) the (state) value functions
\(V^h_\pi(s):=\mathbb{E}\left[\sum_{t=0}^{h-1} \gamma^t \mathrm{R}\left(s_t, \pi\left(s_t\right)\right) \mid s_0=s, \pi\right], \forall s, h\)
MDP
Policy evaluation
Quick summary
Markov Decision Processes
Definition, terminologies, and policy
Policy Evaluation
\(V\)-values: State Value Functions
Bellman recursions and Bellman equations
Policy Optimization
Optimal policies \(\pi^*\)
\(Q\)-values: State-action Optimal Value Functions
Value iteration
Optimal policy \(\pi^*\)
Definition of \(\pi^*\): for a given MDP and a fixed horizon \(h\) (possibly infinite), \(\mathrm{V}^h_{\pi^*}({s}) \geqslant \mathrm{V}^h_\pi({s})\) for all \(s \in \mathcal{S}\) and for all possible policy \(\pi\).
How to search for an optimal policy \(\pi^*\)?
Definition of \(\pi^*\): for a given MDP and a fixed horizon \(h\) (possibly infinite), \(\mathrm{V}^h_{\pi^*}({s}) \geqslant \mathrm{V}^h_\pi({s})\) for all \(s \in \mathcal{S}\) and for all possible policy \(\pi\).
Optimal state-action value functions \(Q^h(s, a)\)
\(Q^h(s, a)\): the expected sum of discounted rewards for
\(V\) values vs. \(Q\) values
Optimal state-action value functions \(Q^h(s, a)\)
\(Q^h(s, a)\): the expected sum of discounted rewards for
recursively finding \(Q^h(s, a)\)
\(Q^h(s, a)\): the expected sum of discounted rewards for
Recall:
\(\gamma = 0.9\)
States and one special transition:
\(\mathrm{R}(s,a)\)
Let's consider \(Q^2(3, \rightarrow)\)
\( = 1 + .9 \max _{a^{\prime}} Q^{1}\left(3, a^{\prime}\right)\)
\( = 1.9\)
Recall:
\(\gamma = 0.9\)
States and one special transition:
\(Q^h(s, a)\): the expected sum of discounted rewards for
\(Q^2(3, \rightarrow) = \mathrm{R}(3,\rightarrow) + \gamma \max _{a^{\prime}} Q^{1}\left(3, a^{\prime}\right)\)
Let's consider \(Q^2(3, \uparrow)\)
\( = 1 + .9 \max _{a^{\prime}} Q^{1}\left(3, a^{\prime}\right)\)
\( = 1.9\)
Recall:
\(\gamma = 0.9\)
States and one special transition:
\(Q^h(s, a)\): the expected sum of discounted rewards for
\(Q^2(3, \uparrow) = \mathrm{R}(3,\uparrow) + \gamma \max _{a^{\prime}} Q^{1}\left(3, a^{\prime}\right)\)
Let's consider \(Q^2(3, \leftarrow)\)
\( = 1 + .9 \max _{a^{\prime}} Q^{1}\left(2, a^{\prime}\right)\)
\( = 1\)
Recall:
\(\gamma = 0.9\)
States and one special transition:
\(Q^h(s, a)\): the expected sum of discounted rewards for
\(Q^2(3, \leftarrow) = \mathrm{R}(3,\leftarrow) + \gamma \max _{a^{\prime}} Q^{1}\left(2, a^{\prime}\right)\)
Let's consider \(Q^2(3, \downarrow)\)
\( = 1 + .9 \max _{a^{\prime}} Q^{1}\left(6, a^{\prime}\right)\)
\( = -8\)
Recall:
\(\gamma = 0.9\)
States and one special transition:
\(Q^h(s, a)\): the expected sum of discounted rewards for
\(Q^2(3, \leftarrow) = \mathrm{R}(3,\leftarrow) + \gamma \max _{a^{\prime}} Q^{1}\left(2, a^{\prime}\right)\)
Recall:
\(\gamma = 0.9\)
States and one special transition:
\(= -10 + .9 [.2*0+ .8*1] = -9.28\)
\(Q^h(s, a)\): the expected sum of discounted rewards for
\(Q^2(6, \uparrow) =\mathrm{R}(6,\uparrow) + \gamma[.2 \max _{a^{\prime}} Q^{1}\left(2, a^{\prime}\right)+ .8\max _{a^{\prime}} Q^{1}\left(3, a^{\prime}\right)] \)
Let's consider
\(Q^2(6, \uparrow) =\mathrm{R}(6,\uparrow) + \gamma[.2 \max _{a^{\prime}} Q^{1}\left(2, a^{\prime}\right)+ .8\max _{a^{\prime}} Q^{1}\left(3, a^{\prime}\right)] \)
in general
Recall:
\(\gamma = 0.9\)
States and one special transition:
\(Q^h(s, a)\): the expected sum of discounted rewards for
what's the optimal action in state 3, with horizon 2, given by \(\pi_2^*(3)=?\)
in general
either up or right
Recall:
\(\gamma = 0.9\)
States and one special transition:
\(Q^h(s, a)\): the expected sum of discounted rewards for
Given the recursion
we can have an infinite horizon equation
Infinite-horizon Value Iteration
if run this block \(h\) times and break, then the returns are exactly \(Q^h\)
\(Q^{\infty}(s, a)\)
We'd love to hear your thoughts.