Shen Shen
April 19, 2024
Recap: Supervised Learning
Markov Decision Processes
Formal definition
Policy Evaluation
State-Value Functions: \(V\)-values
Finite horizon (recursion) and infinite horizon (equation)
Optimal Policy and Finding Optimal Policy
General tool: State-action Value Functions: \(Q\)-values
Value iteration
Toddler demo, Russ Tedrake thesis, 2004
(Uses vanilla policy gradient (actor-critic))
(The demo won't embed in PDF. But the direct link below works.)
Text
Reinforcement Learning with Human Feedback
Foundational tools and concept to understand RL.
Research area initiated in the 1950s (Bellman), known under various names (in various communities):
Stochastic optimal control (Control theory)
Stochastic shortest path (Operations research)
Sequential decision making under uncertainty (Economics)
Dynamic programming, control of dynamical systems (under uncertainty)
Reinforcement learning (Artificial Intelligence, Machine Learning)
A rich variety of (accessible & elegant) theory/math, algorithms, and applications/illustrations
As a result, quite a large variations of notations.
We will use the most RL-flavored notation
Normally, actions take Mario to the “intended” state.
E.g., in state (7), action “↑” gets to state (4)
If an action would've taken us out of this world, stay put
E.g., in state (9), action “→” gets back to state (9)
except, in state (6), action “↑” leads to two possibilities:
20% chance ends in (2)
80% chance ends in (3)
example cont'd
reward of being in state 3, taking action \(\uparrow\)
reward of being in state 3, taking action \(\downarrow\)
reward of being in state 6, taking action \(\downarrow\)
reward of being in state 6, taking action \(\rightarrow\)
actions: {Up ↑, Down ↓, Left ←, Right →}
Ultimate goal of an MDP: Find the "best" policy \(\pi\).
Sidenote:
State \(s\)
Action \(a\)
Reward \(r\)
Policy \(\pi(s)\)
Transition \(\mathrm{T}\left(s, a, s^{\prime}\right)\)
Reward \(\mathrm{R}(s, a)\)
time
a trajectory (aka an experience or rollout) \(\quad \tau=\left(s_0, a_0, r_0, s_1, a_1, r_1, \ldots\right)\)
how "good" is a trajectory?
State \(s\)
Action \(a\)
Reward \(r\)
Policy \(\pi(s)\)
Transition \(\mathrm{T}\left(s, a, s^{\prime}\right)\)
Reward \(\mathrm{R}(s, a)\)
time
a trajectory (aka an experience or rollout) \(\quad \tau=\left(s_0, a_0, r_0, s_1, a_1, r_1, \ldots\right)\)
For a given policy \(\pi(s),\) the finite-horizon horizon-\(h\) (state) value functions are:
\(V^h_\pi(s):=\mathbb{E}\left[\sum_{t=0}^{h-1} \gamma^t \mathrm{R}\left(s_t, \pi\left(s_t\right)\right) \mid s_0=s, \pi\right], \forall s, h\)
MDP
Policy evaluation
Recall:
example: evaluating the "always \(\uparrow\)" policy
\(\pi(s) = ``\uparrow",\ \forall s\)
\(\mathrm{R}(3, \uparrow) = 1\)
\(\mathrm{R}(6, \uparrow) = -10\)
\(\mathrm{R}(s, \uparrow) = 0\) for all other seven states
Suppose \(\gamma = 0.9\)
\(V^h_\pi(s):=\mathbb{E}\left[\sum_{t=0}^{h-1} \gamma^t \mathrm{R}\left(s_t, \pi\left(s_t\right)\right) \mid s_0=s, \pi\right], \forall s, h\)
\( h\) terms inside
Recall:
\(\pi(s) = ``\uparrow",\ \forall s\)
\(\mathrm{R}(3, \uparrow) = 1\)
\(\mathrm{R}(6, \uparrow) = -10\)
\(\gamma = 0.9\)
\(V^h_\pi(s):=\mathbb{E}\left[\sum_{t=0}^{h-1} \gamma^t \mathrm{R}\left(s_t, \pi\left(s_t\right)\right) \mid s_0=s, \pi\right]\)
\( 2\) terms inside
Recall:
\(\pi(s) = ``\uparrow",\ \forall s\)
\(\mathrm{R}(3, \uparrow) = 1\)
\(\mathrm{R}(6, \uparrow) = -10\)
\(\gamma = 0.9\)
\(V^h_\pi(s):=\mathbb{E}\left[\sum_{t=0}^{h-1} \gamma^t \mathrm{R}\left(s_t, \pi\left(s_t\right)\right) \mid s_0=s, \pi\right]\)
\( 2\) terms inside
action \(\uparrow\)
6
2
3
action \(\uparrow\)
action \(\uparrow\)
Now, let's think about \(V_\pi^3(6)\)
Recall:
\(\pi(s) = ``\uparrow",\ \forall s\)
\(\mathrm{R}(3, \uparrow) = 1\)
\(\mathrm{R}(6, \uparrow) = -10\)
\(\gamma = 0.9\)
2
3
action \(\uparrow\)
action \(\uparrow\)
6
action \(\uparrow\)
2
action \(\uparrow\)
3
action \(\uparrow\)
weighted by the probability of getting to that next state \(s^{\prime}\)
\((h-1)\) horizon values at a next state \(s^{\prime}\)
immediate reward, for being in state \(s\) and taking the action given by policy \(\pi(s)\)
discounted by \(\gamma\)
expected sum of discounted rewards, for starting in state \(s,\) follow policy \(\pi(s)\) for horizon \(h\)
finite-horizon policy evaluation
infinite-horizon policy evaluation
\(\gamma\) is now necessarily <1 for convergence too
Bellman equation
For any given policy \(\pi(s),\) the infinite-horizon (state) value functions are
\(V_\pi(s):=\mathbb{E}\left[\sum_{t=0}^{\infty} \gamma^t \mathrm{R}\left(s_t, \pi\left(s_t\right)\right) \mid s_0=s, \pi\right], \forall s\)
For a given policy \(\pi(s),\) the finite-horizon horizon-\(h\) (state) value functions are:
\(V^h_\pi(s):=\mathbb{E}\left[\sum_{t=0}^{h-1} \gamma^t \mathrm{R}\left(s_t, \pi\left(s_t\right)\right) \mid s_0=s, \pi\right], \forall s\)
Bellman recursion
\(V\) values vs. \(Q\) values
\(Q^h(s, a)\) is the expected sum of discounted rewards for
Recall:
example: recursively finding \(Q^h(s, a)\)
\(\gamma = 0.9\)
\(Q^h(s, a)\) is the expected sum of discounted rewards for
States and one special transition:
\(\mathrm{R}(s,a)\)
Recall:
\(\gamma = 0.9\)
\(Q^h(s, a)\) is the expected sum of discounted rewards for
States and one special transition:
Let's consider \(Q^2(3, \rightarrow)\)
\( = 1 + .9 \max _{a^{\prime}} Q^{1}\left(3, a^{\prime}\right)\)
\( = 1.9\)
Recall:
\(\gamma = 0.9\)
\(Q^h(s, a)\) is the expected sum of discounted rewards for
States and one special transition:
Let's consider \(Q^2(3, \uparrow)\)
\( = 1 + .9 \max _{a^{\prime}} Q^{1}\left(3, a^{\prime}\right)\)
\( = 1.9\)
States and one special transition:
Recall:
\(\gamma = 0.9\)
\(Q^h(s, a)\) is the expected sum of discounted rewards for
States and one special transition:
Let's consider \(Q^2(3, \leftarrow)\)
\( = 1 + .9 \max _{a^{\prime}} Q^{1}\left(2, a^{\prime}\right)\)
\( = 1\)
Recall:
\(\gamma = 0.9\)
\(Q^h(s, a)\) is the expected sum of discounted rewards for
States and one special transition:
Let's consider \(Q^2(3, \downarrow)\)
\( = 1 + .9 \max _{a^{\prime}} Q^{1}\left(6, a^{\prime}\right)\)
\( = -8\)
Recall:
\(\gamma = 0.9\)
\(Q^h(s, a)\) is the expected sum of discounted rewards for
States and one special transition:
Let's consider \(Q^2(6, \uparrow)\)
\(= -10 + .9 [.2*0+ .8*1] = -9.28\)
\(=\mathrm{R}(6,\uparrow) + \gamma[.2 \max _{a^{\prime}} Q^{1}\left(2, a^{\prime}\right)+ .8\max _{a^{\prime}} Q^{1}\left(3, a^{\prime}\right)] \)
Recall:
\(\gamma = 0.9\)
\(Q^h(s, a)\) is the expected sum of discounted rewards for
States and one special transition:
\(Q^2(6, \uparrow) =\mathrm{R}(6,\uparrow) + \gamma[.2 \max _{a^{\prime}} Q^{1}\left(2, a^{\prime}\right)+ .8\max _{a^{\prime}} Q^{1}\left(3, a^{\prime}\right)] \)
in general
Recall:
\(\gamma = 0.9\)
\(Q^h(s, a)\) is the expected sum of discounted rewards for
States and one special transition:
what's the optimal action in state 3, with horizon 2, given by \(\pi_2^*(3)=?\)
in general
either up or right
Given the finite horizon recursion
We should easily be convinced of the infinite horizon equation
Infinite-horizon Value Iteration
if instead of relying on line 6 (convergence criterion), we run the block of (line 4 and 5) for \(h\) times, then the returned values are exactly horizon-\(h\) Q values
We'd appreciate your feedback on the lecture.
Let's try to find \(Q^1 (1, \uparrow)\)
next state following (1, \uparrow)\) is only state 1.
Normally, actions take Mario to the “intended” state.
E.g., in state (7), action “↑” gets to state (4)
If an action would've taken us out of this world, stay put
E.g., in state (1), action “↑” gets back to state (1)
except, in state (6), action “↑” leads to two possibilities:
20% chance ends in (2)
80% chance ends in (3)
Recall:
\(\pi(s) = ``\uparrow",\ \forall s\)
\(\mathrm{R}(3, \uparrow) = 1\)
\(\mathrm{R}(6, \uparrow) = -10\)
\(\gamma = 0.9\)