Shen Shen
November 22, 2024
The goal of an MDP is to find a "good" policy.
Sidenote: In 6.390,
Recap:
Markov Decision Processes - Definition and terminologies
For a given policy \(\pi(s),\) the (state) value functions
\(V^h_\pi(s):=\mathbb{E}\left[\sum_{t=0}^{h-1} \gamma^t \mathrm{R}\left(s_t, \pi\left(s_t\right)\right) \mid s_0=s, \pi\right], \forall s, h\)
State value functions \(V\) values
Recap:
Bellman Recursion
weighted by the probability of getting to that next state \(s^{\prime}\)
\((h-1)\) horizon values at a next state \(s^{\prime}\)
the immediate reward for taking the policy-prescribed action \(\pi(s)\) in state \(s\).
discounted by \(\gamma\)
horizon-\(h\) value in state \(s\): the expected sum of discounted rewards, starting in state \(s\) and following policy \(\pi\) for \(h\) steps.
Recap:
finite-horizon Bellman recursions
infinite-horizon Bellman equations
For a given policy \(\pi(s),\) the (state) value functions
\(V^h_\pi(s):=\mathbb{E}\left[\sum_{t=0}^{h-1} \gamma^t \mathrm{R}\left(s_t, \pi\left(s_t\right)\right) \mid s_0=s, \pi\right], \forall s, h\)
MDP
Policy evaluation
Recap:
Optimal policy \(\pi^*\)
Definition: for a given MDP and a fixed horizon \(h\) (possibly infinite), a policy \(\pi^*\) is an optimal policy if \(\mathrm{V}^h_{\pi^*}({s}) \geqslant \mathrm{V}^h_\pi({s})\) for all \(s \in \mathcal{S}\) and for all possible policy \(\pi\).
Recap:
\(\mathrm{Q}^h(s, a)\): expected sum of discounted rewards
recipe for constructing an optimal policy
\(\mathrm{Q}^h(s, a)\): expected sum of discounted rewards
Recap:
Infinite-horizon Value Iteration
if run this block \(h\) times and break, then the returns are exactly \(Q^h\)
that satisfies the infinite-horizon equation
Recap:
Normally, we get to the “intended” state;
E.g., in state (7), action “↑” gets to state (4)
If an action would take Mario out of the grid world, stay put;
E.g., in state (9), “→” gets back to state (9)
In state (6), action “↑” leads to two possibilities:
20% chance to (2)
80% chance to (3)
Running example: Mario in a grid-world
Recall
reward of (3, \(\downarrow\))
reward of \((3,\uparrow\))
reward of \((6, \downarrow\))
reward of \((6,\rightarrow\))
Mario in a grid-world, cont'd
Running example: Mario in a grid-world
Reinforcement learning setup
Now
The goal of an MDP problem is to find a "good" policy.
Markov Decision Processes - Definition and terminologies
Reinforcement Learning
RL
State \(s\)
Action \(a\)
Reward \(r\)
Policy \(\pi(s)\)
Transition \(\mathrm{T}\left(s, a, s^{\prime}\right)\)
Reward \(\mathrm{R}(s, a)\)
time
a trajectory (aka, an experience, or a rollout), of horizon \(h\)
\(\quad \tau=\left(s_0, a_0, r_0, s_1, a_1, r_1, \ldots s_{h-1}, a_{h-1}, r_{h-1}\right)\)
initial state
all depends on \(\pi\)
also depends on \(\mathrm{T}, \mathrm{R},\) but we do not know \(\mathrm{T}, \mathrm{R},\) explicitly
Reinforcement learning is very general:
robotics
games
social sciences
chatbot (RLHF)
health care
...
Keep playing the game to approximate the unknown rewards and transitions.
e.g. observe what reward \(r\) is received from taking the \((6, \uparrow)\) pair, we get \(\mathrm{R}(6,\uparrow)\)
e.g. play the game 1000 times, count the # of times that (start in state 6, take \(\uparrow\) action, end in state 2), then, roughly, \(\mathrm{T}(6,\uparrow, 2 ) = (\text{that count}/1000) \)
Now, with \(\mathrm{R}\) and \(\mathrm{T}\) estimated, we're back in MDP setting.
In Reinforcement Learning:
[A non-exhaustive, but useful taxonomy of algorithms in modern RL. Source]
Is it possible that we get a good policy without learning transition or rewards explicitly?
We kinda know a way already:
If we have access to Q value functions, we can back out an optimal policy easily (without needing transition or rewards)
(Recall, from MDP lab)
But... doesn't value iteration rely on transition and rewards explicitly?
Value Iteration
as an approximate (rough) update?
target
States and unknown transition:
Game Set up
Try using
unknown rewards:
execute \((3, \uparrow)\), observe a reward \(r=1\)
\(\mathrm{Q}_\text{old}(s, a)\)
\(\mathrm{Q}_{\text{new}}(s, a)\)
States and unknown transition:
Try out
\(-10 + 0.9 \max _{a^{\prime}} \mathrm{Q}_{\text {old }}\left(3, a^{\prime}\right)\)
= -10 + 0.9 = -9.1
To update the estimate of \(\mathrm{Q}(6, \uparrow)\):
\(\gamma = 0.9\)
\(\mathrm{Q}_\text{old}(s, a)\)
\(\mathrm{Q}_{\text{new}}(s, a)\)
States and unknown transition:
\(-10 + 0.9 \max _{a^{\prime}} \mathrm{Q}_{\text {old }}\left(2, a^{\prime}\right)\)
= -10 + 0 = -10
\(\gamma = 0.9\)
Try out
\(\mathrm{Q}_\text{old}(s, a)\)
\(\mathrm{Q}_{\text{new}}(s, a)\)
To update the estimate of \(\mathrm{Q}(6, \uparrow)\):
States and unknown transition:
Try out
\(-10 + 0.9 \max _{a^{\prime}} \mathrm{Q}_{\text {old }}\left(3, a^{\prime}\right)\)
= -10 + 0.9 = -9.1
To update the estimate of \(\mathrm{Q}(6, \uparrow)\):
\(\gamma = 0.9\)
\(\mathrm{Q}_\text{old}(s, a)\)
\(\mathrm{Q}_{\text{new}}(s, a)\)
States and unknown transition:
\(-10 + 0.9 \max _{a^{\prime}} \mathrm{Q}_{\text {old }}\left(2, a^{\prime}\right)\)
= -10 + 0 = -10
\(\gamma = 0.9\)
Try out
\(\mathrm{Q}_\text{old}(s, a)\)
\(\mathrm{Q}_{\text{new}}(s, a)\)
To update the estimate of \(\mathrm{Q}(6, \uparrow)\):
States and unknown transition:
Try out
\(-10 + 0.9 \max _{a^{\prime}} \mathrm{Q}_{\text {old }}\left(3, a^{\prime}\right)\)
= -10 + 0.9 = -9.1
To update the estimate of \(\mathrm{Q}(6, \uparrow)\):
\(\gamma = 0.9\)
\(\mathrm{Q}_\text{old}(s, a)\)
\(\mathrm{Q}_{\text{new}}(s, a)\)
target
🥺
old belief
learning rate
😍
target
\((-10 + \)
= -5 + 0.5(-10 + 0.9)= - 9.55
States and unknown transition:
Better idea:
\(\gamma = 0.9\)
pick learning rate \(\alpha =0.5\)
+ 0.5
(1-0.5) * -10
\(\mathrm{Q}_\text{old}(s, a)\)
\(\mathrm{Q}_{\text{new}}(s, a)\)
To update the estimate of \(\mathrm{Q}(6, \uparrow)\):
\(0.9 \max _{a^{\prime}} \mathrm{Q}_{\text {old }}\left(3, a^{\prime}\right))\)
\((-10\)
= 0.5*-9.55 + 0.5(-10 + 0)= -9.775
States and unknown transition:
Better idea:
\(\gamma = 0.9\)
pick learning rate \(\alpha =0.5\)
+ 0.5
(1-0.5) * -9.55
\(\mathrm{Q}_\text{old}(s, a)\)
\(\mathrm{Q}_{\text{new}}(s, a)\)
To update the estimate of \(\mathrm{Q}(6, \uparrow)\):
\(+ 0.9 \max _{a^{\prime}} \mathrm{Q}_{\text {old }}\left(2, a^{\prime}\right))\)
Value Iteration\((\mathcal{S}, \mathcal{A}, \mathrm{T}, \mathrm{R}, \gamma, \epsilon)\)
"calculating"
"learning" (estimating)
Q-Learning \(\left(\mathcal{S}, \mathcal{A}, \gamma, \alpha, s_0\right. \text{max-iter})\)
1. \(i=0\)
2. for \(s \in \mathcal{S}, a \in \mathcal{A}:\)
3. \({\mathrm{Q}_\text{old}}(s, a) = 0\)
4. \(s \leftarrow s_0\)
5. while \(i < \text{max-iter}:\)
6. \(a \gets \text{select}\_\text{action}(s, {\mathrm{Q}_\text{old}}(s, a))\)
7. \(r,s' \gets \text{execute}(a)\)
8. \({\mathrm{Q}}_{\text{new}}(s, a) \leftarrow (1-\alpha){\mathrm{Q}}_{\text{old}}(s, a) + \alpha(r + \gamma \max_{a'}{\mathrm{Q}}_{\text{old}}(s', a'))\)
9. \(s \leftarrow s'\)
10. \(i \leftarrow (i+1)\)
11. \(\mathrm{Q}_{\text{old}} \leftarrow \mathrm{Q}_{\text{new}}\)
12. return \(\mathrm{Q}_{\text{new}}\)
"learning"
Q-Learning \(\left(\mathcal{S}, \mathcal{A}, \gamma, \alpha, s_0\right. \text{max-iter})\)
1. \(i=0\)
2. for \(s \in \mathcal{S}, a \in \mathcal{A}:\)
3. \({\mathrm{Q}_\text{old}}(s, a) = 0\)
4. \(s \leftarrow s_0\)
5. while \(i < \text{max-iter}:\)
6. \(a \gets \text{select}\_\text{action}(s, {\mathrm{Q}_\text{old}}(s, a))\)
7. \(r,s' \gets \text{execute}(a)\)
8. \({\mathrm{Q}}_{\text{new}}(s, a) \leftarrow (1-\alpha){\mathrm{Q}}_{\text{old}}(s, a) + \alpha(r + \gamma \max_{a'}{\mathrm{Q}}_{\text{old}}(s', a'))\)
9. \(s \leftarrow s'\)
10. \(i \leftarrow (i+1)\)
11. \(\mathrm{Q}_{\text{old}} \leftarrow \mathrm{Q}_{\text{new}}\)
12. return \(\mathrm{Q}_{\text{new}}\)
\(^1\) given we visit all \(s,a\) infinitely often, and satisfy a condition on the learning rate \(\alpha\).
the current estimate of \(\mathrm{Q}\) values
"learning"
Q-Learning \(\left(\mathcal{S}, \mathcal{A}, \gamma, \alpha, s_0\right. \text{max-iter})\)
1. \(i=0\)
2. for \(s \in \mathcal{S}, a \in \mathcal{A}:\)
3. \({\mathrm{Q}_\text{old}}(s, a) = 0\)
4. \(s \leftarrow s_0\)
5. while \(i < \text{max-iter}:\)
6. \(a \gets \text{select}\_\text{action}(s, {\mathrm{Q}_\text{old}}(s, a))\)
7. \(r,s' \gets \text{execute}(a)\)
8. \({\mathrm{Q}}_{\text{new}}(s, a) \leftarrow (1-\alpha){\mathrm{Q}}_{\text{old}}(s, a) + \alpha(r + \gamma \max_{a'}{\mathrm{Q}}_{\text{old}}(s', a'))\)
9. \(s \leftarrow s'\)
10. \(i \leftarrow (i+1)\)
11. \(\mathrm{Q}_{\text{old}} \leftarrow \mathrm{Q}_{\text{new}}\)
12. return \(\mathrm{Q}_{\text{new}}\)
is equivalently:
\(\mathrm{Q}_{\text {new}}(s, a) \leftarrow\mathrm{Q}_{\text {old }}(s, a)+\alpha\left([r+\gamma \max _{a^{\prime}} \mathrm{Q}_{\text {old}}(s', a')] - \mathrm{Q}_{\text {old }}(s, a)\right)\)
new belief
\(\leftarrow\)
old belief
learning rate
target
old belief
\(\mathrm{Q}_{\text {new}}(s, a) \leftarrow\mathrm{Q}_{\text {old }}(s, a)+\alpha\left([r+\gamma \max _{a^{\prime}} \mathrm{Q}_{\text {old}}(s', a')] - \mathrm{Q}_{\text {old }}(s, a)\right)\)
new belief
\(\leftarrow\)
old belief
learning rate
target
old belief
\(\left(\mathrm{Q}_{\theta}(s, a)-\text{target}\right)^2\)
Gradient descent does: \(\theta_{\text{new}} \leftarrow \theta_{\text{old}} + \eta (\text{target} - \text{guess}_{\theta})\frac{d \text{guess}}{d \theta}\)
1. parameterize \(\mathrm{Q}_{\theta}(s,a)\)
2. collect data \((r, s')\) to construct the target
3. update \(\theta\) via gradient-descent methods to minimize
\(r+\gamma \max _{a^{\prime}} \mathrm{Q}_{\theta}\left(s^{\prime}, a^{\prime}\right)\)
[Slide Credit: Yann LeCun]
Reinforcement learning has a lot of challenges:
...
We'd love to hear your thoughts.
Recall: recursively finding \(Q^h(s, a)\)
\(Q^h(s, a)\): the expected sum of discounted rewards for
Recall:
\(\gamma = 0.9\)
States and one special transition:
\(\mathrm{R}(s,a)\)
Let's consider \(Q^2(3, \downarrow)\)
\( = 1 + .9 \max _{a^{\prime}} Q^{1}\left(6, a^{\prime}\right)\)
\( = -8\)
Recall:
\(\gamma = 0.9\)
States and one special transition:
\(Q^h(s, a)\): the expected sum of discounted rewards for
\(Q^2(3, \leftarrow) = \mathrm{R}(3,\leftarrow) + \gamma \max _{a^{\prime}} Q^{1}\left(2, a^{\prime}\right)\)
Recall:
\(\gamma = 0.9\)
States and one special transition:
\(= -10 + .9 [.2*0+ .8*1] = -9.28\)
\(Q^h(s, a)\): the expected sum of discounted rewards for
\(Q^2(6, \uparrow) =\mathrm{R}(6,\uparrow) + \gamma[.2 \max _{a^{\prime}} Q^{1}\left(2, a^{\prime}\right)+ .8\max _{a^{\prime}} Q^{1}\left(3, a^{\prime}\right)] \)
Let's consider