Shen Shen
April 26, 2024
Ultimate goal of an MDP: Find the "best" policy \(\pi\).
State \(s\)
Action \(a\)
Reward \(r\)
Policy \(\pi(s)\)
Transition \(\mathrm{T}\left(s, a, s^{\prime}\right)\)
Reward \(\mathrm{R}(s, a)\)
time
a trajectory (aka an experience or rollout) \(\quad \tau=\left(s_0, a_0, r_0, s_1, a_1, r_1, \ldots\right)\)
how "good" is a trajectory?
Normally, actions take Mario to the “intended” state.
E.g., in state (7), action “↑” gets to state (4)
If an action would've taken us out of this world, stay put
E.g., in state (9), action “→” gets back to state (9)
except, in state (6), action “↑” leads to two possibilities:
20% chance ends in (2)
80% chance ends in (3)
example cont'd
actions: {Up ↑, Down ↓, Left ←, Right →}
Now, let's think about \(V_\pi^3(6)\)
Recall:
\(\pi(s) = ``\uparrow",\ \forall s\)
\(\mathrm{R}(3, \uparrow) = 1\)
\(\mathrm{R}(6, \uparrow) = -10\)
\(\gamma = 0.9\)
2
3
action \(\uparrow\)
action \(\uparrow\)
6
action \(\uparrow\)
2
action \(\uparrow\)
3
action \(\uparrow\)
MDP
Policy evaluation
finite-horizon policy evaluation
infinite-horizon policy evaluation
\(\gamma\) is now necessarily <1 for convergence too in general
Bellman equation
For any given policy \(\pi(s),\) the infinite-horizon (state) value functions are
\(V_\pi(s):=\mathbb{E}\left[\sum_{t=0}^{\infty} \gamma^t \mathrm{R}\left(s_t, \pi\left(s_t\right)\right) \mid s_0=s, \pi\right], \forall s\)
For a given policy \(\pi(s),\) the finite-horizon horizon-\(h\) (state) value functions are:
\(V^h_\pi(s):=\mathbb{E}\left[\sum_{t=0}^{h-1} \gamma^t \mathrm{R}\left(s_t, \pi\left(s_t\right)\right) \mid s_0=s, \pi\right], \forall s\)
Bellman recursion
Recall:
example: recursively finding \(Q^h(s, a)\)
\(\gamma = 0.9\)
\(Q^h(s, a)\) is the expected sum of discounted rewards for
States and one special transition:
\(\mathrm{R}(s,a)\)
Recall:
\(\gamma = 0.9\)
\(Q^h(s, a)\) is the expected sum of discounted rewards for
States and one special transition:
Let's consider \(Q^2(6, \uparrow)\)
\(= -10 + .9 [.2*0+ .8*1] = -9.28\)
\(=\mathrm{R}(6,\uparrow) + \gamma[.2 \max _{a^{\prime}} Q^{1}\left(2, a^{\prime}\right)+ .8\max _{a^{\prime}} Q^{1}\left(3, a^{\prime}\right)] \)
Recall:
\(\gamma = 0.9\)
\(Q^h(s, a)\) is the expected sum of discounted rewards for
States and one special transition:
\(Q^2(6, \uparrow) =\mathrm{R}(6,\uparrow) + \gamma[.2 \max _{a^{\prime}} Q^{1}\left(2, a^{\prime}\right)+ .8\max _{a^{\prime}} Q^{1}\left(3, a^{\prime}\right)] \)
in general
Recall:
\(\gamma = 0.9\)
\(Q^h(s, a)\) is the expected sum of discounted rewards for
States and one special transition:
what's the optimal action in state 3, with horizon 2, given by \(\pi_2^*(3)=?\)
in general
either up or right
Given the finite horizon recursion
We should easily be convinced of the infinite horizon equation
Infinite-horizon Value Iteration
Ultimate goal of an MDP: Find the "best" policy \(\pi\).
RL:
Keep playing the game to approximate the unknown rewards and transitions.
e.g. by observing what reward \(r\) received from being in state 6 and take \(\uparrow\) action, we know \(\mathrm{R}(6,\uparrow)\)
Transitions are a bit more involved but still simple:
Rewards are particularly easy:
e.g. play the game 1000 times, count the # of times (we started in state 6, take \(\uparrow\) action, end in state 2), then, roughly, \(\mathrm{T}(6,\uparrow, 2 ) = (\text{that count}/1000) \)
Now, with \(\mathrm{R}, \mathrm{T}\) estimated, we're back in MDP setting.
In Reinforcement Learning:
How do we learn a good policy without learning transition or rewards explicitly?
We kinda already know a way: Q functions!
So once we have "good" Q values, we can find optimal policy easily.
(Recall from MDP lab)
But didn't we calculate this Q-table via value iteration using transition and rewards explicitly?
Indeed, recall that, in MDP:
e.g.
as the proxy for the r.h.s. assignment?
target
old belief
learning rate
VALUE-ITERATION \((\mathcal{S}, \mathcal{A}, \mathrm{T}, \mathrm{R}, \gamma, \epsilon)\)
Q-LEARNING \(\left(\mathcal{S}, \mathcal{A}, \gamma, \mathrm{s}_0,\alpha\right)\)
"calculating"
"estimating"
Q-LEARNING \(\left(\mathcal{S}, \mathcal{A}, \gamma, \mathrm{s}_0,\alpha\right)\)
\(\mathrm{Q}_{\text {new}}(s, a) \leftarrow\mathrm{Q}_{\text {old }}(s, a)+\alpha\left([r+\gamma \max _{a^{\prime}} \mathrm{Q}_{\text {old}}(s', a')] - \mathrm{Q}_{\text {old }}(s, a)\right)\)
is equivalently:
old belief
learning
rate
target
old belief
pretty similar to SGD.
is equivalently:
\(\mathrm{Q}_{\text {new}}(s, a) \leftarrow\mathrm{Q}_{\text {old }}(s, a)+\alpha\left([r+\gamma \max _{a^{\prime}} \mathrm{Q}_{\text {old}}(s', a')] - \mathrm{Q}_{\text {old }}(s, a)\right)\)
learning
rate
old belief
target
old belief
\(\left(Q(s, a)-\left(r+\gamma \max _{a^{\prime}} Q\left(s^{\prime}, a^{\prime}\right)\right)\right)^2\)
via gradient method!
Supervised learning
We'd appreciate your feedback on the lecture.