The goal of an MDP is to find a "good" policy.
MDP: Definition and terminologies
In 6.390,
Recall
finite-horizon Bellman recursions
infinite-horizon Bellman equations
Policy Evaluation
Use the definition and sum up expected rewards:
Or, leverage the recursive structure:
1️⃣
2️⃣
3️⃣
Recall
the immediate reward for taking the policy-prescribed action \(\pi(s)\) in state \(s\).
horizon-\(h\) value in state \(s\): the expected sum of discounted rewards, starting in state \(s\) and following policy \(\pi\) for \(h\) steps.
\((h-1)\) horizon future value at a next state \(s^{\prime}\)
sum of future values weighted by the probability of reaching that next state \(s^{\prime}\)
discounted by \(\gamma\)
2️⃣
Recall
the optimal state-action value functions \(\mathrm{Q}^*_h(s, a):\)
the expected sum of discounted rewards, obtained by
\(\mathrm{V}_h^*(s) = \max_{a} \big[\mathrm{R}(s, a) + \gamma \sum_{s'} \mathrm{T}(s, a, s') \mathrm{V}_{h-1}^*(s') \big]\)
\(=\max_{a}\left[\mathrm{Q}^*_{h}(s, a)\right]\)
\(\mathrm{Q}^*\) satisfies the Bellman recursion:
\(\mathrm{Q}^*_h (s, a)=\mathrm{R}(s, a)+\gamma \sum_{s^{\prime}} \mathrm{T}\left(s, a, s^{\prime} \right) \max _{a^{\prime}} \mathrm{Q}^*_{h-1}\left(s^{\prime}, a^{\prime}\right)\)
4️⃣
5️⃣
Recall
Value Iteration
if run this block \(h\) times and break, then the returns are \(\mathrm{Q}^*_h\)
returns are \(\mathrm{Q}^*_{\infty}\)
Value iteration: iteratively invoke
Recall
\(\mathrm{Q}^*_h (s, a)=\mathrm{R}(s, a)+\gamma \sum_{s^{\prime}} \mathrm{T}\left(s, a, s^{\prime} \right) \max _{a^{\prime}} \mathrm{Q}^*_{h-1}\left(s^{\prime}, a^{\prime}\right)\)
5️⃣
Mario in a grid-world v1.0
(Markov-decision-process version)
e.g., \(\mathrm{T}\left(7, \uparrow, 4\right) = 1\)
\(\mathrm{T}\left(9, \rightarrow, 9\right) = 1\)
\(\mathrm{T}\left(6, \uparrow, 3\right) = 0.8\)
\(\mathrm{T}\left(6, \uparrow, 2\right) = 0.2\)
Mario in a grid-world v2.0
(reinforcement learning version)
e.g., \(\mathrm{T}\left(7, \uparrow, 4\right) = ?\)
\(\mathrm{T}\left(9, \rightarrow, 9\right) = ?\)
\(\mathrm{T}\left(6, \uparrow, 3\right) = ?\)
\(\mathrm{T}\left(6, \uparrow, 2\right) = ?\)
The goal of an MDP problem is to find a "good" policy.
Markov Decision Processes - Definition and terminologies
Reinforcement Learning
RL
Reinforcement learning is very general:
robotics
games
social sciences
chatbot (RLHF)
health care
...
Model-based RL: Learn the MDP tuple
1. Collect set of experiences \((s, a, r, s^{\prime})\)
2. Estimate \(\mathrm{\hat{T}}\), \(\mathrm{\hat{R}}\)
3. Solve \(\langle\mathcal{S}, \mathcal{A}, \mathrm{\hat{T}}, \mathrm{\hat{R}}, \gamma\rangle\) via e.g. Value Iteration
\(\frac{\text{observed} (6, \uparrow, 2) \text{count}}{\text{total gameplay count}} \)
e.g. \({\mathrm{\hat{T}}}(6,\uparrow, 2 ) \approx \)
observed reward received from (6, \(\uparrow\))
e.g. \({\mathrm{\hat{R}}}(6,\uparrow ) =\)
\(\gamma = 0.9\)
Unknow transition:
Unknown rewards
\(\dots\)
\((1,\uparrow)\)
0
1
\((3,\uparrow)\)
1
3
\((3,\downarrow)\)
1
6
\(\dots\)
\((6,\uparrow)\)
-10
3
\(\dots\)
\((6,\uparrow)\)
-10
2
\((s, a)\)
\(r \)
\(s^{\prime}\)
\((1,\downarrow)\)
\(0\)
\(4\)
compounding error: if the learned MDP model is slightly wrong, our policy is doomed
[A non-exhaustive, but useful taxonomy of algorithms in modern RL. Source]
We'll focus on (tabular) Q-learning
and to a lesser extent, touch on fitted Q-learning methods such as DQN
Direct Policy-Based
Is it possible to get an optimal policy without learning transition or rewards explicitly?
Yes! We know one way already:
(Recall, from the MDP lab)
Optimal policy \(\pi^*\) easily extracted from \(\mathrm{Q}^*\):
6️⃣
and doesn't value iteration rely on transition and rewards explicitly?
\(\mathrm{Q}_{\text {new }}(s, a) \leftarrow \mathrm{R}(s, a)+\gamma \sum_{s^{\prime}} \mathrm{T}\left(s, a, s^{\prime}\right) \max _{a^{\prime}} \mathrm{Q}_{\text {old }}\left(s^{\prime}, a^{\prime}\right)\)
Value Iteration
But... didn't we arrive at \(\mathrm{Q}^*\) by value iteration,
(we will see this idea has issues)
\(\mathrm{Q}_{\text {new }}(s, a) \leftarrow \mathrm{R}(s, a)+\gamma \sum_{s^{\prime}} \mathrm{T}\left(s, a, s^{\prime}\right) \max _{a^{\prime}} \mathrm{Q}_{\text {old }}\left(s^{\prime}, a^{\prime}\right)\)
immediate reward
future value, starting in state \(s'\) and acting optimally for \((h-1)\) steps
expected future value, weighted by the chance of landing in that particular future state \(s'\)
target
\(\mathrm{Q}_{\text {new }}(s, a) \leftarrow ~ r +\gamma \max _{a^{\prime}} \mathrm{Q}_{\text {old }}\left(s^{\prime}, a^{\prime}\right)\)
\(\mathrm{Q}_\text{old}(s, a)\)
\(\mathrm{Q}_{\text{new}}(s, a)\)
\(\dots\)
\((s, a)\)
\(r \)
\(s^{\prime}\)
\((1,\uparrow)\)
0
1
\((3,\uparrow)\)
1
3
\((3,\downarrow)\)
1
6
\(\dots\)
\((6,\uparrow)\)
-10
3
\(\dots\)
\((6,\uparrow)\)
-10
2
\(\gamma = 0.9\)
\((1,\downarrow)\)
\(0\)
\(1\)
\(\mathrm{Q}_{\text {new }}(s, a) \leftarrow ~ r +\gamma \max _{a^{\prime}} \mathrm{Q}_{\text {old }}\left(s^{\prime}, a^{\prime}\right)\)
\(\mathrm{Q}_\text{old}(s, a)\)
\(\mathrm{Q}_{\text{new}}(s, a)\)
\(\mathrm{Q}_{\text {new }}(s, a) \leftarrow ~ r +\gamma \max _{a^{\prime}} \mathrm{Q}_{\text {old }}\left(s^{\prime}, a^{\prime}\right)\)
\(\gamma = 0.9\)
\(\dots\)
\((6,\uparrow)\)
-10
3
\((6,\uparrow)\)
-10
2
\((6,\uparrow)\)
-10
3
\((6,\uparrow)\)
-10
2
\((6,\uparrow)\)
-10
3
\(\mathrm{Q}_{\text {new }}(6, \uparrow) \leftarrow\)
\(-10 +\gamma \max _{a^{\prime}} \mathrm{Q}_{\text {old }}\left(3, a^{\prime}\right)\)
\(=-10 + 0.9 * 1 = -9.1 \)
\((s, a)\)
\(r \)
\(s^{\prime}\)
\(\dots\)
\(\mathrm{Q}_\text{old}(s, a)\)
\(\mathrm{Q}_{\text{new}}(s, a)\)
\((s, a)\)
\(r \)
\(s^{\prime}\)
\(\mathrm{Q}_{\text {new }}(s, a) \leftarrow ~ r +\gamma \max _{a^{\prime}} \mathrm{Q}_{\text {old }}\left(s^{\prime}, a^{\prime}\right)\)
\(\gamma = 0.9\)
\(\dots\)
\((6,\uparrow)\)
-10
2
\((6,\uparrow)\)
-10
2
\((6,\uparrow)\)
-10
3
\((6,\uparrow)\)
-10
2
\((6,\uparrow)\)
-10
3
\(\mathrm{Q}_{\text {new }}(6, \uparrow) \leftarrow\)
\(\dots\)
\(-10 +\gamma \max _{a^{\prime}} \mathrm{Q}_{\text {old }}\left(2, a^{\prime}\right)\)
\(=-10 + 0.9 * 0 = -10 \)
\(\mathrm{Q}_\text{old}(s, a)\)
\(\mathrm{Q}_{\text{new}}(s, a)\)
\((s, a)\)
\(r \)
\(s^{\prime}\)
\(\mathrm{Q}_{\text {new }}(s, a) \leftarrow ~ r +\gamma \max _{a^{\prime}} \mathrm{Q}_{\text {old }}\left(s^{\prime}, a^{\prime}\right)\)
\(\gamma = 0.9\)
\(\dots\)
\((6,\uparrow)\)
-10
3
\((6,\uparrow)\)
-10
2
\((6,\uparrow)\)
-10
3
\((6,\uparrow)\)
-10
2
\((6,\uparrow)\)
-10
3
\(\mathrm{Q}_{\text {new }}(6, \uparrow) \leftarrow\)
\(\dots\)
\(-10 +\gamma \max _{a^{\prime}} \mathrm{Q}_{\text {old }}\left(3, a^{\prime}\right)\)
\(=-10 + 0.9 * 1 = -9.1 \)
\(\mathrm{Q}_\text{old}(s, a)\)
\(\mathrm{Q}_{\text{new}}(s, a)\)
\((s, a)\)
\(r \)
\(s^{\prime}\)
\(\mathrm{Q}_{\text {new }}(s, a) \leftarrow ~ r +\gamma \max _{a^{\prime}} \mathrm{Q}_{\text {old }}\left(s^{\prime}, a^{\prime}\right)\)
\(\gamma = 0.9\)
\(\dots\)
\((6,\uparrow)\)
-10
2
\((6,\uparrow)\)
-10
2
\((6,\uparrow)\)
-10
3
\((6,\uparrow)\)
-10
2
\((6,\uparrow)\)
-10
3
\(\mathrm{Q}_{\text {new }}(6, \uparrow) \leftarrow\)
\(\dots\)
\(-10 +\gamma \max _{a^{\prime}} \mathrm{Q}_{\text {old }}\left(2, a^{\prime}\right)\)
\(=-10 + 0.9 * 0 = -10 \)
\(\gamma = 0.9\)
\(\mathrm{Q}_{\text {new }}(6, \uparrow) \leftarrow\)
\(-10 +\gamma \max _{a^{\prime}} \mathrm{Q}_{\text {old }}\left(2, a^{\prime}\right)\)
\(=-10 + 0.9 * 0 = -10 \)
🥺 Simply commit to the new keeps "washing away" the old belief
target
Whenever observe \((6, \uparrow),\) -10, 3:
Whenever observe \((6, \uparrow),\) -10, 2:
\(\mathrm{Q}_{\text {new }}(6, \uparrow) \leftarrow\)
\(-10 +\gamma \max _{a^{\prime}} \mathrm{Q}_{\text {old }}\left(3, a^{\prime}\right)\)
\(=-10 + 0.9 * 1 = -9.1 \)
\(\mathrm{Q}_{\text {new }}(s, a) \leftarrow ~ r +\gamma \max _{a^{\prime}} \mathrm{Q}_{\text {old }}\left(s^{\prime}, a^{\prime}\right)\)
\(\mathrm{Q}_{\text {new }}(s, a) \leftarrow ~ r +\gamma \max _{a^{\prime}} \mathrm{Q}_{\text {old }}\left(s^{\prime}, a^{\prime}\right)\)
target
\((s, a)\)
\(r \)
\(s^{\prime}\)
\(\dots\)
\((6,\uparrow)\)
-10
2
\((6,\uparrow)\)
-10
3
\((6,\uparrow)\)
-10
2
\((6,\uparrow)\)
-10
3
\(\dots\)
😍 merge old belief and target
\(r +\gamma \max _{a^{\prime}} \mathrm{Q}_{\text {old }}\left(s^{\prime}, a^{\prime}\right)\)
\(\mathrm{Q}_{\text {new }}(s, a) ~ \leftarrow ~ \)
\((1- \quad ) \)
\(\mathrm{Q}_{\text {old }}(s, a)\)
\(+\)
learning rate
target
old belief
\(+\)
learning rate
\((1- \qquad \qquad \qquad ) \)
\(\mathrm{Q}_\text{old}(s, a)\)
\(\mathrm{Q}_{\text{new}}(s, a)\)
\(\dots\)
\((s, a)\)
\(r \)
\(s^{\prime}\)
\((1,\uparrow)\)
0
1
\((3,\uparrow)\)
1
3
\((3,\downarrow)\)
1
6
\(\dots\)
\((1,\downarrow)\)
\(0\)
\(4\)
\(\mathrm{Q}_{\text {new }}(s, a) \leftarrow (1-\alpha) \mathrm{Q}_{\text {old }}(s, a)+\alpha\left(r+\gamma \max _{a^{\prime}} \mathrm{Q}_{\text {old }}\left(s^{\prime}, a^{\prime}\right)\right)\)
\(\gamma = 0.9\)
\(\alpha = 0.7\)
\((1,\leftarrow)\)
\(0\)
\(1\)
\((1,\rightarrow)\)
\(0\)
\(2\)
\(\mathrm{Q}_{\text {new }}(1, \uparrow) \leftarrow (1-0.7) \mathrm{Q}_{\text {old }}(1, \uparrow)+0.7\left(0+ 0.9\max _{a^{\prime}} \mathrm{Q}_{\text {old }}\left(1, a^{\prime}\right)\right)\)
\(= (1-0.7)*0 + 0.7*( 0 + 0.9*0) = 0\)
\(\mathrm{Q}_\text{old}(s, a)\)
\(\mathrm{Q}_{\text{new}}(s, a)\)
\(\dots\)
\((s, a)\)
\(r \)
\(s^{\prime}\)
\((1,\uparrow)\)
0
1
\((3,\uparrow)\)
1
3
\((3,\downarrow)\)
1
6
\(\dots\)
\((1,\downarrow)\)
\(0\)
\(4\)
\(\mathrm{Q}_{\text {new }}(s, a) \leftarrow (1-\alpha) \mathrm{Q}_{\text {old }}(s, a)+\alpha\left(r+\gamma \max _{a^{\prime}} \mathrm{Q}_{\text {old }}\left(s^{\prime}, a^{\prime}\right)\right)\)
\(\gamma = 0.9\)
\(\alpha = 0.7\)
\((1,\leftarrow)\)
\(0\)
\(1\)
\((1,\rightarrow)\)
\(0\)
\(2\)
\(\mathrm{Q}_{\text {new }}(1, \downarrow) \leftarrow (1-0.7) \mathrm{Q}_{\text {old }}(1, \downarrow)+0.7\left(0+ 0.9\max _{a^{\prime}} \mathrm{Q}_{\text {old }}\left(4, a^{\prime}\right)\right)\)
\(= (1-0.7)*0 + 0.7*( 0 + 0.9*0) = 0\)
\(\mathrm{Q}_\text{old}(s, a)\)
\(\mathrm{Q}_{\text{new}}(s, a)\)
\(\dots\)
\((s, a)\)
\(r \)
\(s^{\prime}\)
\((1,\uparrow)\)
0
1
\((3,\uparrow)\)
1
3
\((3,\downarrow)\)
1
6
\(\dots\)
\((1,\downarrow)\)
\(0\)
\(4\)
\(\mathrm{Q}_{\text {new }}(s, a) \leftarrow (1-\alpha) \mathrm{Q}_{\text {old }}(s, a)+\alpha\left(r+\gamma \max _{a^{\prime}} \mathrm{Q}_{\text {old }}\left(s^{\prime}, a^{\prime}\right)\right)\)
\(\gamma = 0.9\)
\(\alpha = 0.7\)
\((1,\leftarrow)\)
\(0\)
\(1\)
\((1,\rightarrow)\)
\(0\)
\(2\)
\(\mathrm{Q}_{\text {new }}(1, \leftarrow) \leftarrow (1-0.7) \mathrm{Q}_{\text {old }}(1, \leftarrow)+0.7\left(0+ 0.9\max _{a^{\prime}} \mathrm{Q}_{\text {old }}\left(1, a^{\prime}\right)\right)\)
\(= (1-0.7)*0 + 0.7*( 0 + 0.9*0) = 0\)
\(\mathrm{Q}_\text{old}(s, a)\)
\(\mathrm{Q}_{\text{new}}(s, a)\)
\(\dots\)
\((s, a)\)
\(r \)
\(s^{\prime}\)
\((1,\uparrow)\)
0
1
\((3,\uparrow)\)
1
3
\((3,\downarrow)\)
1
6
\(\dots\)
\((1,\downarrow)\)
\(0\)
\(4\)
\(\mathrm{Q}_{\text {new }}(s, a) \leftarrow (1-\alpha) \mathrm{Q}_{\text {old }}(s, a)+\alpha\left(r+\gamma \max _{a^{\prime}} \mathrm{Q}_{\text {old }}\left(s^{\prime}, a^{\prime}\right)\right)\)
\(\gamma = 0.9\)
\(\alpha = 0.7\)
\((1,\leftarrow)\)
\(0\)
\(1\)
\((1,\rightarrow)\)
\(0\)
\(2\)
\(\mathrm{Q}_{\text {new }}(1, \rightarrow) \leftarrow (1-0.7) \mathrm{Q}_{\text {old }}(1, \rightarrow)+0.7\left(0+ 0.9\max _{a^{\prime}} \mathrm{Q}_{\text {old }}\left(2, a^{\prime}\right)\right)\)
\(= (1-0.7)*0 + 0.7*( 0 + 0.9*0) = 0\)
\(\mathrm{Q}_\text{old}(s, a)\)
\(\mathrm{Q}_{\text{new}}(s, a)\)
\(\dots\)
\((s, a)\)
\(r \)
\(s^{\prime}\)
\((1,\uparrow)\)
0
1
\((3,\uparrow)\)
1
3
\((3,\downarrow)\)
1
6
\(\dots\)
\((1,\downarrow)\)
\(0\)
\(4\)
\(\mathrm{Q}_{\text {new }}(s, a) \leftarrow (1-\alpha) \mathrm{Q}_{\text {old }}(s, a)+\alpha\left(r+\gamma \max _{a^{\prime}} \mathrm{Q}_{\text {old }}\left(s^{\prime}, a^{\prime}\right)\right)\)
\(\gamma = 0.9\)
\(\alpha = 0.7\)
\((1,\leftarrow)\)
\(0\)
\(1\)
\((1,\rightarrow)\)
\(0\)
\(2\)
\(\mathrm{Q}_{\text {new }}(3, \uparrow) \leftarrow (1-0.7) \mathrm{Q}_{\text {old }}(3, \uparrow)+0.7\left(1+ 0.9\max _{a^{\prime}} \mathrm{Q}_{\text {old }}\left(3, a^{\prime}\right)\right)\)
\(= (1-0.7)*0 + 0.7*( 1 + 0.9*0) = 0.7\)
\(\mathrm{Q}_\text{old}(s, a)\)
\(\mathrm{Q}_{\text{new}}(s, a)\)
\(\dots\)
\((s, a)\)
\(r \)
\(s^{\prime}\)
\((1,\uparrow)\)
0
1
\((3,\uparrow)\)
1
3
\((3,\downarrow)\)
1
6
\(\dots\)
\((1,\downarrow)\)
\(0\)
\(4\)
\(\mathrm{Q}_{\text {new }}(s, a) \leftarrow (1-\alpha) \mathrm{Q}_{\text {old }}(s, a)+\alpha\left(r+\gamma \max _{a^{\prime}} \mathrm{Q}_{\text {old }}\left(s^{\prime}, a^{\prime}\right)\right)\)
\(\gamma = 0.9\)
\(\alpha = 0.7\)
\((1,\leftarrow)\)
\(0\)
\(1\)
\((1,\rightarrow)\)
\(0\)
\(2\)
\(\mathrm{Q}_{\text {new }}(3, \downarrow) \leftarrow (1-0.7) \mathrm{Q}_{\text {old }}(3, \downarrow)+0.7\left(1+ 0.9\max _{a^{\prime}} \mathrm{Q}_{\text {old }}\left(6, a^{\prime}\right)\right)\)
\(= (1-0.7)*0 + 0.7*( 1 + 0.9*0) = 0.7\)
\(\mathrm{Q}_\text{old}(s, a)\)
\(\mathrm{Q}_{\text{new}}(s, a)\)
\(\dots\)
\(\mathrm{Q}_{\text {new }}(s, a) \leftarrow (1-\alpha) \mathrm{Q}_{\text {old }}(s, a)+\alpha\left(r+\gamma \max _{a^{\prime}} \mathrm{Q}_{\text {old }}\left(s^{\prime}, a^{\prime}\right)\right)\)
\(\alpha = 0.7\)
\(\dots\)
\((s, a)\)
\(r \)
\(s^{\prime}\)
\(\gamma = 0.9\)
\((6,\uparrow)\)
-10
\((6,\uparrow)\)
-10
\((6,\uparrow)\)
-10
\((6,\uparrow)\)
-10
\(\mathrm{Q}_{\text {new }}(6, \uparrow) \leftarrow (1-0.7) \mathrm{Q}_{\text {old }}(6, \uparrow)+0.7\left(-10+ 0.9\max _{a^{\prime}} \mathrm{Q}_{\text {old }}\left(3, a^{\prime}\right)\right)\)
\(= (1-0.7)*0 + 0.7*( -10 + 0.9*0.7) = -6.56\)
\(\mathrm{Q}_\text{old}(s, a)\)
\(\mathrm{Q}_{\text{new}}(s, a)\)
\(\mathrm{Q}_{\text {new }}(s, a) \leftarrow (1-\alpha) \mathrm{Q}_{\text {old }}(s, a)+\alpha\left(r+\gamma \max _{a^{\prime}} \mathrm{Q}_{\text {old }}\left(s^{\prime}, a^{\prime}\right)\right)\)
\(\gamma = 0.9\)
\(\alpha = 0.7\)
\(\mathrm{Q}_{\text {new }}(6, \uparrow) \leftarrow (1-0.7) \mathrm{Q}_{\text {old }}(6, \uparrow)+0.7\left(-10+ 0.9\max _{a^{\prime}} \mathrm{Q}_{\text {old }}\left(2, a^{\prime}\right)\right)\)
\(= (1-0.7)*-6.56 + 0.7*( -10 + 0.9*0) = -8.97\)
\(\dots\)
\(\dots\)
\((s, a)\)
\(r \)
\(s^{\prime}\)
\((6,\uparrow)\)
-10
\((6,\uparrow)\)
-10
\((6,\uparrow)\)
-10
\((6,\uparrow)\)
-10
\(\mathrm{Q}_\text{old}(s, a)\)
\(\mathrm{Q}_{\text{new}}(s, a)\)
\(\mathrm{Q}_{\text {new }}(s, a) \leftarrow (1-\alpha) \mathrm{Q}_{\text {old }}(s, a)+\alpha\left(r+\gamma \max _{a^{\prime}} \mathrm{Q}_{\text {old }}\left(s^{\prime}, a^{\prime}\right)\right)\)
\(\gamma = 0.9\)
\(\alpha = 0.7\)
\(\mathrm{Q}_{\text {new }}(6, \uparrow) \leftarrow (1-0.7) \mathrm{Q}_{\text {old }}(6, \uparrow)+0.7\left(-10+ 0.9\max _{a^{\prime}} \mathrm{Q}_{\text {old }}\left(3, a^{\prime}\right)\right)\)
\(= (1-0.7)*-8.97 + 0.7*( -10 + 0.9*0.7) = -9.25\)
\(\dots\)
\(\dots\)
\((s, a)\)
\(r \)
\(s^{\prime}\)
\((6,\uparrow)\)
-10
\((6,\uparrow)\)
-10
\((6,\uparrow)\)
-10
\((6,\uparrow)\)
-10
\(\mathrm{Q}_\text{old}(s, a)\)
\(\mathrm{Q}_{\text{new}}(s, a)\)
\(\mathrm{Q}_{\text {new }}(s, a) \leftarrow (1-\alpha) \mathrm{Q}_{\text {old }}(s, a)+\alpha\left(r+\gamma \max _{a^{\prime}} \mathrm{Q}_{\text {old }}\left(s^{\prime}, a^{\prime}\right)\right)\)
\(\gamma = 0.9\)
\(\alpha = 0.7\)
\(\mathrm{Q}_{\text {new }}(6, \uparrow) \leftarrow (1-0.7) \mathrm{Q}_{\text {old }}(6, \uparrow)+0.7\left(-10+ 0.9\max _{a^{\prime}} \mathrm{Q}_{\text {old }}\left(2, a^{\prime}\right)\right)\)
\(= (1-0.7)*-9.25 + 0.7*( -10 + 0.9*0) = -9.77\)
\(\dots\)
\(\dots\)
\((s, a)\)
\(r \)
\(s^{\prime}\)
\((6,\uparrow)\)
-10
\((6,\uparrow)\)
-10
\((6,\uparrow)\)
-10
\((6,\uparrow)\)
-10
\(\mathrm{Q}_{\text {new }}(s, a) \leftarrow (1-\alpha) \mathrm{Q}_{\text {old }}(s, a)+\alpha\left(r+\gamma \max _{a^{\prime}} \mathrm{Q}_{\text {old }}\left(s^{\prime}, a^{\prime}\right)\right)\)
Depends on
This is the fundamental dilemma in reinforcement learning... and in life!
Whether to exploit what we've already learned or explore to discover something better.
\(\epsilon\) controls the trade-off between exploration vs. exploitation.
During learning, especially in early stages, we'd like to explore, and observe diverse \((s,a\)) consequences.
During later stages, can act more greedily w.r.t. the current estimated Q values
\(\epsilon\)-greedy action selection strategy
Value Iteration\((\mathcal{S}, \mathcal{A}, \mathrm{T}, \mathrm{R}, \gamma, \epsilon)\)
"calculating"
"learning" (estimating)
Q-Learning \(\left(\mathcal{S}, \mathcal{A}, \gamma, \alpha, s_0\right. \text{max-iter})\)
1. \(i=0\)
2. for \(s \in \mathcal{S}, a \in \mathcal{A}:\)
3. \({\mathrm{Q}_\text{old}}(s, a) = 0\)
4. \(s \leftarrow s_0\)
5. while \(i < \text{max-iter}:\)
6. \(a \gets \text{select}\_\text{action}(s, {\mathrm{Q}_\text{old}}(s, a))\)
7. \(r,s' \gets \text{execute}(a)\)
8. \({\mathrm{Q}}_{\text{new}}(s, a) \leftarrow (1-\alpha){\mathrm{Q}}_{\text{old}}(s, a) + \alpha(r + \gamma \max_{a'}{\mathrm{Q}}_{\text{old}}(s', a'))\)
9. \(s \leftarrow s'\)
10. \(i \leftarrow (i+1)\)
11. \(\mathrm{Q}_{\text{old}} \leftarrow \mathrm{Q}_{\text{new}}\)
12. return \(\mathrm{Q}_{\text{new}}\)
"learning"
Q-Learning \(\left(\mathcal{S}, \mathcal{A}, \gamma, \alpha, s_0\right. \text{max-iter})\)
1. \(i=0\)
2. for \(s \in \mathcal{S}, a \in \mathcal{A}:\)
3. \({\mathrm{Q}_\text{old}}(s, a) = 0\)
4. \(s \leftarrow s_0\)
5. while \(i < \text{max-iter}:\)
6. \(a \gets \text{select}\_\text{action}(s, {\mathrm{Q}_\text{old}}(s, a))\)
7. \(r,s' \gets \text{execute}(a)\)
8. \({\mathrm{Q}}_{\text{new}}(s, a) \leftarrow (1-\alpha){\mathrm{Q}}_{\text{old}}(s, a) + \alpha(r + \gamma \max_{a'}{\mathrm{Q}}_{\text{old}}(s', a'))\)
9. \(s \leftarrow s'\)
10. \(i \leftarrow (i+1)\)
11. \(\mathrm{Q}_{\text{old}} \leftarrow \mathrm{Q}_{\text{new}}\)
12. return \(\mathrm{Q}_{\text{new}}\)
\(^1\) given we visit all \(s,a\) infinitely often, and satisfy a decaying condition on the learning rate \(\alpha\).
"learning"
Q-Learning \(\left(\mathcal{S}, \mathcal{A}, \gamma, \alpha, s_0\right. \text{max-iter})\)
1. \(i=0\)
2. for \(s \in \mathcal{S}, a \in \mathcal{A}:\)
3. \({\mathrm{Q}_\text{old}}(s, a) = 0\)
4. \(s \leftarrow s_0\)
5. while \(i < \text{max-iter}:\)
6. \(a \gets \text{select}\_\text{action}(s, {\mathrm{Q}_\text{old}}(s, a))\)
7. \(r,s' \gets \text{execute}(a)\)
8. \({\mathrm{Q}}_{\text{new}}(s, a) \leftarrow (1-\alpha){\mathrm{Q}}_{\text{old}}(s, a) + \alpha(r + \gamma \max_{a'}{\mathrm{Q}}_{\text{old}}(s', a'))\)
9. \(s \leftarrow s'\)
10. \(i \leftarrow (i+1)\)
11. \(\mathrm{Q}_{\text{old}} \leftarrow \mathrm{Q}_{\text{new}}\)
12. return \(\mathrm{Q}_{\text{new}}\)
each between 0 and 1, controls some trade-off
Continuous state and action space
\(10^{16992}\) (pixels) states
Fitted Q-learning: from table to functions
is equivalently:
\(\mathrm{Q}_{\text {new}}(s, a) \leftarrow\mathrm{Q}_{\text {old }}(s, a)+\alpha\left([r+\gamma \max _{a^{\prime}} \mathrm{Q}_{\text {old}}(s', a')] - \mathrm{Q}_{\text {old }}(s, a)\right)\)
new belief
\(\leftarrow\)
old belief
learning rate
target
old belief
Gradient update rule when minimizing \((\text{target} - \text{guess}_{\theta})^2\)
\(\theta_{\text{new}} \leftarrow \theta_{\text{old}} + \eta (\text{target} - \text{guess}_{\theta})\nabla_{\theta}\text{guess}\)
remind you of something?
\(\mathrm{Q}_{\text {new}}(s, a) \leftarrow\mathrm{Q}_{\text {old }}(s, a)+\alpha\left([r+\gamma \max _{a^{\prime}} \mathrm{Q}_{\text {old}}(s', a')] - \mathrm{Q}_{\text {old }}(s, a)\right)\)
new belief
\(\leftarrow\)
old belief
learning rate
target
old belief
\(\left(\text{target} -\mathrm{Q}_{\theta}(s, a)\right)^2\)
\(r+\gamma \max _{a^{\prime}} \mathrm{Q}_{\theta}\left(s^{\prime}, a^{\prime}\right)\)
[Slide Credit: Yann LeCun]
Reinforcement learning has a lot of challenges:
...
We'd love to hear your thoughts.
\(-10 + 0.9 \max _{a^{\prime}} \mathrm{Q}_{\text {old }}\left(2, a^{\prime}\right)\)
= -10 + 0 = -10
To update the estimate of \(\mathrm{Q}(6, \uparrow)\):
\(\gamma = 0.9\)
\(\mathrm{Q}_\text{old}(s, a)\)
\(\mathrm{Q}_{\text{new}}(s, a)\)
Let's try
rewards now known
\(-10 + 0.9 \max _{a^{\prime}} \mathrm{Q}_{\text {old }}\left(3, a^{\prime}\right)\)
= -10 + 0.9 = -9.1
To update the estimate of \(\mathrm{Q}(6, \uparrow)\):
\(\gamma = 0.9\)
\(\mathrm{Q}_\text{old}(s, a)\)
\(\mathrm{Q}_{\text{new}}(s, a)\)
Let's try
rewards now known
\(-10 + 0.9 \max _{a^{\prime}} \mathrm{Q}_{\text {old }}\left(2, a^{\prime}\right)\)
= -10 + 0 = -10
To update the estimate of \(\mathrm{Q}(6, \uparrow)\):
\(\gamma = 0.9\)
\(\mathrm{Q}_\text{old}(s, a)\)
\(\mathrm{Q}_{\text{new}}(s, a)\)
Let's try
rewards now known
\(-10 + 0.9 \max _{a^{\prime}} \mathrm{Q}_{\text {old }}\left(3, a^{\prime}\right)\)
= -10 + 0.9 = -9.1
To update the estimate of \(\mathrm{Q}(6, \uparrow)\):
\(\gamma = 0.9\)
\(\mathrm{Q}_\text{old}(s, a)\)
\(\mathrm{Q}_{\text{new}}(s, a)\)
Let's try
rewards now known
Keep playing the game to approximate the unknown rewards and transitions.
e.g. observe what reward \(r\) is received from taking the \((6, \uparrow)\) pair, we get \(\mathrm{R}(6,\uparrow)\)
e.g. play the game 1000 times, count the # of times that (start in state 6, take \(\uparrow\) action, end in state 2), then, roughly, \(\mathrm{T}(6,\uparrow, 2 ) = (\text{that count}/1000) \)
Now, with \(\mathrm{R}\) and \(\mathrm{T}\) estimated, we're back in MDP setting.
In Reinforcement Learning:
In Reinforcement Learning:
\(\gamma = 0.9\)
Unknow transition:
\(\mathrm{Q}_\text{old}(s, a)\)
\(\mathrm{Q}_{\text{new}}(s, a)\)
\(\dots\)
since rewards are deterministic, recover the rewards once all \((s,a)\) pairs tried once.
Unknown rewards
\((s, a)\)
\(r \)
\(s^{\prime}\)
\((1,\downarrow)\)
\(0\)
\(1\)
\((1,\uparrow)\)
0
1
\((3,\uparrow)\)
1
3
\((3,\downarrow)\)
1
6
\(\dots\)
\((6,\uparrow)\)
-10
3
\(\dots\)
\((6,\uparrow)\)
-10
2
\(\dots\)
\((1,\downarrow)\)
\(0\)
\(1\)
\((1,\uparrow)\)
0
1
\((3,\uparrow)\)
1
3
\((3,\downarrow)\)
1
6
\(\dots\)
\((6,\uparrow)\)
-10
3
\(\dots\)
\((6,\uparrow)\)
-10
2
To update the estimate of \(\mathrm{Q}(6, \uparrow)\):
\(\gamma = 0.9\)
\(\mathrm{Q}_\text{old}(s, a)\)
\(\mathrm{Q}_{\text{new}}(s, a)\)
rewards now known
e.g. pick \(\alpha =0.5\)
\((-10 + \)
= -5 + 0.5(-10 + 0.9)= - 9.55
+ 0.5
\((1-0.5)*(-10)\)
\(0.9 \max _{a^{\prime}} \mathrm{Q}_{\text {old }}\left(3, a^{\prime}\right))\)
Q-learning update
To update the estimate of \(\mathrm{Q}(6, \uparrow)\):
\(\gamma = 0.9\)
\(\mathrm{Q}_\text{old}(s, a)\)
\(\mathrm{Q}_{\text{new}}(s, a)\)
rewards now known
e.g. pick \(\alpha =0.5\)
\((-10 + \)
= -4.775 + 0.5(-10 + 0)= - 9.775
+ 0.5
(1-0.5) * -9.55
\(0.9 \max _{a^{\prime}} \mathrm{Q}_{\text {old }}\left(2, a^{\prime}\right))\)
Q-learning update
To update the estimate of \(\mathrm{Q}(6, \uparrow)\):
\(\gamma = 0.9\)
\(\mathrm{Q}_\text{old}(s, a)\)
\(\mathrm{Q}_{\text{new}}(s, a)\)
rewards now known
e.g. pick \(\alpha =0.5\)
\((-10 + \)
= -4.8875 + 0.5(-10 + 0)= - 9.8875
+ 0.5
(1-0.5) * -9.775
\(0.9 \max _{a^{\prime}} \mathrm{Q}_{\text {old }}\left(2, a^{\prime}\right))\)
Q-learning update