Shen Shen
April 16, 2025
2:30pm, Room 32-144
Last time, we ended at:
Value Iteration
which gives us a recipe for finding \(\pi_h^*(s)\), by picking \(\arg \max _a \mathrm{Q}^h(s, a), \forall s, h\)
Recall:
\(\gamma = 0.9\)
States and one special transition:
\(= -10 + .9 [.2*0+ .8*1] = -9.28\)
\(\mathrm{Q}^h(s, a)\): the expected sum of discounted rewards for
Let's consider \(\mathrm{Q}^2(6, \uparrow) \)
\(=\mathrm{R}(6,\uparrow) + \gamma[.2 \max _{a^{\prime}} \mathrm{Q}^{1}\left(2, a^{\prime}\right)+ .8\max _{a^{\prime}} \mathrm{Q}^{1}\left(3, a^{\prime}\right)] \)
\(Q^2(6, \uparrow) =\mathrm{R}(6,\uparrow) + \gamma[.2 \max _{a^{\prime}} Q^{1}\left(2, a^{\prime}\right)+ .8\max _{a^{\prime}} Q^{1}\left(3, a^{\prime}\right)] \)
in general
Recall:
\(\gamma = 0.9\)
States and one special transition:
\(Q^h(s, a)\): the expected sum of discounted rewards for
target
States and unknown transition:
Game Set up
Try using
unknown rewards:
execute \((3, \uparrow)\), observe a reward \(r=1\)
\(\mathrm{Q}_\text{old}(s, a)\)
\(\mathrm{Q}_{\text{new}}(s, a)\)
States and unknown transition:
Try out
\(-10 + 0.9 \max _{a^{\prime}} \mathrm{Q}_{\text {old }}\left(3, a^{\prime}\right)\)
= -10 + 0.9 = -9.1
To update the estimate of \(\mathrm{Q}(6, \uparrow)\):
\(\gamma = 0.9\)
\(\mathrm{Q}_\text{old}(s, a)\)
\(\mathrm{Q}_{\text{new}}(s, a)\)
States and unknown transition:
\(-10 + 0.9 \max _{a^{\prime}} \mathrm{Q}_{\text {old }}\left(2, a^{\prime}\right)\)
= -10 + 0 = -10
\(\gamma = 0.9\)
Try out
\(\mathrm{Q}_\text{old}(s, a)\)
\(\mathrm{Q}_{\text{new}}(s, a)\)
To update the estimate of \(\mathrm{Q}(6, \uparrow)\):
States and unknown transition:
Try out
\(-10 + 0.9 \max _{a^{\prime}} \mathrm{Q}_{\text {old }}\left(3, a^{\prime}\right)\)
= -10 + 0.9 = -9.1
To update the estimate of \(\mathrm{Q}(6, \uparrow)\):
\(\gamma = 0.9\)
\(\mathrm{Q}_\text{old}(s, a)\)
\(\mathrm{Q}_{\text{new}}(s, a)\)
States and unknown transition:
\(-10 + 0.9 \max _{a^{\prime}} \mathrm{Q}_{\text {old }}\left(2, a^{\prime}\right)\)
= -10 + 0 = -10
\(\gamma = 0.9\)
Try out
\(\mathrm{Q}_\text{old}(s, a)\)
\(\mathrm{Q}_{\text{new}}(s, a)\)
To update the estimate of \(\mathrm{Q}(6, \uparrow)\):
States and unknown transition:
Try out
\(-10 + 0.9 \max _{a^{\prime}} \mathrm{Q}_{\text {old }}\left(3, a^{\prime}\right)\)
= -10 + 0.9 = -9.1
To update the estimate of \(\mathrm{Q}(6, \uparrow)\):
\(\gamma = 0.9\)
\(\mathrm{Q}_\text{old}(s, a)\)
\(\mathrm{Q}_{\text{new}}(s, a)\)
target
🥺
old belief
learning rate
😍
target
\((-10 + \)
= -5 + 0.5(-10 + 0.9)= - 9.55
States and unknown transition:
Better idea:
\(\gamma = 0.9\)
pick learning rate \(\alpha =0.5\)
+ 0.5
(1-0.5) * -10
\(\mathrm{Q}_\text{old}(s, a)\)
\(\mathrm{Q}_{\text{new}}(s, a)\)
To update the estimate of \(\mathrm{Q}(6, \uparrow)\):
\(0.9 \max _{a^{\prime}} \mathrm{Q}_{\text {old }}\left(3, a^{\prime}\right))\)
\((-10\)
= 0.5*-9.55 + 0.5(-10 + 0)= -9.775
States and unknown transition:
Better idea:
\(\gamma = 0.9\)
pick learning rate \(\alpha =0.5\)
+ 0.5
(1-0.5) * -9.55
\(\mathrm{Q}_\text{old}(s, a)\)
\(\mathrm{Q}_{\text{new}}(s, a)\)
To update the estimate of \(\mathrm{Q}(6, \uparrow)\):
\(+ 0.9 \max _{a^{\prime}} \mathrm{Q}_{\text {old }}\left(2, a^{\prime}\right))\)
Value Iteration\((\mathcal{S}, \mathcal{A}, \mathrm{T}, \mathrm{R}, \gamma, \epsilon)\)
"calculating"
"learning" (estimating)
Q-Learning \(\left(\mathcal{S}, \mathcal{A}, \gamma, \alpha, s_0\right. \text{max-iter})\)
1. \(i=0\)
2. for \(s \in \mathcal{S}, a \in \mathcal{A}:\)
3. \({\mathrm{Q}_\text{old}}(s, a) = 0\)
4. \(s \leftarrow s_0\)
5. while \(i < \text{max-iter}:\)
6. \(a \gets \text{select}\_\text{action}(s, {\mathrm{Q}_\text{old}}(s, a))\)
7. \(r,s' \gets \text{execute}(a)\)
8. \({\mathrm{Q}}_{\text{new}}(s, a) \leftarrow (1-\alpha){\mathrm{Q}}_{\text{old}}(s, a) + \alpha(r + \gamma \max_{a'}{\mathrm{Q}}_{\text{old}}(s', a'))\)
9. \(s \leftarrow s'\)
10. \(i \leftarrow (i+1)\)
11. \(\mathrm{Q}_{\text{old}} \leftarrow \mathrm{Q}_{\text{new}}\)
12. return \(\mathrm{Q}_{\text{new}}\)
"learning"
Q-Learning \(\left(\mathcal{S}, \mathcal{A}, \gamma, \alpha, s_0\right. \text{max-iter})\)
1. \(i=0\)
2. for \(s \in \mathcal{S}, a \in \mathcal{A}:\)
3. \({\mathrm{Q}_\text{old}}(s, a) = 0\)
4. \(s \leftarrow s_0\)
5. while \(i < \text{max-iter}:\)
6. \(a \gets \text{select}\_\text{action}(s, {\mathrm{Q}_\text{old}}(s, a))\)
7. \(r,s' \gets \text{execute}(a)\)
8. \({\mathrm{Q}}_{\text{new}}(s, a) \leftarrow (1-\alpha){\mathrm{Q}}_{\text{old}}(s, a) + \alpha(r + \gamma \max_{a'}{\mathrm{Q}}_{\text{old}}(s', a'))\)
9. \(s \leftarrow s'\)
10. \(i \leftarrow (i+1)\)
11. \(\mathrm{Q}_{\text{old}} \leftarrow \mathrm{Q}_{\text{new}}\)
12. return \(\mathrm{Q}_{\text{new}}\)
\(^1\) given we visit all \(s,a\) infinitely often, and satisfy a condition on the learning rate \(\alpha\).
the current estimate of \(\mathrm{Q}\) values
"learning"
Q-Learning \(\left(\mathcal{S}, \mathcal{A}, \gamma, \alpha, s_0\right. \text{max-iter})\)
1. \(i=0\)
2. for \(s \in \mathcal{S}, a \in \mathcal{A}:\)
3. \({\mathrm{Q}_\text{old}}(s, a) = 0\)
4. \(s \leftarrow s_0\)
5. while \(i < \text{max-iter}:\)
6. \(a \gets \text{select}\_\text{action}(s, {\mathrm{Q}_\text{old}}(s, a))\)
7. \(r,s' \gets \text{execute}(a)\)
8. \({\mathrm{Q}}_{\text{new}}(s, a) \leftarrow (1-\alpha){\mathrm{Q}}_{\text{old}}(s, a) + \alpha(r + \gamma \max_{a'}{\mathrm{Q}}_{\text{old}}(s', a'))\)
9. \(s \leftarrow s'\)
10. \(i \leftarrow (i+1)\)
11. \(\mathrm{Q}_{\text{old}} \leftarrow \mathrm{Q}_{\text{new}}\)
12. return \(\mathrm{Q}_{\text{new}}\)
is equivalently:
\(\mathrm{Q}_{\text {new}}(s, a) \leftarrow\mathrm{Q}_{\text {old }}(s, a)+\alpha\left([r+\gamma \max _{a^{\prime}} \mathrm{Q}_{\text {old}}(s', a')] - \mathrm{Q}_{\text {old }}(s, a)\right)\)
new belief
\(\leftarrow\)
old belief
learning rate
target
old belief
\(\mathrm{Q}_{\text {new}}(s, a) \leftarrow\mathrm{Q}_{\text {old }}(s, a)+\alpha\left([r+\gamma \max _{a^{\prime}} \mathrm{Q}_{\text {old}}(s', a')] - \mathrm{Q}_{\text {old }}(s, a)\right)\)
new belief
\(\leftarrow\)
old belief
learning rate
target
old belief
\(\left(\mathrm{Q}_{\theta}(s, a)-\text{target}\right)^2\)
Gradient descent does: \(\theta_{\text{new}} \leftarrow \theta_{\text{old}} + \eta (\text{target} - \text{guess}_{\theta})\frac{\nabla \text{guess}}{\nabla \theta}\)
\(r+\gamma \max _{a^{\prime}} \mathrm{Q}_{\theta}\left(s^{\prime}, a^{\prime}\right)\)
Quick Summary
Identity \(\begin{aligned} \nabla_\theta p_\theta(\tau) & =p_\theta(\tau) \frac{\nabla_\theta p_\theta(\tau)}{p_\theta(\tau)} \\ & =p_\theta(\tau) \nabla_\theta \log p_\theta(\tau)\end{aligned}\)
\(=\nabla_\theta \sum_\tau P(\tau ; \theta) R(\tau)\)
\(=\sum_\tau \nabla_\theta P(\tau ; \theta) R(\tau)\)
\(=\sum_\tau \frac{P(\tau ; \theta)}{P(\tau ; \theta)} \nabla_\theta P(\tau ; \theta) R(\tau)\)
\(=\sum_\tau P(\tau ; \theta) \frac{\nabla_\theta P(\tau ; \theta)}{P(\tau ; \theta)} R(\tau)\)
\(=\sum_\tau P(\tau ; \theta) \nabla_\theta \log P(\tau ; \theta) R(\tau)\)
\(\nabla_\theta U(\theta)\)
where \(P(\tau ; \theta)=\prod_{t=0} \underbrace{P\left(s_{t+1} \mid s_t, a_t\right)}_{\text {transition }} \cdot \underbrace{\left.\pi_\theta\left(a_t \mid s_t\right)\right]}_{\text {policy }}\)
\(=\sum_\tau P(\tau ; \theta) \nabla_\theta \log P(\tau ; \theta) R(\tau)\)
\(\nabla_\theta U(\theta)\)
Transition is unknown....
Stuck?
\(=\sum_\tau P(\tau ; \theta) \nabla_\theta \log P(\tau ; \theta) R(\tau)\)
\(\nabla_\theta U(\theta)\)
Approximate with the empirical estimate for \(m\) sample traj. under policy \(\pi_\theta\)
\(\nabla_\theta U(\theta) \approx \hat{g}=\frac{1}{m} \sum_{i=1}^m \nabla_\theta \log P\left(\tau^{(i)} ; \theta\right) R\left(\tau^{(i)}\right)\)
Valid even when:
where \(P(\tau ; \theta)=\prod_{t=0} \underbrace{P\left(s_{t+1} \mid s_t, a_t\right)}_{\text {transition }} \cdot \underbrace{\left.\pi_\theta\left(a_t \mid s_t\right)\right]}_{\text {policy }}\)
\(=\nabla_\theta \log [\prod_{t=0} \underbrace{P\left(s_{t+1} \mid s_t, a_t\right)}_{\text {transition }} \cdot \underbrace{\left.\pi_\theta\left(a_t \mid s_t\right)\right]}_{\text {policy }}\)
\(=\nabla_\theta\left[\sum_{t=0} \log P\left(s_{t+1} \mid s_t, a_t\right)+\sum_{t=0} \log \pi_\theta\left(a_t \mid s_t\right)\right]\)
\(=\nabla_\theta \sum_{t=0} \log \pi_\theta\left(a_t \mid s_t\right)\)
\(=\sum_{t=0} \underbrace{\nabla_\theta \log \pi_\theta\left(a_t \mid s_t\right)}_{\text {no transition model required, }}\)
\(\nabla_\theta \log P(\tau ; \theta)\)
\(\nabla_\theta U(\theta) \approx \hat{g}=\frac{1}{m} \sum_{i=1}^m \nabla_\theta \log P\left(\tau^{(i)} ; \theta\right) R\left(\tau^{(i)}\right)\)
\(=\sum_{t=0} \underbrace{\nabla_\theta \log \pi_\theta\left(a_t \mid s_t\right)}_{\text {no transition model required, }}\)
\(\nabla_\theta \log P(\tau ; \theta)\)
\(\nabla_\theta U(\theta) \approx \hat{g}=\frac{1}{m} \sum_{i=1}^m \nabla_\theta \log P\left(\tau^{(i)} ; \theta\right) R\left(\tau^{(i)}\right)\)
The following expression provides us with an unbiased estimate of the gradient, and we can compute it without access to the world model:
Unbiased estimator \(\mathrm{E}[\hat{g}]=\nabla_\theta U(\theta)\), but very noisy.
where
We'd love to hear your thoughts.