Artyom Sorokin | 19 Feb
Reward-to-go / Return:
State Value Function / V-Function:
State-Action Value Function / Q-Function:
Bellman Expectation Equations for policy \(\pi\):
Bellman Optimality Equations:
Bellman Expectation Equations for policy \(\pi\):
Bellman Optimality Equations:
Use Belman Expectation Equations to learn \(V\)/\(Q\) for current policy
Greedily Update policy w.r.t. V/Q-function
Policy Evaluation steps
Policy Improvement steps
Imporves value function estimate for current policy \(\pi\)
Imporves policy \(\pi\)
w.r.t current value function
Use Belman Optimality Equations to learn \(V\)/\(Q\) for current policy
Policy Improvement is implicitly used here
GOAL: Learn value functions \(Q_{\pi}\) or \(V_{\pi}\) without knowing \(p(s'|s,a)\) and \(R(s,a)\)
RECALL that value function is the expected return:
By Law of Large Numbers, \(q(s,a) \rightarrow Q_{\pi}(s,a)\) as \(N(s,a) \rightarrow \infty\)
IDEA: Estimate expectation \(Q_{\pi}(s,a)\) with empirical mean \(q(s,a)\):
We can update mean values incrementally:
Incremental Monte Carlo Update:
Prediction error
Old estimate
Learning rate
In non-stationary problems we can fix learning rate:
Remember Policy Iteration?
How would look PI version with Monte-Carlo Policy Evaluation?
Questions:
Agent can't visit every \((s,a)\) with greedy policy!
Agent can't get correct \(q(s,a)\) estimates without visiting \((s,a)\) frequently!
(i.e. remember law of large numbers)
Use \(\epsilon\)-greedy policy:
Policy Iteration with Monte-Carlo method:
For every episode:
GLIE Mont-Carlo Control:
Problems with Monte-Carlo method:
Solution:
Goal: learn \(Q_{\pi}\) online from experience
Incremental Monte-Carlo:
Temporal-Difference learning:
\(r_{t+1} + \gamma q(s_{t+1}, a_{t+1})\) is called the TD target
\(\delta_t = r_{t+1} + \gamma q(s_{t+1}, a_{t+1}) - q(s_t, a_t)\) is called the TD error
Temporal Difference Learning:
This update is called SARSA: State, Action, Reward, next State, next Action
Policy Iteration with Temporal Difference Learning:
For every step:
We approximate Bellman Expectation Equation with SARSA update:
Can we utilize Bellman Optimality Equation for TD-Learning?
Yes, of course:
From Bellman Expectation Equation (SARSA) :
From Bellman Optimality Equation (Q-Learning):
\(a'\) comes from the policy \(\pi\) that generated this experience!
No connection to the actual policy \(\pi\)
Q-Learning Update:
SARSA Update:
SARSA and Monte-Carlo are on-policy algorithms:
Q-Learning is off-policy algorithm:
Monte Carlo
Temporal Difference
Consider the following n-step returns for n = 1, 2, ...:
n-step Temporal Difference Learning:
.
.
.
.
.
.
(MC)
(TD: SARSA)
We can average n-step returns over different n,
e.g. average the 2-step and 4-step returns:
But why?
What happens when \(\lambda = 0\)?
i.e. TD target
What happens when \(\lambda = 1\)?
We can rewrite \(G^{\lambda}_t\) as:
i.e. MC target
What happens when \(\lambda = 0\)?
i.e. just TD-learning
What happens when \(\lambda = 1\)?
i.e. Monte-Carlo learning
We can rewrite \(G^{\lambda}_t\) as:
HOW?
Text