Previous Lecture: Value Functions

Reward-to-go / Return:

G_t \doteq r(s_t, a_t) + \gamma r(s_{t+1}, a_{t+1}) + \gamma^2 r(s_{t+2}, a_{t+2}) + ...\, = \sum^{\infty}_{i=0} \gamma^{i} r(s_{t+i}, a_{t+i})

= \sum^{\infty}_{i=0} \gamma^i r_{t+i}

V_{\pi}(s) \doteq \mathbb{E}_{\pi} [ G_t| s_t = s] = \mathbb{E}_{\pi} \big[ \sum^{\infty}_{i=0} \gamma^i r_{t+i} | s_t = s \big]

Q_{\pi}(s, a) \doteq \mathbb{E}_{\pi} [ G_t| s_t = s, a_t = a] = \mathbb{E}_{\pi} \big[ \sum^{\infty}_{i=0} \gamma^i r_{t+i} | s_t = s, a_t = a \big]

State Value Function / V-Function:

State-Action Value Function / Q-Function:

Previous Lecture: Bellman Equations

Bellman Expectation Equations for policy \(\pi\):

V_{\pi}(s) = \sum_a \pi(a|s) \sum_{s^{\prime}} p(s'|s, a) [r + \gamma V_{\pi}(s^{\prime})]

V_{*}(s) = max_{a} \sum_{s^{\prime}} p(s'|s, a) [r + \gamma V_{*}(s^{\prime})] = max_{a}\, Q_{*}(s,a)

Q_{\pi}(s, a) = \sum_{s^{\prime}} p(s'|s, a) [r + \gamma \sum_{a'} \pi(a'|s') Q_{\pi}(s', a')]

Q_{*}(s, a) = \sum_{s^{\prime}} p(s'|s, a) [r + \gamma \max_{a'} Q_{*}(s', a')]

Bellman Optimality Equations:

Previous Lecture: Bellman Equations

Bellman Expectation Equations for policy \(\pi\):

Bellman Optimality Equations:

Previous Lecture: Policy Iteration

Use Belman Expectation Equations to learn \(V\)/\(Q\) for current policy

Greedily Update policy w.r.t. V/Q-function

Previous Lecture: Policy Iteration

Policy Evaluation steps

Policy Improvement steps

Generalized Policy Iteration

Imporves value function estimate for current policy \(\pi\)

Imporves policy \(\pi\)

w.r.t current value function

Previous Lecture: Value Iteration

Use Belman Optimality Equations to learn \(V\)/\(Q\) for current policy

Policy Improvement is implicitly used here

But we don't know the model

V_{\pi}(s) = \sum_a \pi(a|s) \sum_{s^{\prime}} p(s'|s, a) [r + \gamma V_{\pi}(s^{\prime})]

Q_{\pi}(s, a) = \sum_{s^{\prime}} p(s'|s, a) [r + \gamma \sum_{a'} \pi(a'|s') Q_{\pi}(s', a')]

Solution: Use Sampling

The model of the environment is unknown! :(
But you still can interact with it!

Monte-Carlo Methods

GOAL: Learn value functions \(Q_{\pi}\) or \(V_{\pi}\) without knowing \(p(s'|s,a)\) and \(R(s,a)\)

RECALL that value function is the expected return:

Q_{\pi}(s,a) = \mathbb{E}_{\pi} [\sum^{\infty}_{k=0} \gamma^{k} r_{t+k+1}|S_t=s, A_t=a ]

First Visit Monte-Carlo

By Law of Large Numbers, \(q(s,a) \rightarrow Q_{\pi}(s,a)\) as \(N(s,a) \rightarrow \infty\)

IDEA: Estimate expectation \(Q_{\pi}(s,a)\) with empirical mean \(q(s,a)\):

Generate an episode with \(\pi\)
The first time-step \(t\) that state-action \((s,a)\) is visited
Increment counter \(N(s,a) = N(s,a) + 1\)
Increment total return \(S(s,a) = S(s,a) + G_t\)
Value is estimated by mean return \(q(s,a) = S(s,a)/N(s,a)\)

Monte-Carlo Methods: Prediction

We can update mean values incrementally:

Incremental Monte Carlo Update:

\mu_{k+1} = \frac{1}{k+1} \sum^{k+1}_{i=1} x_i = \mu_{k} + \frac{1}{k+1} (x_{k+1} - \mu_{k})

Prediction error

Old estimate

Learning rate

q(s_t,a_t) \leftarrow q(s_t, a_t) + \frac{1}{N(s_t,a_t)} (G_t - q(s_t,a_t))

In non-stationary problems we can fix learning rate:

q(s_t,a_t) \leftarrow q(s_t,a_t) + \alpha (G_t - q(s_t,a_t))

Monte-Carlo Methods

MC methods learn directly from episodes of experience
MC is model-free: no knowledge of MDP transitions / rewards
MC learns from complete episodes: no bootstrapping
MC uses the simplest possible idea: value = mean return
Caveat: can only apply MC to episodic MDPs:
- All episodes must terminate

Monte-Carlo Methods: Control

Remember Policy Iteration?

How would look PI version with Monte-Carlo Policy Evaluation?

Questions:

Why we estimate \(Q_{\pi}\) and not \(V_{\pi}\)?
Do you see any problem with policy improvement step?

Policy Evaluation: Monte-Carlo Evaluation, \(q = Q_{\pi}\)
Policy Improvement: Greedy policy improvement, i.e. \(\pi'(s) = argmax_a q(s,a)\)

Eploration-Exploitation Problem

Agent can't visit every \((s,a)\) with greedy policy!

Agent can't get correct \(q(s,a)\) estimates without visiting \((s,a)\) frequently!

(i.e. remember law of large numbers)

Monte-Carlo Methods: Control

Use \(\epsilon\)-greedy policy:

Simplest idea for ensuring continual exploration
All \(m\) actions are tried with non-zero probability
With probability \(1 − \epsilon\) choose the greedy action
With probability \(\epsilon\) choose an action at random

Monte-Carlo Methods: Control

Monte-Carlo Method

Policy Iteration with Monte-Carlo method:

For every episode:

Policy Evaluation: Monte-Carlo Evaluation, \(q = Q_{\pi}\)
Policy Improvement: \(\epsilon\)-greedy policy improvement

Monte-Carlo Methods: GLIE

GLIE Mont-Carlo Control:

Sample k-th episode using curent policy \(\pi\)
Update with \(q(s,a)\) with \(1/N(s,a)\) learning rate
Set \(\epsilon = 1/k\)
Use \(\epsilon\)-greedy policy improvement

Temporal Difference Learning

Problems with Monte-Carlo method:

Updates policy only once per episode, i.e. only episodic MDPs
Doesn't use MDP properties

Solution:

Recall Bellman equation:

Use sampling instead of knowledge about the model:

Q_{\pi}(s, a) = \sum_{s^{\prime}} p(s'|s, a) [r + \gamma \sum_{a'} \pi(a'|s') Q_{\pi}(s', a')]

q_{\pi}(s, a) = \textcolor{blue}{\mathbb{E}_{s'}} [r + \gamma \textcolor{blue}{\mathbb{E}_{a'}}\,q_{\pi}(s', a')] = \textcolor{blue}{\mathbb{E}_{\tau}}[r + \gamma q_{\pi}(s', a')]

TD-learning: Prediction

Goal: learn \(Q_{\pi}\) online from experience

Incremental Monte-Carlo:

Update value \(q(s_t, a_t)\) toward actual return \(G_t\)

Temporal-Difference learning:

Update value \(q(s_t, a_t)\) toward estimated return \(r_{t+1} + \gamma q(s_{t+1}, a_{t+1})\)

\(r_{t+1} + \gamma q(s_{t+1}, a_{t+1})\) is called the TD target

\(\delta_t = r_{t+1} + \gamma q(s_{t+1}, a_{t+1}) - q(s_t, a_t)\) is called the TD error

q(s_t,a_t) \leftarrow q(s_t,a_t) + \alpha (\textcolor{red}{G_t} - q(s_t,a_t))

q(s_t,a_t) \leftarrow q(s_t,a_t) + \alpha (\textcolor{red}{r_t + \gamma q(s_{t+1},a_{t+1})} - q(s_t,a_t))

Temporal Difference Learning

Temporal Difference Learning:

TD methods learn directly from episodes of experience
TD is model-free: no knowledge of MDP transitions / rewards
TD learns from incomplete episodes, by bootstrapping
TD updates a guess towards a guess

TD-learning: SARSA update

q(s_t,a_t) \leftarrow q(s_t,a_t) + \alpha (r_t + \gamma q(s_{t+1},a_{t+1}) - q(s_t,a_t))

This update is called SARSA: State, Action, Reward, next State, next Action

TD-learning: SARSA as Policy Iteration

Policy Iteration with Temporal Difference Learning:

For every step:

Policy Evaluation: SARSA Evaluation, \(q = Q_{\pi}\)
Policy Improvement: \(\epsilon\)-greedy policy improvement

TD-learning: SARSA algorithm

TD-learning: SARSA

TD-learning: Q-Learning

We approximate Bellman Expectation Equation with SARSA update:

Can we utilize Bellman Optimality Equation for TD-Learning?

Q_{\pi}(s, a) = \sum_{s^{\prime}} p(s'|s, a) [r + \gamma \sum_{a'} \pi(a'|s') Q_{\pi}(s', a')]

Q_{*}(s, a) = \sum_{s^{\prime}} p(s'|s, a) [r + \gamma \max_{a'} Q_{*}(s', a')]

q(s, a) = \textcolor{blue}{\mathbb{E}_{s'}} [r + \gamma \max_{a'} q(s', a')]

Yes, of course:

TD-learning: Q-Learning vs SARSA

From Bellman Expectation Equation (SARSA) :

From Bellman Optimality Equation (Q-Learning):

q(s, a) = \textcolor{blue}{\mathbb{E}_{s'}} [r + \gamma \textcolor{blue}{\mathbb{E}_{a'}}\,q(s', a')]

q(s, a) = \textcolor{blue}{\mathbb{E}_{s'}} [r + \gamma \max_{a'} q(s', a')]

\(a'\) comes from the policy \(\pi\) that generated this experience!

No connection to the actual policy \(\pi\)

TD-learning: Q-Learning vs SARSA

Q-Learning Update:

q(s_t,a_t) \leftarrow q(s_t,a_t) + \alpha (r_t + \gamma \textcolor{blue}{max_{a'}}\,q(s_{t+1},\textcolor{blue}{a'}) - q(s_t,a_t))

q(s_t,a_t) \leftarrow q(s_t,a_t) + \alpha (r_t + \gamma q(s_{t+1},a_{t+1}) - q(s_t,a_t))

SARSA Update:

TD-learning: Q-Learning

On-policy vs Off-Policy Algorithms

SARSA and Monte-Carlo are on-policy algorithms:

Improve policy \(\pi_k\) only from experience sampled with this policy \(\pi_k\)
Can't use old trajectories sampled with \(\pi_{k-i}\)

Q-Learning is off-policy algorithm:

Can Learn policy \(\pi\) using experience generated with other policy \(\mu\)
Learn from observing humans or other agents
Re-use experience generated from old policies
Learn about optimal policy while following exploratory policy
Learn about multiple policies while following one policy

TD-learning: Cliff Example

TD vs MC: Driving Home Example

TD vs MC: Driving Home Example

Monte Carlo

Temporal Difference

TD vs MC: Bias-Variance Tradeoff

Return \(G_t = R_{t+1} + \gamma R_{t+2} + ... + \gamma^{T-t-1}R_{T}\) is unbiased estimate of \(Q_{\pi}(s_t, a_t)\)
True TD target \( R_{t+1} + \gamma Q_{\pi}(s_{t+1},a_{t+1}) \) is unbiased estimate of \(Q_{\pi}(s_t, a_t)\)
TD target \( R_{t+1} + \gamma q(s_{t+1},a_{t+1}) \) is biased estimate of \(Q_{\pi}(s_t, a_t)\)
TD target has lower variance than the return:
- Return depends on many random actions, transitions, rewards
- TD target depends on one random action, transition, reward

Monte-Carlo Methods: high variance, no bias
TD-Обучение: low variance, has bias

TD vs MC: Bias-Variance Tradeoff

MC has high variance, zero bias
- Good convergence properties
- (even with function approximation)
- Not very sensitive to initial value
- Very simple to understand and use
TD has low variance, some bias
- Usually more efficient than MC
- TD converges to \(Q_{\pi}(s,a)\)
- (but not always with function approximation)
- More sensitive to initial value

TD vs MC: AB Example

Monte-Carlo Backup

q(s_t,a_t) \leftarrow q(s_t,a_t) + \alpha (G_t - q(s_t,a_t))

Temporal-Difference Backup

q(s_t,a_t) \leftarrow q(s_t,a_t) + \alpha (r_t + \gamma q(s_{t+1},a_{t+1}) - q(s_t,a_t))

Sampling and Bootstrapping

Bootstrapping: update involves an estimate
- MC does not bootstrap
- DP bootstraps
- TD bootstraps
Sampling: update samples an expectation
- MC samples
- DP does not sample
- TD samples

Sampling and Bootstrapping

N-step Returns

\textcolor{red}{n=1}\,\,\,\,\,\,\,\, G^{(1)}_t = R_{t+1} + \gamma Q(S_{t+1}, A_{t+1})

\textcolor{red}{n=2}\,\,\,\,\,\,\,\, G^{(2)}_t = R_{t+1} + \gamma R_{t+2} + \gamma^{2} Q(S_{t+2}, A_{t+2})

\textcolor{red}{n=\infty}\,\,\,\,\, G^{(\infty)}_t = R_{t+1} + \gamma R_{t+2} + ... + \gamma^{T-t-1} R_{T}

Consider the following n-step returns for n = 1, 2, ...:

n-step Temporal Difference Learning:

.

q(s_t,a_t) \leftarrow q(s_t,a_t) + \alpha (G^{(n)}_t - q(s_t,a_t))

(MC)

(TD: SARSA)

Combining N-step Returns

We can average n-step returns over different n,

e.g. average the 2-step and 4-step returns:

G^{(x)}_t = \frac{G^{(2)}_t + G^{(4)}_t}{2}

Can we combine information from all n-step returns?
Yes! Turns out that it is easier to use this combination than selecting right value of \(n\) for n-step return.

TD(\(\lambda\))

The \(\lambda\)-return \(G^{\lambda}_t\) combines all n-step returns \(G^{(n)}_t\)
Using weight \((1 −\lambda) \lambda^{n-1}\)

G^{\lambda}_t = (1 - \lambda) \sum^{\infty}_{n=1} \lambda^{n-1} G^{(N)}_{t}

Forward-view TD(\(\lambda\))... actually SARSA(\(\lambda\)):

q(s_t,a_t) \leftarrow q(s_t,a_t) + \alpha (G^{\lambda}_t - q(s_t,a_t))

TD(0) and TD(1):

But why?

What happens when \(\lambda = 0\)?

G^{\lambda=0}_t = G^{1}_t = R_{t+1} + \gamma Q(S_{t+1}, A_{t+1})

i.e. TD target

What happens when \(\lambda = 1\)?

G^{\lambda=1}_t = G^{\infty}_t = R_{t+1} + \gamma R_{t+2} + ... + \gamma^{T-t-1} R_T

We can rewrite \(G^{\lambda}_t\) as:

i.e. MC target

G^{\lambda}_t =(1 - \lambda) \sum^{T-t-1}_{n=1} \lambda^{n-1} G^{(n)}_t + (1-\lambda) \sum^{\infty}_{n=T-t} \lambda^{n-1} G^{\infty}_{t}

G^{\lambda}_t =(1 - \lambda) \sum^{T-t-1}_{n=1} \lambda^{n-1} G^{(n)}_t + \lambda^{T-t-1} G^{\infty}_{t} \frac{(1-\lambda)}{(1-\lambda)}

TD(0) and TD(1):

What happens when \(\lambda = 0\)?

G^{\lambda=0}_t = G^{1}_t = R_{t+1} + \gamma V(S_{t+1})

i.e. just TD-learning

What happens when \(\lambda = 1\)?

G^{\lambda=1}_t = G^{\infty}_t = R_{t+1} + \gamma R_{t+2} + ... + \gamma^{T-t-1} R_T

i.e. Monte-Carlo learning

We can rewrite \(G^{\lambda}_t\) as:

HOW?

TD(\(\lambda\)): Forward view

Update value function towards the \(\lambda\)-return
Forward-view looks into the future to compute \(G^{\lambda}_{t}\)
Like MC, can only be computed from complete episodes

Text

TD(\(\lambda\)): Backward View

Credit assignment problem: did bell or light cause shock?
Frequency heuristic: assign credit to most frequent states
Recency heuristic: assign credit to most recent states
Eligibility traces combine both heuristics

E_0(s,a) = 0

E_t(s,a) = \gamma \lambda E_{t-1}(s,a) + \mathbf{1}[S_t = s, A_t = a]

Eligibility Traces: SARSA(\(\lambda\))

Keep an eligibility trace for every state \(s\)
Update value \(q(s, a)\) for every (\(s, a\)) with non zero \(E_t(s,a)\)
In proportion to TD-error \(\delta_t\) and eligibility trace \(E_t(s,a)\)

\delta_t = r_{t+1} + \gamma q(s_{t+1}, a_{t+1}) - q(s_t, a_t)

q(s,a) \leftarrow q(s,a) + \alpha \delta_t E_t(s,a)

SARSA(\(\lambda\)) vs SARSA

Lecture 3:

Tabular Reinforcement Learning

Previous Lecture: Value Functions

Previous Lecture: Bellman Equations

Previous Lecture: Bellman Equations

Previous Lecture: Policy Iteration

Previous Lecture: Policy Iteration

Generalized Policy Iteration

Previous Lecture: Value Iteration

But we don't know the model

Solution: Use Sampling

Monte-Carlo Methods

First Visit Monte-Carlo

Monte-Carlo Methods: Prediction

Monte-Carlo Methods

Monte-Carlo Methods: Control

Eploration-Exploitation Problem

Monte-Carlo Methods: Control

Monte-Carlo Methods: Control

Monte-Carlo Method

Monte-Carlo Methods: GLIE

Monte-Carlo Methods: GLIE

Temporal Difference Learning

TD-learning: Prediction

Temporal Difference Learning

TD-learning: SARSA update

TD-learning: SARSA as Policy Iteration

TD-learning: SARSA algorithm

TD-learning: SARSA

TD-learning: Q-Learning

TD-learning: Q-Learning vs SARSA

TD-learning: Q-Learning vs SARSA

TD-learning: Q-Learning

On-policy vs Off-Policy Algorithms

TD-learning: Cliff Example

TD vs MC: Driving Home Example

TD vs MC: Driving Home Example

TD vs MC: Bias-Variance Tradeoff

TD vs MC: Bias-Variance Tradeoff

TD vs MC: AB Example

Monte-Carlo Backup

Temporal-Difference Backup

Sampling and Bootstrapping

Sampling and Bootstrapping

N-step Returns

Combining N-step Returns

TD(\(\lambda\))

TD(0) and TD(1):

TD(0) and TD(1):

TD(\(\lambda\)): Forward view

TD(\(\lambda\)): Backward View

Eligibility Traces: SARSA(\(\lambda\))

SARSA(\(\lambda\)) vs SARSA

SARSA(\(\lambda\)) vs SARSA

Tabular RL: Resume

Thank you for your attention!