Multi-Steps Returns

Recall Q-learning/DQN target:

y_t =r_t + \gamma \max_{a} Q_{\theta'}(s_{t+1}, a)

The only values that matter if \(Q_{\theta'}\) is bad!

These are important if \(Q_{\theta'}\) is good!

useless estimation at the beggining

noisy but true

Multi-Steps Returns

\textcolor{red}{n=1}\,\,\,\,\,\,\,\, G^{(1)}_t = R_{t+1} + \gamma Q(S_{t+1}, A_{t+1})

\textcolor{red}{n=2}\,\,\,\,\,\,\,\, G^{(2)}_t = R_{t+1} + \gamma R_{t+2} + \gamma^{2} Q(S_{t+2}, A_{t+2})

\textcolor{red}{n=\infty}\,\,\,\,\, G^{(\infty)}_t = R_{t+1} + \gamma R_{t+2} + ... + \gamma^{T-t-1} R_{T}

Recall N-step Returns:

.

SARSA/TD-target: High Bias, Low Varaince

MC-target: High Variance, No Bias

SARSA/TD-target: High Bias, Low Varaince

We can constract similar multi-step targets for Q-learning:

y^{N}_t = \sum^{t+N-1}_{k=t} \gamma^{k-t} r_k + \gamma^{N} \max_{a} Q_{\theta'}(s_{t+N}, a)

Multi-Steps Returns

This multi-step Q-learning target:

y^{N}_t = \sum^{t+N-1}_{k=t} \gamma^{k-t} r_k + \gamma^{N} \max_{a} Q_{\theta'}(s_{t+N}, a)

Q-learning tries to learn greedy policy: \(\pi(s_t) = argmax_a Q_{\theta}(s_t, a)\)
Buy these N-steps were sampled with \(e\)-greedy policy!

Multi-Steps Returns

"Multi-step Q-learning target":

y^{N}_t = \sum^{t+N-1}_{k=t} \gamma^{k-t} r_k + \gamma^{N} \max_{a} Q_{\theta'}(s_{t+N}, a)

How to fix:

just ignore the problem:) Often works very well!
cut N-steps returns dynamically

For more details: Safe and efficient off-policy reinforcement learning

Prioritized Experience Replay

Not all transitions are equally valuable!

Goal: Reach the green target!

Reward: +1 when reaching the target, 0 otherwise

All model estimates all are close to zero: Q(s,a) ~ 0.

Prioritized Experience Replay

Lets prioritize transitions with TD-Error!

\delta_t = r_t + \gamma Q_{\textcolor{blue}{\theta'}}(s_t, argmax_a Q_{\textcolor{red}{\theta}}(s_{t+1}, a)) - Q_{\textcolor{red}{\theta}}(s_t, a_t)

Double DQN TD-Error:

proportional prioritization:

p_i = |\delta_i| + \epsilon

Sampling probability:

rank-based prioritization:

p_i = 1 / \text{rank}(i)

noise to avoid 0. probs

should be more stable

\(\alpha\) controls randomness

Prioritized Experience Replay

Problem:

We are estimating mean expected return!
...but prioritized replay introduces bias to our estimate!

S,A

S'',A''

S',A'

prob=0.5

High TD-Error:

\(r + \gamma Q(S',A') - Q(S,A)\)

prob=0.5

Low TD-Error:

\(r + \gamma Q(S'',A'') - Q(S,A)\)

\textcolor{red}{prob=0.1}

\textcolor{red}{prob=0.9}

Prioritized Experience Replay

Answer: Fix biased sampling with Importance Sampling

Importance sampling weights:

Here:

N - Replay Buffer Size
\(P(i)\) - Sampling Prob
\(\beta\) - hyperparameter

Normalize \(w_i\) by \(1/max_j w_j\) to avoid increasing learning step size

Use IS weights as coefficient to TD-Error in gradient updates:

\textcolor{red}{w_i} \delta_i \nabla_{\theta} Q_{\theta}(s_i,a_i)

\(\beta\) controls IS correction. Slowly increase \(\beta\) to 1.

already normilized value

Prioritized Experience Replay

Q: How to efficiently sample elements from Replay Buffer proportional to their priority?

A: Use SumTree

non-leaf nodes contains sum of their chidren priorities

Prioritized Experience Replay

SumTree example:

Pseudocode for Sampling:

s = sample random value 
node = root

while True:
  if s <= p(left_child):
      node = left_child
  else:
     s = s - p(left_child)
     node = righ_child
  
  if is_leaf(node): break

Prioritized Experience Replay

Full Algorithm:

Prioritized Experience Replay

Dueling DQN

Observation:

Some states are just bad/good regardless of actions

For policy relative values of actions are more important than their absolute values

Dueling DQN

Idea:

Lets estimates States and Actions separately!

Remember Value Functions:

We define Advantage Function:

Dueling DQN

Dueling-DQN

DQN

Dueling DQN

Dueling-DQN

Dueling-DQN:

Aggregate Q-values as: \(Q(s,a) = V(s) + A(s,a)\)
Learn networks as original DQN

Dueling DQN

But there is a problem with learning from just \(Q\)-values:

Even if the sum \(V(s)+A(s,a)\) is a good estimation,

\(V(s)\) and \(A(s,a)\) can both be incorrect estimates!

Dueling DQN

Notice the following properties of Advantage functions:

For a stochastic policy \(\pi(a|s)\):

For greedy deterministic policy \(\pi(s) = argmax_a Q^{*}(s,a) \), where \(a^{*}\) is a greedy action:

We can force these properties on our Advantage values!

V^{\pi}(s) = \mathbb{E}_{a \sim \pi}[Q^{\pi}(s,a)] \\

= \mathbb{E}_{a \sim \pi}[A^{\pi}(s,a) + V^{\pi}(s)]

= \mathbb{E}_{a \sim \pi}[A^{\pi}(s,a)] + V^{\pi}(s)

= 0 + V^{\pi}(s)

\mathbb{E}_{a \sim \pi}[A^{\pi}(s,a)] = A(s, a^*)

Dueling DQN

Q-value aggregation for greedy policy:

Actual Q-value aggregation in the experiments:

Q-value aggregation for greedy policy:

These are just network outputs

Results in brackets are actual advatages!

Dueling DQN

Dueling versus Double

Over all 57 games

Unstructured Exploration Problem

Greedy action:

Exploration action:

NoisyNets for Exploration

Idea: Add noise to network parameters for better exploration!

Policy Noise: \(\epsilon\)-greedy

Parameter Noise

NoisyNets for Exploration

Implementation details:

Graphical Representation of Noisy Linear Layer

Noise and Parameters:

\( \mu \)'s and \( \sigma \)'s are learnable parameters!
\(\epsilon\)'s is a zero-mean noise with fixed statistics

Policy:

Greedy policy using generated network parameters
Generate new parameters after each gradient step

NoisyNets for Exploration

Two types of Noise:

Independent Gaussian Noise
- You need to generate \(|\text{input}||\text{output}| + |\text{output}|\) values

Factored Gaussian Noise
- you generate \(|\text{intput}|\) of \(\epsilon_i\) and \(|\text{output}|\) of \(\epsilon_j\) gaussian variables.
- Then the weight noise is computed as follows:

where

works faster with DQN

used only with Actor-Critic methods

NoisyNets for Exploration

Rainbow: Lets combine everything!

Seminar 4:

Improving DQN

Double DQN

Multi-Steps Returns

Multi-Steps Returns

Multi-Steps Returns

Multi-Steps Returns

Prioritized Experience Replay

Prioritized Experience Replay

Prioritized Experience Replay

Prioritized Experience Replay

Prioritized Experience Replay

Prioritized Experience Replay

Prioritized Experience Replay

Prioritized Experience Replay

Dueling DQN

Dueling DQN

Dueling DQN

Dueling DQN

Dueling DQN

Dueling DQN

Dueling DQN

Dueling DQN

Unstructured Exploration Problem

NoisyNets for Exploration

NoisyNets for Exploration

NoisyNets for Exploration

NoisyNets for Exploration

Rainbow: Lets combine everything!

Rainbow: Lets combine everything!

Rainbow: Lets combine everything!

Thank you for your attention!