Artyom Sorokin | 26 Feb
Recall Q-learning/DQN target:
The only values that matter if \(Q_{\theta'}\) is bad!
These are important if \(Q_{\theta'}\) is good!
useless estimation at the beggining
noisy but true
Recall N-step Returns:
.
.
.
.
.
.
SARSA/TD-target: High Bias, Low Varaince
MC-target: High Variance, No Bias
SARSA/TD-target: High Bias, Low Varaince
SARSA/TD-target: High Bias, Low Varaince
We can constract similar multi-step targets for Q-learning:
This multi-step Q-learning target:
"Multi-step Q-learning target":
How to fix:
For more details: Safe and efficient off-policy reinforcement learning
Not all transitions are equally valuable!
Goal: Reach the green target!
Reward: +1 when reaching the target, 0 otherwise
All model estimates all are close to zero: Q(s,a) ~ 0.
Lets prioritize transitions with TD-Error!
Double DQN TD-Error:
proportional prioritization:
Sampling probability:
rank-based prioritization:
noise to avoid 0. probs
should be more stable
\(\alpha\) controls randomness
Problem:
High TD-Error:
\(r + \gamma Q(S',A') - Q(S,A)\)
Low TD-Error:
\(r + \gamma Q(S'',A'') - Q(S,A)\)
Answer: Fix biased sampling with Importance Sampling
Importance sampling weights:
Here:
Normalize \(w_i\) by \(1/max_j w_j\) to avoid increasing learning step size
Use IS weights as coefficient to TD-Error in gradient updates:
\(\beta\) controls IS correction. Slowly increase \(\beta\) to 1.
already normilized value
Q: How to efficiently sample elements from Replay Buffer proportional to their priority?
A: Use SumTree
non-leaf nodes contains sum of their chidren priorities
SumTree example:
Pseudocode for Sampling:
s = sample random value
node = root
while True:
if s <= p(left_child):
node = left_child
else:
s = s - p(left_child)
node = righ_child
if is_leaf(node): break
Full Algorithm:
Observation:
Idea:
Remember Value Functions:
We define Advantage Function:
Dueling-DQN
DQN
Dueling-DQN
Dueling-DQN:
But there is a problem with learning from just \(Q\)-values:
Even if the sum \(V(s)+A(s,a)\) is a good estimation,
\(V(s)\) and \(A(s,a)\) can both be incorrect estimates!
Notice the following properties of Advantage functions:
We can force these properties on our Advantage values!
Q-value aggregation for greedy policy:
Actual Q-value aggregation in the experiments:
Q-value aggregation for greedy policy:
These are just network outputs
Results in brackets are actual advatages!
Dueling versus Double
Over all 57 games
Greedy action:
Exploration action:
Idea: Add noise to network parameters for better exploration!
Policy Noise: \(\epsilon\)-greedy
Parameter Noise
Implementation details:
Graphical Representation of Noisy Linear Layer
Noise and Parameters:
Policy:
Two types of Noise:
where
works faster with DQN
used only with Actor-Critic methods