Distributional RL
Pavel Temirchev
Distributional - not distributed
Reminder: Q-function
By definition:
Motivation
Motivation
We expect DQN to learn two features:
- is rocket? (then -1)
- is finish? (then +1)
But it can learn arbitrary staff, since we are learning average value! (which is zero)
Rewards are not equal
Intuitively, which one is better?
Utility functions
Von Neumann-Morgenstern theorem:
A rational, in some sense, agent maximises mathematical expectation of a utility function.
In our case, the utility function is the return
from where the randomness comes under the expectation?
Maximization of the Q-function is enough to obtain an optimal policy.
And Q-function is defined in the terms of expected rewards.
Expectations are enough - we don't need to know the distribution of rewards to succeed
Let's try to make learning easier
Deep Q-network should return a distribution over the outputs!
DQN
was
what we want
This will work in a simple discrete setting:
We will use a projection step to extend this idea to a broad class of RL problems
Reminder
Bellman equation
Bellman optimality equation
In the form of operators:
contractions in
Distributional perspective on RL
Distributional Bellman operator
where
Is the distributional operator a contraction?
Wasserstein metric
or
Is the distributional operator a contraction?
Wasserstein metric's properties
What about the optimality distributional Bellman operator?
Lemma
Operator is not a contraction.
Lemma
Optimality distributional Bellman operator
is not a contraction!
Ok! Still! Want to use it instead of DQN anyway.
What we had in DQN:
What we need now:
Distributional Bellman update rule
We have a sample from buffer:
MEAN
Now, minimize the distance from
a guess to the guess
What distance? Wasserstein?
No. It is a bad idea to estimate it from samples:
Kulback-Leibler divergence? Kinda, but it goes to INF if the support is disjoint.
Project the updated distribution
on the domain.
For j := 0 to N-1:
C51 results
C51 results
C51 results
Cons of C51
-
Hyperparameters \( z_{MIN} , z_{MAX} \) are required in advance
-
We are not minimizing Wasserstein!
-
Strange projection step is required
Let's transpose the parametrization!
Let's set a uniform distribution over some returns
The minimizer of \( d_1(Z, Z_\theta) \) is such \(\theta : (z_\theta)_j = F^{-1}_Z(\hat \tau_j) \;\; \forall j \)
While the values are from a parametric model.
Let's transpose the parametrization!
What if and ???
and you have only samples from \(Z\)
What if and some ???
What if ???
Quantile Regression DQN
Input: s, a, r, s'
Output:
QR-DQN results
QR-DQN results
Is the distributional operator a contraction?
Proposition:
Wasserstein metric in a maximal form is a distance between value distributions
Lemma:
Distributional Bellman operator is a contraction in the maximal Wasserstein distance.
Is the distributional operator a contraction?
Proof:
Is the distributional operator a contraction?
Proof:
Is the distributional operator a contraction?
Proof:
Is the distributional operator a contraction?
Proof:
And RHS of the upper inequality will be rewritten as:
LHS:
or, equally:
Is the distributional operator a contraction?
The last stage of the proof:
Due to the monotonicity of 1/p
Apply Jensens inequality and voila
Thank you
Distributional RL
By cydoroga
Distributional RL
- 515