Pavel Temirchev
We expect DQN to learn two features:
But it can learn arbitrary staff, since we are learning average value! (which is zero)
Intuitively, which one is better?
A rational, in some sense, agent maximises mathematical expectation of a utility function.
from where the randomness comes under the expectation?
DQN
was
what we want
This will work in a simple discrete setting:
We will use a projection step to extend this idea to a broad class of RL problems
Operator is not a contraction.
What we had in DQN:
What we need now:
We have a sample from buffer:
MEAN
What distance? Wasserstein?
No. It is a bad idea to estimate it from samples:
Kulback-Leibler divergence? Kinda, but it goes to INF if the support is disjoint.
For j := 0 to N-1:
Let's set a uniform distribution over some returns
The minimizer of \( d_1(Z, Z_\theta) \) is such \(\theta : (z_\theta)_j = F^{-1}_Z(\hat \tau_j) \;\; \forall j \)
While the values are from a parametric model.
What if and ???
and you have only samples from \(Z\)
What if and some ???
What if ???
Input: s, a, r, s'
Output:
Wasserstein metric in a maximal form is a distance between value distributions
Distributional Bellman operator is a contraction in the maximal Wasserstein distance.
And RHS of the upper inequality will be rewritten as:
LHS:
or, equally:
Due to the monotonicity of 1/p
Apply Jensens inequality and voila