Distributional RL

Pavel Temirchev

Distributional - not distributed

Reminder: Q-function

Q^\pi(s, a) = \mathbb{E}\big[ \sum_t \gamma^t r(s_t, a_t) | s_0 =s, a_0 = a \big]

​By definition:

Motivation

V^*(s) = +1
V^*(s) = -1

Motivation

V^*(s) = 0.5* (1 - 1) = 0
V^*(s) = 0

We expect DQN to learn two features:

  • is rocket? (then -1)
  • is finish? (then +1)

But it can learn arbitrary staff, since we are learning average value! (which is zero)

Rewards are not equal

R_1(a)
R_1(a)
R_2(a)
-8
2
12
\mathbb{E} R_1(a) = 2
\mathbb{E} R_2(a) = 2

Intuitively, which one is better?

Utility functions

Von Neumann-Morgenstern theorem:

A rational, in some sense, agent maximises mathematical expectation of a utility function.

In our case, the utility function is the return

Z(s, a) = \sum_t \gamma^t R(s_t, a_t)
Q(s, a) = \mathbb{E}Z(s, a)

from where the randomness comes under the expectation?

Maximization of the Q-function is enough to obtain an optimal policy.

And Q-function is defined in the terms of expected rewards. 

Q(s, a) = \mathbb{E} R(s, a) + \gamma \mathbb{E}Q(s', a')

Expectations are enough - we don't need to know the distribution of rewards to succeed

Let's try to make learning easier

Deep Q-network should return a distribution over the outputs!

s
a

DQN

Q(s, a)

was

what we want

p(Q = z_1|s, a)
p(Q = z_2|s, a)
p(Q = z_n|s, a)
\dots

This will work in a simple discrete setting:

Q(s, a) \in \{z_1, z_2, \dots, z_n \}

We will use a projection step to extend this idea to a broad class of RL problems

Q(s, a) = \sum_i z_i p(Q = z_i|s, a)

Reminder

Q^\pi(s, a) = \mathbb{E} R(s, a) + \gamma \mathbb{E}_{p, \pi} Q^\pi(s', a')
Q^*(s, a) = \mathbb{E} R(s, a) + \gamma \mathbb{E}_{p} \max_{a'} Q^*(s', a')
[\mathcal{T}^\pi Q](s, a) = \mathbb{E} R(s, a) + \gamma \mathbb{E}_{p, \pi} Q(s', a')
[\mathcal{T} Q](s, a) = \mathbb{E} R(s, a) + \gamma \mathbb{E}_{p} \max_{a'} Q(s', a')

Bellman equation

Bellman optimality equation

In the form of operators:

contractions in 

||\cdot||_\infty

Distributional perspective on RL

\mathcal{T}^\pi Z(s, a) := R(s, a) + \gamma P^\pi Z(s, a)

Distributional Bellman operator

S' \sim p(\cdot | s, a)
A' \sim \pi(\cdot | S')

where

\mathcal{T}^\pi Z(s, a) := R(s, a) + \gamma Z(S', A')

Is the distributional operator a contraction?

 

Wasserstein metric

d_p(U, V) = \Big( \inf_{\mu \in \Gamma(U, V)} \mathbb{E}_{(u, v) \sim \mu} ||u-v||^p_p \Big)^{1/p}

or

d_p(U, V) = \Big(\int^1_0 |F^{-1}_U(\omega)-F^{-1}_V(\omega)|^p d\omega\Big)^{1/p}

Is the distributional operator a contraction?

 

Wasserstein metric's properties

||A||_p = (\mathbb{E}||A||_p^p)^{1/p}

What about the optimality distributional Bellman operator?

 

Lemma

Operator             is not a contraction.

\mathcal{T}

Lemma

 

|| \mathbb{E} \mathcal{T} Z_1 - \mathbb{E} \mathcal{T} Z_2||_\infty \le \gamma ||\mathbb{E}Z_1 - \mathbb{E}Z_2||_\infty
\mathcal{T} Z(s, a) := R(s, a) + \gamma P^{\pi\in \mathcal{G}_Z}Z(s, a)
\mathcal{G}_Z := \{ \pi: \mathbb{E}_\pi \mathbb{E}Z(s,a) = max_{a'} \mathbb{E}Z(s,a') \}
|| \mathbb{E} \mathcal{T} Z_1 - \mathbb{E} \mathcal{T} Z_2||_\infty = || \mathcal{T}_E \mathbb{E} Z_1 - \mathcal{T}_E \mathbb{E} Z_2||_\infty \le \gamma ||\mathbb{E}Z_1 - \mathbb{E}Z_2||_\infty

Optimality distributional Bellman operator
is not a contraction!

 
\mathcal{T} Z(s, a) := R(s, a) + \gamma P^{\pi\in \mathcal{G}_Z}Z(s, a)
\mathcal{G}_Z := \{ \pi: \mathbb{E}_\pi \mathbb{E}Z(s,a) = max_{a'} \mathbb{E}Z(s,a') \}
\pi^*(s_1) = a_1
\pi^*(s_2) = a_2^2
Z^*
Z
\mathcal{T}Z
s_1, a_1
s_2, a_2^1
s_2, a_2^2
\epsilon \pm 1
0
\epsilon \pm 1
s_1
s_2
a_1
R(s_1, a_1) = 0
a_2^1
a_2^2
R(s_2, a_2^1) = 0
R(s_2, a_2^1) = \epsilon \pm 1
-\epsilon \pm 1
0
\epsilon \pm 1
\epsilon \pm 1
0
0
\bar d_1(Z, Z^*) = 2 \epsilon
\bar d_1(\mathcal{T}Z, Z^*) = \frac{1}{2} | 1 - \epsilon| + \frac{1}{2} | 1 + \epsilon|

Ok! Still! Want to use it instead of DQN anyway.

 
Distribution

What we had in DQN:

(s, a)
\theta
\hat Q (s, a) \in \mathbb{R}

What we need now:

(s, a)
\theta
Z
z_{MIN}
z_{MAX}
z_{0}
z_{1}
p(z_0|s, a)
p(z_1|s, a)

Distributional Bellman update rule

 
(s, a, r, s')

We have a sample from buffer:

\hat\mathcal{T}z_j = r + \gamma z_j
p(z_0|s', a_0)
p(z_1|s', a_0)
p(z_0|s', a_1)
p(z_1|s', a_1)
p(z_0|s', a_2)
p(z_1|s', a_2)

MEAN

z_0
z_1
a' = \arg\max_a
p(\hat \mathcal{T}z_j|s, a) = p_\theta(z_j|s', a')

Now, minimize the distance from
a guess to the guess

 

What distance? Wasserstein?

No. It is a bad idea to estimate it from samples:

d_p(\mathbb{E}_i P_i, Q) \le \mathbb{E}_i d_p(P_i, Q)

Kulback-Leibler divergence? Kinda, but it goes to INF if the support is disjoint.

KL \Big ( \hat \mathcal{T}Z_{\tilde \theta}(s, a) \Big|\Big| Z_\theta(s, a) \Big) = KL ( q || p) = \sum_{z: q(z) > 0} q(z) \log \frac{q(z)}{p(z) }

Project the updated distribution

on the domain.

 
​For j := 0 to N-1:
b = \frac{\hat \mathcal{T}z_j - z_{MIN}}{\Delta z}
l = \lfloor b \rfloor ; \;\;\;\;u = \lceil b\rceil
\Phi q(z_l) = \Phi q(z_l) + p_{\tilde\theta}(z_j|s', a')(u - b)
\Phi q(z_u) = \Phi q(z_u) + p_{\tilde\theta}(z_j|s', a')(b - l)
KL \Big ( \Phi \hat \mathcal{T}Z_{\tilde \theta}(s, a) \Big|\Big| Z_\theta(s, a) \Big) = KL ( \Phi q || p_\theta) \propto -\sum_{j=0} ^{N-1} \Phi q(z_j) \log p_\theta(z_j) \rightarrow \min_\theta
\hat \mathcal{T} z_j
z_l
z_u
p_{\tilde\theta}(z_j|s', a')

C51 results

 

C51 results

 

C51 results

 

Cons of C51

 
  • Hyperparameters \( z_{MIN} , z_{MAX} \) are required in advance

  • We are not minimizing Wasserstein!

  • Strange projection step is required

Let's transpose the parametrization!

 
p_j = \frac{1}{N}

Let's set a uniform distribution over some returns

\tau_j = \sum_i^j p_i

The minimizer of \( d_1(Z, Z_\theta) \) is such \(\theta : (z_\theta)_j = F^{-1}_Z(\hat \tau_j) \;\; \forall j \) 

While the values are from a parametric model.

(s, a)
\theta
z_\theta(s, a)_j
\forall j

Let's transpose the parametrization!

 

What if                    and                         ???

N = 1
\hat \tau = 0.5

and you have only samples from \(Z\)

\theta = \arg\min_\theta \mathbb{E}_{\tilde z \sim Z} \Big [ |\tilde z - z_\theta| \Big]

What if                    and some       ???

N = 1
\hat \tau
\theta = \arg\min_\theta \mathbb{E}_{\tilde z \sim Z} \Big [ \rho_{\hat \tau} (\tilde z - z_\theta) \Big]

What if                   ???

N > 1
\theta = \arg\min_\theta \mathbb{E}_{\tilde z \sim Z} \sum_j \Big [ \rho_{\hat \tau_j} (\tilde z_j - (z_\theta)_j) \Big]
\rho_\tau(x) = x(\tau - [x<0])

Quantile Regression DQN

 
Input: s, a, r, s'
Q(s', \hat a) = \frac{1}{N} \sum_j z_\theta(s', \hat a)_j
a' = \arg\max_{\hat a} Q(s', \hat a)
\mathcal{T}z_j = r + \gamma z_\theta(s', a')_j \;\;\; \forall j
\frac{1}{N} \sum_{i=1}^N \sum_{j=1}^N \Big ( \rho_{\hat \tau_i}(\mathcal{T}z_j - z_\theta(s, a)_i \Big)
Output: 

QR-DQN results

 

QR-DQN results

 

Is the distributional operator a contraction?

 

Proposition:

Wasserstein metric in a maximal form is a distance between value distributions

\bar{d}_p(Z_1, Z_2) = \sup_{s, a} d_p(Z_1(s, a), Z_2(s, a))

Lemma:

Distributional Bellman operator is a contraction in the maximal Wasserstein distance.

\bar{d}_p(\mathcal{T}^\pi Z_1, \mathcal{T}^\pi Z_2) \leq \gamma \bar{d}_p( Z_1, Z_2)

Is the distributional operator a contraction?

 

Proof:

\bar{d}_p(\mathcal{T}^\pi Z_1, \mathcal{T}^\pi Z_2) \leq \gamma \bar{d}_p( Z_1, Z_2)
\sup_{s, a} d_p(\mathcal{T}^\pi Z_1(s,a), \mathcal{T}^\pi Z_2(s, a)) \leq \gamma \sup_{s,a } d_p( Z_1(s,a), Z_2(s,a))
d_p(\mathcal{T}^\pi Z_1(s,a), \mathcal{T}^\pi Z_2(s, a)) =
= d_p\Big(R(s,a) + \gamma Z_1(S',A'), R(s,a) + \gamma Z_2(S', A')\Big) \leq
\le d_p\Big( \gamma Z_1(S',A'), \gamma Z_2(S', A')\Big) \leq
\le \gamma d_p\Big( Z_1(S',A'), Z_2(S', A')\Big) \leq
\le \gamma \sup_{s', a'} d_p\Big( Z_1(s',a'), Z_2(s', a')\Big)

Is the distributional operator a contraction?

 

Proof:

d_p\Big( Z_1(X), Z_2(X)\Big) \leq \sup_{x} d_p\Big( Z_1(x), Z_2(x)\Big)
\sup_x d_p\Big( Z_1(x), Z_2(x)\Big) = \sup_x \Big(\inf_{\mu \in \Gamma_x} \mathbb{E}_\mu ||z_1-z_2||^p_p \Big)^{1/p}
Z_1(x) \sim p(z_1|x)
Z_2(x) \sim p(z_2|x)
\Gamma_x(Z_1(x), Z_2(x)) = \Bigg \{ \mu :
\int \mu(z_1, z_2 | x) dz_2 = p(z_1|x)
\int \mu(z_1, z_2| x) dz_1 = p(z_2|x)
\Bigg \}

Is the distributional operator a contraction?

 

Proof:

d_p\Big( Z_1(X), Z_2(X)\Big) \leq \sup_{x} d_p\Big( Z_1(x), Z_2(x)\Big)
d_p\Big( Z_1(X), Z_2(X)\Big) = \Big(\inf_{\mu \in \Gamma_X} \mathbb{E}_\mu ||z_1-z_2||^p_p \Big)^{1/p}
Z_1(X) \sim \int p(z_1|x) p(x) dx
Z_2(X) \sim \int p(z_2|x) p(x) dx
\Gamma_X(Z_1(X), Z_2(X)) = \Bigg \{ \mu :
\int \mu(z_1, z_2 | x) p(x) dz_2 dx = p(z_1)
\int \mu(z_1, z_2| x) p(x) dz_1 dx = p(z_2)
\Bigg \}

Is the distributional operator a contraction?

 

Proof:

d_p\Big( Z_1(X), Z_2(X)\Big) \leq \sup_{x} d_p\Big( Z_1(x), Z_2(x)\Big)
d_p\Big( Z_1(X), Z_2(X)\Big) = \Big(\inf_{\mu \in \Gamma} \int \mu(z_1, z_2|x)p(x)||z_1-z_2||^p_pdz_1dz_2dx \Big)^{1/p}

And RHS of the upper inequality will be rewritten as:

\sup_x d_p\Big( Z_1(x), Z_2(x)\Big) = \sup_x \Big(\inf_{\mu \in \Gamma} \int \mu(z_1, z_2|x)||z_1-z_2||^p_pdz_1dz_2 \Big)^{1/p}

LHS:

or, equally:

d_p\Big( Z_1(X), Z_2(X)\Big) = \Big(\inf_{\mu \in \Gamma} \mathbb{E}_x \int \mu(z_1, z_2|x)||z_1-z_2||^p_pdz_1dz_2 \Big)^{1/p}

Is the distributional operator a contraction?

 

The last stage of the proof:

\Big( \inf_\mu \mathbb{E}_{p(x)} f(\mu, x) \Big)^{1/p} \le \sup_x \Big(\inf_\mu f(\mu, x) \Big) ^{1/p}

Due to the monotonicity of 1/p

\Big( \inf_\mu \mathbb{E}_{p(x)} f(\mu, x) \Big)^{1/p} \le \Big( \sup_x \inf_\mu f(\mu, x) \Big) ^{1/p}

Apply Jensens inequality and voila

\inf_\mu \mathbb{E}_{p(x)} f(\mu, x) \le \sup_x \inf_\mu f(\mu, x)
\inf_\mu \mathbb{E}_{p(x)} f(\mu, x) \le \mathbb{E}_{p(x)} \inf_\mu f(\mu, x) \le \sup_x \inf_\mu f(\mu, x)

Thank you

 

Distributional RL

By cydoroga

Distributional RL

  • 515