Distributional RL

Pavel Temirchev

Distributional - not distributed

Reminder: Q-function

Q^\pi(s, a) = \mathbb{E}\big[ \sum_t \gamma^t r(s_t, a_t) | s_0 =s, a_0 = a \big]

By definition:

Motivation

V^*(s) = +1

V^*(s) = -1

Motivation

V^*(s) = 0.5* (1 - 1) = 0

V^*(s) = 0

We expect DQN to learn two features:

is rocket? (then -1)
is finish? (then +1)

But it can learn arbitrary staff, since we are learning average value! (which is zero)

Rewards are not equal

R_1(a)

R_2(a)

-8

\mathbb{E} R_1(a) = 2

\mathbb{E} R_2(a) = 2

Intuitively, which one is better?

Utility functions

Von Neumann-Morgenstern theorem:

A rational, in some sense, agent maximises mathematical expectation of a utility function.

In our case, the utility function is the return

Z(s, a) = \sum_t \gamma^t R(s_t, a_t)

Q(s, a) = \mathbb{E}Z(s, a)

from where the randomness comes under the expectation?

Maximization of the Q-function is enough to obtain an optimal policy.

And Q-function is defined in the terms of expected rewards.

Q(s, a) = \mathbb{E} R(s, a) + \gamma \mathbb{E}Q(s', a')

Expectations are enough - we don't need to know the distribution of rewards to succeed

Let's try to make learning easier

Deep Q-network should return a distribution over the outputs!

DQN

Q(s, a)

was

what we want

p(Q = z_1|s, a)

p(Q = z_2|s, a)

p(Q = z_n|s, a)

\dots

This will work in a simple discrete setting:

Q(s, a) \in \{z_1, z_2, \dots, z_n \}

We will use a projection step to extend this idea to a broad class of RL problems

Q(s, a) = \sum_i z_i p(Q = z_i|s, a)

Reminder

Q^\pi(s, a) = \mathbb{E} R(s, a) + \gamma \mathbb{E}_{p, \pi} Q^\pi(s', a')

Q^*(s, a) = \mathbb{E} R(s, a) + \gamma \mathbb{E}_{p} \max_{a'} Q^*(s', a')

[\mathcal{T}^\pi Q](s, a) = \mathbb{E} R(s, a) + \gamma \mathbb{E}_{p, \pi} Q(s', a')

[\mathcal{T} Q](s, a) = \mathbb{E} R(s, a) + \gamma \mathbb{E}_{p} \max_{a'} Q(s', a')

Bellman equation

Bellman optimality equation

In the form of operators:

contractions in

||\cdot||_\infty

Distributional perspective on RL

\mathcal{T}^\pi Z(s, a) := R(s, a) + \gamma P^\pi Z(s, a)

Distributional Bellman operator

S' \sim p(\cdot | s, a)

A' \sim \pi(\cdot | S')

where

\mathcal{T}^\pi Z(s, a) := R(s, a) + \gamma Z(S', A')

Is the distributional operator a contraction?

Wasserstein metric

d_p(U, V) = \Big( \inf_{\mu \in \Gamma(U, V)} \mathbb{E}_{(u, v) \sim \mu} ||u-v||^p_p \Big)^{1/p}

or

d_p(U, V) = \Big(\int^1_0 |F^{-1}_U(\omega)-F^{-1}_V(\omega)|^p d\omega\Big)^{1/p}

Is the distributional operator a contraction?

Wasserstein metric's properties

||A||_p = (\mathbb{E}||A||_p^p)^{1/p}

What about the optimality distributional Bellman operator?

Lemma

Operator is not a contraction.

\mathcal{T}

Lemma

|| \mathbb{E} \mathcal{T} Z_1 - \mathbb{E} \mathcal{T} Z_2||_\infty \le \gamma ||\mathbb{E}Z_1 - \mathbb{E}Z_2||_\infty

\mathcal{T} Z(s, a) := R(s, a) + \gamma P^{\pi\in \mathcal{G}_Z}Z(s, a)

\mathcal{G}_Z := \{ \pi: \mathbb{E}_\pi \mathbb{E}Z(s,a) = max_{a'} \mathbb{E}Z(s,a') \}

|| \mathbb{E} \mathcal{T} Z_1 - \mathbb{E} \mathcal{T} Z_2||_\infty = || \mathcal{T}_E \mathbb{E} Z_1 - \mathcal{T}_E \mathbb{E} Z_2||_\infty \le \gamma ||\mathbb{E}Z_1 - \mathbb{E}Z_2||_\infty

Optimality distributional Bellman operator
is not a contraction!

\mathcal{T} Z(s, a) := R(s, a) + \gamma P^{\pi\in \mathcal{G}_Z}Z(s, a)

\mathcal{G}_Z := \{ \pi: \mathbb{E}_\pi \mathbb{E}Z(s,a) = max_{a'} \mathbb{E}Z(s,a') \}

\pi^*(s_1) = a_1

\pi^*(s_2) = a_2^2

Z^*

\mathcal{T}Z

s_1, a_1

s_2, a_2^1

s_2, a_2^2

\epsilon \pm 1

s_1

s_2

a_1

R(s_1, a_1) = 0

a_2^1

a_2^2

R(s_2, a_2^1) = 0

R(s_2, a_2^1) = \epsilon \pm 1

-\epsilon \pm 1

\epsilon \pm 1

\bar d_1(Z, Z^*) = 2 \epsilon

\bar d_1(\mathcal{T}Z, Z^*) = \frac{1}{2} | 1 - \epsilon| + \frac{1}{2} | 1 + \epsilon|

Ok! Still! Want to use it instead of DQN anyway.

Distribution

What we had in DQN:

(s, a)

\theta

\hat Q (s, a) \in \mathbb{R}

What we need now:

(s, a)

\theta

z_{MIN}

z_{MAX}

z_{0}

z_{1}

p(z_0|s, a)

p(z_1|s, a)

Distributional Bellman update rule

(s, a, r, s')

We have a sample from buffer:

\hat\mathcal{T}z_j = r + \gamma z_j

p(z_0|s', a_0)

p(z_1|s', a_0)

p(z_0|s', a_1)

p(z_1|s', a_1)

p(z_0|s', a_2)

p(z_1|s', a_2)

MEAN

z_0

z_1

a' = \arg\max_a

p(\hat \mathcal{T}z_j|s, a) = p_\theta(z_j|s', a')

Now, minimize the distance from
a guess to the guess

What distance? Wasserstein?

No. It is a bad idea to estimate it from samples:

d_p(\mathbb{E}_i P_i, Q) \le \mathbb{E}_i d_p(P_i, Q)

Kulback-Leibler divergence? Kinda, but it goes to INF if the support is disjoint.

KL \Big ( \hat \mathcal{T}Z_{\tilde \theta}(s, a) \Big|\Big| Z_\theta(s, a) \Big) = KL ( q || p) = \sum_{z: q(z) > 0} q(z) \log \frac{q(z)}{p(z) }

Project the updated distribution

on the domain.

For j := 0 to N-1:

b = \frac{\hat \mathcal{T}z_j - z_{MIN}}{\Delta z}

l = \lfloor b \rfloor ; \;\;\;\;u = \lceil b\rceil

\Phi q(z_l) = \Phi q(z_l) + p_{\tilde\theta}(z_j|s', a')(u - b)

\Phi q(z_u) = \Phi q(z_u) + p_{\tilde\theta}(z_j|s', a')(b - l)

KL \Big ( \Phi \hat \mathcal{T}Z_{\tilde \theta}(s, a) \Big|\Big| Z_\theta(s, a) \Big) = KL ( \Phi q || p_\theta) \propto -\sum_{j=0} ^{N-1} \Phi q(z_j) \log p_\theta(z_j) \rightarrow \min_\theta

\hat \mathcal{T} z_j

z_l

z_u

p_{\tilde\theta}(z_j|s', a')

C51 results

Cons of C51

Hyperparameters \( z_{MIN} , z_{MAX} \) are required in advance
We are not minimizing Wasserstein!
Strange projection step is required

Let's transpose the parametrization!

p_j = \frac{1}{N}

Let's set a uniform distribution over some returns

\tau_j = \sum_i^j p_i

The minimizer of \( d_1(Z, Z_\theta) \) is such \(\theta : (z_\theta)_j = F^{-1}_Z(\hat \tau_j) \;\; \forall j \)

While the values are from a parametric model.

(s, a)

\theta

z_\theta(s, a)_j

\forall j

Let's transpose the parametrization!

What if and ???

N = 1

\hat \tau = 0.5

and you have only samples from \(Z\)

\theta = \arg\min_\theta \mathbb{E}_{\tilde z \sim Z} \Big [ |\tilde z - z_\theta| \Big]

What if and some ???

N = 1

\hat \tau

\theta = \arg\min_\theta \mathbb{E}_{\tilde z \sim Z} \Big [ \rho_{\hat \tau} (\tilde z - z_\theta) \Big]

What if ???

N > 1

\theta = \arg\min_\theta \mathbb{E}_{\tilde z \sim Z} \sum_j \Big [ \rho_{\hat \tau_j} (\tilde z_j - (z_\theta)_j) \Big]

\rho_\tau(x) = x(\tau - [x<0])

Quantile Regression DQN

Input: s, a, r, s'

Q(s', \hat a) = \frac{1}{N} \sum_j z_\theta(s', \hat a)_j

a' = \arg\max_{\hat a} Q(s', \hat a)

\mathcal{T}z_j = r + \gamma z_\theta(s', a')_j \;\;\; \forall j

\frac{1}{N} \sum_{i=1}^N \sum_{j=1}^N \Big ( \rho_{\hat \tau_i}(\mathcal{T}z_j - z_\theta(s, a)_i \Big)

Output:

QR-DQN results

Is the distributional operator a contraction?

Proposition:

Wasserstein metric in a maximal form is a distance between value distributions

\bar{d}_p(Z_1, Z_2) = \sup_{s, a} d_p(Z_1(s, a), Z_2(s, a))

Lemma:

Distributional Bellman operator is a contraction in the maximal Wasserstein distance.

\bar{d}_p(\mathcal{T}^\pi Z_1, \mathcal{T}^\pi Z_2) \leq \gamma \bar{d}_p( Z_1, Z_2)

Is the distributional operator a contraction?

Proof:

\bar{d}_p(\mathcal{T}^\pi Z_1, \mathcal{T}^\pi Z_2) \leq \gamma \bar{d}_p( Z_1, Z_2)

\sup_{s, a} d_p(\mathcal{T}^\pi Z_1(s,a), \mathcal{T}^\pi Z_2(s, a)) \leq \gamma \sup_{s,a } d_p( Z_1(s,a), Z_2(s,a))

d_p(\mathcal{T}^\pi Z_1(s,a), \mathcal{T}^\pi Z_2(s, a)) =

= d_p\Big(R(s,a) + \gamma Z_1(S',A'), R(s,a) + \gamma Z_2(S', A')\Big) \leq

\le d_p\Big( \gamma Z_1(S',A'), \gamma Z_2(S', A')\Big) \leq

\le \gamma d_p\Big( Z_1(S',A'), Z_2(S', A')\Big) \leq

\le \gamma \sup_{s', a'} d_p\Big( Z_1(s',a'), Z_2(s', a')\Big)

Is the distributional operator a contraction?

Proof:

d_p\Big( Z_1(X), Z_2(X)\Big) \leq \sup_{x} d_p\Big( Z_1(x), Z_2(x)\Big)

\sup_x d_p\Big( Z_1(x), Z_2(x)\Big) = \sup_x \Big(\inf_{\mu \in \Gamma_x} \mathbb{E}_\mu ||z_1-z_2||^p_p \Big)^{1/p}

Z_1(x) \sim p(z_1|x)

Z_2(x) \sim p(z_2|x)

\Gamma_x(Z_1(x), Z_2(x)) = \Bigg \{ \mu :

\int \mu(z_1, z_2 | x) dz_2 = p(z_1|x)

\int \mu(z_1, z_2| x) dz_1 = p(z_2|x)

\Bigg \}

Is the distributional operator a contraction?

Proof:

d_p\Big( Z_1(X), Z_2(X)\Big) \leq \sup_{x} d_p\Big( Z_1(x), Z_2(x)\Big)

d_p\Big( Z_1(X), Z_2(X)\Big) = \Big(\inf_{\mu \in \Gamma_X} \mathbb{E}_\mu ||z_1-z_2||^p_p \Big)^{1/p}

Z_1(X) \sim \int p(z_1|x) p(x) dx

Z_2(X) \sim \int p(z_2|x) p(x) dx

\Gamma_X(Z_1(X), Z_2(X)) = \Bigg \{ \mu :

\int \mu(z_1, z_2 | x) p(x) dz_2 dx = p(z_1)

\int \mu(z_1, z_2| x) p(x) dz_1 dx = p(z_2)

\Bigg \}

Is the distributional operator a contraction?

Proof:

d_p\Big( Z_1(X), Z_2(X)\Big) \leq \sup_{x} d_p\Big( Z_1(x), Z_2(x)\Big)

d_p\Big( Z_1(X), Z_2(X)\Big) = \Big(\inf_{\mu \in \Gamma} \int \mu(z_1, z_2|x)p(x)||z_1-z_2||^p_pdz_1dz_2dx \Big)^{1/p}

And RHS of the upper inequality will be rewritten as:

\sup_x d_p\Big( Z_1(x), Z_2(x)\Big) = \sup_x \Big(\inf_{\mu \in \Gamma} \int \mu(z_1, z_2|x)||z_1-z_2||^p_pdz_1dz_2 \Big)^{1/p}

LHS:

or, equally:

d_p\Big( Z_1(X), Z_2(X)\Big) = \Big(\inf_{\mu \in \Gamma} \mathbb{E}_x \int \mu(z_1, z_2|x)||z_1-z_2||^p_pdz_1dz_2 \Big)^{1/p}

Is the distributional operator a contraction?

The last stage of the proof:

\Big( \inf_\mu \mathbb{E}_{p(x)} f(\mu, x) \Big)^{1/p} \le \sup_x \Big(\inf_\mu f(\mu, x) \Big) ^{1/p}

Due to the monotonicity of 1/p

\Big( \inf_\mu \mathbb{E}_{p(x)} f(\mu, x) \Big)^{1/p} \le \Big( \sup_x \inf_\mu f(\mu, x) \Big) ^{1/p}

Apply Jensens inequality and voila

\inf_\mu \mathbb{E}_{p(x)} f(\mu, x) \le \sup_x \inf_\mu f(\mu, x)

\inf_\mu \mathbb{E}_{p(x)} f(\mu, x) \le \mathbb{E}_{p(x)} \inf_\mu f(\mu, x) \le \sup_x \inf_\mu f(\mu, x)

Distributional RL

Distributional - not distributed

Reminder: Q-function

​By definition:

Motivation

Motivation

Rewards are not equal

Utility functions

Von Neumann-Morgenstern theorem:

In our case, the utility function is the return

Maximization of the Q-function is enough to obtain an optimal policy.

And Q-function is defined in the terms of expected rewards.

Expectations are enough - we don't need to know the distribution of rewards to succeed

Let's try to make learning easier

Deep Q-network should return a distribution over the outputs!

Reminder

Bellman equation

Bellman optimality equation

In the form of operators:

contractions in

Distributional perspective on RL

Distributional Bellman operator

where

Is the distributional operator a contraction?

Wasserstein metric

or

Is the distributional operator a contraction?

Wasserstein metric's properties

What about the optimality distributional Bellman operator?

Lemma

Lemma

Optimality distributional Bellman operator is not a contraction!

Ok! Still! Want to use it instead of DQN anyway.

Distributional Bellman update rule

Now, minimize the distance from a guess to the guess

Project the updated distribution

on the domain.

C51 results

C51 results

C51 results

Cons of C51

Hyperparameters \( z_{MIN} , z_{MAX} \) are required in advance

We are not minimizing Wasserstein!

Strange projection step is required

Let's transpose the parametrization!

Let's transpose the parametrization!

Quantile Regression DQN

QR-DQN results

QR-DQN results

Is the distributional operator a contraction?

Proposition:

Lemma:

Is the distributional operator a contraction?

Proof:

Is the distributional operator a contraction?

Proof:

Is the distributional operator a contraction?

Proof:

Is the distributional operator a contraction?

Proof:

Is the distributional operator a contraction?

The last stage of the proof:

Thank you

By definition:

Optimality distributional Bellman operator
is not a contraction!

Now, minimize the distance from
a guess to the guess