speaker: Pavel Temirchev

# What we will NOT discuss today?

• ### Distributional RL

• hey, we already discussed it

# What we WILL discuss

### How to treat RL problem as a probabilistic inference problem?

\mathbb{E}_\pi \sum_{t=0}^T r(s_t, a_t) \rightarrow \max_\pi

A
B
p(A|B) = \;?

### WE

\pi(a_t| s_t, \pi \;\text{is optimal})

# Background

p(a, b, c, d, e)

### Generally, a joint probability distribution

p(a, b, c, d, e) = p(a|b, c, d, e) p(b|c, d, e) p(c|d, e) p(d|e) p(e)

### to embed structure into the model:

a
b
c
d
e
p(a, b, c, d, e) = p(a) p(b) p(c|a, b) p(d|c) p(e|c)

# Background

### Inference:

p(z_1) = \int p(z_1|x_1) p(x_1) dx_1
p(x_2) = \int p(x_2|x_1) p(x_1) dx_1

### We will discuss very simple example of Message Passing Algorithm on trees.

p(z_l) = \;? \;\;\; \forall l

### Model:

p(x_{0:L}, z_{0:L}) = p(x_0)p(z_0|x_0)\prod_l p(x_l|x_{l-1})p(z_l|x_l)
z_0
x_0
z_1
x_1
z_L
x_L
\dots
p(z_0) = \int p(z_0|x_0) p(x_0) dx_0
p(x_1) = \int p(x_1|x_0) p(x_0) dx_0
p(z_l) = \int p(z_l|x_l) p(x_l) dx_l
p(x_{l+1}) = \int p(x_{l+1}|x_l) p(x_l) dx_l

# Background

### Kulback-Leibler divergence is a measure of how one distribution is different from another, reference, distribution (not symmetric):

p(z|x) =
\frac{p(x|z) p(z)}{p(x)}
=
\frac{p(x|z) p(z)}{\int p(x|z)p(z)dz}
\text{KL}\Big( q(x)\; \big|\big|\; p(x) \Big) = \mathbb{E}_{x \sim q} \Big[ \log \frac{q(x)}{p(x)} \Big]

# Background

### is intractability of the evidence term

p(x) = \int p(x|z)p(z)dz

### One way to go is to use approximate inference procedure called Variational Inference

p(z|x)
\text{KL}\Big( q(z)\; \big|\big|\; p(z|x) \Big) \rightarrow \min_{q \in \mathcal{Q}}

p(z|x)
q(z)
q \in\mathcal{Q}

# Background

### to the maximization of some lower bound on the evidence

p(x) = \int p(x|z)p(z)dz

### We can rewrite the logarithm of the evidence as follows:

\log p(x) = \text{KL}\Big( q(z)\; \big|\big|\; p(z|x) \Big) + \mathcal{L}(q)

### Where                           is so-called Evidence Lower Bound Objective (ELBO)

\mathcal{L}(q) = - \mathbb{E}_q \Big[\log q(z) - \log p(x, z) \Big]
(1)

(1)
q

### Hence, the minimization of         is equal to the maximization of ELBO

\text{KL}
\mathcal{L}(q)
\text{KL}\big( q\; ||\; p \big) \rightarrow \min_{q \in \mathcal{Q}}

### or you want to maximize:

\mathcal{L}(q) \rightarrow \max_{q \in \mathcal{Q}}

# Background

### Markov process:

p(\tau) = p(s_0) \prod_{t=0}^T p(a_t|s_t) p(s_{t+1}|s_t, a_t)

### Maximization problem:

\pi^\star = \arg\max_\pi \sum_{t=0}^T \mathbb{E}_{s_t, a_t \sim \pi} [r(s_t, a_t)]
Q^\pi(s_t,a_t) := r(s_t,a_t) + \sum_{t'=t+1}^T \mathbb{E}_{s_{t'}, a_{t'} \sim \pi} [r(s_{t'}, a_{t'})]

### Value functions (defined for policy):

Q^\star(s_t,a_t) = r(s_t,a_t) + \mathbb{E}_{s_{t+1}} V^\star(s_{t+1})

### Bellman Optimality operator:

V^\star(s_t) = \max_a Q^\star(s_{t}, a)

### for MDP

a_0
s_0
a_1
s_1
a_2
s_2
V^\pi(s_t) = \mathbb{E}_a Q^\pi(s_{t}, a)

### Generally, reward is a random variable:

r(s_t,a_t) = \mathbb{E} \big[ R(s_t, a_t) \big]

# A heuristic for better exploration

### Maximum entropy RL

a_t \sim \mathcal{N}(\cdot| \pi^\star, \sigma^2)

a_t \sim \exp{Q(s_t, a_t)}

### How to find such a policy?

\min_\pi\text{KL}\Big(\pi(\cdot|s_0)||\exp{Q(s_0, \cdot)}\Big) =
\max_\pi \mathbb{E}_\pi \Big[ Q(s_0, a_0) - \log \pi(a_0|s_0) \Big] =
\max_\pi \mathbb{E}_\pi \Big[ \sum_t^T r(s_t, a_t) {\color{pink}+ \mathcal{H} \big( \pi(\cdot|s_0) \big)}\Big]
Q^\star(s_0, \cdot)

go left

go right

a_0
\exp Q^\star(s_0, \cdot)
\mathcal{N}(\cdot|\arg\max Q^\star, \sigma^2)

### It is very similar to the heuristic Maximum Entropy RL objective

\max_\pi \mathbb{E}_\pi \Big[ \sum_t^T r(s_t, a_t) {\color{pink}+ \mathcal{H} \big( \pi(\cdot|s_t) \big)}\Big]

# RL as Probabilistic Inference

### Graphical Model with Optimality variables

a_0
s_0
a_1
s_1
\mathcal{O}_0
\mathcal{O}_1
p(\mathcal{O}_t =1 |s_t, a_t) := p(\mathcal{O}_t |s_t, a_t)

\mathcal{O}_2
a_2
s_2

### If                 then timestep     was optimal.

\mathcal{O}_t = 1
t

### Probability that the               pair is optimal:

(s_t, a_t)
p(\mathcal{O}_t =1 |s_t, a_t) := p(\mathcal{O}_t |s_t, a_t) = \exp\big(r(s_t,a_t)\big)

### Let us analyze the distribution of trajectories conditioned on optimality:

p(\tau|\mathcal{O}_{0:T}) \propto p(\tau,\mathcal{O}_{0:T}) = p(s_0)\prod_{t=0}^Tp(a_t|s_t)p(s_{t+1}|s_t,a_t) \exp\big(r(s_t, a_t)\big)
p({\color{#00ff00}\tau}|{\color{#ff0000}\mathcal{O}_{0:T}}) \propto p({\color{#00ff00}\tau},{\color{#ff0000}\mathcal{O}_{0:T}}) = {\color{#00ff00} p(s_0)\prod_{t=0}^Tp(a_t|s_t)p(s_{t+1}|s_t,a_t)} {\color{#ff0000}\,\exp\big(r(s_t, a_t)\big)}

# RL as Probabilistic Inference

### Exact inference for Optimal actions

p(a_t|s_t, \mathcal{O}_{0:T}) = p(a_t|s_t, \mathcal{O}_{t:T})
=
\frac{p(\mathcal{O}_{t:T}|s_t, a_t) p(a_t|s_t) p(s_t)}{p(\mathcal{O}_{t:T})}
\frac{p(\mathcal{O}_{t:T})}{p(\mathcal{O}_{t:T}|s_t) p(s_t)}

p(a_t|s_t)

### the optimal policy is the following:

p(a_t|s_t) = \frac{1}{|\mathcal{A}|}
p(a_t|s_t, \mathcal{O}_{t:T}) \propto

### We can now infer actions conditioned on optimality - optimal policy

\frac{ p(s_t, a_t|\mathcal{O}_{t:T})}{ p(s_t|\mathcal{O}_{t:T})}
(*)

### due to the structure of PGM

(*)
a_t
\mathcal{O}_{0:t-1}
s_t
=
=
\frac{ \color{#00ff00} p(s_t, a_t|\mathcal{O}_{t:T})}{ \color{#ff0000} p(s_t|\mathcal{O}_{t:T})}

### let's apply Bayes rule!

\frac{p(\mathcal{O}_{t:T}|s_t, a_t) p(a_t|s_t) p(s_t)}{p(\mathcal{O}_{t:T})}\frac{p(\mathcal{O}_{t:T})}{p(\mathcal{O}_{t:T}|s_t) p(s_t)}
p(a_t|s_t, \mathcal{O}_{0:T})
\frac{p(\mathcal{O}_{t:T}|s_t, a_t) }{p(\mathcal{O}_{t:T}|s_t)}

# Exact inference for optimal actions

### Let's introduce new notation:

\alpha_t(s_t, a_t) := p(\mathcal{O}_{t:T}|s_t, a_t)
\beta_t(s_t) := p(\mathcal{O}_{t:T}|s_t) = \int \alpha_t(s_t, a_t) p(a_t|s_t)da_t

\alpha_t
\beta_t

### For the timestep      :

T
\alpha_T(s_T, a_T) = \exp(r(s_T, a_T))
\beta_T(s_T) = \int \alpha_T(s_T, a_T) p(a_T|s_T)da_T

### Recursively:

\alpha_t(s_t, a_t) = \int \beta_{t+1}(s_{t+1}) \exp(r(s_t, a_t)) p(s_{t+1}|s_t, a_t)ds_{t+1}
\beta_t(s_t) = \int \alpha_t(s_t, a_t) p(a_t|s_t)da_t

### for all

p(a_t|s_t, \mathcal{O}_{t:T}) \propto
\frac{p(\mathcal{O}_{t:T}|s_t, a_t) }{p(\mathcal{O}_{t:T}|s_t)}
p(\mathcal{O}_{t:T}|s_t, a_t)
p(\mathcal{O}_{t:T}|s_t)
0 \le t \le T

# Introducing $$Q^{soft}$$ and $$V^{soft}$$ functions

### Log-scale messages

Q^{soft}(s_t, a_t) := \log\alpha_t(s_t, a_t)
V^{soft}(s_t) := \log\beta_t(s_t)

### Substituting into the recursive relation, we will obtain the following:

V^{soft}(s_t) =\log \mathbb{E}_{p(a_t|s_t)} [\exp Q^{soft}(s_t, a_t)]

### soft maximum

Q^{soft}(s_t, a_t) = r(s_t, a_t) + \log \mathbb{E}_{p(s_{t+1}|s_t, a_t)} [\exp V^{soft}(s_{t+1})]

### approximates hard maximum with

Q^{soft}(s_t, a_t) \rightarrow \infty

# Compare $$(Q^\star,\;V^\star)$$ with $$(Q^{soft},\;V^{soft})$$

### "Hard"        and        functions:

V^\star(s_t) =\max_{a_t} Q^\star(s_t, a_t)
Q^\star(s_t, a_t) = r(s_t, a_t) + \mathbb{E}_{p(s_{t+1}|s_t, a_t)} V^\star(s_{t+1})
V^{soft}(s_t) =\log \mathbb{E}_{p(a_t|s_t)} [\exp Q^{soft}(s_t, a_t)]
Q^{soft}(s_t, a_t) = r(s_t, a_t) + \log \mathbb{E}_{p(s_{t+1}|s_t, a_t)} [\exp V^{soft}(s_{t+1})]

### "Soft" analogues:

Q^\star(s_t, a_t) = r(s_t, a_t) + \mathbb{E}_{p(s_{t+1}|s_t, a_t)} \max_{a_{t+1}} Q^\star(s_{t+1}, a_{t+1})
Q^{soft}(s_t, a_t) \approx r(s_t, a_t) + \max_{s_{t+1}} \max_{a_{t+1}} Q^{soft}(s_{t+1}, a_{t+1})
\max_{s_{t+1}}
V^\star
Q^\star

# Why we are so optimistic?

p(\tau|\mathcal{O}_{0:T}) = p(s_0|\mathcal{O}_{0:T})\prod_{t=0}^T{\color{#00ff00}p(a_t|s_t,\mathcal{O}_{t:T})}p(s_{t+1}|s_t,a_t,\mathcal{O}_{t+1:T})
p(\tau|\mathcal{O}_{0:T}) = p(s_0|{\color{#ff0000}\mathcal{O}_{0:T}})\prod_{t=0}^Tp(a_t|s_t,\mathcal{O}_{t:T})p(s_{t+1}|s_t,a_t,{\color{#ff0000}\mathcal{O}_{t+1:T}})

# Variational Inference

### This is a Variational Inference problem:

\tau \sim p(\tau|\mathcal{O}_{0:T})
p(s_{t+1}|s_t, a_t, \mathcal{O}_{t+1:T})
\pi
q(\tau) = p(s_0)\prod_{t=0}^T \pi(a_t|s_t)p(s_{t+1}|s_t,a_t)
\pi
\tau \sim q(\tau)
\tau \sim p(\tau|\mathcal{O}_{0:T})
\text{KL}\big(q(\tau)\;||\;p(\tau|\mathcal{O}_{0:T})\big) \rightarrow \min_\pi

# Variational Inference

### Let us expand VI objective using the definition of KL-divergence:

\min_\pi \text{KL}\big(q(\tau)||p(\tau|\mathcal{O}_{0:T})\big) = - \min_\pi \mathbb{E}_q \log \frac{p(\tau,\;\mathcal{O}_{0:T})}{q(\tau)\;p(\mathcal{O}_{0:T})} = \max_\pi \mathbb{E}_q \log \frac{p(\tau,\;\mathcal{O}_{0:T})}{q(\tau)} + \text{const}

### this is Maximum Entropy RL Objective

\max_\pi \mathbb{E}_q \log \frac{p(\tau,\;\mathcal{O}_{0:T})}{q(\tau)}= \max_\pi \mathbb{E}_q \Big[ \log p(s_0)+\sum_{t} \big( \log p(s_{t+1}|s_t,a_t) + r(s_t, a_t) \big) -
- \log p(s_0)-\sum_{t} \big( \log p(s_{t+1}|s_t,a_t) - \log \pi(a_t| s_t) \big) \Big]=
= \max_\pi \mathbb{E}_\pi \sum_{t}\Big[ r(s_t, a_t) + \mathcal{H}\big( \pi(\cdot| s_t)\big) \Big]

# Returning to $$Q^{soft}$$ and $$V^{soft}$$ functions

### The objective from the previous slide can be rewritten as follows:

V^{soft}(s_t) =\log \int \exp Q^{soft}(s_t, a_t) da_t
Q^{soft}(s_t, a_t) = r(s_t, a_t) + \mathbb{E}_{p(s_{t+1}|s_t, a_t)} V^{soft}(s_{t+1})

check it yourself!

\pi(a_t|s_t) =\frac{\exp(Q^{soft}(s_t, a_t))}{\exp(V^{soft}(s_t))}
\sum_{t=0}^T\mathbb{E}_{s_t} \Big[ -\text{KL}\Big(\pi(a_t|s_t)||\frac{\exp(Q^{soft}(s_t, a_t))}{\exp(V^{soft}(s_t))}\Big) + V^{soft}(s_t) \Big] \rightarrow \max_\pi

Q^{soft}
V^{soft}

# RL as Inference with function approximators

### RL as Inference with function approximators

\mathbb{E}_{\tau \sim \pi_\theta} \sum_{t=0}^T\Big[ r(s_t, a_t) + \mathcal{H}\big(\pi_\theta(\cdot|s_t)\big) \Big] \rightarrow \max_\theta

### For gradients, use log-derivative trick:

\sum_{t=0}^T\mathbb{E}_{(s_t,a_t) \sim q_\theta} \Big[ \nabla_\theta \log\pi_\theta(a_t|s_t) \sum_{t'=t}^T\Big( r(s_{t'}, a_{t'}) -\log\pi_\theta(a_{t'}|s_{t'}) - b(s_{t'}) \Big)\Big]
\theta

### Policy       is parametrized with a neural network with parameters

\theta
\pi
\pi_\theta(a|s) = \mathcal{N}\Big(a\;\big|\;\mu_\theta(s), \;\sigma^2 \Big)

# Soft Q-learning

### Train Q-network with parameters      :

\phi
\mathbb{E}_{(s_t,a_t, s_{t+1}) \sim \mathcal{D}} \Big[ Q^{soft}_\phi(s_t, a_t) - \Big( r(s_t, a_t) + V^{soft}_\phi(s_{t+1})\Big) \Big]^2\rightarrow \min_\phi

use replay buffer

### where

V^{soft}_\phi(s_t) =\log \int \exp Q^{soft}_\phi(s_t, a_t) da_t

### Policy is implicit

\pi(a_t|s_t) = \exp\big(Q^{soft}_\phi(s_t, a_t) - V^{soft}_\phi(s_t)\big)

# Soft Actor-Critic

### Train              and               networks jointly with policy

\mathbb{E}_{(s_t,a_t, s_{t+1}) \sim \mathcal{D}} \Big[ Q^{soft}_\phi(s_t, a_t) - \Big( r(s_t, a_t) + V^{soft}_\psi(s_{t+1})\Big) \Big]^2\rightarrow \min_\phi

### V-network loss:

\hat{V}^{soft}(s_t) = \mathbb{E}_{a_t \sim \pi_\theta} \Big[ Q^{soft}_\phi(s_t, a_t) - \log\pi_\theta(a_t|s_t) \Big]
\mathbb{E}_{s_t \sim \mathcal{D}} \Big[ \hat{V}^{soft}(s_t) - V^{soft}_\psi(s_{t}) \Big]^2 \rightarrow \min_\psi

### Objective for the policy:

\mathbb{E}_{s_t \sim \mathcal{D}, \;a_t \sim \pi_\theta} \Big[ Q^{soft}_\phi(s_t, a_t) -\log\pi_\theta(a_{t'}|s_{t})\Big] \rightarrow \max_\theta
\mathbb{E}_{(s_t,a_t, s_{t+1}) \sim \mathcal{D}} \Big[ Q^{soft}_\phi(s_t, a_t) - \Big( r(s_t, a_t) + V^{soft}_\psi(s_{t+1})\Big) \Big]^2\rightarrow \min_\phi

### V-network loss:

\hat{V}^{soft}(s_t) = \mathbb{E}_{a_t \sim \pi_\theta} \Big[ Q^{soft}_\phi(s_t, a_t) - \log\pi_\theta(a_t|s_t) \Big]
\mathbb{E}_{s_t \sim \mathcal{D}} \Big[ \hat{V}^{soft}(s_t) - V^{soft}_\psi(s_{t}) \Big]^2 \rightarrow \min_\psi

### Objective for the policy:

\mathbb{E}_{s_t \sim \mathcal{D}, \;a_t \sim \pi_\theta} \Big[ Q^{soft}_\phi(s_t, a_t) -\log\pi_\theta(a_{t'}|s_{t})\Big] \rightarrow \max_\theta
Q^{soft}_\phi
V^{soft}_\psi
\pi_\theta

# All is good. Stop?

### Let us discuss simple Multi-Armed Bandit problem

\mathcal{S} = \empty
\mathcal{A} = \{1, \;2, \;\dots, \;N\}
​Bandit №1
​Bandit №2

​Bandit №3

# . . .

​Bandit №N

COVID-19

### We can model it via sampling MDP from some prior distribution over MDPs:

\mathcal{M} = \{ M^+, M^-\}

### Sample                     and learn in      episodes of interaction

M \sim \mathcal{M}
L

a = 1

a = 2

a = 3

...

a = N

REWARD

M^+
M^-
1
1
2
-2
1-\epsilon
1-\epsilon
1-\epsilon
1-\epsilon
1-\epsilon
1-\epsilon

# Multi-Armed Bandit example

### How Soft Q-learning will deal with it?

\mathcal{S} = \empty
\mathcal{A} = \{1, \;2, \;\dots, \;N\}

### We can model it via sampling MDP from some prior distribution over MDPs:

\mathcal{M} = \{ M^+, M^-\}

### Sample                     and learn in      episodes of interaction

M \sim \mathcal{M}
L

a = 1

a = 2

a = 3

...

a = N

REWARD

M^+
M^-
1
1
2
-2
1-\epsilon
1-\epsilon
1-\epsilon
1-\epsilon
1-\epsilon
1-\epsilon

### Let us compute             and             :

Q^{soft}
V^{soft}
V^{soft}(s_t) =\log \sum_{a_t} \exp Q^{soft}(s_t, a_t)
Q^{soft}(s_t, a_t) = \mathbb{E}[R(s_t, a_t)] + \mathbb{E}_{p(s_{t+1}|s_t, a_t)} V^{soft}(s_{t+1})
Q^{soft}(a) = \mathbb{E}[R(a)]
V^{soft} =\log \sum_a \exp Q^{soft}(a)

# Multi-Armed Bandit example

a = 1

a = 2

a = 3

...

a = N

REWARD

M^+
M^-
1
1
2
-2
1-\epsilon
1-\epsilon
1-\epsilon
1-\epsilon
1-\epsilon
1-\epsilon

### and             :

Q^{soft}
V^{soft}
Q^{soft}(a) = \mathbb{E}[R(a)]
V^{soft} =\log \sum_a \exp Q^{soft}(a)
Q^{soft}
1
0
1-\epsilon
1-\epsilon
1-\epsilon
\dots

a = 1

a = 2

a = 3

a = N

N = 3
V^{soft} =1.86
\pi(2) = 0.16

N = 10
V^{soft} =3.23
\pi(2) = 0.04

N = 100
V^{soft} =5.59
\pi(2) = 0.004

# Reminder

### Regret

\text{Regret}(M, \text{alg}, L) = \mathbb{E}_{\tau \sim M, \;\text{alg}} \Bigg[\sum_{l=0}^L \Bigg( V^\star(s_0^l) - \sum_{t=0}^T r(s_t^l, a_t^l) \Bigg) \Bigg]

### (with associated probabilities of being in a concrete MDP      )

\text{BayesRegret}(\phi, \text{alg}, L) = \mathbb{E}_{M \sim \mathcal{M}} \;\text{Regret}(M, \text{alg}, L)
\text{WorstCaseRegret}(\mathcal{M}, \text{alg}, L) = \max_{M \in \mathcal{M}}\;\text{Regret}(M, \text{alg}, L)
\mathcal{M}
\phi
L
\text{alg}
M
M

# K-learning

u(X)

### We will discuss the exponential family of utility functions:

u(X) = \tau \exp(X / \tau) - 1)

### Certainty Equivalent Value is an amount of guarantied payoff, that agent considers similarly to the random one:

C^X(\tau) = u^{-1}(\mathbb{E}u(X)) = \tau \log \mathbb{E} \exp(X/\tau)

### For exponential utility functions, certainty equivalent values are closely related to the Cumulant Generative Function of a r.v.:

C^X(\tau) = \tau G^X(1/\tau)

# K-learning

### The cumulant generating function of the posterior for the optimal Q-values satisfies the following Bellman inequality

C^{\star|t}_{s_l, a_l} \le \tilde G^{\mu|t}_{s_l, a_l}(1/\tau_t)\; +\; \sum_{s_{l+1}} \mathbb{E}^t ( P_{s_{l+1}, s_l, a_l}) \tau_t \log \sum_{a_{l+1}} \exp \big( C^{\star|t}_{s_{l+1}, a_{l+1}} / \tau_t \big)

### or, similarly

C^{\star|t}_{l} \le \mathcal{B}(\tau_t, C^{\star|t}_{l+1})

### where

\tilde G^{\mu|t}_{s_l, a_l}(\beta) = G^{\mu|t}_{s_l, a_l}(\beta) + \frac{(L-l)^2 \beta^2}{2(n^t_{s_l, a_l}+1)}

# K-learning

### We will call it K-value:

K^t_{l} = \mathcal{B}(\tau_t, K^t_{l+1})

### And we define policy as follows:

\pi(a_l|s_l) \propto \exp \big( K^t_{s_l, a_l} / \tau_t \big)

# References

Soft Q-learning:

https://arxiv.org/pdf/1702.08165.pdf

Soft Actor Critic:

https://arxiv.org/pdf/1801.01290.pdf

Big Review on Probabilistic Inference for RL:

https://arxiv.org/pdf/1805.00909.pdf

Implementation on TensorFlow:

https://github.com/rail-berkeley/softlearning

Implementation on Catalyst.RL:

https://github.com/catalyst-team/catalyst/tree/master/examples/rl_gym