speaker: Pavel Temirchev

RL as Probabilistic Inference

Background

Probabilistic Graphical Models

p(a, b, c, d, e)

p(a, b, c, d, e)

Generally, a joint probability distribution

p(a, b, c, d, e) = p(a|b, c, d, e) p(b|c, d, e) p(c|d, e) p(d|e) p(e)

p(a, b, c, d, e) = p(a|b, c, d, e) p(b|c, d, e) p(c|d, e) p(d|e) p(e)

can be factorized as follows:

Graphical representation of a probabilistic model can help

to embed structure into the model:

a

b

c

d

e

p(a, b, c, d, e) = p(a) p(b) p(c|a, b) p(d|c) p(e|c)

p(a, b, c, d, e) = p(a) p(b) p(c|a, b) p(d|c) p(e|c)

Background

Inference on PGMs

Inference:

p(z_1) = \int p(z_1|x_1) p(x_1) dx_1

p(z_1) = \int p(z_1|x_1) p(x_1) dx_1

p(x_2) = \int p(x_2|x_1) p(x_1) dx_1

p(x_2) = \int p(x_2|x_1) p(x_1) dx_1

Graphical representation can help make probabilistic inference more easily.

There are a lot of algorithms for exact and approximate inference for PMGs

We will discuss very simple example of
Message Passing Algorithm on trees.

p(z_l) = \;? \;\;\; \forall l

p(z_l) = \;? \;\;\; \forall l

Question:

Model:

p(x_{0:L}, z_{0:L}) = p(x_0)p(z_0|x_0)\prod_l p(x_l|x_{l-1})p(z_l|x_l)

p(x_{0:L}, z_{0:L}) = p(x_0)p(z_0|x_0)\prod_l p(x_l|x_{l-1})p(z_l|x_l)

z_0

z_0

x_0

x_0

z_1

z_1

x_1

x_1

z_L

z_L

x_L

x_L

\dots

\dots

p(z_0) = \int p(z_0|x_0) p(x_0) dx_0

p(z_0) = \int p(z_0|x_0) p(x_0) dx_0

p(x_1) = \int p(x_1|x_0) p(x_0) dx_0

p(x_1) = \int p(x_1|x_0) p(x_0) dx_0

p(z_l) = \int p(z_l|x_l) p(x_l) dx_l

p(z_l) = \int p(z_l|x_l) p(x_l) dx_l

p(x_{l+1}) = \int p(x_{l+1}|x_l) p(x_l) dx_l

p(x_{l+1}) = \int p(x_{l+1}|x_l) p(x_l) dx_l

Background

Bayes' Rule and Kulback-Leibler divergence (KL)

Bayes' Rule: allows to calculate posterior distribution of a r.v.
given new data and a prior distribution

Kulback-Leibler divergence is a measure of how one distribution is different from another, reference, distribution (not symmetric):

p(z|x) =

p(z|x) =

\frac{p(x|z) p(z)}{p(x)}

\frac{p(x|z) p(z)}{p(x)}

=

\frac{p(x|z) p(z)}{\int p(x|z)p(z)dz}

\frac{p(x|z) p(z)}{\int p(x|z)p(z)dz}

\text{KL}\Big( q(x)\; \big|\big|\; p(x) \Big) = \mathbb{E}_{x \sim q} \Big[ \log \frac{q(x)}{p(x)} \Big]

\text{KL}\Big( q(x)\; \big|\big|\; p(x) \Big) = \mathbb{E}_{x \sim q} \Big[ \log \frac{q(x)}{p(x)} \Big]

Background

Approximate probabilistic inference: Variational Inference (VI)

When applying Bayes' rule, the common situation

is intractability of the evidence term

p(x) = \int p(x|z)p(z)dz

p(x) = \int p(x|z)p(z)dz

Hence, the exact posterior is intractable!

One way to go is to use approximate inference procedure called Variational Inference

p(z|x)

p(z|x)

\text{KL}\Big( q(z)\; \big|\big|\; p(z|x) \Big) \rightarrow \min_{q \in \mathcal{Q}}

\text{KL}\Big( q(z)\; \big|\big|\; p(z|x) \Big) \rightarrow \min_{q \in \mathcal{Q}}

We want to minimize the dissimilarity between the true posterior

and our approximation - variational distribution

The search is in the chosen family of variational distributions

p(z|x)

p(z|x)

q(z)

q(z)

q \in\mathcal{Q}

q \in\mathcal{Q}

Background

Approximate probabilistic inference: Variational Inference (VI)

It can be shown that the defined minimization problem is closely related

to the maximization of some lower bound on the evidence

p(x) = \int p(x|z)p(z)dz

p(x) = \int p(x|z)p(z)dz

We can rewrite the logarithm of the evidence as follows:

\log p(x) = \text{KL}\Big( q(z)\; \big|\big|\; p(z|x) \Big) + \mathcal{L}(q)

\log p(x) = \text{KL}\Big( q(z)\; \big|\big|\; p(z|x) \Big) + \mathcal{L}(q)

Where

is so-called Evidence Lower Bound Objective (ELBO)

\mathcal{L}(q) = - \mathbb{E}_q \Big[\log q(z) - \log p(x, z) \Big]

\mathcal{L}(q) = - \mathbb{E}_q \Big[\log q(z) - \log p(x, z) \Big]

(1)

(1)

The LHS of is independent of , whereas each term on the RHS is dependent.

(1)

(1)

q

Hence, the minimization of is equal to the maximization of ELBO

\text{KL}

\text{KL}

\mathcal{L}(q)

\mathcal{L}(q)

\text{KL}\big( q\; ||\; p \big) \rightarrow \min_{q \in \mathcal{Q}}

\text{KL}\big( q\; ||\; p \big) \rightarrow \min_{q \in \mathcal{Q}}

It is your choice: either you want to minimize:

or you want to maximize:

\mathcal{L}(q) \rightarrow \max_{q \in \mathcal{Q}}

\mathcal{L}(q) \rightarrow \max_{q \in \mathcal{Q}}

Background

RL Basics

Markov process:

p(\tau) = p(s_0) \prod_{t=0}^T p(a_t|s_t) p(s_{t+1}|s_t, a_t)

p(\tau) = p(s_0) \prod_{t=0}^T p(a_t|s_t) p(s_{t+1}|s_t, a_t)

Maximization problem:

\pi^\star = \arg\max_\pi \sum_{t=0}^T \mathbb{E}_{s_t, a_t \sim \pi} [r(s_t, a_t)]

\pi^\star = \arg\max_\pi \sum_{t=0}^T \mathbb{E}_{s_t, a_t \sim \pi} [r(s_t, a_t)]

Q^\pi(s_t,a_t) := r(s_t,a_t) + \sum_{t'=t+1}^T \mathbb{E}_{s_{t'}, a_{t'} \sim \pi} [r(s_{t'}, a_{t'})]

Q^\pi(s_t,a_t) := r(s_t,a_t) + \sum_{t'=t+1}^T \mathbb{E}_{s_{t'}, a_{t'} \sim \pi} [r(s_{t'}, a_{t'})]

Value functions (defined for policy):

Q^\star(s_t,a_t) = r(s_t,a_t) + \mathbb{E}_{s_{t+1}} V^\star(s_{t+1})

Q^\star(s_t,a_t) = r(s_t,a_t) + \mathbb{E}_{s_{t+1}} V^\star(s_{t+1})

Bellman Optimality operator:

V^\star(s_t) = \max_a Q^\star(s_{t}, a)

V^\star(s_t) = \max_a Q^\star(s_{t}, a)

Probabilistic Graphical Model

for MDP

a_0

a_0

s_0

s_0

a_1

a_1

s_1

s_1

a_2

a_2

s_2

s_2

V^\pi(s_t) = \mathbb{E}_a Q^\pi(s_{t}, a)

V^\pi(s_t) = \mathbb{E}_a Q^\pi(s_{t}, a)

Generally, reward is a random variable:

r(s_t,a_t) = \mathbb{E} \big[ R(s_t, a_t) \big]

r(s_t,a_t) = \mathbb{E} \big[ R(s_t, a_t) \big]

A heuristic for better exploration

Maximum entropy RL

a_t \sim \mathcal{N}(\cdot| \pi^\star, \sigma^2)

a_t \sim \mathcal{N}(\cdot| \pi^\star, \sigma^2)

Standard Policy Gradient:

a_t \sim \exp{Q(s_t, a_t)}

a_t \sim \exp{Q(s_t, a_t)}

Policy "proportional" to Q:

How to find such a policy?

\min_\pi\text{KL}\Big(\pi(\cdot|s_0)||\exp{Q(s_0, \cdot)}\Big) =

\min_\pi\text{KL}\Big(\pi(\cdot|s_0)||\exp{Q(s_0, \cdot)}\Big) =

\max_\pi \mathbb{E}_\pi \Big[ Q(s_0, a_0) - \log \pi(a_0|s_0) \Big] =

\max_\pi \mathbb{E}_\pi \Big[ Q(s_0, a_0) - \log \pi(a_0|s_0) \Big] =

\max_\pi \mathbb{E}_\pi \Big[ \sum_t^T r(s_t, a_t) {\color{pink}+ \mathcal{H} \big( \pi(\cdot|s_0) \big)}\Big]

\max_\pi \mathbb{E}_\pi \Big[ \sum_t^T r(s_t, a_t) {\color{pink}+ \mathcal{H} \big( \pi(\cdot|s_0) \big)}\Big]

Q^\star(s_0, \cdot)

Q^\star(s_0, \cdot)

go left

go right

a_0

a_0

\exp Q^\star(s_0, \cdot)

\exp Q^\star(s_0, \cdot)

\mathcal{N}(\cdot|\arg\max Q^\star, \sigma^2)

\mathcal{N}(\cdot|\arg\max Q^\star, \sigma^2)

It is very similar to the heuristic Maximum Entropy RL objective

\max_\pi \mathbb{E}_\pi \Big[ \sum_t^T r(s_t, a_t) {\color{pink}+ \mathcal{H} \big( \pi(\cdot|s_t) \big)}\Big]

\max_\pi \mathbb{E}_\pi \Big[ \sum_t^T r(s_t, a_t) {\color{pink}+ \mathcal{H} \big( \pi(\cdot|s_t) \big)}\Big]

During the lecture we will derive a probabilistic model inference on which results in Maximum Entropy RL objective

RL as Probabilistic Inference

Graphical Model with Optimality variables

a_0

a_0

s_0

s_0

a_1

a_1

s_1

s_1

\mathcal{O}_0

\mathcal{O}_0

\mathcal{O}_1

\mathcal{O}_1

p(\mathcal{O}_t =1 |s_t, a_t) := p(\mathcal{O}_t |s_t, a_t)

p(\mathcal{O}_t =1 |s_t, a_t) := p(\mathcal{O}_t |s_t, a_t)

What if we would have binary optimality variables?

Let us look at the PGM for an MDP

\mathcal{O}_2

\mathcal{O}_2

a_2

a_2

s_2

s_2

If then timestep was optimal.

\mathcal{O}_t = 1

\mathcal{O}_t = 1

t

Probability that the pair is optimal:

(s_t, a_t)

(s_t, a_t)

p(\mathcal{O}_t =1 |s_t, a_t) := p(\mathcal{O}_t |s_t, a_t) = \exp\big(r(s_t,a_t)\big)

p(\mathcal{O}_t =1 |s_t, a_t) := p(\mathcal{O}_t |s_t, a_t) = \exp\big(r(s_t,a_t)\big)

But how we should define this probability?

Use exponentiation. Exponents are good.

Let us analyze the distribution of trajectories conditioned on optimality:

p(\tau|\mathcal{O}_{0:T}) \propto p(\tau,\mathcal{O}_{0:T}) = p(s_0)\prod_{t=0}^Tp(a_t|s_t)p(s_{t+1}|s_t,a_t) \exp\big(r(s_t, a_t)\big)

p(\tau|\mathcal{O}_{0:T}) \propto p(\tau,\mathcal{O}_{0:T}) = p(s_0)\prod_{t=0}^Tp(a_t|s_t)p(s_{t+1}|s_t,a_t) \exp\big(r(s_t, a_t)\big)

p({\color{#00ff00}\tau}|{\color{#ff0000}\mathcal{O}_{0:T}}) \propto p({\color{#00ff00}\tau},{\color{#ff0000}\mathcal{O}_{0:T}}) = {\color{#00ff00} p(s_0)\prod_{t=0}^Tp(a_t|s_t)p(s_{t+1}|s_t,a_t)} {\color{#ff0000}\,\exp\big(r(s_t, a_t)\big)}

p({\color{#00ff00}\tau}|{\color{#ff0000}\mathcal{O}_{0:T}}) \propto p({\color{#00ff00}\tau},{\color{#ff0000}\mathcal{O}_{0:T}}) = {\color{#00ff00} p(s_0)\prod_{t=0}^Tp(a_t|s_t)p(s_{t+1}|s_t,a_t)} {\color{#ff0000}\,\exp\big(r(s_t, a_t)\big)}

RL as Probabilistic Inference

Exact inference for Optimal actions

p(a_t|s_t, \mathcal{O}_{0:T}) = p(a_t|s_t, \mathcal{O}_{t:T})

p(a_t|s_t, \mathcal{O}_{0:T}) = p(a_t|s_t, \mathcal{O}_{t:T})

=

\frac{p(\mathcal{O}_{t:T}|s_t, a_t) p(a_t|s_t) p(s_t)}{p(\mathcal{O}_{t:T})}

\frac{p(\mathcal{O}_{t:T}|s_t, a_t) p(a_t|s_t) p(s_t)}{p(\mathcal{O}_{t:T})}

\frac{p(\mathcal{O}_{t:T})}{p(\mathcal{O}_{t:T}|s_t) p(s_t)}

\frac{p(\mathcal{O}_{t:T})}{p(\mathcal{O}_{t:T}|s_t) p(s_t)}

here - some prior (non-informative) policy

p(a_t|s_t)

p(a_t|s_t)

if we set , then

the optimal policy is the following:

p(a_t|s_t) = \frac{1}{|\mathcal{A}|}

p(a_t|s_t) = \frac{1}{|\mathcal{A}|}

p(a_t|s_t, \mathcal{O}_{t:T}) \propto

p(a_t|s_t, \mathcal{O}_{t:T}) \propto

We can now infer actions conditioned on optimality - optimal policy

\frac{ p(s_t, a_t|\mathcal{O}_{t:T})}{ p(s_t|\mathcal{O}_{t:T})}

\frac{ p(s_t, a_t|\mathcal{O}_{t:T})}{ p(s_t|\mathcal{O}_{t:T})}

(*)

(*)

is conditionally

independent of

given

due to the structure of PGM

(*)

(*)

a_t

a_t

\mathcal{O}_{0:t-1}

\mathcal{O}_{0:t-1}

s_t

s_t

=

=

\frac{ \color{#00ff00} p(s_t, a_t|\mathcal{O}_{t:T})}{ \color{#ff0000} p(s_t|\mathcal{O}_{t:T})}

\frac{ \color{#00ff00} p(s_t, a_t|\mathcal{O}_{t:T})}{ \color{#ff0000} p(s_t|\mathcal{O}_{t:T})}

let's apply Bayes rule!

\frac{p(\mathcal{O}_{t:T}|s_t, a_t) p(a_t|s_t) p(s_t)}{p(\mathcal{O}_{t:T})}\frac{p(\mathcal{O}_{t:T})}{p(\mathcal{O}_{t:T}|s_t) p(s_t)}

\frac{p(\mathcal{O}_{t:T}|s_t, a_t) p(a_t|s_t) p(s_t)}{p(\mathcal{O}_{t:T})}\frac{p(\mathcal{O}_{t:T})}{p(\mathcal{O}_{t:T}|s_t) p(s_t)}

p(a_t|s_t, \mathcal{O}_{0:T})

p(a_t|s_t, \mathcal{O}_{0:T})

\frac{p(\mathcal{O}_{t:T}|s_t, a_t) }{p(\mathcal{O}_{t:T}|s_t)}

\frac{p(\mathcal{O}_{t:T}|s_t, a_t) }{p(\mathcal{O}_{t:T}|s_t)}

Exact inference for optimal actions

Message Passing Algorithm

Let's introduce
new notation:

\alpha_t(s_t, a_t) := p(\mathcal{O}_{t:T}|s_t, a_t)

\alpha_t(s_t, a_t) := p(\mathcal{O}_{t:T}|s_t, a_t)

\beta_t(s_t) := p(\mathcal{O}_{t:T}|s_t) = \int \alpha_t(s_t, a_t) p(a_t|s_t)da_t

\beta_t(s_t) := p(\mathcal{O}_{t:T}|s_t) = \int \alpha_t(s_t, a_t) p(a_t|s_t)da_t

We can find

all the and via

Message Passing algorithm:

\alpha_t

\alpha_t

\beta_t

\beta_t

For the timestep :

T

\alpha_T(s_T, a_T) = \exp(r(s_T, a_T))

\alpha_T(s_T, a_T) = \exp(r(s_T, a_T))

\beta_T(s_T) = \int \alpha_T(s_T, a_T) p(a_T|s_T)da_T

\beta_T(s_T) = \int \alpha_T(s_T, a_T) p(a_T|s_T)da_T

Recursively:

\alpha_t(s_t, a_t) = \int \beta_{t+1}(s_{t+1}) \exp(r(s_t, a_t)) p(s_{t+1}|s_t, a_t)ds_{t+1}

\alpha_t(s_t, a_t) = \int \beta_{t+1}(s_{t+1}) \exp(r(s_t, a_t)) p(s_{t+1}|s_t, a_t)ds_{t+1}

\beta_t(s_t) = \int \alpha_t(s_t, a_t) p(a_t|s_t)da_t

\beta_t(s_t) = \int \alpha_t(s_t, a_t) p(a_t|s_t)da_t

We want to compute and

for all

p(a_t|s_t, \mathcal{O}_{t:T}) \propto

p(a_t|s_t, \mathcal{O}_{t:T}) \propto

\frac{p(\mathcal{O}_{t:T}|s_t, a_t) }{p(\mathcal{O}_{t:T}|s_t)}

\frac{p(\mathcal{O}_{t:T}|s_t, a_t) }{p(\mathcal{O}_{t:T}|s_t)}

p(\mathcal{O}_{t:T}|s_t, a_t)

p(\mathcal{O}_{t:T}|s_t, a_t)

p(\mathcal{O}_{t:T}|s_t)

p(\mathcal{O}_{t:T}|s_t)

0 \le t \le T

0 \le t \le T

Introducing $Q^{soft}$ and $V^{soft}$ functions

Log-scale messages

Q^{soft}(s_t, a_t) := \log\alpha_t(s_t, a_t)

Q^{soft}(s_t, a_t) := \log\alpha_t(s_t, a_t)

V^{soft}(s_t) := \log\beta_t(s_t)

V^{soft}(s_t) := \log\beta_t(s_t)

Substituting into the recursive relation, we will obtain the following:

V^{soft}(s_t) =\log \mathbb{E}_{p(a_t|s_t)} [\exp Q^{soft}(s_t, a_t)]

V^{soft}(s_t) =\log \mathbb{E}_{p(a_t|s_t)} [\exp Q^{soft}(s_t, a_t)]

soft maximum

Q^{soft}(s_t, a_t) = r(s_t, a_t) + \log \mathbb{E}_{p(s_{t+1}|s_t, a_t)} [\exp V^{soft}(s_{t+1})]

Q^{soft}(s_t, a_t) = r(s_t, a_t) + \log \mathbb{E}_{p(s_{t+1}|s_t, a_t)} [\exp V^{soft}(s_{t+1})]

kinda Bellman equation

We can find analogues in the log-scale:

approximates hard maximum with

Q^{soft}(s_t, a_t) \rightarrow \infty

Q^{soft}(s_t, a_t) \rightarrow \infty

Compare $(Q^\star,\;V^\star)$ with $(Q^{soft},\;V^{soft})$

Hard approach vs. Soft approach

"Hard" and functions:

V^\star(s_t) =\max_{a_t} Q^\star(s_t, a_t)

V^\star(s_t) =\max_{a_t} Q^\star(s_t, a_t)

Q^\star(s_t, a_t) = r(s_t, a_t) + \mathbb{E}_{p(s_{t+1}|s_t, a_t)} V^\star(s_{t+1})

Q^\star(s_t, a_t) = r(s_t, a_t) + \mathbb{E}_{p(s_{t+1}|s_t, a_t)} V^\star(s_{t+1})

V^{soft}(s_t) =\log \mathbb{E}_{p(a_t|s_t)} [\exp Q^{soft}(s_t, a_t)]

V^{soft}(s_t) =\log \mathbb{E}_{p(a_t|s_t)} [\exp Q^{soft}(s_t, a_t)]

Q^{soft}(s_t, a_t) = r(s_t, a_t) + \log \mathbb{E}_{p(s_{t+1}|s_t, a_t)} [\exp V^{soft}(s_{t+1})]

Q^{soft}(s_t, a_t) = r(s_t, a_t) + \log \mathbb{E}_{p(s_{t+1}|s_t, a_t)} [\exp V^{soft}(s_{t+1})]

"Soft" analogues:

Q^\star(s_t, a_t) = r(s_t, a_t) + \mathbb{E}_{p(s_{t+1}|s_t, a_t)} \max_{a_{t+1}} Q^\star(s_{t+1}, a_{t+1})

Q^\star(s_t, a_t) = r(s_t, a_t) + \mathbb{E}_{p(s_{t+1}|s_t, a_t)} \max_{a_{t+1}} Q^\star(s_{t+1}, a_{t+1})

Q^{soft}(s_t, a_t) \approx r(s_t, a_t) + \max_{s_{t+1}} \max_{a_{t+1}} Q^{soft}(s_{t+1}, a_{t+1})

Q^{soft}(s_t, a_t) \approx r(s_t, a_t) + \max_{s_{t+1}} \max_{a_{t+1}} Q^{soft}(s_{t+1}, a_{t+1})

\max_{s_{t+1}}

\max_{s_{t+1}}

V^\star

V^\star

Q^\star

Q^\star

Why we are so optimistic?

p(\tau|\mathcal{O}_{0:T}) = p(s_0|\mathcal{O}_{0:T})\prod_{t=0}^T{\color{#00ff00}p(a_t|s_t,\mathcal{O}_{t:T})}p(s_{t+1}|s_t,a_t,\mathcal{O}_{t+1:T})

p(\tau|\mathcal{O}_{0:T}) = p(s_0|\mathcal{O}_{0:T})\prod_{t=0}^T{\color{#00ff00}p(a_t|s_t,\mathcal{O}_{t:T})}p(s_{t+1}|s_t,a_t,\mathcal{O}_{t+1:T})

p(\tau|\mathcal{O}_{0:T}) = p(s_0|{\color{#ff0000}\mathcal{O}_{0:T}})\prod_{t=0}^Tp(a_t|s_t,\mathcal{O}_{t:T})p(s_{t+1}|s_t,a_t,{\color{#ff0000}\mathcal{O}_{t+1:T}})

p(\tau|\mathcal{O}_{0:T}) = p(s_0|{\color{#ff0000}\mathcal{O}_{0:T}})\prod_{t=0}^Tp(a_t|s_t,\mathcal{O}_{t:T})p(s_{t+1}|s_t,a_t,{\color{#ff0000}\mathcal{O}_{t+1:T}})

What we have done is the inference of the policy term

which was taken from the formula for optimal trajectories distribution:

But who are the neighbors of the policy?

This policy is optimal only in the presence of optimal dynamics!

Can we fix it?

Variational Inference

Approximate inference for achievable trajectories via VI

The trajectories are not really achievable

since they are based on the optimistic dynamics

Our policy , however, will be exploited with the prior dynamics:

And we want policy to produce trajectories ,
which are as close as possible to optimal trajectories

This is a Variational Inference problem:

\tau \sim p(\tau|\mathcal{O}_{0:T})

\tau \sim p(\tau|\mathcal{O}_{0:T})

p(s_{t+1}|s_t, a_t, \mathcal{O}_{t+1:T})

p(s_{t+1}|s_t, a_t, \mathcal{O}_{t+1:T})

\pi

\pi

q(\tau) = p(s_0)\prod_{t=0}^T \pi(a_t|s_t)p(s_{t+1}|s_t,a_t)

q(\tau) = p(s_0)\prod_{t=0}^T \pi(a_t|s_t)p(s_{t+1}|s_t,a_t)

\pi

\pi

\tau \sim q(\tau)

\tau \sim q(\tau)

\tau \sim p(\tau|\mathcal{O}_{0:T})

\tau \sim p(\tau|\mathcal{O}_{0:T})

\text{KL}\big(q(\tau)\;||\;p(\tau|\mathcal{O}_{0:T})\big) \rightarrow \min_\pi

\text{KL}\big(q(\tau)\;||\;p(\tau|\mathcal{O}_{0:T})\big) \rightarrow \min_\pi

Variational Inference

Approximate inference for achievable trajectories via VI

Let us expand VI objective using the definition of KL-divergence:

\min_\pi \text{KL}\big(q(\tau)||p(\tau|\mathcal{O}_{0:T})\big) = - \min_\pi \mathbb{E}_q \log \frac{p(\tau,\;\mathcal{O}_{0:T})}{q(\tau)\;p(\mathcal{O}_{0:T})} = \max_\pi \mathbb{E}_q \log \frac{p(\tau,\;\mathcal{O}_{0:T})}{q(\tau)} + \text{const}

\min_\pi \text{KL}\big(q(\tau)||p(\tau|\mathcal{O}_{0:T})\big) = - \min_\pi \mathbb{E}_q \log \frac{p(\tau,\;\mathcal{O}_{0:T})}{q(\tau)\;p(\mathcal{O}_{0:T})} = \max_\pi \mathbb{E}_q \log \frac{p(\tau,\;\mathcal{O}_{0:T})}{q(\tau)} + \text{const}

this is Maximum Entropy RL Objective

\max_\pi \mathbb{E}_q \log \frac{p(\tau,\;\mathcal{O}_{0:T})}{q(\tau)}= \max_\pi \mathbb{E}_q \Big[ \log p(s_0)+\sum_{t} \big( \log p(s_{t+1}|s_t,a_t) + r(s_t, a_t) \big) -

\max_\pi \mathbb{E}_q \log \frac{p(\tau,\;\mathcal{O}_{0:T})}{q(\tau)}= \max_\pi \mathbb{E}_q \Big[ \log p(s_0)+\sum_{t} \big( \log p(s_{t+1}|s_t,a_t) + r(s_t, a_t) \big) -

- \log p(s_0)-\sum_{t} \big( \log p(s_{t+1}|s_t,a_t) - \log \pi(a_t| s_t) \big) \Big]=

- \log p(s_0)-\sum_{t} \big( \log p(s_{t+1}|s_t,a_t) - \log \pi(a_t| s_t) \big) \Big]=

= \max_\pi \mathbb{E}_\pi \sum_{t}\Big[ r(s_t, a_t) + \mathcal{H}\big( \pi(\cdot| s_t)\big) \Big]

= \max_\pi \mathbb{E}_\pi \sum_{t}\Big[ r(s_t, a_t) + \mathcal{H}\big( \pi(\cdot| s_t)\big) \Big]

Returning to $Q^{soft}$ and $V^{soft}$ functions

Risk-neutral Soft approach

The objective from the previous slide can be rewritten as follows:

V^{soft}(s_t) =\log \int \exp Q^{soft}(s_t, a_t) da_t

V^{soft}(s_t) =\log \int \exp Q^{soft}(s_t, a_t) da_t

Q^{soft}(s_t, a_t) = r(s_t, a_t) + \mathbb{E}_{p(s_{t+1}|s_t, a_t)} V^{soft}(s_{t+1})

Q^{soft}(s_t, a_t) = r(s_t, a_t) + \mathbb{E}_{p(s_{t+1}|s_t, a_t)} V^{soft}(s_{t+1})

check it yourself!

\pi(a_t|s_t) =\frac{\exp(Q^{soft}(s_t, a_t))}{\exp(V^{soft}(s_t))}

\pi(a_t|s_t) =\frac{\exp(Q^{soft}(s_t, a_t))}{\exp(V^{soft}(s_t))}

\sum_{t=0}^T\mathbb{E}_{s_t} \Big[ -\text{KL}\Big(\pi(a_t|s_t)||\frac{\exp(Q^{soft}(s_t, a_t))}{\exp(V^{soft}(s_t))}\Big) + V^{soft}(s_t) \Big] \rightarrow \max_\pi

\sum_{t=0}^T\mathbb{E}_{s_t} \Big[ -\text{KL}\Big(\pi(a_t|s_t)||\frac{\exp(Q^{soft}(s_t, a_t))}{\exp(V^{soft}(s_t))}\Big) + V^{soft}(s_t) \Big] \rightarrow \max_\pi

Hence, the optimal policy is:

but with a bit changed and functions:

- soft maximum

- normal Bellman equation

Q^{soft}

Q^{soft}

V^{soft}

V^{soft}

Soft Q-learning

RL as Inference with function approximators

Train Q-network with parameters :

\phi

\phi

\mathbb{E}_{(s_t,a_t, s_{t+1}) \sim \mathcal{D}} \Big[ Q^{soft}_\phi(s_t, a_t) - \Big( r(s_t, a_t) + V^{soft}_\phi(s_{t+1})\Big) \Big]^2\rightarrow \min_\phi

\mathbb{E}_{(s_t,a_t, s_{t+1}) \sim \mathcal{D}} \Big[ Q^{soft}_\phi(s_t, a_t) - \Big( r(s_t, a_t) + V^{soft}_\phi(s_{t+1})\Big) \Big]^2\rightarrow \min_\phi

use replay buffer

where

V^{soft}_\phi(s_t) =\log \int \exp Q^{soft}_\phi(s_t, a_t) da_t

V^{soft}_\phi(s_t) =\log \int \exp Q^{soft}_\phi(s_t, a_t) da_t

for continuous actions use

Importance Sampling

Policy is implicit

\pi(a_t|s_t) = \exp\big(Q^{soft}_\phi(s_t, a_t) - V^{soft}_\phi(s_t)\big)

\pi(a_t|s_t) = \exp\big(Q^{soft}_\phi(s_t, a_t) - V^{soft}_\phi(s_t)\big)

for samples use

Stein Variational Gradient Descent

or MCMC :D

Soft Actor-Critic

RL as Inference with function approximators

Train and networks jointly with policy

\mathbb{E}_{(s_t,a_t, s_{t+1}) \sim \mathcal{D}} \Big[ Q^{soft}_\phi(s_t, a_t) - \Big( r(s_t, a_t) + V^{soft}_\psi(s_{t+1})\Big) \Big]^2\rightarrow \min_\phi

\mathbb{E}_{(s_t,a_t, s_{t+1}) \sim \mathcal{D}} \Big[ Q^{soft}_\phi(s_t, a_t) - \Big( r(s_t, a_t) + V^{soft}_\psi(s_{t+1})\Big) \Big]^2\rightarrow \min_\phi

Q-network loss:

V-network loss:

\hat{V}^{soft}(s_t) = \mathbb{E}_{a_t \sim \pi_\theta} \Big[ Q^{soft}_\phi(s_t, a_t) - \log\pi_\theta(a_t|s_t) \Big]

\hat{V}^{soft}(s_t) = \mathbb{E}_{a_t \sim \pi_\theta} \Big[ Q^{soft}_\phi(s_t, a_t) - \log\pi_\theta(a_t|s_t) \Big]

\mathbb{E}_{s_t \sim \mathcal{D}} \Big[ \hat{V}^{soft}(s_t) - V^{soft}_\psi(s_{t}) \Big]^2 \rightarrow \min_\psi

\mathbb{E}_{s_t \sim \mathcal{D}} \Big[ \hat{V}^{soft}(s_t) - V^{soft}_\psi(s_{t}) \Big]^2 \rightarrow \min_\psi

Objective for the policy:

\mathbb{E}_{s_t \sim \mathcal{D}, \;a_t \sim \pi_\theta} \Big[ Q^{soft}_\phi(s_t, a_t) -\log\pi_\theta(a_{t'}|s_{t})\Big] \rightarrow \max_\theta

\mathbb{E}_{s_t \sim \mathcal{D}, \;a_t \sim \pi_\theta} \Big[ Q^{soft}_\phi(s_t, a_t) -\log\pi_\theta(a_{t'}|s_{t})\Big] \rightarrow \max_\theta

\mathbb{E}_{(s_t,a_t, s_{t+1}) \sim \mathcal{D}} \Big[ Q^{soft}_\phi(s_t, a_t) - \Big( r(s_t, a_t) + V^{soft}_\psi(s_{t+1})\Big) \Big]^2\rightarrow \min_\phi

\mathbb{E}_{(s_t,a_t, s_{t+1}) \sim \mathcal{D}} \Big[ Q^{soft}_\phi(s_t, a_t) - \Big( r(s_t, a_t) + V^{soft}_\psi(s_{t+1})\Big) \Big]^2\rightarrow \min_\phi

Q-network loss:

V-network loss:

\hat{V}^{soft}(s_t) = \mathbb{E}_{a_t \sim \pi_\theta} \Big[ Q^{soft}_\phi(s_t, a_t) - \log\pi_\theta(a_t|s_t) \Big]

\hat{V}^{soft}(s_t) = \mathbb{E}_{a_t \sim \pi_\theta} \Big[ Q^{soft}_\phi(s_t, a_t) - \log\pi_\theta(a_t|s_t) \Big]

\mathbb{E}_{s_t \sim \mathcal{D}} \Big[ \hat{V}^{soft}(s_t) - V^{soft}_\psi(s_{t}) \Big]^2 \rightarrow \min_\psi

\mathbb{E}_{s_t \sim \mathcal{D}} \Big[ \hat{V}^{soft}(s_t) - V^{soft}_\psi(s_{t}) \Big]^2 \rightarrow \min_\psi

Objective for the policy:

\mathbb{E}_{s_t \sim \mathcal{D}, \;a_t \sim \pi_\theta} \Big[ Q^{soft}_\phi(s_t, a_t) -\log\pi_\theta(a_{t'}|s_{t})\Big] \rightarrow \max_\theta

\mathbb{E}_{s_t \sim \mathcal{D}, \;a_t \sim \pi_\theta} \Big[ Q^{soft}_\phi(s_t, a_t) -\log\pi_\theta(a_{t'}|s_{t})\Big] \rightarrow \max_\theta

Q^{soft}_\phi

Q^{soft}_\phi

V^{soft}_\psi

V^{soft}_\psi

\pi_\theta

\pi_\theta

speaker: Pavel Temirchev RL as Probabilistic Inference

Adv RL: RL as probabilistic inference

By cydoroga

Adv RL: RL as probabilistic inference

5 years ago
2,427