Reinforcement Learning as Probabilistic Inference

report is made by
Pavel Temirchev

 

Deep RL

reading group

 

based on the research of Sergey Levine's team

Motivation

1) Sample Complexity!

2) Convergence to local optimas

The problems of standard RL:

Idea: encourage an agent to investigate all the promising strategies!

REMINDER: standard RL

Markov process:

a_0
s_0
a_1
s_1
a_2
s_2
p(\tau) = p(s_0) \prod_{t=0}^T p(a_t|s_t) p(s_{t+1}|s_t, a_t)

Maximization problem:

\pi^\star = \arg\max_\pi \sum_{t=0}^T \mathbb{E}_{s_t, a_t \sim \pi} [r(s_t, a_t)]
Q^\pi(s_t,a_t) := r(s_t,a_t) + \sum_{t'=t+1}^T \mathbb{E}_{s_{t'}, a_{t'} \sim \pi} [r(s_{t'}, a_{t'})]

Q-function:

Q^\star(s_t,a_t) = r(s_t,a_t) + \mathbb{E}_{s_{t+1}} V^\star(s_{t+1})

Bellman equality (optimal Q-function):

V^\star(s_t) = \max_a Q^\star(s_{t}, a)
\tau = (s_0, \dots, a_t, s_t, \dots, a_T, s_T)

Maximum Entropy RL

a_t \sim \mathcal{N}(\cdot| \pi^\star, \sigma^2)

Standard RL:

a_t \sim \exp{Q(s_t, a_t)}

Policy "proportional" to Q:

How to find such a policy?

\min_\pi\text{KL}\Big(\pi(\cdot|s_0)||\exp{Q(s_0, \cdot)}\Big) =
\max_\pi \mathbb{E}_\pi \Big[ Q(s_0, a_0) - \log \pi(a_0|s_0) \Big] =
\max_\pi \mathbb{E}_\pi \Big[ \sum_t^T r(s_t, a_t) {\color{pink}+ \mathcal{H} \big( \pi(\cdot|s_t) \big)}\Big]
Q(s_0, a_0)

RL as Probabilistic Inference

a_0
s_0
a_1
s_1
\mathcal{O}_0
\mathcal{O}_1

Optimality:

p(\mathcal{O}_t =1 |s_t, a_t) := p(\mathcal{O}_t |s_t, a_t) = \exp(r(s_t, a_t))

RL:

Which actions will lead as to the optimal future?

Probabilistic Inference:

Which actions were made given that the future is optimal?

p(a_t|s_t \mathcal{O}_{t:T})

Exact Probabilistic Inference

Let's find an optimal policy:

p(a_t|s_t, \mathcal{O}_{t:T}) = \frac{ \color{#00ff00} p(s_t, a_t|\mathcal{O}_{t:T})}{ \color{#ff0000} p(s_t|\mathcal{O}_{t:T})} =
= \frac{p(\mathcal{O}_{t:T}|s_t, a_t) p(a_t|s_t) p(s_t)}{p(\mathcal{O}_{t:T})}\frac{p(\mathcal{O}_{t:T})}{p(\mathcal{O}_{t:T}|s_t) p(s_t)}
\frac{p(\mathcal{O}_{t:T}|s_t, a_t) p(a_t|s_t) p(s_t)}{p(\mathcal{O}_{t:T})}
\frac{p(\mathcal{O}_{t:T})}{p(\mathcal{O}_{t:T}|s_t) p(s_t)}

where                    - prior policy

p(a_t|s_t)

if                                 , then

p(a_t|s_t) = \frac{1}{|\mathcal{A}|}
p(a_t|s_t, \mathcal{O}_{t:T}) \propto \frac{p(\mathcal{O}_{t:T}|s_t, a_t) }{p(\mathcal{O}_{t:T}|s_t)}

apply Bayes rule!

Let's introduce new notation:

\alpha_t(s_t, a_t) := p(\mathcal{O}_{t:T}|s_t, a_t)
\beta_t(s_t) := p(\mathcal{O}_{t:T}|s_t) = \int \alpha_t(s_t, a_t) p(a_t|s_t)da_t

We can find all the       and       via Message Passing algorithm:

\alpha_t
\beta_t

For the timestep      :

T
\alpha_T(s_T, a_T) = \exp(r(s_T, a_T))
\beta_T(s_T) = \int \alpha_T(s_T, a_T) p(a_T|s_T)da_T

Recursively:

\alpha_t(s_t, a_t) = \int \beta_{t+1}(s_{t+1}) \exp(r(s_t, a_t)) p(s_{t+1}|s_t, a_t)ds_{t+1}
\beta_t(s_t) = \int \alpha_t(s_t, a_t) p(a_t|s_t)da_t

Exact Probabilistic Inference

Let's introduce new notation:

\alpha_t(s_t, a_t) := p(\mathcal{O}_{t:T}|s_t, a_t)
\beta_t(s_t) := p(\mathcal{O}_{t:T}|s_t) = \int \alpha_t(s_t, a_t) p(a_t|s_t)da_t

We can find all the       and       via Message Passing algorithm:

\alpha_t
\beta_t

For the timestep      :

T
\alpha_T(s_T, a_T) = \exp(r(s_T, a_T))
\beta_T(s_T) = \int \alpha_T(s_T, a_T) p(a_T|s_T)da_T

Recursively:

\alpha_t(s_t, a_t) = \int \beta_{t+1}(s_{t+1}) \exp(r(s_t, a_t)) p(s_{t+1}|s_t, a_t)ds_{t+1}
\beta_t(s_t) = \int \alpha_t(s_t, a_t) p(a_t|s_t)da_t

Exact Probabilistic Inference

Soft Q and V functions

Q^{soft}(s_t, a_t) := \log\alpha_t(s_t, a_t)
V^{soft}(s_t) := \log\beta_t(s_t)

Recursively:

V^{soft}(s_t) =\log \mathbb{E}_{p(a_t|s_t)} [\exp Q^{soft}(s_t, a_t)]

- soft maximum

Q^{soft}(s_t, a_t) = r(s_t, a_t) + \log \mathbb{E}_{p(s_{t+1}|s_t, a_t)} [\exp V^{soft}(s_{t+1})]

kinda Bellman equation

We can find analogues in the log-scale:

Soft and Hard Q and V functions

"Hard" Q and V functions:

V^\star(s_t) =\max_{a_t} Q^\star(s_t, a_t)
Q^\star(s_t, a_t) = r(s_t, a_t) + \mathbb{E}_{p(s_{t+1}|s_t, a_t)} V^\star(s_{t+1})
V^{soft}(s_t) =\log \mathbb{E}_{p(a_t|s_t)} [\exp Q^{soft}(s_t, a_t)]
Q^{soft}(s_t, a_t) = r(s_t, a_t) + \log \mathbb{E}_{p(s_{t+1}|s_t, a_t)} [\exp V^{soft}(s_{t+1})]

"Soft" analogues:

Q^\star(s_t, a_t) = r(s_t, a_t) + \mathbb{E}_{p(s_{t+1}|s_t, a_t)} \max_{a_{t+1}} Q^\star(s_{t+1}, a_{t+1})
Q^{soft}(s_t, a_t) \approx r(s_t, a_t) + \max_{s_{t+1}} \max_{a_{t+1}} Q^{soft}(s_{t+1}, a_{t+1})
\max_{s_{t+1}}

What is being optimized?

p(\tau|\mathcal{O}_{0:T}) =\frac{p(\tau,\mathcal{O}_{0:T})} {p(\mathcal{O}_{0:T})}

Let's analyze an "exact variational inference" procedure:

true conditional

joint

evidence

\arg\min_q \text{KL}\big(q(\tau)||p(\tau|\mathcal{O}_{0:T})\big) = \arg\min_{p(a_t|s_t, \mathcal{O}_{t:T})} \text{KL}\big(p(\tau|\mathcal{O}_{0:T})||p(\tau|\mathcal{O}_{0:T})\big)
= \arg\min_{p(a_t|s_t, \mathcal{O}_{t:T})} \text{KL}\big(p(\tau|\mathcal{O}_{0:T})||p(\tau,\mathcal{O}_{0:T})\big)

What is being optimized?

where the joint ("exact") distribution is:

p({\color{#ff0000}\tau},{\color{#00ff00}\mathcal{O}_{0:T}}) = {\color{#ff0000} p(s_0)\prod_{t=0}^Tp(a_t|s_t)p(s_{t+1}|s_t,a_t)} {\color{#00ff00}\exp\big(r(s_t, a_t)\big)}

and the variational one is: 

p(\tau|\mathcal{O}_{0:T}) = p(s_0|\mathcal{O}_{0:T})\prod_{t=0}^Tp(a_t|s_t,\mathcal{O}_{0:T})p(s_{t+1}|s_t,a_t,\mathcal{O}_{0:T})

we tried to find a policy which is optimal

only in an optimal environment!

We can fix this!

\text{KL}\big(p(\tau|\mathcal{O}_{0:T})||p(\tau,\mathcal{O}_{0:T})\big) \rightarrow \min_{p(a_t|s_t, \mathcal{O}_{t:T})}
p(\tau,\mathcal{O}_{0:T}) = p(s_0)\prod_{t=0}^Tp(a_t|s_t)p(s_{t+1}|s_t,a_t) \exp\big(r(s_t, a_t)\big)
p(\tau|\mathcal{O}_{0:T}) = p(s_0|{\color{#ff0000}\mathcal{O}_{0:T}})\prod_{t=0}^Tp(a_t|s_t,\mathcal{O}_{0:T})p(s_{t+1}|s_t,a_t,{\color{#ff0000}\mathcal{O}_{0:T}})

Variational Inference

The form of the      - is our choice

q(\tau) = p(s_0)\prod_{t=0}^T\pi(a_t|s_t) p(s_{t+1}|s_t,a_t)

Minimization problem for VI

\text{KL}\big(q(\tau)||p(\tau,\mathcal{O}_{0:T})\big) \rightarrow \min_q
q(\tau)

        is a distribution over 

ACHIEVABLE trajectories

q

fix the dynamics!

Then:

\min_q \text{KL}\big(q(\tau)||p(\tau,\mathcal{O}_{0:T})\big) = - \min_q \mathbb{E}_q \log \frac{p(\tau,\mathcal{O}_{0:T})}{q(\tau)} =

Maximum Entropy RL Objective

= \max_q \mathbb{E}_q \Big[ \log p(s_0)+\sum_{t} \big( \log p(s_{t+1}|s_t,a_t) + r(s_t, a_t) \big) -
- \log p(s_0)-\sum_{t} \big( \log p(s_{t+1}|s_t,a_t) - \log \pi(a_t| s_t) \big) \Big]=
= \max_\pi \mathbb{E}_\pi \sum_{t}\Big[ r(s_t, a_t) + \mathcal{H}\big( \pi(\cdot| s_t)\big) \Big]

Variational Inference

Returning to the Q and V functions

This objective can be rewritten as follows:

V^{soft}(s_t) =\log \int \exp Q^{soft}(s_t, a_t) da_t
Q^{soft}(s_t, a_t) = r(s_t, a_t) + \mathbb{E}_{p(s_{t+1}|s_t, a_t)} V^{soft}(s_{t+1})

check it yourself!

\pi(a_t|s_t) =\frac{\exp(Q^{soft}(s_t, a_t))}{\exp(V^{soft}(s_t))}
\sum_{t=0}^T\mathbb{E}_{s_t} \Big[ -\text{KL}\Big(\pi(a_t|s_t)||\frac{\exp(Q^{soft}(s_t, a_t))}{\exp(V^{soft}(s_t))}\Big) + V^{soft}(s_t) \Big] \rightarrow \max_\pi

Then the optimal policy is:

where

- soft maximum

- normal Bellman equation

VI with function approximators

(neural nets)

Maximum Entropy Policy Gradients

\mathbb{E}_{\tau \sim \pi_\theta} \sum_{t=0}^T\Big[ r(s_t, a_t) + \mathcal{H}\big(\pi_\theta(\cdot|s_t)\big) \Big] \rightarrow \max_\theta

Directly maximize entropy-augmented objective

over policy parameters      :

For gradients, use log-derivative trick:

\sum_{t=0}^T\mathbb{E}_{(s_t,a_t) \sim q_\theta} \Big[ \nabla_\theta \log\pi_\theta(a_t|s_t) \sum_{t'=t}^T\Big( r(s_{t'}, a_{t'}) -\log\pi_\theta(a_{t'}|s_{t'}) - b(s_{t'}) \Big)\Big]
\theta
  • on-policy
  • unimodal policies

Soft Q-learning

Train Q-network with parameters      :

\phi
\mathbb{E}_{(s_t,a_t, s_{t+1}) \sim \mathcal{D}} \Big[ Q^{soft}_\phi(s_t, a_t) - \Big( r(s_t, a_t) + V^{soft}_\phi(s_{t+1})\Big) \Big]^2\rightarrow \min_\phi

use replay buffer

where

V^{soft}_\phi(s_t) =\log \int \exp Q^{soft}_\phi(s_t, a_t) da_t

for continuous actions use

importance sampling

Policy is implicit

\pi(a_t|s_t) = \exp\big(Q^{soft}_\phi(s_t, a_t) - V^{soft}_\phi(s_t)\big)

for samples use SVGD

or MCMC :D

Soft Q-learning

Exploration

Robustness

Multimodal Policy

Soft Actor-Critic

Train Q- and V-networks jointly with policy

\mathbb{E}_{(s_t,a_t, s_{t+1}) \sim \mathcal{D}} \Big[ Q^{soft}_\phi(s_t, a_t) - \Big( r(s_t, a_t) + V^{soft}_\psi(s_{t+1})\Big) \Big]^2\rightarrow \min_\phi

Q-network loss:

V-network loss:

\hat{V}^{soft}(s_t) = \mathbb{E}_{a_t \sim \pi_\theta} \Big[ Q^{soft}_\phi(s_t, a_t) - \log\pi_\theta(a_t|s_t) \Big]
\mathbb{E}_{s_t \sim \mathcal{D}} \Big[ \hat{V}^{soft}(s_t) - V^{soft}_\psi(s_{t}) \Big]^2 \rightarrow \min_\psi

Objective for the policy:

\mathbb{E}_{s_t \sim \mathcal{D}, \;a_t \sim \pi_\theta} \Big[ Q^{soft}_\phi(s_t, a_t) -\log\pi_\theta(a_{t'}|s_{t})\Big] \rightarrow \max_\theta
\mathbb{E}_{(s_t,a_t, s_{t+1}) \sim \mathcal{D}} \Big[ Q^{soft}_\phi(s_t, a_t) - \Big( r(s_t, a_t) + V^{soft}_\psi(s_{t+1})\Big) \Big]^2\rightarrow \min_\phi

Q-network loss:

V-network loss:

\hat{V}^{soft}(s_t) = \mathbb{E}_{a_t \sim \pi_\theta} \Big[ Q^{soft}_\phi(s_t, a_t) - \log\pi_\theta(a_t|s_t) \Big]
\mathbb{E}_{s_t \sim \mathcal{D}} \Big[ \hat{V}^{soft}(s_t) - V^{soft}_\psi(s_{t}) \Big]^2 \rightarrow \min_\psi

Objective for the policy:

\mathbb{E}_{s_t \sim \mathcal{D}, \;a_t \sim \pi_\theta} \Big[ Q^{soft}_\phi(s_t, a_t) -\log\pi_\theta(a_{t'}|s_{t})\Big] \rightarrow \max_\theta

Soft Actor-Critic

Soft Actor-Critic

Thank you for your attention!

and visit our seminars in RL Reading Group

telegram: https://t.me/theoreticalrl

REFERENCES:

Soft Q-learning:

https://arxiv.org/pdf/1702.08165.pdf

Soft Actor Critic:

https://arxiv.org/pdf/1801.01290.pdf

Big Review on Probabilistic Inference for RL:

https://arxiv.org/pdf/1805.00909.pdf

Implementation on TensorFlow:

https://github.com/rail-berkeley/softlearning

Implementation on Catalyst.RL:

https://github.com/catalyst-team/catalyst/tree/master/examples/rl_gym

Hierarchical policies (further reading):

https://arxiv.org/abs/1804.02808

Made with Slides.com