(bayessem) Reinforcement Learning as Probabilistic Inference

Motivation

1) Sample Complexity!

2) Convergence to local optimas

The problems of standard RL:

Idea: encourage an agent to investigate all the promising strategies!

REMINDER: standard RL

Markov process:

a_0

s_0

a_1

s_1

a_2

s_2

p(\tau) = p(s_0) \prod_{t=0}^T p(a_t|s_t) p(s_{t+1}|s_t, a_t)

Maximization problem:

\pi^\star = \arg\max_\pi \sum_{t=0}^T \mathbb{E}_{s_t, a_t \sim \pi} [r(s_t, a_t)]

Q^\pi(s_t,a_t) := r(s_t,a_t) + \sum_{t'=t+1}^T \mathbb{E}_{s_{t'}, a_{t'} \sim \pi} [r(s_{t'}, a_{t'})]

Q-function:

Q^\star(s_t,a_t) = r(s_t,a_t) + \mathbb{E}_{s_{t+1}} V^\star(s_{t+1})

Bellman equality (optimal Q-function):

V^\star(s_t) = \max_a Q^\star(s_{t}, a)

\tau = (s_0, \dots, a_t, s_t, \dots, a_T, s_T)

Maximum Entropy RL

a_t \sim \mathcal{N}(\cdot| \pi^\star, \sigma^2)

Standard RL:

a_t \sim \exp{Q(s_t, a_t)}

Policy "proportional" to Q:

How to find such a policy?

\min_\pi\text{KL}\Big(\pi(\cdot|s_0)||\exp{Q(s_0, \cdot)}\Big) =

\max_\pi \mathbb{E}_\pi \Big[ Q(s_0, a_0) - \log \pi(a_0|s_0) \Big] =

\max_\pi \mathbb{E}_\pi \Big[ \sum_t^T r(s_t, a_t) {\color{pink}+ \mathcal{H} \big( \pi(\cdot|s_t) \big)}\Big]

Q(s_0, a_0)

RL as Probabilistic Inference

a_0

s_0

a_1

s_1

\mathcal{O}_0

\mathcal{O}_1

Optimality:

p(\mathcal{O}_t =1 |s_t, a_t) := p(\mathcal{O}_t |s_t, a_t) = \exp(r(s_t, a_t))

RL:

Which actions will lead as to the optimal future?

Probabilistic Inference:

Which actions were made given that the future is optimal?

p(a_t|s_t \mathcal{O}_{t:T})

Exact Probabilistic Inference

Let's find an optimal policy:

p(a_t|s_t, \mathcal{O}_{t:T}) = \frac{ \color{#00ff00} p(s_t, a_t|\mathcal{O}_{t:T})}{ \color{#ff0000} p(s_t|\mathcal{O}_{t:T})} =

= \frac{p(\mathcal{O}_{t:T}|s_t, a_t) p(a_t|s_t) p(s_t)}{p(\mathcal{O}_{t:T})}\frac{p(\mathcal{O}_{t:T})}{p(\mathcal{O}_{t:T}|s_t) p(s_t)}

\frac{p(\mathcal{O}_{t:T}|s_t, a_t) p(a_t|s_t) p(s_t)}{p(\mathcal{O}_{t:T})}

\frac{p(\mathcal{O}_{t:T})}{p(\mathcal{O}_{t:T}|s_t) p(s_t)}

where - prior policy

p(a_t|s_t)

if , then

p(a_t|s_t) = \frac{1}{|\mathcal{A}|}

p(a_t|s_t, \mathcal{O}_{t:T}) \propto \frac{p(\mathcal{O}_{t:T}|s_t, a_t) }{p(\mathcal{O}_{t:T}|s_t)}

apply Bayes rule!

Let's introduce new notation:

\alpha_t(s_t, a_t) := p(\mathcal{O}_{t:T}|s_t, a_t)

\beta_t(s_t) := p(\mathcal{O}_{t:T}|s_t) = \int \alpha_t(s_t, a_t) p(a_t|s_t)da_t

We can find all the and via Message Passing algorithm:

\alpha_t

\beta_t

For the timestep :

T

\alpha_T(s_T, a_T) = \exp(r(s_T, a_T))

\beta_T(s_T) = \int \alpha_T(s_T, a_T) p(a_T|s_T)da_T

Recursively:

\alpha_t(s_t, a_t) = \int \beta_{t+1}(s_{t+1}) \exp(r(s_t, a_t)) p(s_{t+1}|s_t, a_t)ds_{t+1}

\beta_t(s_t) = \int \alpha_t(s_t, a_t) p(a_t|s_t)da_t

Exact Probabilistic Inference

Let's introduce new notation:

\alpha_t(s_t, a_t) := p(\mathcal{O}_{t:T}|s_t, a_t)

\beta_t(s_t) := p(\mathcal{O}_{t:T}|s_t) = \int \alpha_t(s_t, a_t) p(a_t|s_t)da_t

We can find all the and via Message Passing algorithm:

\alpha_t

\beta_t

For the timestep :

T

\alpha_T(s_T, a_T) = \exp(r(s_T, a_T))

\beta_T(s_T) = \int \alpha_T(s_T, a_T) p(a_T|s_T)da_T

Recursively:

\alpha_t(s_t, a_t) = \int \beta_{t+1}(s_{t+1}) \exp(r(s_t, a_t)) p(s_{t+1}|s_t, a_t)ds_{t+1}

\beta_t(s_t) = \int \alpha_t(s_t, a_t) p(a_t|s_t)da_t

Exact Probabilistic Inference

Soft Q and V functions

Q^{soft}(s_t, a_t) := \log\alpha_t(s_t, a_t)

V^{soft}(s_t) := \log\beta_t(s_t)

Recursively:

V^{soft}(s_t) =\log \mathbb{E}_{p(a_t|s_t)} [\exp Q^{soft}(s_t, a_t)]

- soft maximum

Q^{soft}(s_t, a_t) = r(s_t, a_t) + \log \mathbb{E}_{p(s_{t+1}|s_t, a_t)} [\exp V^{soft}(s_{t+1})]

kinda Bellman equation

We can find analogues in the log-scale:

Soft and Hard Q and V functions

"Hard" Q and V functions:

V^\star(s_t) =\max_{a_t} Q^\star(s_t, a_t)

Q^\star(s_t, a_t) = r(s_t, a_t) + \mathbb{E}_{p(s_{t+1}|s_t, a_t)} V^\star(s_{t+1})

V^{soft}(s_t) =\log \mathbb{E}_{p(a_t|s_t)} [\exp Q^{soft}(s_t, a_t)]

Q^{soft}(s_t, a_t) = r(s_t, a_t) + \log \mathbb{E}_{p(s_{t+1}|s_t, a_t)} [\exp V^{soft}(s_{t+1})]

"Soft" analogues:

Q^\star(s_t, a_t) = r(s_t, a_t) + \mathbb{E}_{p(s_{t+1}|s_t, a_t)} \max_{a_{t+1}} Q^\star(s_{t+1}, a_{t+1})

Q^{soft}(s_t, a_t) \approx r(s_t, a_t) + \max_{s_{t+1}} \max_{a_{t+1}} Q^{soft}(s_{t+1}, a_{t+1})

\max_{s_{t+1}}

What is being optimized?

p(\tau|\mathcal{O}_{0:T}) =\frac{p(\tau,\mathcal{O}_{0:T})} {p(\mathcal{O}_{0:T})}

Let's analyze an "exact variational inference" procedure:

true conditional

joint

evidence

\arg\min_q \text{KL}\big(q(\tau)||p(\tau|\mathcal{O}_{0:T})\big) = \arg\min_{p(a_t|s_t, \mathcal{O}_{t:T})} \text{KL}\big(p(\tau|\mathcal{O}_{0:T})||p(\tau|\mathcal{O}_{0:T})\big)

= \arg\min_{p(a_t|s_t, \mathcal{O}_{t:T})} \text{KL}\big(p(\tau|\mathcal{O}_{0:T})||p(\tau,\mathcal{O}_{0:T})\big)

What is being optimized?

where the joint ("exact") distribution is:

p({\color{#ff0000}\tau},{\color{#00ff00}\mathcal{O}_{0:T}}) = {\color{#ff0000} p(s_0)\prod_{t=0}^Tp(a_t|s_t)p(s_{t+1}|s_t,a_t)} {\color{#00ff00}\exp\big(r(s_t, a_t)\big)}

and the variational one is:

p(\tau|\mathcal{O}_{0:T}) = p(s_0|\mathcal{O}_{0:T})\prod_{t=0}^Tp(a_t|s_t,\mathcal{O}_{0:T})p(s_{t+1}|s_t,a_t,\mathcal{O}_{0:T})

we tried to find a policy which is optimal

only in an optimal environment!

We can fix this!

\text{KL}\big(p(\tau|\mathcal{O}_{0:T})||p(\tau,\mathcal{O}_{0:T})\big) \rightarrow \min_{p(a_t|s_t, \mathcal{O}_{t:T})}

p(\tau,\mathcal{O}_{0:T}) = p(s_0)\prod_{t=0}^Tp(a_t|s_t)p(s_{t+1}|s_t,a_t) \exp\big(r(s_t, a_t)\big)

p(\tau|\mathcal{O}_{0:T}) = p(s_0|{\color{#ff0000}\mathcal{O}_{0:T}})\prod_{t=0}^Tp(a_t|s_t,\mathcal{O}_{0:T})p(s_{t+1}|s_t,a_t,{\color{#ff0000}\mathcal{O}_{0:T}})

Variational Inference

The form of the - is our choice

q(\tau) = p(s_0)\prod_{t=0}^T\pi(a_t|s_t) p(s_{t+1}|s_t,a_t)

Minimization problem for VI

\text{KL}\big(q(\tau)||p(\tau,\mathcal{O}_{0:T})\big) \rightarrow \min_q

q(\tau)

is a distribution over

ACHIEVABLE trajectories

q

fix the dynamics!

Then:

\min_q \text{KL}\big(q(\tau)||p(\tau,\mathcal{O}_{0:T})\big) = - \min_q \mathbb{E}_q \log \frac{p(\tau,\mathcal{O}_{0:T})}{q(\tau)} =

Maximum Entropy RL Objective

= \max_q \mathbb{E}_q \Big[ \log p(s_0)+\sum_{t} \big( \log p(s_{t+1}|s_t,a_t) + r(s_t, a_t) \big) -

- \log p(s_0)-\sum_{t} \big( \log p(s_{t+1}|s_t,a_t) - \log \pi(a_t| s_t) \big) \Big]=

= \max_\pi \mathbb{E}_\pi \sum_{t}\Big[ r(s_t, a_t) + \mathcal{H}\big( \pi(\cdot| s_t)\big) \Big]

Variational Inference

Returning to the Q and V functions

This objective can be rewritten as follows:

V^{soft}(s_t) =\log \int \exp Q^{soft}(s_t, a_t) da_t

Q^{soft}(s_t, a_t) = r(s_t, a_t) + \mathbb{E}_{p(s_{t+1}|s_t, a_t)} V^{soft}(s_{t+1})

check it yourself!

\pi(a_t|s_t) =\frac{\exp(Q^{soft}(s_t, a_t))}{\exp(V^{soft}(s_t))}

\sum_{t=0}^T\mathbb{E}_{s_t} \Big[ -\text{KL}\Big(\pi(a_t|s_t)||\frac{\exp(Q^{soft}(s_t, a_t))}{\exp(V^{soft}(s_t))}\Big) + V^{soft}(s_t) \Big] \rightarrow \max_\pi

Then the optimal policy is:

where

- soft maximum

- normal Bellman equation

VI with function approximators

(neural nets)

Maximum Entropy Policy Gradients
Soft Q-learning
https://arxiv.org/abs/1702.08165
Soft Actor-Critic
https://arxiv.org/abs/1801.01290

Maximum Entropy Policy Gradients

\mathbb{E}_{\tau \sim \pi_\theta} \sum_{t=0}^T\Big[ r(s_t, a_t) + \mathcal{H}\big(\pi_\theta(\cdot|s_t)\big) \Big] \rightarrow \max_\theta

Directly maximize entropy-augmented objective

over policy parameters :

For gradients, use log-derivative trick:

\sum_{t=0}^T\mathbb{E}_{(s_t,a_t) \sim q_\theta} \Big[ \nabla_\theta \log\pi_\theta(a_t|s_t) \sum_{t'=t}^T\Big( r(s_{t'}, a_{t'}) -\log\pi_\theta(a_{t'}|s_{t'}) - b(s_{t'}) \Big)\Big]

\theta

on-policy
unimodal policies

Soft Q-learning

Train Q-network with parameters :

\phi

\mathbb{E}_{(s_t,a_t, s_{t+1}) \sim \mathcal{D}} \Big[ Q^{soft}_\phi(s_t, a_t) - \Big( r(s_t, a_t) + V^{soft}_\phi(s_{t+1})\Big) \Big]^2\rightarrow \min_\phi

use replay buffer

where

V^{soft}_\phi(s_t) =\log \int \exp Q^{soft}_\phi(s_t, a_t) da_t

for continuous actions use

importance sampling

Policy is implicit

\pi(a_t|s_t) = \exp\big(Q^{soft}_\phi(s_t, a_t) - V^{soft}_\phi(s_t)\big)

for samples use SVGD

or MCMC :D

Soft Q-learning

Exploration

Robustness

Multimodal Policy

Soft Actor-Critic

Train Q- and V-networks jointly with policy

\mathbb{E}_{(s_t,a_t, s_{t+1}) \sim \mathcal{D}} \Big[ Q^{soft}_\phi(s_t, a_t) - \Big( r(s_t, a_t) + V^{soft}_\psi(s_{t+1})\Big) \Big]^2\rightarrow \min_\phi

Q-network loss:

V-network loss:

\hat{V}^{soft}(s_t) = \mathbb{E}_{a_t \sim \pi_\theta} \Big[ Q^{soft}_\phi(s_t, a_t) - \log\pi_\theta(a_t|s_t) \Big]

\mathbb{E}_{s_t \sim \mathcal{D}} \Big[ \hat{V}^{soft}(s_t) - V^{soft}_\psi(s_{t}) \Big]^2 \rightarrow \min_\psi

Objective for the policy:

\mathbb{E}_{s_t \sim \mathcal{D}, \;a_t \sim \pi_\theta} \Big[ Q^{soft}_\phi(s_t, a_t) -\log\pi_\theta(a_{t'}|s_{t})\Big] \rightarrow \max_\theta

\mathbb{E}_{(s_t,a_t, s_{t+1}) \sim \mathcal{D}} \Big[ Q^{soft}_\phi(s_t, a_t) - \Big( r(s_t, a_t) + V^{soft}_\psi(s_{t+1})\Big) \Big]^2\rightarrow \min_\phi

Q-network loss:

V-network loss:

\hat{V}^{soft}(s_t) = \mathbb{E}_{a_t \sim \pi_\theta} \Big[ Q^{soft}_\phi(s_t, a_t) - \log\pi_\theta(a_t|s_t) \Big]

\mathbb{E}_{s_t \sim \mathcal{D}} \Big[ \hat{V}^{soft}(s_t) - V^{soft}_\psi(s_{t}) \Big]^2 \rightarrow \min_\psi

Objective for the policy:

\mathbb{E}_{s_t \sim \mathcal{D}, \;a_t \sim \pi_\theta} \Big[ Q^{soft}_\phi(s_t, a_t) -\log\pi_\theta(a_{t'}|s_{t})\Big] \rightarrow \max_\theta