Reinforcement Learning as Probabilistic Inference

report is made by
Pavel Temirchev

Deep RL

reading group

based on the research of Sergey Levine's team

Maximum Entropy RL

a_t \sim \mathcal{N}(\cdot| \pi^\star, \sigma^2)

a_t \sim \mathcal{N}(\cdot| \pi^\star, \sigma^2)

Standard RL:

a_t \sim \exp{Q(s_t, a_t)}

a_t \sim \exp{Q(s_t, a_t)}

Policy "proportional" to Q:

How to find such a policy?

\min_\pi\text{KL}\Big(\pi(\cdot|s_0)||\exp{Q(s_0, \cdot)}\Big) =

\min_\pi\text{KL}\Big(\pi(\cdot|s_0)||\exp{Q(s_0, \cdot)}\Big) =

\max_\pi \mathbb{E}_\pi \Big[ Q(s_0, a_0) - \log \pi(a_0|s_0) \Big] =

\max_\pi \mathbb{E}_\pi \Big[ Q(s_0, a_0) - \log \pi(a_0|s_0) \Big] =

\max_\pi \mathbb{E}_\pi \Big[ \sum_t^T r(s_t, a_t) {\color{pink}+ \mathcal{H} \big( \pi(\cdot|s_t) \big)}\Big]

\max_\pi \mathbb{E}_\pi \Big[ \sum_t^T r(s_t, a_t) {\color{pink}+ \mathcal{H} \big( \pi(\cdot|s_t) \big)}\Big]

Q(s_0, a_0)

Q(s_0, a_0)

RL as Probabilistic Inference

a_0

a_0

s_0

s_0

a_1

a_1

s_1

s_1

\mathcal{O}_0

\mathcal{O}_0

\mathcal{O}_1

\mathcal{O}_1

Optimality:

p(\mathcal{O}_t =1 |s_t, a_t) := p(\mathcal{O}_t |s_t, a_t) = \exp(r(s_t, a_t))

p(\mathcal{O}_t =1 |s_t, a_t) := p(\mathcal{O}_t |s_t, a_t) = \exp(r(s_t, a_t))

RL:

Which actions will lead as to the optimal future?

Probabilistic Inference:

Which actions were made given that the future is optimal?

p(a_t|s_t \mathcal{O}_{t:T})

p(a_t|s_t \mathcal{O}_{t:T})

Exact Probabilistic Inference

Let's find an optimal policy:

p(a_t|s_t, \mathcal{O}_{t:T}) = \frac{ \color{#00ff00} p(s_t, a_t|\mathcal{O}_{t:T})}{ \color{#ff0000} p(s_t|\mathcal{O}_{t:T})} =

p(a_t|s_t, \mathcal{O}_{t:T}) = \frac{ \color{#00ff00} p(s_t, a_t|\mathcal{O}_{t:T})}{ \color{#ff0000} p(s_t|\mathcal{O}_{t:T})} =

= \frac{p(\mathcal{O}_{t:T}|s_t, a_t) p(a_t|s_t) p(s_t)}{p(\mathcal{O}_{t:T})}\frac{p(\mathcal{O}_{t:T})}{p(\mathcal{O}_{t:T}|s_t) p(s_t)}

= \frac{p(\mathcal{O}_{t:T}|s_t, a_t) p(a_t|s_t) p(s_t)}{p(\mathcal{O}_{t:T})}\frac{p(\mathcal{O}_{t:T})}{p(\mathcal{O}_{t:T}|s_t) p(s_t)}

\frac{p(\mathcal{O}_{t:T}|s_t, a_t) p(a_t|s_t) p(s_t)}{p(\mathcal{O}_{t:T})}

\frac{p(\mathcal{O}_{t:T}|s_t, a_t) p(a_t|s_t) p(s_t)}{p(\mathcal{O}_{t:T})}

\frac{p(\mathcal{O}_{t:T})}{p(\mathcal{O}_{t:T}|s_t) p(s_t)}

\frac{p(\mathcal{O}_{t:T})}{p(\mathcal{O}_{t:T}|s_t) p(s_t)}

where - prior policy

p(a_t|s_t)

p(a_t|s_t)

if , then

p(a_t|s_t) = \frac{1}{|\mathcal{A}|}

p(a_t|s_t) = \frac{1}{|\mathcal{A}|}

p(a_t|s_t, \mathcal{O}_{t:T}) \propto \frac{p(\mathcal{O}_{t:T}|s_t, a_t) }{p(\mathcal{O}_{t:T}|s_t)}

p(a_t|s_t, \mathcal{O}_{t:T}) \propto \frac{p(\mathcal{O}_{t:T}|s_t, a_t) }{p(\mathcal{O}_{t:T}|s_t)}

apply Bayes rule!

Let's introduce new notation:

\alpha_t(s_t, a_t) := p(\mathcal{O}_{t:T}|s_t, a_t)

\alpha_t(s_t, a_t) := p(\mathcal{O}_{t:T}|s_t, a_t)

\beta_t(s_t) := p(\mathcal{O}_{t:T}|s_t) = \int \alpha_t(s_t, a_t) p(a_t|s_t)da_t

\beta_t(s_t) := p(\mathcal{O}_{t:T}|s_t) = \int \alpha_t(s_t, a_t) p(a_t|s_t)da_t

We can find all the and via Message Passing algorithm:

\alpha_t

\alpha_t

\beta_t

\beta_t

For the timestep :

T

\alpha_T(s_T, a_T) = \exp(r(s_T, a_T))

\alpha_T(s_T, a_T) = \exp(r(s_T, a_T))

\beta_T(s_T) = \int \alpha_T(s_T, a_T) p(a_T|s_T)da_T

\beta_T(s_T) = \int \alpha_T(s_T, a_T) p(a_T|s_T)da_T

Recursively:

\alpha_t(s_t, a_t) = \int \beta_{t+1}(s_{t+1}) \exp(r(s_t, a_t)) p(s_{t+1}|s_t, a_t)ds_{t+1}

\alpha_t(s_t, a_t) = \int \beta_{t+1}(s_{t+1}) \exp(r(s_t, a_t)) p(s_{t+1}|s_t, a_t)ds_{t+1}

\beta_t(s_t) = \int \alpha_t(s_t, a_t) p(a_t|s_t)da_t

\beta_t(s_t) = \int \alpha_t(s_t, a_t) p(a_t|s_t)da_t

Exact Probabilistic Inference

Let's introduce new notation:

\alpha_t(s_t, a_t) := p(\mathcal{O}_{t:T}|s_t, a_t)

\alpha_t(s_t, a_t) := p(\mathcal{O}_{t:T}|s_t, a_t)

\beta_t(s_t) := p(\mathcal{O}_{t:T}|s_t) = \int \alpha_t(s_t, a_t) p(a_t|s_t)da_t

\beta_t(s_t) := p(\mathcal{O}_{t:T}|s_t) = \int \alpha_t(s_t, a_t) p(a_t|s_t)da_t

We can find all the and via Message Passing algorithm:

\alpha_t

\alpha_t

\beta_t

\beta_t

For the timestep :

T

\alpha_T(s_T, a_T) = \exp(r(s_T, a_T))

\alpha_T(s_T, a_T) = \exp(r(s_T, a_T))

\beta_T(s_T) = \int \alpha_T(s_T, a_T) p(a_T|s_T)da_T

\beta_T(s_T) = \int \alpha_T(s_T, a_T) p(a_T|s_T)da_T

Recursively:

\alpha_t(s_t, a_t) = \int \beta_{t+1}(s_{t+1}) \exp(r(s_t, a_t)) p(s_{t+1}|s_t, a_t)ds_{t+1}

\alpha_t(s_t, a_t) = \int \beta_{t+1}(s_{t+1}) \exp(r(s_t, a_t)) p(s_{t+1}|s_t, a_t)ds_{t+1}

\beta_t(s_t) = \int \alpha_t(s_t, a_t) p(a_t|s_t)da_t

\beta_t(s_t) = \int \alpha_t(s_t, a_t) p(a_t|s_t)da_t

Exact Probabilistic Inference

Soft Q and V functions

Q^{soft}(s_t, a_t) := \log\alpha_t(s_t, a_t)

Q^{soft}(s_t, a_t) := \log\alpha_t(s_t, a_t)

V^{soft}(s_t) := \log\beta_t(s_t)

V^{soft}(s_t) := \log\beta_t(s_t)

Recursively:

V^{soft}(s_t) =\log \mathbb{E}_{p(a_t|s_t)} [\exp Q^{soft}(s_t, a_t)]

V^{soft}(s_t) =\log \mathbb{E}_{p(a_t|s_t)} [\exp Q^{soft}(s_t, a_t)]

- soft maximum

Q^{soft}(s_t, a_t) = r(s_t, a_t) + \log \mathbb{E}_{p(s_{t+1}|s_t, a_t)} [\exp V^{soft}(s_{t+1})]

Q^{soft}(s_t, a_t) = r(s_t, a_t) + \log \mathbb{E}_{p(s_{t+1}|s_t, a_t)} [\exp V^{soft}(s_{t+1})]

kinda Bellman equation

We can find analogues in the log-scale:

Soft and Hard Q and V functions

"Hard" Q and V functions:

V^\star(s_t) =\max_{a_t} Q^\star(s_t, a_t)

V^\star(s_t) =\max_{a_t} Q^\star(s_t, a_t)

Q^\star(s_t, a_t) = r(s_t, a_t) + \mathbb{E}_{p(s_{t+1}|s_t, a_t)} V^\star(s_{t+1})

Q^\star(s_t, a_t) = r(s_t, a_t) + \mathbb{E}_{p(s_{t+1}|s_t, a_t)} V^\star(s_{t+1})

V^{soft}(s_t) =\log \mathbb{E}_{p(a_t|s_t)} [\exp Q^{soft}(s_t, a_t)]

V^{soft}(s_t) =\log \mathbb{E}_{p(a_t|s_t)} [\exp Q^{soft}(s_t, a_t)]

Q^{soft}(s_t, a_t) = r(s_t, a_t) + \log \mathbb{E}_{p(s_{t+1}|s_t, a_t)} [\exp V^{soft}(s_{t+1})]

Q^{soft}(s_t, a_t) = r(s_t, a_t) + \log \mathbb{E}_{p(s_{t+1}|s_t, a_t)} [\exp V^{soft}(s_{t+1})]

"Soft" analogues:

Q^\star(s_t, a_t) = r(s_t, a_t) + \mathbb{E}_{p(s_{t+1}|s_t, a_t)} \max_{a_{t+1}} Q^\star(s_{t+1}, a_{t+1})

Q^\star(s_t, a_t) = r(s_t, a_t) + \mathbb{E}_{p(s_{t+1}|s_t, a_t)} \max_{a_{t+1}} Q^\star(s_{t+1}, a_{t+1})

Q^{soft}(s_t, a_t) \approx r(s_t, a_t) + \max_{s_{t+1}} \max_{a_{t+1}} Q^{soft}(s_{t+1}, a_{t+1})

Q^{soft}(s_t, a_t) \approx r(s_t, a_t) + \max_{s_{t+1}} \max_{a_{t+1}} Q^{soft}(s_{t+1}, a_{t+1})

\max_{s_{t+1}}

\max_{s_{t+1}}

What is being optimized?

p(\tau|\mathcal{O}_{0:T}) =\frac{p(\tau,\mathcal{O}_{0:T})} {p(\mathcal{O}_{0:T})}

p(\tau|\mathcal{O}_{0:T}) =\frac{p(\tau,\mathcal{O}_{0:T})} {p(\mathcal{O}_{0:T})}

Let's analyze an "exact variational inference" procedure:

true conditional

joint

evidence

\arg\min_q \text{KL}\big(q(\tau)||p(\tau|\mathcal{O}_{0:T})\big) = \arg\min_{p(a_t|s_t, \mathcal{O}_{t:T})} \text{KL}\big(p(\tau|\mathcal{O}_{0:T})||p(\tau|\mathcal{O}_{0:T})\big)

\arg\min_q \text{KL}\big(q(\tau)||p(\tau|\mathcal{O}_{0:T})\big) = \arg\min_{p(a_t|s_t, \mathcal{O}_{t:T})} \text{KL}\big(p(\tau|\mathcal{O}_{0:T})||p(\tau|\mathcal{O}_{0:T})\big)

= \arg\min_{p(a_t|s_t, \mathcal{O}_{t:T})} \text{KL}\big(p(\tau|\mathcal{O}_{0:T})||p(\tau,\mathcal{O}_{0:T})\big)

= \arg\min_{p(a_t|s_t, \mathcal{O}_{t:T})} \text{KL}\big(p(\tau|\mathcal{O}_{0:T})||p(\tau,\mathcal{O}_{0:T})\big)

What is being optimized?

where the joint ("exact") distribution is:

p({\color{#ff0000}\tau},{\color{#00ff00}\mathcal{O}_{0:T}}) = {\color{#ff0000} p(s_0)\prod_{t=0}^Tp(a_t|s_t)p(s_{t+1}|s_t,a_t)} {\color{#00ff00}\exp\big(r(s_t, a_t)\big)}

p({\color{#ff0000}\tau},{\color{#00ff00}\mathcal{O}_{0:T}}) = {\color{#ff0000} p(s_0)\prod_{t=0}^Tp(a_t|s_t)p(s_{t+1}|s_t,a_t)} {\color{#00ff00}\exp\big(r(s_t, a_t)\big)}

and the variational one is:

p(\tau|\mathcal{O}_{0:T}) = p(s_0|\mathcal{O}_{0:T})\prod_{t=0}^Tp(a_t|s_t,\mathcal{O}_{0:T})p(s_{t+1}|s_t,a_t,\mathcal{O}_{0:T})

p(\tau|\mathcal{O}_{0:T}) = p(s_0|\mathcal{O}_{0:T})\prod_{t=0}^Tp(a_t|s_t,\mathcal{O}_{0:T})p(s_{t+1}|s_t,a_t,\mathcal{O}_{0:T})

we tried to find a policy which is optimal

only in an optimal environment!

We can fix this!

\text{KL}\big(p(\tau|\mathcal{O}_{0:T})||p(\tau,\mathcal{O}_{0:T})\big) \rightarrow \min_{p(a_t|s_t, \mathcal{O}_{t:T})}

\text{KL}\big(p(\tau|\mathcal{O}_{0:T})||p(\tau,\mathcal{O}_{0:T})\big) \rightarrow \min_{p(a_t|s_t, \mathcal{O}_{t:T})}

p(\tau,\mathcal{O}_{0:T}) = p(s_0)\prod_{t=0}^Tp(a_t|s_t)p(s_{t+1}|s_t,a_t) \exp\big(r(s_t, a_t)\big)

p(\tau,\mathcal{O}_{0:T}) = p(s_0)\prod_{t=0}^Tp(a_t|s_t)p(s_{t+1}|s_t,a_t) \exp\big(r(s_t, a_t)\big)

p(\tau|\mathcal{O}_{0:T}) = p(s_0|{\color{#ff0000}\mathcal{O}_{0:T}})\prod_{t=0}^Tp(a_t|s_t,\mathcal{O}_{0:T})p(s_{t+1}|s_t,a_t,{\color{#ff0000}\mathcal{O}_{0:T}})

p(\tau|\mathcal{O}_{0:T}) = p(s_0|{\color{#ff0000}\mathcal{O}_{0:T}})\prod_{t=0}^Tp(a_t|s_t,\mathcal{O}_{0:T})p(s_{t+1}|s_t,a_t,{\color{#ff0000}\mathcal{O}_{0:T}})

Variational Inference

The form of the - is our choice

q(\tau) = p(s_0)\prod_{t=0}^T\pi(a_t|s_t) p(s_{t+1}|s_t,a_t)

q(\tau) = p(s_0)\prod_{t=0}^T\pi(a_t|s_t) p(s_{t+1}|s_t,a_t)

Minimization problem for VI

\text{KL}\big(q(\tau)||p(\tau,\mathcal{O}_{0:T})\big) \rightarrow \min_q

\text{KL}\big(q(\tau)||p(\tau,\mathcal{O}_{0:T})\big) \rightarrow \min_q

q(\tau)

q(\tau)

is a distribution over

ACHIEVABLE trajectories

q

fix the dynamics!

Then:

\min_q \text{KL}\big(q(\tau)||p(\tau,\mathcal{O}_{0:T})\big) = - \min_q \mathbb{E}_q \log \frac{p(\tau,\mathcal{O}_{0:T})}{q(\tau)} =

\min_q \text{KL}\big(q(\tau)||p(\tau,\mathcal{O}_{0:T})\big) = - \min_q \mathbb{E}_q \log \frac{p(\tau,\mathcal{O}_{0:T})}{q(\tau)} =

Maximum Entropy RL Objective

= \max_q \mathbb{E}_q \Big[ \log p(s_0)+\sum_{t} \big( \log p(s_{t+1}|s_t,a_t) + r(s_t, a_t) \big) -

= \max_q \mathbb{E}_q \Big[ \log p(s_0)+\sum_{t} \big( \log p(s_{t+1}|s_t,a_t) + r(s_t, a_t) \big) -

- \log p(s_0)-\sum_{t} \big( \log p(s_{t+1}|s_t,a_t) - \log \pi(a_t| s_t) \big) \Big]=

- \log p(s_0)-\sum_{t} \big( \log p(s_{t+1}|s_t,a_t) - \log \pi(a_t| s_t) \big) \Big]=

= \max_\pi \mathbb{E}_\pi \sum_{t}\Big[ r(s_t, a_t) + \mathcal{H}\big( \pi(\cdot| s_t)\big) \Big]

= \max_\pi \mathbb{E}_\pi \sum_{t}\Big[ r(s_t, a_t) + \mathcal{H}\big( \pi(\cdot| s_t)\big) \Big]

Variational Inference

Returning to the Q and V functions

This objective can be rewritten as follows:

V^{soft}(s_t) =\log \int \exp Q^{soft}(s_t, a_t) da_t

V^{soft}(s_t) =\log \int \exp Q^{soft}(s_t, a_t) da_t

Q^{soft}(s_t, a_t) = r(s_t, a_t) + \mathbb{E}_{p(s_{t+1}|s_t, a_t)} V^{soft}(s_{t+1})

Q^{soft}(s_t, a_t) = r(s_t, a_t) + \mathbb{E}_{p(s_{t+1}|s_t, a_t)} V^{soft}(s_{t+1})

check it yourself!

\pi(a_t|s_t) =\frac{\exp(Q^{soft}(s_t, a_t))}{\exp(V^{soft}(s_t))}

\pi(a_t|s_t) =\frac{\exp(Q^{soft}(s_t, a_t))}{\exp(V^{soft}(s_t))}

\sum_{t=0}^T\mathbb{E}_{s_t} \Big[ -\text{KL}\Big(\pi(a_t|s_t)||\frac{\exp(Q^{soft}(s_t, a_t))}{\exp(V^{soft}(s_t))}\Big) + V^{soft}(s_t) \Big] \rightarrow \max_\pi

\sum_{t=0}^T\mathbb{E}_{s_t} \Big[ -\text{KL}\Big(\pi(a_t|s_t)||\frac{\exp(Q^{soft}(s_t, a_t))}{\exp(V^{soft}(s_t))}\Big) + V^{soft}(s_t) \Big] \rightarrow \max_\pi

Then the optimal policy is:

where

- soft maximum

- normal Bellman equation

Maximum Entropy Policy Gradients

\mathbb{E}_{\tau \sim \pi_\theta} \sum_{t=0}^T\Big[ r(s_t, a_t) + \mathcal{H}\big(\pi_\theta(\cdot|s_t)\big) \Big] \rightarrow \max_\theta

\mathbb{E}_{\tau \sim \pi_\theta} \sum_{t=0}^T\Big[ r(s_t, a_t) + \mathcal{H}\big(\pi_\theta(\cdot|s_t)\big) \Big] \rightarrow \max_\theta

Directly maximize entropy-augmented objective

over policy parameters :

For gradients, use log-derivative trick:

\sum_{t=0}^T\mathbb{E}_{(s_t,a_t) \sim q_\theta} \Big[ \nabla_\theta \log\pi_\theta(a_t|s_t) \sum_{t'=t}^T\Big( r(s_{t'}, a_{t'}) -\log\pi_\theta(a_{t'}|s_{t'}) - b(s_{t'}) \Big)\Big]

\sum_{t=0}^T\mathbb{E}_{(s_t,a_t) \sim q_\theta} \Big[ \nabla_\theta \log\pi_\theta(a_t|s_t) \sum_{t'=t}^T\Big( r(s_{t'}, a_{t'}) -\log\pi_\theta(a_{t'}|s_{t'}) - b(s_{t'}) \Big)\Big]

\theta

\theta

on-policy
unimodal policies

Soft Q-learning

Train Q-network with parameters :

\phi

\phi

\mathbb{E}_{(s_t,a_t, s_{t+1}) \sim \mathcal{D}} \Big[ Q^{soft}_\phi(s_t, a_t) - \Big( r(s_t, a_t) + V^{soft}_\phi(s_{t+1})\Big) \Big]^2\rightarrow \min_\phi

\mathbb{E}_{(s_t,a_t, s_{t+1}) \sim \mathcal{D}} \Big[ Q^{soft}_\phi(s_t, a_t) - \Big( r(s_t, a_t) + V^{soft}_\phi(s_{t+1})\Big) \Big]^2\rightarrow \min_\phi

use replay buffer

where

V^{soft}_\phi(s_t) =\log \int \exp Q^{soft}_\phi(s_t, a_t) da_t

V^{soft}_\phi(s_t) =\log \int \exp Q^{soft}_\phi(s_t, a_t) da_t

for continuous actions use

importance sampling

Policy is implicit

\pi(a_t|s_t) = \exp\big(Q^{soft}_\phi(s_t, a_t) - V^{soft}_\phi(s_t)\big)

\pi(a_t|s_t) = \exp\big(Q^{soft}_\phi(s_t, a_t) - V^{soft}_\phi(s_t)\big)

for samples use SVGD

or MCMC :D

Soft Actor-Critic

Train Q- and V-networks jointly with policy

\mathbb{E}_{(s_t,a_t, s_{t+1}) \sim \mathcal{D}} \Big[ Q^{soft}_\phi(s_t, a_t) - \Big( r(s_t, a_t) + V^{soft}_\psi(s_{t+1})\Big) \Big]^2\rightarrow \min_\phi

\mathbb{E}_{(s_t,a_t, s_{t+1}) \sim \mathcal{D}} \Big[ Q^{soft}_\phi(s_t, a_t) - \Big( r(s_t, a_t) + V^{soft}_\psi(s_{t+1})\Big) \Big]^2\rightarrow \min_\phi

Q-network loss:

V-network loss:

\hat{V}^{soft}(s_t) = \mathbb{E}_{a_t \sim \pi_\theta} \Big[ Q^{soft}_\phi(s_t, a_t) - \log\pi_\theta(a_t|s_t) \Big]

\hat{V}^{soft}(s_t) = \mathbb{E}_{a_t \sim \pi_\theta} \Big[ Q^{soft}_\phi(s_t, a_t) - \log\pi_\theta(a_t|s_t) \Big]

\mathbb{E}_{s_t \sim \mathcal{D}} \Big[ \hat{V}^{soft}(s_t) - V^{soft}_\psi(s_{t}) \Big]^2 \rightarrow \min_\psi

\mathbb{E}_{s_t \sim \mathcal{D}} \Big[ \hat{V}^{soft}(s_t) - V^{soft}_\psi(s_{t}) \Big]^2 \rightarrow \min_\psi

Objective for the policy:

\mathbb{E}_{s_t \sim \mathcal{D}, \;a_t \sim \pi_\theta} \Big[ Q^{soft}_\phi(s_t, a_t) -\log\pi_\theta(a_{t'}|s_{t})\Big] \rightarrow \max_\theta

\mathbb{E}_{s_t \sim \mathcal{D}, \;a_t \sim \pi_\theta} \Big[ Q^{soft}_\phi(s_t, a_t) -\log\pi_\theta(a_{t'}|s_{t})\Big] \rightarrow \max_\theta

\mathbb{E}_{(s_t,a_t, s_{t+1}) \sim \mathcal{D}} \Big[ Q^{soft}_\phi(s_t, a_t) - \Big( r(s_t, a_t) + V^{soft}_\psi(s_{t+1})\Big) \Big]^2\rightarrow \min_\phi

\mathbb{E}_{(s_t,a_t, s_{t+1}) \sim \mathcal{D}} \Big[ Q^{soft}_\phi(s_t, a_t) - \Big( r(s_t, a_t) + V^{soft}_\psi(s_{t+1})\Big) \Big]^2\rightarrow \min_\phi

Q-network loss:

V-network loss:

\hat{V}^{soft}(s_t) = \mathbb{E}_{a_t \sim \pi_\theta} \Big[ Q^{soft}_\phi(s_t, a_t) - \log\pi_\theta(a_t|s_t) \Big]

\hat{V}^{soft}(s_t) = \mathbb{E}_{a_t \sim \pi_\theta} \Big[ Q^{soft}_\phi(s_t, a_t) - \log\pi_\theta(a_t|s_t) \Big]

\mathbb{E}_{s_t \sim \mathcal{D}} \Big[ \hat{V}^{soft}(s_t) - V^{soft}_\psi(s_{t}) \Big]^2 \rightarrow \min_\psi

\mathbb{E}_{s_t \sim \mathcal{D}} \Big[ \hat{V}^{soft}(s_t) - V^{soft}_\psi(s_{t}) \Big]^2 \rightarrow \min_\psi

Objective for the policy:

\mathbb{E}_{s_t \sim \mathcal{D}, \;a_t \sim \pi_\theta} \Big[ Q^{soft}_\phi(s_t, a_t) -\log\pi_\theta(a_{t'}|s_{t})\Big] \rightarrow \max_\theta

\mathbb{E}_{s_t \sim \mathcal{D}, \;a_t \sim \pi_\theta} \Big[ Q^{soft}_\phi(s_t, a_t) -\log\pi_\theta(a_{t'}|s_{t})\Big] \rightarrow \max_\theta

Reinforcement Learning as Probabilistic Inference report is made by Pavel Temirchev Deep RL reading group based on the research of Sergey Levine's team

Reinforcement Learning as Probabilistic Inference

Motivation

REMINDER: standard RL

Maximum Entropy RL

RL as Probabilistic Inference

Exact Probabilistic Inference

Exact Probabilistic Inference

Exact Probabilistic Inference

Soft Q and V functions

Soft and Hard Q and V functions

What is being optimized?

What is being optimized?

Variational Inference

Variational Inference

Returning to the Q and V functions

VI with function approximators

Maximum Entropy Policy Gradients

Soft Q-learning

Soft Q-learning

Soft Actor-Critic

Soft Actor-Critic

Soft Actor-Critic

(bayessem) Reinforcement Learning as Probabilistic Inference

(bayessem) Reinforcement Learning as Probabilistic Inference

cydoroga

Reinforcement Learning as Probabilistic Inference

(bayessem) Reinforcement Learning as Probabilistic Inference

More from cydoroga