speaker: Pavel Temirchev
RL as Probabilistic Inference
What we will NOT discuss today?
-
Use of Bayesian Neural Networks (BNNs) for Exploration:
-
Bayesian model ensembling for model-based RL:
-
Distributional RL
-
hey, we already discussed it
-
-
A lot of other interesting things...
What we WILL discuss
How to treat RL problem as a probabilistic inference problem?
Eπ∑t=0Tr(st,at)→maxπ
\mathbb{E}_\pi \sum_{t=0}^T r(s_t, a_t) \rightarrow \max_\pi
Standard RL: optimization
Probabilistic Inference
A
A
B
B
p(A∣B)=?
p(A|B) = \;?
WE

π(at∣st,πis optimal)
\pi(a_t| s_t, \pi \;\text{is optimal})
may be something like this will do...
Why we will discuss it?
-
Treating RL as inference can help at using effective inference tools for solving RL problems.
We can develop new algorithms.
-
Bayesians are always try to generalize other's ideas.
-
As we will see, Inference has a close connection to
Maximum Entropy RL - may be it will help to improve exploration!
Background
Probabilistic Graphical Models
p(a,b,c,d,e)
p(a, b, c, d, e)
Generally, a joint probability distribution
p(a,b,c,d,e)=p(a∣b,c,d,e)p(b∣c,d,e)p(c∣d,e)p(d∣e)p(e)
p(a, b, c, d, e) = p(a|b, c, d, e) p(b|c, d, e) p(c|d, e) p(d|e) p(e)
can be factorized as follows:
Graphical representation of a probabilistic model can help
to embed structure into the model:
a
a
b
b
c
c
d
d
e
e
p(a,b,c,d,e)=p(a)p(b)p(c∣a,b)p(d∣c)p(e∣c)
p(a, b, c, d, e) = p(a) p(b) p(c|a, b) p(d|c) p(e|c)
Background
Inference on PGMs
Inference:
p(z1)=∫p(z1∣x1)p(x1)dx1
p(z_1) = \int p(z_1|x_1) p(x_1) dx_1
p(x2)=∫p(x2∣x1)p(x1)dx1
p(x_2) = \int p(x_2|x_1) p(x_1) dx_1
Graphical representation can help make probabilistic inference more easily.
There are a lot of algorithms for exact and approximate inference for PMGs
We will discuss very simple example of
Message Passing Algorithm on trees.
p(zl)=?∀l
p(z_l) = \;? \;\;\; \forall l
Question:
Model:
p(x0:L,z0:L)=p(x0)p(z0∣x0)∏lp(xl∣xl−1)p(zl∣xl)
p(x_{0:L}, z_{0:L}) = p(x_0)p(z_0|x_0)\prod_l p(x_l|x_{l-1})p(z_l|x_l)
z0
z_0
x0
x_0
z1
z_1
x1
x_1
zL
z_L
xL
x_L
…
\dots
p(z0)=∫p(z0∣x0)p(x0)dx0
p(z_0) = \int p(z_0|x_0) p(x_0) dx_0
p(x1)=∫p(x1∣x0)p(x0)dx0
p(x_1) = \int p(x_1|x_0) p(x_0) dx_0
p(zl)=∫p(zl∣xl)p(xl)dxl
p(z_l) = \int p(z_l|x_l) p(x_l) dx_l
p(xl+1)=∫p(xl+1∣xl)p(xl)dxl
p(x_{l+1}) = \int p(x_{l+1}|x_l) p(x_l) dx_l
Background
Bayes' Rule and Kulback-Leibler divergence (KL)
Bayes' Rule: allows to calculate posterior distribution of a r.v.
given new data and a prior distribution
Kulback-Leibler divergence is a measure of how one distribution is different from another, reference, distribution (not symmetric):
p(z∣x)=
p(z|x) =
p(x)p(x∣z)p(z)
\frac{p(x|z) p(z)}{p(x)}
=
=
∫p(x∣z)p(z)dzp(x∣z)p(z)
\frac{p(x|z) p(z)}{\int p(x|z)p(z)dz}
KL(q(x)p(x))=Ex∼q[logp(x)q(x)]
\text{KL}\Big( q(x)\; \big|\big|\; p(x) \Big) = \mathbb{E}_{x \sim q} \Big[ \log \frac{q(x)}{p(x)} \Big]
Background
Approximate probabilistic inference: Variational Inference (VI)
When applying Bayes' rule, the common situation
is intractability of the evidence term
p(x)=∫p(x∣z)p(z)dz
p(x) = \int p(x|z)p(z)dz
Hence, the exact posterior is intractable!
One way to go is to use approximate inference procedure called Variational Inference
p(z∣x)
p(z|x)
KL(q(z)p(z∣x))→minq∈Q
\text{KL}\Big( q(z)\; \big|\big|\; p(z|x) \Big) \rightarrow \min_{q \in \mathcal{Q}}
We want to minimize the dissimilarity between the true posterior
and our approximation - variational distribution
The search is in the chosen family of variational distributions
p(z∣x)
p(z|x)
q(z)
q(z)
q∈Q
q \in\mathcal{Q}
Background
Approximate probabilistic inference: Variational Inference (VI)
It can be shown that the defined minimization problem is closely related
to the maximization of some lower bound on the evidence
p(x)=∫p(x∣z)p(z)dz
p(x) = \int p(x|z)p(z)dz
We can rewrite the logarithm of the evidence as follows:
logp(x)=KL(q(z)p(z∣x))+L(q)
\log p(x) = \text{KL}\Big( q(z)\; \big|\big|\; p(z|x) \Big) + \mathcal{L}(q)
Where
is so-called Evidence Lower Bound Objective (ELBO)
L(q)=−Eq[logq(z)−logp(x,z)]
\mathcal{L}(q) = - \mathbb{E}_q \Big[\log q(z) - \log p(x, z) \Big]
(1)
(1)
The LHS of is independent of , whereas each term on the RHS is dependent.
(1)
(1)
q
q
Hence, the minimization of is equal to the maximization of ELBO
KL
\text{KL}
L(q)
\mathcal{L}(q)
KL(q∣∣p)→minq∈Q
\text{KL}\big( q\; ||\; p \big) \rightarrow \min_{q \in \mathcal{Q}}
It is your choice: either you want to minimize:
or you want to maximize:
L(q)→maxq∈Q
\mathcal{L}(q) \rightarrow \max_{q \in \mathcal{Q}}
Background
RL Basics
Markov process:
p(τ)=p(s0)∏t=0Tp(at∣st)p(st+1∣st,at)
p(\tau) = p(s_0) \prod_{t=0}^T p(a_t|s_t) p(s_{t+1}|s_t, a_t)
Maximization problem:
π⋆=argmaxπ∑t=0TEst,at∼π[r(st,at)]
\pi^\star = \arg\max_\pi \sum_{t=0}^T \mathbb{E}_{s_t, a_t \sim \pi} [r(s_t, a_t)]
Qπ(st,at):=r(st,at)+∑t′=t+1TEst′,at′∼π[r(st′,at′)]
Q^\pi(s_t,a_t) := r(s_t,a_t) + \sum_{t'=t+1}^T \mathbb{E}_{s_{t'}, a_{t'} \sim \pi} [r(s_{t'}, a_{t'})]
Value functions (defined for policy):
Q⋆(st,at)=r(st,at)+Est+1V⋆(st+1)
Q^\star(s_t,a_t) = r(s_t,a_t) + \mathbb{E}_{s_{t+1}} V^\star(s_{t+1})
Bellman Optimality operator:
V⋆(st)=maxaQ⋆(st,a)
V^\star(s_t) = \max_a Q^\star(s_{t}, a)
Probabilistic Graphical Model
for MDP
a0
a_0
s0
s_0
a1
a_1
s1
s_1
a2
a_2
s2
s_2
Vπ(st)=EaQπ(st,a)
V^\pi(s_t) = \mathbb{E}_a Q^\pi(s_{t}, a)
Generally, reward is a random variable:
r(st,at)=E[R(st,at)]
r(s_t,a_t) = \mathbb{E} \big[ R(s_t, a_t) \big]
A heuristic for better exploration
Maximum entropy RL



at∼N(⋅∣π⋆,σ2)
a_t \sim \mathcal{N}(\cdot| \pi^\star, \sigma^2)
Standard Policy Gradient:
at∼expQ(st,at)
a_t \sim \exp{Q(s_t, a_t)}
Policy "proportional" to Q:
How to find such a policy?
minπKL(π(⋅∣s0)∣∣expQ(s0,⋅))=
\min_\pi\text{KL}\Big(\pi(\cdot|s_0)||\exp{Q(s_0, \cdot)}\Big) =
maxπEπ[Q(s0,a0)−logπ(a0∣s0)]=
\max_\pi \mathbb{E}_\pi \Big[ Q(s_0, a_0) - \log \pi(a_0|s_0) \Big] =
maxπEπ[∑tTr(st,at)+H(π(⋅∣s0))]
\max_\pi \mathbb{E}_\pi \Big[ \sum_t^T r(s_t, a_t) {\color{pink}+ \mathcal{H} \big( \pi(\cdot|s_0) \big)}\Big]
Q⋆(s0,⋅)
Q^\star(s_0, \cdot)

go left
go right
a0
a_0
expQ⋆(s0,⋅)
\exp Q^\star(s_0, \cdot)
N(⋅∣argmaxQ⋆,σ2)
\mathcal{N}(\cdot|\arg\max Q^\star, \sigma^2)
It is very similar to the heuristic Maximum Entropy RL objective
maxπEπ[∑tTr(st,at)+H(π(⋅∣st))]
\max_\pi \mathbb{E}_\pi \Big[ \sum_t^T r(s_t, a_t) {\color{pink}+ \mathcal{H} \big( \pi(\cdot|s_t) \big)}\Big]
During the lecture we will derive a probabilistic model inference on which results in Maximum Entropy RL objective
RL as Probabilistic Inference
Graphical Model with Optimality variables
a0
a_0
s0
s_0
a1
a_1
s1
s_1
O0
\mathcal{O}_0
O1
\mathcal{O}_1
p(Ot=1∣st,at):=p(Ot∣st,at)
p(\mathcal{O}_t =1 |s_t, a_t) := p(\mathcal{O}_t |s_t, a_t)
What if we would have binary optimality variables?
Let us look at the PGM for an MDP
O2
\mathcal{O}_2
a2
a_2
s2
s_2
If then timestep was optimal.
Ot=1
\mathcal{O}_t = 1
t
t
Probability that the pair is optimal:
(st,at)
(s_t, a_t)
p(Ot=1∣st,at):=p(Ot∣st,at)=exp(r(st,at))
p(\mathcal{O}_t =1 |s_t, a_t) := p(\mathcal{O}_t |s_t, a_t) = \exp\big(r(s_t,a_t)\big)
But how we should define this probability?
Use exponentiation. Exponents are good.
Let us analyze the distribution of trajectories conditioned on optimality:
p(τ∣O0:T)∝p(τ,O0:T)=p(s0)∏t=0Tp(at∣st)p(st+1∣st,at)exp(r(st,at))
p(\tau|\mathcal{O}_{0:T}) \propto p(\tau,\mathcal{O}_{0:T}) = p(s_0)\prod_{t=0}^Tp(a_t|s_t)p(s_{t+1}|s_t,a_t) \exp\big(r(s_t, a_t)\big)
p(τ∣O0:T)∝p(τ,O0:T)=p(s0)∏t=0Tp(at∣st)p(st+1∣st,at)exp(r(st,at))
p({\color{#00ff00}\tau}|{\color{#ff0000}\mathcal{O}_{0:T}}) \propto p({\color{#00ff00}\tau},{\color{#ff0000}\mathcal{O}_{0:T}}) = {\color{#00ff00} p(s_0)\prod_{t=0}^Tp(a_t|s_t)p(s_{t+1}|s_t,a_t)} {\color{#ff0000}\,\exp\big(r(s_t, a_t)\big)}
RL as Probabilistic Inference
Exact inference for Optimal actions
p(at∣st,O0:T)=p(at∣st,Ot:T)
p(a_t|s_t, \mathcal{O}_{0:T}) = p(a_t|s_t, \mathcal{O}_{t:T})
=
=
p(Ot:T)p(Ot:T∣st,at)p(at∣st)p(st)
\frac{p(\mathcal{O}_{t:T}|s_t, a_t) p(a_t|s_t) p(s_t)}{p(\mathcal{O}_{t:T})}
p(Ot:T∣st)p(st)p(Ot:T)
\frac{p(\mathcal{O}_{t:T})}{p(\mathcal{O}_{t:T}|s_t) p(s_t)}
here - some prior (non-informative) policy
p(at∣st)
p(a_t|s_t)
if we set , then
the optimal policy is the following:
p(at∣st)=∣A∣1
p(a_t|s_t) = \frac{1}{|\mathcal{A}|}
p(at∣st,Ot:T)∝
p(a_t|s_t, \mathcal{O}_{t:T}) \propto
We can now infer actions conditioned on optimality - optimal policy
p(st∣Ot:T)p(st,at∣Ot:T)
\frac{ p(s_t, a_t|\mathcal{O}_{t:T})}{ p(s_t|\mathcal{O}_{t:T})}
(∗)
(*)
is conditionally
independent of
given
due to the structure of PGM
(∗)
(*)
at
a_t
O0:t−1
\mathcal{O}_{0:t-1}
st
s_t
=
=
=
=
p(st∣Ot:T)p(st,at∣Ot:T)
\frac{ \color{#00ff00} p(s_t, a_t|\mathcal{O}_{t:T})}{ \color{#ff0000} p(s_t|\mathcal{O}_{t:T})}
let's apply Bayes rule!
p(Ot:T)p(Ot:T∣st,at)p(at∣st)p(st)p(Ot:T∣st)p(st)p(Ot:T)
\frac{p(\mathcal{O}_{t:T}|s_t, a_t) p(a_t|s_t) p(s_t)}{p(\mathcal{O}_{t:T})}\frac{p(\mathcal{O}_{t:T})}{p(\mathcal{O}_{t:T}|s_t) p(s_t)}
p(at∣st,O0:T)
p(a_t|s_t, \mathcal{O}_{0:T})
p(Ot:T∣st)p(Ot:T∣st,at)
\frac{p(\mathcal{O}_{t:T}|s_t, a_t) }{p(\mathcal{O}_{t:T}|s_t)}
Exact inference for optimal actions
Message Passing Algorithm
Let's introduce
new notation:
αt(st,at):=p(Ot:T∣st,at)
\alpha_t(s_t, a_t) := p(\mathcal{O}_{t:T}|s_t, a_t)
βt(st):=p(Ot:T∣st)=∫αt(st,at)p(at∣st)dat
\beta_t(s_t) := p(\mathcal{O}_{t:T}|s_t) = \int \alpha_t(s_t, a_t) p(a_t|s_t)da_t
We can find
all the and via
Message Passing algorithm:
αt
\alpha_t
βt
\beta_t
For the timestep :
T
T
αT(sT,aT)=exp(r(sT,aT))
\alpha_T(s_T, a_T) = \exp(r(s_T, a_T))
βT(sT)=∫αT(sT,aT)p(aT∣sT)daT
\beta_T(s_T) = \int \alpha_T(s_T, a_T) p(a_T|s_T)da_T
Recursively:
αt(st,at)=∫βt+1(st+1)exp(r(st,at))p(st+1∣st,at)dst+1
\alpha_t(s_t, a_t) = \int \beta_{t+1}(s_{t+1}) \exp(r(s_t, a_t)) p(s_{t+1}|s_t, a_t)ds_{t+1}
βt(st)=∫αt(st,at)p(at∣st)dat
\beta_t(s_t) = \int \alpha_t(s_t, a_t) p(a_t|s_t)da_t
We want to compute and
for all
p(at∣st,Ot:T)∝
p(a_t|s_t, \mathcal{O}_{t:T}) \propto
p(Ot:T∣st)p(Ot:T∣st,at)
\frac{p(\mathcal{O}_{t:T}|s_t, a_t) }{p(\mathcal{O}_{t:T}|s_t)}
p(Ot:T∣st,at)
p(\mathcal{O}_{t:T}|s_t, a_t)
p(Ot:T∣st)
p(\mathcal{O}_{t:T}|s_t)
0≤t≤T
0 \le t \le T
Introducing Qsoft and Vsoft functions
Log-scale messages
Qsoft(st,at):=logαt(st,at)
Q^{soft}(s_t, a_t) := \log\alpha_t(s_t, a_t)
Vsoft(st):=logβt(st)
V^{soft}(s_t) := \log\beta_t(s_t)
Substituting into the recursive relation, we will obtain the following:
Vsoft(st)=logEp(at∣st)[expQsoft(st,at)]
V^{soft}(s_t) =\log \mathbb{E}_{p(a_t|s_t)} [\exp Q^{soft}(s_t, a_t)]
soft maximum
Qsoft(st,at)=r(st,at)+logEp(st+1∣st,at)[expVsoft(st+1)]
Q^{soft}(s_t, a_t) = r(s_t, a_t) + \log \mathbb{E}_{p(s_{t+1}|s_t, a_t)} [\exp V^{soft}(s_{t+1})]
kinda Bellman equation
We can find analogues in the log-scale:
approximates hard maximum with
Qsoft(st,at)→∞
Q^{soft}(s_t, a_t) \rightarrow \infty
Compare (Q⋆,V⋆) with (Qsoft,Vsoft)
Hard approach vs. Soft approach
"Hard" and functions:
V⋆(st)=maxatQ⋆(st,at)
V^\star(s_t) =\max_{a_t} Q^\star(s_t, a_t)
Q⋆(st,at)=r(st,at)+Ep(st+1∣st,at)V⋆(st+1)
Q^\star(s_t, a_t) = r(s_t, a_t) + \mathbb{E}_{p(s_{t+1}|s_t, a_t)} V^\star(s_{t+1})
Vsoft(st)=logEp(at∣st)[expQsoft(st,at)]
V^{soft}(s_t) =\log \mathbb{E}_{p(a_t|s_t)} [\exp Q^{soft}(s_t, a_t)]
Qsoft(st,at)=r(st,at)+logEp(st+1∣st,at)[expVsoft(st+1)]
Q^{soft}(s_t, a_t) = r(s_t, a_t) + \log \mathbb{E}_{p(s_{t+1}|s_t, a_t)} [\exp V^{soft}(s_{t+1})]
"Soft" analogues:
Q⋆(st,at)=r(st,at)+Ep(st+1∣st,at)maxat+1Q⋆(st+1,at+1)
Q^\star(s_t, a_t) = r(s_t, a_t) + \mathbb{E}_{p(s_{t+1}|s_t, a_t)} \max_{a_{t+1}} Q^\star(s_{t+1}, a_{t+1})
Qsoft(st,at)≈r(st,at)+maxst+1maxat+1Qsoft(st+1,at+1)
Q^{soft}(s_t, a_t) \approx r(s_t, a_t) + \max_{s_{t+1}} \max_{a_{t+1}} Q^{soft}(s_{t+1}, a_{t+1})
maxst+1
\max_{s_{t+1}}
V⋆
V^\star
Q⋆
Q^\star
Why we are so optimistic?
p(τ∣O0:T)=p(s0∣O0:T)∏t=0Tp(at∣st,Ot:T)p(st+1∣st,at,Ot+1:T)
p(\tau|\mathcal{O}_{0:T}) = p(s_0|\mathcal{O}_{0:T})\prod_{t=0}^T{\color{#00ff00}p(a_t|s_t,\mathcal{O}_{t:T})}p(s_{t+1}|s_t,a_t,\mathcal{O}_{t+1:T})
p(τ∣O0:T)=p(s0∣O0:T)∏t=0Tp(at∣st,Ot:T)p(st+1∣st,at,Ot+1:T)
p(\tau|\mathcal{O}_{0:T}) = p(s_0|{\color{#ff0000}\mathcal{O}_{0:T}})\prod_{t=0}^Tp(a_t|s_t,\mathcal{O}_{t:T})p(s_{t+1}|s_t,a_t,{\color{#ff0000}\mathcal{O}_{t+1:T}})
What we have done is the inference of the policy term
which was taken from the formula for optimal trajectories distribution:
But who are the neighbors of the policy?
This policy is optimal only in the presence of optimal dynamics!
Can we fix it?
Variational Inference
Approximate inference for achievable trajectories via VI
The trajectories are not really achievable
since they are based on the optimistic dynamics
Our policy , however, will be exploited with the prior dynamics:
And we want policy to produce trajectories ,
which are as close as possible to optimal trajectories
This is a Variational Inference problem:
τ∼p(τ∣O0:T)
\tau \sim p(\tau|\mathcal{O}_{0:T})
p(st+1∣st,at,Ot+1:T)
p(s_{t+1}|s_t, a_t, \mathcal{O}_{t+1:T})
π
\pi
q(τ)=p(s0)∏t=0Tπ(at∣st)p(st+1∣st,at)
q(\tau) = p(s_0)\prod_{t=0}^T \pi(a_t|s_t)p(s_{t+1}|s_t,a_t)
π
\pi
τ∼q(τ)
\tau \sim q(\tau)
τ∼p(τ∣O0:T)
\tau \sim p(\tau|\mathcal{O}_{0:T})
KL(q(τ)∣∣p(τ∣O0:T))→minπ
\text{KL}\big(q(\tau)\;||\;p(\tau|\mathcal{O}_{0:T})\big)
\rightarrow \min_\pi
Variational Inference
Approximate inference for achievable trajectories via VI
Let us expand VI objective using the definition of KL-divergence:
minπKL(q(τ)∣∣p(τ∣O0:T))=−minπEqlogq(τ)p(O0:T)p(τ,O0:T)=maxπEqlogq(τ)p(τ,O0:T)+const
\min_\pi \text{KL}\big(q(\tau)||p(\tau|\mathcal{O}_{0:T})\big) = - \min_\pi \mathbb{E}_q \log \frac{p(\tau,\;\mathcal{O}_{0:T})}{q(\tau)\;p(\mathcal{O}_{0:T})} = \max_\pi \mathbb{E}_q \log \frac{p(\tau,\;\mathcal{O}_{0:T})}{q(\tau)} + \text{const}
this is Maximum Entropy RL Objective
maxπEqlogq(τ)p(τ,O0:T)=maxπEq[logp(s0)+∑t(logp(st+1∣st,at)+r(st,at))−
\max_\pi \mathbb{E}_q \log \frac{p(\tau,\;\mathcal{O}_{0:T})}{q(\tau)}= \max_\pi \mathbb{E}_q \Big[ \log p(s_0)+\sum_{t} \big( \log p(s_{t+1}|s_t,a_t) + r(s_t, a_t) \big) -
−logp(s0)−∑t(logp(st+1∣st,at)−logπ(at∣st))]=
- \log p(s_0)-\sum_{t} \big( \log p(s_{t+1}|s_t,a_t) - \log \pi(a_t| s_t) \big) \Big]=
=maxπEπ∑t[r(st,at)+H(π(⋅∣st))]
= \max_\pi \mathbb{E}_\pi \sum_{t}\Big[ r(s_t, a_t) + \mathcal{H}\big( \pi(\cdot| s_t)\big) \Big]
Returning to Qsoft and Vsoft functions
Risk-neutral Soft approach
The objective from the previous slide can be rewritten as follows:
Vsoft(st)=log∫expQsoft(st,at)dat
V^{soft}(s_t) =\log \int \exp Q^{soft}(s_t, a_t) da_t
Qsoft(st,at)=r(st,at)+Ep(st+1∣st,at)Vsoft(st+1)
Q^{soft}(s_t, a_t) = r(s_t, a_t) + \mathbb{E}_{p(s_{t+1}|s_t, a_t)} V^{soft}(s_{t+1})
check it yourself!
π(at∣st)=exp(Vsoft(st))exp(Qsoft(st,at))
\pi(a_t|s_t) =\frac{\exp(Q^{soft}(s_t, a_t))}{\exp(V^{soft}(s_t))}
∑t=0TEst[−KL(π(at∣st)∣∣exp(Vsoft(st))exp(Qsoft(st,at)))+Vsoft(st)]→maxπ
\sum_{t=0}^T\mathbb{E}_{s_t} \Big[ -\text{KL}\Big(\pi(a_t|s_t)||\frac{\exp(Q^{soft}(s_t, a_t))}{\exp(V^{soft}(s_t))}\Big) + V^{soft}(s_t) \Big] \rightarrow \max_\pi
Hence, the optimal policy is:
but with a bit changed and functions:
- soft maximum
- normal Bellman equation
Qsoft
Q^{soft}
Vsoft
V^{soft}
RL as Inference with function approximators
-
Maximum Entropy Policy Gradients
-
Soft Q-learning
https://arxiv.org/abs/1702.08165 -
Soft Actor-Critic
https://arxiv.org/abs/1801.01290
Soft Q-learning
RL as Inference with function approximators
Train Q-network with parameters :
ϕ
\phi
E(st,at,st+1)∼D[Qϕsoft(st,at)−(r(st,at)+Vϕsoft(st+1))]2→minϕ
\mathbb{E}_{(s_t,a_t, s_{t+1}) \sim \mathcal{D}} \Big[ Q^{soft}_\phi(s_t, a_t) - \Big( r(s_t, a_t) + V^{soft}_\phi(s_{t+1})\Big) \Big]^2\rightarrow \min_\phi
use replay buffer
where
Vϕsoft(st)=log∫expQϕsoft(st,at)dat
V^{soft}_\phi(s_t) =\log \int \exp Q^{soft}_\phi(s_t, a_t) da_t
for continuous actions use
Importance Sampling
Policy is implicit
π(at∣st)=exp(Qϕsoft(st,at)−Vϕsoft(st))
\pi(a_t|s_t) = \exp\big(Q^{soft}_\phi(s_t, a_t) - V^{soft}_\phi(s_t)\big)
for samples use
Stein Variational Gradient Descent
or MCMC :D
Soft Q-learning




Soft Actor-Critic
RL as Inference with function approximators
Train and networks jointly with policy
E(st,at,st+1)∼D[Qϕsoft(st,at)−(r(st,at)+Vψsoft(st+1))]2→minϕ
\mathbb{E}_{(s_t,a_t, s_{t+1}) \sim \mathcal{D}} \Big[ Q^{soft}_\phi(s_t, a_t) - \Big( r(s_t, a_t) + V^{soft}_\psi(s_{t+1})\Big) \Big]^2\rightarrow \min_\phi
Q-network loss:
V-network loss:
V^soft(st)=Eat∼πθ[Qϕsoft(st,at)−logπθ(at∣st)]
\hat{V}^{soft}(s_t) = \mathbb{E}_{a_t \sim \pi_\theta} \Big[ Q^{soft}_\phi(s_t, a_t) - \log\pi_\theta(a_t|s_t) \Big]
Est∼D[V^soft(st)−Vψsoft(st)]2→minψ
\mathbb{E}_{s_t \sim \mathcal{D}} \Big[ \hat{V}^{soft}(s_t) - V^{soft}_\psi(s_{t}) \Big]^2 \rightarrow \min_\psi
Objective for the policy:
Est∼D,at∼πθ[Qϕsoft(st,at)−logπθ(at′∣st)]→maxθ
\mathbb{E}_{s_t \sim \mathcal{D}, \;a_t \sim \pi_\theta} \Big[ Q^{soft}_\phi(s_t, a_t) -\log\pi_\theta(a_{t'}|s_{t})\Big] \rightarrow \max_\theta
E(st,at,st+1)∼D[Qϕsoft(st,at)−(r(st,at)+Vψsoft(st+1))]2→minϕ
\mathbb{E}_{(s_t,a_t, s_{t+1}) \sim \mathcal{D}} \Big[ Q^{soft}_\phi(s_t, a_t) - \Big( r(s_t, a_t) + V^{soft}_\psi(s_{t+1})\Big) \Big]^2\rightarrow \min_\phi
Q-network loss:
V-network loss:
V^soft(st)=Eat∼πθ[Qϕsoft(st,at)−logπθ(at∣st)]
\hat{V}^{soft}(s_t) = \mathbb{E}_{a_t \sim \pi_\theta} \Big[ Q^{soft}_\phi(s_t, a_t) - \log\pi_\theta(a_t|s_t) \Big]
Est∼D[V^soft(st)−Vψsoft(st)]2→minψ
\mathbb{E}_{s_t \sim \mathcal{D}} \Big[ \hat{V}^{soft}(s_t) - V^{soft}_\psi(s_{t}) \Big]^2 \rightarrow \min_\psi
Objective for the policy:
Est∼D,at∼πθ[Qϕsoft(st,at)−logπθ(at′∣st)]→maxθ
\mathbb{E}_{s_t \sim \mathcal{D}, \;a_t \sim \pi_\theta} \Big[ Q^{soft}_\phi(s_t, a_t) -\log\pi_\theta(a_{t'}|s_{t})\Big] \rightarrow \max_\theta
Qϕsoft
Q^{soft}_\phi
Vψsoft
V^{soft}_\psi
πθ
\pi_\theta
Soft Actor-Critic
References
Soft Q-learning:
https://arxiv.org/pdf/1702.08165.pdf
Soft Actor Critic:
https://arxiv.org/pdf/1801.01290.pdf
Big Review on Probabilistic Inference for RL:
https://arxiv.org/pdf/1805.00909.pdf
Implementation on TensorFlow:
https://github.com/rail-berkeley/softlearning
Implementation on Catalyst.RL:
https://github.com/catalyst-team/catalyst/tree/master/examples/rl_gym
Hierarchical policies (further reading):
Thank you for your attention!
speaker: Pavel Temirchev RL as Probabilistic Inference
Adv RL: RL as probabilistic inference
By cydoroga
Adv RL: RL as probabilistic inference
- 2,427