Reinforcement Learning as Probabilistic Inference
report is made by
Pavel Temirchev
Deep RL
reading group
based on the research of Sergey Levine's team
Motivation
1) Sample Complexity!
2) Convergence to local optimas
The problems of standard RL:
Idea: encourage an agent to investigate all the promising strategies!
REMINDER: standard RL
Markov process:
Maximization problem:
Q-function:
Bellman equality (optimal Q-function):
Maximum Entropy RL
Standard RL:
Policy "proportional" to Q:
How to find such a policy?
RL as Probabilistic Inference
Optimality:
RL:
Which actions will lead as to the optimal future?
Probabilistic Inference:
Which actions were made given that the future is optimal?
Exact Probabilistic Inference
Let's find an optimal policy:
where - prior policy
if , then
apply Bayes rule!
Let's introduce new notation:
We can find all the and via Message Passing algorithm:
For the timestep :
Recursively:
Exact Probabilistic Inference
Let's introduce new notation:
We can find all the and via Message Passing algorithm:
For the timestep :
Recursively:
Exact Probabilistic Inference
Soft Q and V functions
Recursively:
- soft maximum
kinda Bellman equation
We can find analogues in the log-scale:
Soft and Hard Q and V functions
"Hard" Q and V functions:
"Soft" analogues:
What is being optimized?
Let's analyze an "exact variational inference" procedure:
true conditional
joint
evidence
What is being optimized?
where the joint ("exact") distribution is:
and the variational one is:
we tried to find a policy which is optimal
only in an optimal environment!
We can fix this!
Variational Inference
The form of the - is our choice
Minimization problem for VI
is a distribution over
ACHIEVABLE trajectories
fix the dynamics!
Then:
Maximum Entropy RL Objective
Variational Inference
Returning to the Q and V functions
This objective can be rewritten as follows:
check it yourself!
Then the optimal policy is:
where
- soft maximum
- normal Bellman equation
VI with function approximators
(neural nets)
- Maximum Entropy Policy Gradients
- Soft Q-learning
https://arxiv.org/abs/1702.08165 - Soft Actor-Critic
https://arxiv.org/abs/1801.01290
Maximum Entropy Policy Gradients
Directly maximize entropy-augmented objective
over policy parameters :
For gradients, use log-derivative trick:
- on-policy
- unimodal policies
Soft Q-learning
Train Q-network with parameters :
use replay buffer
where
for continuous actions use
importance sampling
Policy is implicit
for samples use SVGD
or MCMC :D
Soft Q-learning
Exploration
Robustness
Multimodal Policy
Soft Actor-Critic
Train Q- and V-networks jointly with policy
Q-network loss:
V-network loss:
Objective for the policy:
Q-network loss:
V-network loss:
Objective for the policy:
Soft Actor-Critic
Soft Actor-Critic
Thank you for your attention!
and visit our seminars in RL Reading Group
telegram: https://t.me/theoreticalrl
REFERENCES:
Soft Q-learning:
https://arxiv.org/pdf/1702.08165.pdf
Soft Actor Critic:
https://arxiv.org/pdf/1801.01290.pdf
Big Review on Probabilistic Inference for RL:
https://arxiv.org/pdf/1805.00909.pdf
Implementation on TensorFlow:
https://github.com/rail-berkeley/softlearning
Implementation on Catalyst.RL:
https://github.com/catalyst-team/catalyst/tree/master/examples/rl_gym
Hierarchical policies (further reading):
(bayessem) Reinforcement Learning as Probabilistic Inference
By cydoroga
(bayessem) Reinforcement Learning as Probabilistic Inference
- 596