report is made by
Pavel Temirchev
Deep RL
reading group
based on the research of Sergey Levine's team
1) Sample Complexity!
2) Convergence to local optimas
The problems of standard RL:
Idea: encourage an agent to investigate all the promising strategies!
Markov process:
Maximization problem:
Q-function:
Bellman equality (optimal Q-function):
Standard RL:
Policy "proportional" to Q:
How to find such a policy?
Optimality:
RL:
Which actions will lead as to the optimal future?
Probabilistic Inference:
Which actions were made given that the future is optimal?
Let's find an optimal policy:
where - prior policy
if , then
apply Bayes rule!
Let's introduce new notation:
We can find all the and via Message Passing algorithm:
For the timestep :
Recursively:
Let's introduce new notation:
We can find all the and via Message Passing algorithm:
For the timestep :
Recursively:
Recursively:
- soft maximum
kinda Bellman equation
We can find analogues in the log-scale:
"Hard" Q and V functions:
"Soft" analogues:
Let's analyze an "exact variational inference" procedure:
true conditional
joint
evidence
where the joint ("exact") distribution is:
and the variational one is:
we tried to find a policy which is optimal
only in an optimal environment!
We can fix this!
The form of the - is our choice
Minimization problem for VI
is a distribution over
ACHIEVABLE trajectories
fix the dynamics!
Then:
Maximum Entropy RL Objective
This objective can be rewritten as follows:
check it yourself!
Then the optimal policy is:
where
- soft maximum
- normal Bellman equation
(neural nets)
Directly maximize entropy-augmented objective
over policy parameters :
For gradients, use log-derivative trick:
Train Q-network with parameters :
use replay buffer
where
for continuous actions use
importance sampling
Policy is implicit
for samples use SVGD
or MCMC :D
Exploration
Robustness
Multimodal Policy
Train Q- and V-networks jointly with policy
Q-network loss:
V-network loss:
Objective for the policy:
Q-network loss:
V-network loss:
Objective for the policy:
Thank you for your attention!
and visit our seminars in RL Reading Group
telegram: https://t.me/theoreticalrl
REFERENCES:
Soft Q-learning:
https://arxiv.org/pdf/1702.08165.pdf
Soft Actor Critic:
https://arxiv.org/pdf/1801.01290.pdf
Big Review on Probabilistic Inference for RL:
https://arxiv.org/pdf/1805.00909.pdf
Implementation on TensorFlow:
https://github.com/rail-berkeley/softlearning
Implementation on Catalyst.RL:
https://github.com/catalyst-team/catalyst/tree/master/examples/rl_gym
Hierarchical policies (further reading):