Artyom Sorokin | 13 Apr
DQN
Many DQN variants
Policy Gradient
A2C/A3C
REINFORCE/PG+
TRPO
PPO
Value-Based
Policy-Based
Off-Policy
DQN
Many DQN variants
Policy Gradient
A2C/A3C
REINFORCE/PG+
TRPO
PPO
Close to On-Policy
On-Policy
Value-Based
Policy-Based
DQN
Many DQN variants
Policy Gradient
A2C/A3C
REINFORCE/PG+
TRPO
PPO
Value-Based
Policy-Based
Discrete Actions:
Continuous Actions
Where are my off-policy methods for continuous actions?
Value-Based methods learn to estimate \(Q(s,a)\)
How to get best action action from \(Q\)-values?
huge number of actions \(\rightarrow\) expensive
continuous action space \(\rightarrow\) solve optimization problem at each \(t\)
Recall Policy Gradients (A2C version):
No need to estimate for (s, a) pairs
Simplest advantage function estimate:
Select actions from parametrized distribution family, e.g, Gaussian
Screen from GAE paper(arxiv.org/abs/1506.02438):
Which \(\Psi_{t}\) is suitable for continuous action spaces?
All of them!
In discrete case we can just output all \(Q(s,a)\):
But we are interested only in \(Q\)-values for particular \(a_t\)!
Continuous case: Estimate \(Q\)-values only for the selected action \(a_t\)
Conclusions:
Continuous case: Estimate \(Q\)-values only for the selected action \(a_t\)
One problem: PG is still on-policy!
Idea:
Problems:
Use deterministic policy ¯\_(ツ)_/¯
Yes! For deterministic policy :)
Now we can rewrite RL objective (Recall PG lecture):
Lets start for new definitions:
Deterministic Policy Gradient Theorem:
where,
TODO! Choose one of Two types of Explanation:
Var1:
Var2:
For more details see: DPG paper and supplementary materials
Notice that we also can rewrite \(J(\mu_{\theta})\) as:
And first we want to prove that:
DPG at first step
expectation over s'
V-value at next step
Leibniz integral rule:
swap grad and integral
same
same
linearity of gradients and integration
By Q-value definition!
by definition of \(p(s \rightarrow s', k, \mu_{\theta})\)
This is proven:
Now we can recursively iterate with this formula for all steps:
until all \(\nabla _{\theta}V^{\mu_{\theta}}\) are gone!
Now return to:
And substitute \(\nabla_{\theta} V^{\mu_{\theta}}\) with our new formula:
Leibniz integral rule
Fibini's theorem to swap integration order
Off-policy Policy Gradient theorem (arxiv.org/abs/1205.4839):
We can learn DPG off-policy:
Imporance sampling
recall TRPO, PPO derviations...
Mods inspired by DQN:
Lets combine DPG with Deep Learning and create DDPG!
DDPG has many problems! Lets try to fix them!
Fix #1
Clipped Double Q-learning [ this is "twin" part ]:
DDPG has many problems! Lets try to fix them!
DDPG has many problems! Lets try to fix them!
Fix #2
Delayed update of Target and Policy Networks:
DDPG has many problems! Lets try to fix them!
From author's code of TD3:
DDPG has many problems! Lets try to fix them!
Fix #3
Target Policy Smoothing [ No reference in the name :( ]
DDPG has many problems! Lets try to fix them!
Max Entropy Objective:
Regular RL Objective:
Backup Diagram:
Add \(soft\) for every value function
Maximum Entropy Bellman Operator:
Why softmax over \(Q\)-values imporves policy?
Concave objective (can find global maximum):
Take partial derivative w.t.r. to the action probability:
Set derivative to 0 and find maximum:
The only problem \(\pi(a|s)\) is a probability distribution:
Policy Evaluation doesn't need to change:
Lets upgrade Soft Policy iteration for continuous action space!
Policy Improvement step for continuous policy:
Critic approximate Policy Evaluation. Trains with MSE:
Actor apporximate Policy Improvement. Trains with the following loss:
where:
Implementation Details:
For environments with continual action spaces start from either TD3 or SAC
TD3 graph: TD3 > SAC
SAC graph: SAC > TD3