Deep RL Family Tree:

DQN

Many DQN variants

Policy Gradient

A2C/A3C

REINFORCE/PG+

TRPO

PPO

Value-Based

Policy-Based

Deep RL Family Tree:

Off-Policy

DQN

Many DQN variants

Policy Gradient

A2C/A3C

REINFORCE/PG+

TRPO

PPO

Close to On-Policy

On-Policy

Value-Based

Policy-Based

Deep RL Family Tree:

Off-Policy

DQN

Many DQN variants

Policy Gradient

A2C/A3C

REINFORCE/PG+

TRPO

PPO

Close to On-Policy

On-Policy

Value-Based

Policy-Based

Discrete Actions:

Continuous Actions

Where are my off-policy methods for continuous actions?

Problem with Value-Based Methods

Value-Based methods learn to estimate \(Q(s,a)\)

How to get best action action from \(Q\)-values?

a_{best} = argmax_{a'} Q(s,a')

huge number of actions \(\rightarrow\) expensive

continuous action space \(\rightarrow\) solve optimization problem at each \(t\)

Why PG-based methods work:

Recall Policy Gradients (A2C version):

No need to estimate for (s, a) pairs

Simplest advantage function estimate:

\nabla_{\theta} J(\theta) \approx \textcolor{black}{\frac{1}{N} \sum\limits_{i=1}^{N}} \biggl[ \sum\limits_{t=0}^T \nabla_{\theta}\,log\, \pi_{\theta}(a_{\textcolor{black}{i},t}|s_{\textcolor{black}{i},t}) \textcolor{black}{A_{\pi_{\theta}}(s_{i,t}, a_{i,t})} \biggr]

A_{\pi_{\theta}}(a_t,s_t) \approx r_t + V_{\pi_{\theta}}(s_{t+1}) - V_{\pi_{\theta}}(s_t)

Select actions from parametrized distribution family, e.g, Gaussian

Many version of PG formula:

Screen from GAE paper(arxiv.org/abs/1506.02438):

Which \(\Psi_{t}\) is suitable for continuous action spaces?

All of them!

A2C with Q-values

In discrete case we can just output all \(Q(s,a)\):

But we are interested only in \(Q\)-values for particular \(a_t\)!

A2C: Q-values + Continuous Actions

Continuous case: Estimate \(Q\)-values only for the selected action \(a_t\)

Conclusions:

Explicit policy allows to omit the problem with \(argmax_{a}\) over all actions
PG theorem proofs that this is a correct algorithm

Continuous case: Estimate \(Q\)-values only for the selected action \(a_t\)

One problem: PG is still on-policy!

DPG: Intro

Idea:

Learn Q-values off-policy with 1-step TD-updates
Gradients \(\nabla_{a} Q_{\phi}(s,a)\) tell how to change the action \(a \sim \pi(s)\) to increase \(Q\)-value
Update actor with these gradients!

Problems:

Can't flow gradients via sampled actions!
Do we have any justifications for this?

Use deterministic policy ¯\_(ツ)_/¯

Yes! For deterministic policy :)

Deterministic Policy Gradients:

Now we can rewrite RL objective (Recall PG lecture):

Lets start for new definitions:

\(\rho_1(s)\): The initial distribution over states
\(\rho(s \rightarrow s', k, \mu)\): starting from \(s\), the visitation probability density at \(s'\) after moving k steps with policy \(\mu\)
\(\rho^{\mu}(s')\): Discounted state distribution, defined as:

\textcolor{red}{\rho^{\mu}(s')} = \int_{\mathbb{S}} \sum^{\infty}_{k=0} \gamma^{k-1} \textcolor{green}{\rho_{1}(s)} \textcolor{blue}{\rho(s \rightarrow s',k, \mu)} ds

J(\mu_\theta) = \mathbb{E}_{\tau \sim \mu_{\theta}(\tau)} [ \sum\limits_t \gamma^{t} r(s_t, a_t)] = \int_{\mathbb{\tau}} p_{\mu_{\theta}}(\tau) \sum\limits_t \gamma^{t} r(s_t, a_t)\, d \tau

= \textcolor{red}{\int_{\mathbb{S}} \rho^{\mu_{\theta}}(s)} \int_{\mathbb{A}} \mu_{\theta}(s|a) r(s,a)\, da\,ds = \mathbb{E}_{\textcolor{red}{s \sim \rho^{\mu_{\theta}}}, a \sim \mu_{\theta}} [r(s_t, a_t)]

Deterministic Policy Gradients:

Deterministic Policy Gradient Theorem:

where,

\(\rho_1(s)\): the initial distribution over states
\(\rho(s \rightarrow s', k, \mu)\): starting from \(s\), the visitation probability density at \(s'\) after moving k steps with policy \(\mu\)
\(\rho^{\mu}(s')\): Discounted state distribution, defined as:

\textcolor{red}{\rho^{\mu}(s')} = \int_{\mathbb{S}} \sum^{\infty}_{k=1} \gamma^{k-1} \textcolor{green}{\rho_{1}(s)} \textcolor{blue}{\rho(s \rightarrow s',k, \mu)} ds

\nabla_{\theta} J(\mu_{\theta}) = \textcolor{red}{\int_{\mathbb{S}} \rho^{\mu}(s)} \nabla_{\theta} \mu_{\theta}(s) \nabla_a Q^{\mu}(s,a)|_{a=\mu_{\theta}(s)} \textcolor{red}{\,ds}

= \textcolor{red}{\mathbb{E}_{s \sim \rho^{\mu}}} [ \nabla_{\theta} \mu_{\theta}(s) \nabla_a Q^{\mu}(s,a)|_{a=\mu_{\theta}(s)} ]

Deterministic Policy Gradients:

TODO! Choose one of Two types of Explanation:

Var1:

DPG theorem proof

Var2:

PG theorem with "improper" discounted \(\rho^{\pi}(s)\) as it easier
Show part of DPG Theorem until we start telescoping
Then DPG is similiar with PG only with integrals

DPG Proof V1

For more details see: DPG paper and supplementary materials

Notice that we also can rewrite \(J(\mu_{\theta})\) as:

J(\mu_\theta) = \mathbb{E}_{\tau \sim \mu_{\theta}(\tau)} [ \sum\limits_t \gamma^{t} r(s_t, a_t)] = \textcolor{green}{\int_{\mathbb{S}} p_1(s)} V^{\mu_{\theta}}(s)\, ds

And first we want to prove that:

DPG at first step

expectation over s'

V-value at next step

DPG Proof V1

Leibniz integral rule:

swap grad and integral

same

DPG Proof V1

linearity of gradients and integration

By Q-value definition!

by definition of \(p(s \rightarrow s', k, \mu_{\theta})\)

Deterministic Policy Gradient V1

This is proven:

Now we can recursively iterate with this formula for all steps:

until all \(\nabla _{\theta}V^{\mu_{\theta}}\) are gone!

Deterministic Policy Gradient V1

Now return to:

And substitute \(\nabla_{\theta} V^{\mu_{\theta}}\) with our new formula:

J(\mu_\theta) = \mathbb{E}_{\tau \sim \mu_{\theta}(\tau)} [ \sum\limits_t \gamma^{t} r(s_t, a_t)] = \textcolor{green}{\int_{\mathbb{S}} p_1(s)} V^{\mu_{\theta}}(s)\, ds

Leibniz integral rule

Fibini's theorem to swap integration order

\textcolor{red}{\rho^{\mu}(s')} = \int_{\mathbb{S}} \sum^{\infty}_{k=1} \gamma^{k-1} \textcolor{green}{\rho_{1}(s)} \textcolor{blue}{\rho(s \rightarrow s',k, \mu)} ds

Deterministic Policy Gradient:

Off-policy Policy Gradient theorem (arxiv.org/abs/1205.4839):

We can learn DPG off-policy:

DPG version of Off-Policy PG Theorem allows to ignore importance sampling correction
Learning Q-function with 1-step TD-learning allows to learn critic off-policy

Imporance sampling

recall TRPO, PPO derviations...

Deep Deterministic Policy Gradient

Mods inspired by DQN:

Add noise(Ornstein-Uhlenbeck or Gaussian) to the sampling policy:

Replay Buffer as in DQN
Separate "deep" networks for Actor and Critic
Target Actor and target Critic
Update target networks with moving average:

Lets combine DPG with Deep Learning and create DDPG!