dhfromkorea(at)gmail.com

## Outline

1. Motivation
2. A3C - theory
3. A3C - implementation
4. A3C with GPU
5. Conclusion

## Q Approximation with NN

• approximating Q function with Deep Neural Network
• unstable out of the box:
• correlated samples (breaks i.i.d. assumed for SGD)
• non-stationarity (a butterfly effect on policy/value updates)
• correlation between target and action values
• no convergence guarantee
• high variance at the expense of expressive power
Q(s,a)\approx Q_{\theta} (\phi(s), a)
$Q(s,a)\approx Q_{\theta} (\phi(s), a)$
1. Motivation

## DQN (Minh, 2013/2015)

• stabilized through Experience Replay and Target Network
• achieved, then, state-of-the-art results on many Atari games

keep a separate target Q (a copy of the original Q) and synchronize with the original Q periodically.

y_i^{DQN}=r_t+\gamma \max_{a'}{Q(s_{t+1},a';\theta^{-})}
$y_i^{DQN}=r_t+\gamma \max_{a'}{Q(s_{t+1},a';\theta^{-})}$

Replay Memory helps de-correlate transition samples and increases data efficiency (recent samples stay alive).

\delta_i=y_i^{DQN}-Q(s_t,a';\theta)
$\delta_i=y_i^{DQN}-Q(s_t,a';\theta)$
\Delta \theta =\alpha \delta \nabla_{\theta}Q(s_t,a_t;\theta)
$\Delta \theta =\alpha \delta \nabla_{\theta}Q(s_t,a_t;\theta)$
1. Motivation

## DQN (Minh, 2013/2015)

• Problem
• ​inherently off-policy (hard to do SARSA or actor-critic)
• more memory and computation required to stabilize the controller.
• maximization bias: Double DQN
• does not differentiate advantage from state values: Dueling DQN
• long training time / scalability
• took a week to train Breakout (Minh, 2015)
1. Motivation

why off-policy?: b/c this may have come from an outdated policy.

## General RL Architecture (Gorila, Nair 2015)

global and/or local buffer

multiple environments, actors/learners, replay momories.

num. of actors == num. of learners == num. of replay buffers

1. Motivation

asynchronous SGD

basically, scalable DQN

## General RL Architecture (Gorila, Nair 2015)

• 130 machine instances: 100 processes for learners & actor + a massive gorila-like(!) model distributed across 30 instances
• short training time: cut a week-long Atari training of single GPU DQN in half (like 3-4 days, surprise?)
• still has issues:
• memory and computation intensive
• still only off-policy (want to use actor-critic)

Motivation of A3C: we want parallelizable RL algorithms that support on-policy as well as off-policy.

1. Motivation

2. A3C: theory

2. A3C: theory
J(\theta) = \mathbb{E}_{\tau}[\sum\limits_{t=0}^{T}{r_t}]
$J(\theta) = \mathbb{E}_{\tau}[\sum\limits_{t=0}^{T}{r_t}]$
• We want more rewards.
• Rewards depend on actions.
• Actions depend on policy.
• Policy is parametized by $$\theta$$
• We must intelligently update $$\theta$$ to get more rewards, usually with gradient ascent algorithms.
• Then...how do we compute the gradient?

policy gradient theorem allows us to estimate policy gradient by sampling gradient log action probabilities scaled by some measure of reward and in a model-free way.

\nabla_{\theta} J(\theta)
$\nabla_{\theta} J(\theta)$
\hat{g_i}= \sum\limits_{t=0}^{T} \nabla_{\theta} \log\pi(a_t|s_t) \psi_t
$\hat{g_i}= \sum\limits_{t=0}^{T} \nabla_{\theta} \log\pi(a_t|s_t) \psi_t$
\Delta \theta = \alpha \nabla_{\theta} J(\theta)
$\Delta \theta = \alpha \nabla_{\theta} J(\theta)$

policy gradient (ascent vector): in which direction and by what magnitude should we move the policy in the weight(theta) space?

=\mathbb{E}_{\tau}[g_i]
$=\mathbb{E}_{\tau}[g_i]$
\approx \frac{1}{N} \sum\limits_{i=0}^N {\hat{g_i}}
$\approx \frac{1}{N} \sum\limits_{i=0}^N {\hat{g_i}}$

reward signal $$\psi_t$$ can be expressed in various flavors; we will discuss this point later.

2. A3C: theory

## PG interpretation

\hat{g_i}= \sum\limits_{t=0}^{T} \nabla_{\theta} \log\pi(a_t|s_t) \psi_t
$\hat{g_i}= \sum\limits_{t=0}^{T} \nabla_{\theta} \log\pi(a_t|s_t) \psi_t$
\psi_t \approx A^{\pi}(s,a) \approx Q^{\pi}(s,a) \approx \sum\limits_{t=0}^{\infty}{\delta_t^{V}}
$\psi_t \approx A^{\pi}(s,a) \approx Q^{\pi}(s,a) \approx \sum\limits_{t=0}^{\infty}{\delta_t^{V}}$
=\sum\limits_{t=0}^{T}\nabla_{\theta}\pi(a_t|s_t) \frac{\psi_t}{\pi(a_t|s_t)}
$=\sum\limits_{t=0}^{T}\nabla_{\theta}\pi(a_t|s_t) \frac{\psi_t}{\pi(a_t|s_t)}$

This gives the steepest direction that increases the probability of taking action, $$a_t$$ in the neighborhood of the current theta setup.

This scalar probability re-scales reward signal $$\psi_t$$ in a way that produces smaller weight updates for high probability actions than low prob actions as high prob actions will be chosen more frequently. This is a fair treatment.

How good/bad was the action, $$a_t$$ at $$s_t$$ (or better/worse than average)?

2. A3C: theory

## PG interpretation

\hat{g_i}= \sum\limits_{t=0}^{T} \nabla_{\theta} \log\pi(a_t|s_t) \psi_t
$\hat{g_i}= \sum\limits_{t=0}^{T} \nabla_{\theta} \log\pi(a_t|s_t) \psi_t$
=\sum\limits_{t=0}^{T}\nabla_{\theta}\pi(a_t|s_t) \frac{\psi_t}{\pi(a_t|s_t)}
$=\sum\limits_{t=0}^{T}\nabla_{\theta}\pi(a_t|s_t) \frac{\psi_t}{\pi(a_t|s_t)}$

Notice sum of vectors creates another vector. This is sort of a summary vector for which we can best assign credit to our actions over the trajectory.

Intuition: this gradient suggests a direction to make high-reward actions more probable for each state.

2. A3C: theory

2. A3C: theory

## variance of $$\psi_t$$

\hat{g_i}= \sum\limits_{t=0}^{T} \nabla_{\theta} \log\pi(a_t|s_t) \psi_{\theta_t}
$\hat{g_i}= \sum\limits_{t=0}^{T} \nabla_{\theta} \log\pi(a_t|s_t) \psi_{\theta_t}$

Choice of $$\psi_t$$ is critical. If chosen badly, the controller will likely have a slow convergence, if at all, due to high variance (large amounts of samples needed).

We want the absolute value of this term to be small. Recall Var(X) $$= \mathbb{E}[X^2] - \mathbb{E}[X]^2$$

2. A3C: theory

The key is to choose one that reduces variance with acceptable bias.

## choice of $$\psi_t$$: bias-variance

\hat{g_i}= \sum\limits_{t=0}^{T} \nabla_{\theta} \log\pi(a_t|s_t) \psi_{\theta_t}
$\hat{g_i}= \sum\limits_{t=0}^{T} \nabla_{\theta} \log\pi(a_t|s_t) \psi_{\theta_t}$
\sum\limits_{t=0}^{\infty}{r_t}
$\sum\limits_{t=0}^{\infty}{r_t}$
\sum\limits_{t=0}^{\infty}{r_t-b(s_t)}
$\sum\limits_{t=0}^{\infty}{r_t-b(s_t)}$
Q^{\pi}(s,a)
$Q^{\pi}(s,a)$

$$\psi_{\theta_t}$$ can be:

REINFORCE MONTE CARLO

A^{\pi}(s,a)
$A^{\pi}(s,a)$
{r_t + V^\pi(s_{t+1}) - V^\pi(s_t)} = \delta_t^{V}
${r_t + V^\pi(s_{t+1}) - V^\pi(s_t)} = \delta_t^{V}$

REINFORCE BASELINE

Q actor-critic

TD actor-critic

Baseline is a control variate that does not shift the mean of the gradient estimator (remaining unbiased). If the baseline is highly correlated with the reward, it can reduce variance.

2. A3C: theory

TD error/residual can be used to estimate advantage function. (Schulman, 2016)

this seems to be the go-to measure for $$\psi$$.

2. A3C: theory

## actor and critic

\hat{g_i}= \sum\limits_{t=0}^{T} \nabla_{\theta} \log\pi_\theta(a_t|s_t) \psi_{\theta^V}
$\hat{g_i}= \sum\limits_{t=0}^{T} \nabla_{\theta} \log\pi_\theta(a_t|s_t) \psi_{\theta^V}$

Actor improves the policy parameters $$\theta$$ as suggested by the critic. Actor hopes to minimizes the surrogate entropy-regularized policy loss function:

Critic evaluates the actor's policy. In practice, it updates action-value function parameters $$\theta^{V}$$. Critic hopes to minimizes the value loss function:

L_\pi(\theta)= \log{\pi_\theta(s,a)}A_{\theta^V}^\pi(s,a)+\beta H(\pi_\theta(s,a))
$L_\pi(\theta)= \log{\pi_\theta(s,a)}A_{\theta^V}^\pi(s,a)+\beta H(\pi_\theta(s,a))$
L_v(\theta^V)= \mathbb{E}[ (R_t -V_{\theta^V}(s_t))^2 ]
$L_v(\theta^V)= \mathbb{E}[ (R_t -V_{\theta^V}(s_t))^2 ]$

If $$\theta$$ and $$\theta^V$$ share parameters, the gradient update for actor and critic can be done at one go:

L(\theta) = L_\pi(\theta) + L_v(\theta^V)
$L(\theta) = L_\pi(\theta) + L_v(\theta^V)$
2. A3C: theory

## actor and critic

\hat{g_i}= \sum\limits_{t=0}^{T} \nabla_{\theta} \log\pi_\theta(a_t|s_t) \psi_{\theta^V}
$\hat{g_i}= \sum\limits_{t=0}^{T} \nabla_{\theta} \log\pi_\theta(a_t|s_t) \psi_{\theta^V}$

Actor: I know what to do!

Critic: hey man, the action you took was stupid!

L_\pi(\theta)= \log{\pi_\theta(s,a)}A_{\theta^V}^\pi(s,a)+\beta H(\pi_\theta(s,a))
$L_\pi(\theta)= \log{\pi_\theta(s,a)}A_{\theta^V}^\pi(s,a)+\beta H(\pi_\theta(s,a))$
2. A3C: theory

Notice actor's loss now depends on what critic says.

Since we're using policy gradients, we still have convergence properties better than purely value-based methods.

L(\theta) = L_\pi(\theta) + L_v(\theta^V)
$L(\theta) = L_\pi(\theta) + L_v(\theta^V)$

Intuitively, feature representations for actors and critics must overlap a lot, so people just share parameters in practice.

2. A3C: theory

## using advantage for $$\psi_t$$

A(s_t,a_t)=Q^{\pi, \gamma}(s_t, a_t)- V^{\pi, \gamma}(s_t)
$A(s_t,a_t)=Q^{\pi, \gamma}(s_t, a_t)- V^{\pi, \gamma}(s_t)$

total profit - opportunity cost*

• almost yields the lowest variance possible (Schulman, 2016)
• intuition: measures how much this action is better or worse than the default action chosen by the policy.
• Since this measures better-than-average values, advantage can be negative. This property allows us to explicitly decrease action probability for bad actions.
• Advantage is unknown; therefore it needs to be estimated.
• If $$A(s_t,a_t) = V^{\pi, \gamma}(s_t)$$, choice of action does not matter. (Motivation of Dueling network)
2. A3C: theory

opportunity cost*

Schulman, 2016, HIGH-DIMENSIONAL CONTINUOUS CONTROL USING GENERALIZED ADVANTAGE ESTIMATION

2. A3C: theory
• okay, I want to use $$A^\pi(s,a)$$ for $$\psi$$...
• well, $$A^\pi(s,a)$$ is unknown, so we must estimate it with $$\hat{A}^\pi(s,a)$$ !
• Under certain conditions (Schulman, 2016), the following can be used:
Q^{\pi, \gamma}(s,a)
$Q^{\pi, \gamma}(s,a)$
Q^{\pi, \gamma}(s,a) - V^{\pi,\gamma}(s_t)
$Q^{\pi, \gamma}(s,a) - V^{\pi,\gamma}(s_t)$
r_t + \gamma V^{\pi,\gamma}(s_{t+1}) - V^{\pi,\gamma}(s_t) = \delta_t^{V}
$r_t + \gamma V^{\pi,\gamma}(s_{t+1}) - V^{\pi,\gamma}(s_t) = \delta_t^{V}$
\sum\limits_{t=0}^{\infty}{\gamma^t r_t}
$\sum\limits_{t=0}^{\infty}{\gamma^t r_t}$

let's use this!

Schulman, 2016, HIGH-DIMENSIONAL CONTINUOUS CONTROL USING GENERALIZED ADVANTAGE ESTIMATION

2. A3C: theory
\hat{A}^{(1)} := \delta_t^V
$\hat{A}^{(1)} := \delta_t^V$

Similar to how we computed the target in n-step TD learning, we agonize over which k to choose. (bias-variance...)

= - V^{\pi,\gamma}(s_t) + r_t + \gamma V^{\pi,\gamma}(s_{t+1})
$= - V^{\pi,\gamma}(s_t) + r_t + \gamma V^{\pi,\gamma}(s_{t+1})$
= - V^{\pi,\gamma}(s_t) + r_t + \gamma r_{t+1} + \gamma^2 V^{\pi,\gamma}(s_{t+2})
$= - V^{\pi,\gamma}(s_t) + r_t + \gamma r_{t+1} + \gamma^2 V^{\pi,\gamma}(s_{t+2})$
= - V^{\pi,\gamma}(s_t) + r_t + \gamma r_{t+1} + ... + \gamma^{k-1} r_{t+k-1} + \gamma^k V^{\pi,\gamma}(s_{t+k})
$= - V^{\pi,\gamma}(s_t) + r_t + \gamma r_{t+1} + ... + \gamma^{k-1} r_{t+k-1} + \gamma^k V^{\pi,\gamma}(s_{t+k})$
...
$...$
\hat{A}^{(k)} := \sum\limits_{l=0}^{k-1} \gamma^l \delta_{t+l}^V
$\hat{A}^{(k)} := \sum\limits_{l=0}^{k-1} \gamma^l \delta_{t+l}^V$
\hat{A}^{(2)} := \delta_{t}^V + \gamma \delta_{t+1}^V
$\hat{A}^{(2)} := \delta_{t}^V + \gamma \delta_{t+1}^V$

Schulman, 2016, HIGH-DIMENSIONAL CONTINUOUS CONTROL USING GENERALIZED ADVANTAGE ESTIMATION

2. A3C: theory

We estimate the advantage function as a weighted sum of discounted TD prediction errors. This approach is analogous to $$TD(\lambda)$$ that estimates value function.

why not just use $$Q(s,a)$$? read FAQ in this paper below.

## Finally, A3C.

2. A3C: theory

recall: we wanted parallelizable RL algorithms that support on-policy as well as off-policy: something simpler, easier, faster, cheaper(?) and more flexible than DQN or Gorila.

## Finally, A3C.

Multiple instances of the same environment.

Multiple agents (actors) interacting in parallel.

2. A3C: theory

parallel scheme decorrelates states

## Finally, A3C.

a local copy of the same global policy (master model). Actor may choose a slightly different exploration tactic.

a policy gradient is computed by a local agent (learner/trainer) and sent asynchronously.

asynchronous

2. A3C: theory

## on a multi-core CPU

https://www.slideshare.net/ssuser07aa33/introduction-to-a3c-model

2. A3C: theory

actor (master)

critic (master)

Basically, Gorila except GPU and replay buffer.

## Finally, A3C.

2. A3C: theory

simpler because there's no replay buffer, easier because it runs on CPU.

faster because it's asynchronous.

cheaper because it can distribute across a single machine.

more flexible because it supports on-policy.

## Asynchronous Stochastic Gradient Descent (ASGD)

• ASGD is for speed, not for accuracy.
• Asynchronous gradient updates, in practice, can cause instability because the gradient computed by an agent may not be valid if the master policy gets updated by another agent before the update reaches the master policy. This is called 'delayed gradient' in some literature (a.k.a. policy lag).
• With ASGD, strictly speaking, on-policy actor-critic becomes slightly off-policy.
2. A3C: theory

Minh, 2016, Asynchronous Methods for Deep Reinforcement Learning

3. A3C: implementation

In the experiment, RMSProp was used.

Advantage estimation: TD(t_max) prediction errors were used.

CNN + FF/LSTM: softmax output for actor, linear output for critic.

## Experiments

https://www.slideshare.net/ssuser07aa33/introduction-to-a3c-model

1. actor and critic parameters were shared.
2. features extracted from input frames using CNN for both FF and LSTM.
3. A3C: implementation

## Experiments

DQN was trained on a single Nvidia K40 GPU while the asynchronous methods were trained using 16 CPU cores.

3. A3C: implementation

## make A3C faster with Gee-PU?

• Notice there was no replay buffer in the original CPU A3C. RL and A3C by design are largely sequential.
• Without mechanisms like Experience Replay, there's not much batch workload to compute gradients/predictions with GPU. GPU would stay idle mostly waiting for transition samples to be sent from agents.
• Okay, what if we can batch action selection (actor) and gradient (learner) tasks so GPU constantly has work to do?
4. improve A3C with GPU

## GA3C

4. improve A3C with GPU

agents receive an action without making the action prediction themselves.

agent send transition samples, without computing gradients themselves.

There's only one copy of the policy. No synchronization is needed.

If training queue is not empty and the model gets updated, the gradient can go stale.

Asynchronous

## GA3C: evaluation metric

4. improve A3C with GPU

trainingper second (TPS)

prediction per second (PPS)

## GA3C

4. improve A3C with GPU

still not a conclusive victory... (policy lag..)

4. improve A3C with GPU

## mitigating the policy lag (stale gradient)

GPU

synchronous update

one agent interacts with multiple env. instances to generate a large batch of samples.

basically syncrhonous GA3C.

Efficient Parallel Methods for Deep RL, Clemente, 2017

4. improve A3C with GPU

## Clemente, 2017

Batch size is $$n_e t_{max}$$. Notice we are averaging the estimated gradients of all agents; therefore, synchronous.

Quad-core i7 intel CPU + Nvidia GTX 980 Ti GPU

## Clemente, 2017

4. improve A3C with GPU

## Clemente, 2017

• roughly the same architecture with Minh 2016 (CNN+FF, RMSProp).
• to the left is the gif of the model I personally trained. 15 hours of training on Nvidia GTX 1050. Began to achieve 400+ scores after 50 million time steps.
• implementation available at: https://github.com/Alfredvc/paac
4. improve A3C with GPU

## A3C rocks, folks!

• You can't run Gorila on your modest laptop setup but you can with A3C!
• Recently PGQL was proposed which beat A3C. It's like A2C + Experience Replay.
• A3C implmentations can be found:
• https://github.com/openai/universe-starter-agent
• https://github.com/miyosuda/async_deep_reinforce
• https://github.com/NVlabs/GA3C
• https://github.com/Alfredvc/paac
5. Conclusion

## To be continued...

12M timesteps

= 48M frames of experience

total reward gained: 0.00

\hat{g_i}= \sum\limits_{t=0}^{T} \nabla_{\theta} \log\pi_\theta(a_t|s_t) \psi_{\theta^V}
$\hat{g_i}= \sum\limits_{t=0}^{T} \nabla_{\theta} \log\pi_\theta(a_t|s_t) \psi_{\theta^V}$

reward's too sparse...