Asynchronous
Advantage
ActorCritic
dhfromkorea(at)gmail.com
Outline
 Motivation
 A3C  theory
 A3C  implementation
 A3C with GPU
 Conclusion
Q Approximation with NN
 approximating Q function with Deep Neural Network

unstable out of the box:
 correlated samples (breaks i.i.d. assumed for SGD)
 nonstationarity (a butterfly effect on policy/value updates)
 correlation between target and action values
 no convergence guarantee
 high variance at the expense of expressive power
1. Motivation
DQN (Minh, 2013/2015)
 stabilized through Experience Replay and Target Network
 achieved, then, stateoftheart results on many Atari games
keep a separate target Q (a copy of the original Q) and synchronize with the original Q periodically.
Replay Memory helps decorrelate transition samples and increases data efficiency (recent samples stay alive).
1. Motivation
DQN (Minh, 2013/2015)

Problem
 inherently offpolicy (hard to do SARSA or actorcritic)
 more memory and computation required to stabilize the controller.
 maximization bias: Double DQN
 does not differentiate advantage from state values: Dueling DQN

long training time / scalability
 took a week to train Breakout (Minh, 2015)
1. Motivation
why offpolicy?: b/c this may have come from an outdated policy.
General RL Architecture (Gorila, Nair 2015)
global and/or local buffer
multiple environments, actors/learners, replay momories.
num. of actors == num. of learners == num. of replay buffers
1. Motivation
asynchronous SGD
basically, scalable DQN
General RL Architecture (Gorila, Nair 2015)
 130 machine instances: 100 processes for learners & actor + a massive gorilalike(!) model distributed across 30 instances
 short training time: cut a weeklong Atari training of single GPU DQN in half (like 34 days, surprise?)

still has issues:
 memory and computation intensive
 still only offpolicy (want to use actorcritic)
Motivation of A3C: we want parallelizable RL algorithms that support onpolicy as well as offpolicy.
1. Motivation
Asynchrnous
Advantage
ActorCritic
2. A3C: theory
policy gradient: basic story
2. A3C: theory
 We want more rewards.
 Rewards depend on actions.
 Actions depend on policy.
 Policy is parametized by \( \theta \)
 We must intelligently update \( \theta \) to get more rewards, usually with gradient ascent algorithms.
 Then...how do we compute the gradient?
policy gradient theorem
policy gradient theorem allows us to estimate policy gradient by sampling gradient log action probabilities scaled by some measure of reward and in a modelfree way.
policy gradient (ascent vector): in which direction and by what magnitude should we move the policy in the weight(theta) space?
reward signal \( \psi_t \) can be expressed in various flavors; we will discuss this point later.
2. A3C: theory
PG interpretation
This gives the steepest direction that increases the probability of taking action, \( a_t \) in the neighborhood of the current theta setup.
This scalar probability rescales reward signal \( \psi_t \) in a way that produces smaller weight updates for high probability actions than low prob actions as high prob actions will be chosen more frequently. This is a fair treatment.
How good/bad was the action, \( a_t \) at \(s_t\) (or better/worse than average)?
2. A3C: theory
PG interpretation
Notice sum of vectors creates another vector. This is sort of a summary vector for which we can best assign credit to our actions over the trajectory.
Intuition: this gradient suggests a direction to make highreward actions more probable for each state.
2. A3C: theory
Asynchrnous
Advantage
ActorCritic
2. A3C: theory
variance of \(\psi_t\)
Choice of \( \psi_t \) is critical. If chosen badly, the controller will likely have a slow convergence, if at all, due to high variance (large amounts of samples needed).
We want the absolute value of this term to be small. Recall Var(X) \( = \mathbb{E}[X^2]  \mathbb{E}[X]^2 \)
2. A3C: theory
The key is to choose one that reduces variance with acceptable bias.
choice of \(\psi_t\): biasvariance
\( \psi_{\theta_t} \) can be:
REINFORCE MONTE CARLO
REINFORCE BASELINE
Q actorcritic
Advantage actorcritic
TD actorcritic
Baseline is a control variate that does not shift the mean of the gradient estimator (remaining unbiased). If the baseline is highly correlated with the reward, it can reduce variance.
2. A3C: theory
TD error/residual can be used to estimate advantage function. (Schulman, 2016)
this seems to be the goto measure for \( \psi \).
Asynchrnous
Advantage
ActorCritic
2. A3C: theory
actor and critic
Actor improves the policy parameters \( \theta \) as suggested by the critic. Actor hopes to minimizes the surrogate entropyregularized policy loss function:
Critic evaluates the actor's policy. In practice, it updates actionvalue function parameters \( \theta^{V} \). Critic hopes to minimizes the value loss function:
If \( \theta \) and \( \theta^V\) share parameters, the gradient update for actor and critic can be done at one go:
2. A3C: theory
actor and critic
Actor: I know what to do!
Critic: hey man, the action you took was stupid!
2. A3C: theory
Notice actor's loss now depends on what critic says.
Since we're using policy gradients, we still have convergence properties better than purely valuebased methods.
Intuitively, feature representations for actors and critics must overlap a lot, so people just share parameters in practice.
Asynchrnous
Advantage
ActorCritic
2. A3C: theory
using advantage for \( \psi_t \)
total profit  opportunity cost*
 almost yields the lowest variance possible (Schulman, 2016)
 intuition: measures how much this action is better or worse than the default action chosen by the policy.
 Since this measures betterthanaverage values, advantage can be negative. This property allows us to explicitly decrease action probability for bad actions.
 Advantage is unknown; therefore it needs to be estimated.
 If \( A(s_t,a_t) = V^{\pi, \gamma}(s_t) \), choice of action does not matter. (Motivation of Dueling network)
2. A3C: theory
opportunity cost*
Generalized Advantage Estimation(GAE)
Schulman, 2016, HIGHDIMENSIONAL CONTINUOUS CONTROL USING GENERALIZED ADVANTAGE ESTIMATION
2. A3C: theory
 okay, I want to use \( A^\pi(s,a)\) for \(\psi\)...
 well, \( A^\pi(s,a)\) is unknown, so we must estimate it with \( \hat{A}^\pi(s,a) \) !
 Under certain conditions (Schulman, 2016), the following can be used:
let's use this!
Generalized Advantage Estimation(GAE)
Schulman, 2016, HIGHDIMENSIONAL CONTINUOUS CONTROL USING GENERALIZED ADVANTAGE ESTIMATION
2. A3C: theory
Similar to how we computed the target in nstep TD learning, we agonize over which k to choose. (biasvariance...)
Generalized Advantage Estimation(GAE)
Schulman, 2016, HIGHDIMENSIONAL CONTINUOUS CONTROL USING GENERALIZED ADVANTAGE ESTIMATION
2. A3C: theory
We estimate the advantage function as a weighted sum of discounted TD prediction errors. This approach is analogous to \( TD(\lambda) \) that estimates value function.
why not just use \(Q(s,a)\)? read FAQ in this paper below.
Finally, A3C.
Babaeizadeh, 2017, REINFORCEMENT LEARNING THROUGH ASYNCHRONOUS ADVANTAGE ACTORCRITIC ON A GPU
2. A3C: theory
recall: we wanted parallelizable RL algorithms that support onpolicy as well as offpolicy: something simpler, easier, faster, cheaper(?) and more flexible than DQN or Gorila.
Finally, A3C.
Babaeizadeh, 2017, REINFORCEMENT LEARNING THROUGH ASYNCHRONOUS ADVANTAGE ACTORCRITIC ON A GPU
Multiple instances of the same environment.
Multiple agents (actors) interacting in parallel.
2. A3C: theory
parallel scheme decorrelates states
and adds diversity in experience.
Finally, A3C.
Babaeizadeh, 2017, REINFORCEMENT LEARNING THROUGH ASYNCHRONOUS ADVANTAGE ACTORCRITIC ON A GPU
a local copy of the same global policy (master model). Actor may choose a slightly different exploration tactic.
a policy gradient is computed by a local agent (learner/trainer) and sent asynchronously.
asynchronous
2. A3C: theory
on a multicore CPU
https://www.slideshare.net/ssuser07aa33/introductiontoa3cmodel
2. A3C: theory
thread 1
actor (master)
critic (master)
thread N
Basically, Gorila except GPU and replay buffer.
Finally, A3C.
Babaeizadeh, 2017, REINFORCEMENT LEARNING THROUGH ASYNCHRONOUS ADVANTAGE ACTORCRITIC ON A GPU
2. A3C: theory
simpler because there's no replay buffer, easier because it runs on CPU.
faster because it's asynchronous.
cheaper because it can distribute across a single machine.
more flexible because it supports onpolicy.
Asynchronous Stochastic Gradient Descent (ASGD)
 ASGD is for speed, not for accuracy.
 Basically, gradient update happens whenever there's a new gradient computed.
 Asynchronous gradient updates, in practice, can cause instability because the gradient computed by an agent may not be valid if the master policy gets updated by another agent before the update reaches the master policy. This is called 'delayed gradient' in some literature (a.k.a. policy lag).
 With ASGD, strictly speaking, onpolicy actorcritic becomes slightly offpolicy.
2. A3C: theory
Minh, 2016, Asynchronous Methods for Deep Reinforcement Learning
3. A3C: implementation
In the experiment, RMSProp was used.
Advantage estimation: TD(t_max) prediction errors were used.
CNN + FF/LSTM: softmax output for actor, linear output for critic.
Experiments
https://www.slideshare.net/ssuser07aa33/introductiontoa3cmodel
 actor and critic parameters were shared.
 features extracted from input frames using CNN for both FF and LSTM.
3. A3C: implementation
Experiments
DQN was trained on a single Nvidia K40 GPU while the asynchronous methods were trained using 16 CPU cores.
3. A3C: implementation
make A3C faster with GeePU?
 Notice there was no replay buffer in the original CPU A3C. RL and A3C by design are largely sequential.
 Without mechanisms like Experience Replay, there's not much batch workload to compute gradients/predictions with GPU. GPU would stay idle mostly waiting for transition samples to be sent from agents.
 Okay, what if we can batch action selection (actor) and gradient (learner) tasks so GPU constantly has work to do?
4. improve A3C with GPU
GA3C
Babaeizadeh, 2017, REINFORCEMENT LEARNING THROUGH ASYNCHRONOUS ADVANTAGE ACTORCRITIC ON A GPU
4. improve A3C with GPU
agents receive an action without making the action prediction themselves.
agent send transition samples, without computing gradients themselves.
There's only one copy of the policy. No synchronization is needed.
If training queue is not empty and the model gets updated, the gradient can go stale.
Asynchronous
GA3C: evaluation metric
4. improve A3C with GPU
trainingper second (TPS)
prediction per second (PPS)
GA3C
4. improve A3C with GPU
still not a conclusive victory... (policy lag..)
4. improve A3C with GPU
mitigating the policy lag (stale gradient)
GPU
synchronous update
one agent interacts with multiple env. instances to generate a large batch of samples.
basically syncrhonous GA3C.
Efficient Parallel Methods for Deep RL, Clemente, 2017
4. improve A3C with GPU
Clemente, 2017
Batch size is \( n_e t_{max} \). Notice we are averaging the estimated gradients of all agents; therefore, synchronous.
Quadcore i7 intel CPU + Nvidia GTX 980 Ti GPU
Clemente, 2017
4. improve A3C with GPU
Clemente, 2017
 roughly the same architecture with Minh 2016 (CNN+FF, RMSProp).
 to the left is the gif of the model I personally trained. 15 hours of training on Nvidia GTX 1050. Began to achieve 400+ scores after 50 million time steps.
 implementation available at: https://github.com/Alfredvc/paac
4. improve A3C with GPU
A3C rocks, folks!
 You can't run Gorila on your modest laptop setup but you can with A3C!
 Recently PGQL was proposed which beat A3C. It's like A2C + Experience Replay.
 A3C implmentations can be found:
 https://github.com/openai/universestarteragent
 https://github.com/miyosuda/async_deep_reinforce
 https://github.com/dennybritz/reinforcementlearning/tree/master/PolicyGradient/a3c
 https://github.com/NVlabs/GA3C
 https://github.com/Alfredvc/paac
5. Conclusion
To be continued...
12M timesteps
= 48M frames of experience
total reward gained: 0.00
reward's too sparse...
Thanks!
A3C(WIP)
By dh
A3C(WIP)
 1,721