Asynchronous
Advantage
Actor-Critic
dhfromkorea(at)gmail.com
Outline
- Motivation
- A3C - theory
- A3C - implementation
- A3C with GPU
- Conclusion
Q Approximation with NN
- approximating Q function with Deep Neural Network
-
unstable out of the box:
- correlated samples (breaks i.i.d. assumed for SGD)
- non-stationarity (a butterfly effect on policy/value updates)
- correlation between target and action values
- no convergence guarantee
- high variance at the expense of expressive power
1. Motivation
DQN (Minh, 2013/2015)
- stabilized through Experience Replay and Target Network
- achieved, then, state-of-the-art results on many Atari games
keep a separate target Q (a copy of the original Q) and synchronize with the original Q periodically.
Replay Memory helps de-correlate transition samples and increases data efficiency (recent samples stay alive).
1. Motivation
DQN (Minh, 2013/2015)
-
Problem
- inherently off-policy (hard to do SARSA or actor-critic)
- more memory and computation required to stabilize the controller.
- maximization bias: Double DQN
- does not differentiate advantage from state values: Dueling DQN
-
long training time / scalability
- took a week to train Breakout (Minh, 2015)
1. Motivation
why off-policy?: b/c this may have come from an outdated policy.
General RL Architecture (Gorila, Nair 2015)
global and/or local buffer
multiple environments, actors/learners, replay momories.
num. of actors == num. of learners == num. of replay buffers
1. Motivation
asynchronous SGD
basically, scalable DQN
General RL Architecture (Gorila, Nair 2015)
- 130 machine instances: 100 processes for learners & actor + a massive gorila-like(!) model distributed across 30 instances
- short training time: cut a week-long Atari training of single GPU DQN in half (like 3-4 days, surprise?)
-
still has issues:
- memory and computation intensive
- still only off-policy (want to use actor-critic)
Motivation of A3C: we want parallelizable RL algorithms that support on-policy as well as off-policy.
1. Motivation
Asynchrnous
Advantage
Actor-Critic
2. A3C: theory
policy gradient: basic story
2. A3C: theory
- We want more rewards.
- Rewards depend on actions.
- Actions depend on policy.
- Policy is parametized by \( \theta \)
- We must intelligently update \( \theta \) to get more rewards, usually with gradient ascent algorithms.
- Then...how do we compute the gradient?
policy gradient theorem
policy gradient theorem allows us to estimate policy gradient by sampling gradient log action probabilities scaled by some measure of reward and in a model-free way.
policy gradient (ascent vector): in which direction and by what magnitude should we move the policy in the weight(theta) space?
reward signal \( \psi_t \) can be expressed in various flavors; we will discuss this point later.
2. A3C: theory
PG interpretation
This gives the steepest direction that increases the probability of taking action, \( a_t \) in the neighborhood of the current theta setup.
This scalar probability re-scales reward signal \( \psi_t \) in a way that produces smaller weight updates for high probability actions than low prob actions as high prob actions will be chosen more frequently. This is a fair treatment.
How good/bad was the action, \( a_t \) at \(s_t\) (or better/worse than average)?
2. A3C: theory
PG interpretation
Notice sum of vectors creates another vector. This is sort of a summary vector for which we can best assign credit to our actions over the trajectory.
Intuition: this gradient suggests a direction to make high-reward actions more probable for each state.
2. A3C: theory
Asynchrnous
Advantage
Actor-Critic
2. A3C: theory
variance of \(\psi_t\)
Choice of \( \psi_t \) is critical. If chosen badly, the controller will likely have a slow convergence, if at all, due to high variance (large amounts of samples needed).
We want the absolute value of this term to be small. Recall Var(X) \( = \mathbb{E}[X^2] - \mathbb{E}[X]^2 \)
2. A3C: theory
The key is to choose one that reduces variance with acceptable bias.
choice of \(\psi_t\): bias-variance
\( \psi_{\theta_t} \) can be:
REINFORCE MONTE CARLO
REINFORCE BASELINE
Q actor-critic
Advantage actor-critic
TD actor-critic
Baseline is a control variate that does not shift the mean of the gradient estimator (remaining unbiased). If the baseline is highly correlated with the reward, it can reduce variance.
2. A3C: theory
TD error/residual can be used to estimate advantage function. (Schulman, 2016)
this seems to be the go-to measure for \( \psi \).
Asynchrnous
Advantage
Actor-Critic
2. A3C: theory
actor and critic
Actor improves the policy parameters \( \theta \) as suggested by the critic. Actor hopes to minimizes the surrogate entropy-regularized policy loss function:
Critic evaluates the actor's policy. In practice, it updates action-value function parameters \( \theta^{V} \). Critic hopes to minimizes the value loss function:
If \( \theta \) and \( \theta^V\) share parameters, the gradient update for actor and critic can be done at one go:
2. A3C: theory
actor and critic
Actor: I know what to do!
Critic: hey man, the action you took was stupid!
2. A3C: theory
Notice actor's loss now depends on what critic says.
Since we're using policy gradients, we still have convergence properties better than purely value-based methods.
Intuitively, feature representations for actors and critics must overlap a lot, so people just share parameters in practice.
Asynchrnous
Advantage
Actor-Critic
2. A3C: theory
using advantage for \( \psi_t \)
total profit - opportunity cost*
- almost yields the lowest variance possible (Schulman, 2016)
- intuition: measures how much this action is better or worse than the default action chosen by the policy.
- Since this measures better-than-average values, advantage can be negative. This property allows us to explicitly decrease action probability for bad actions.
- Advantage is unknown; therefore it needs to be estimated.
- If \( A(s_t,a_t) = V^{\pi, \gamma}(s_t) \), choice of action does not matter. (Motivation of Dueling network)
2. A3C: theory
opportunity cost*
Generalized Advantage Estimation(GAE)
Schulman, 2016, HIGH-DIMENSIONAL CONTINUOUS CONTROL USING GENERALIZED ADVANTAGE ESTIMATION
2. A3C: theory
- okay, I want to use \( A^\pi(s,a)\) for \(\psi\)...
- well, \( A^\pi(s,a)\) is unknown, so we must estimate it with \( \hat{A}^\pi(s,a) \) !
- Under certain conditions (Schulman, 2016), the following can be used:
let's use this!
Generalized Advantage Estimation(GAE)
Schulman, 2016, HIGH-DIMENSIONAL CONTINUOUS CONTROL USING GENERALIZED ADVANTAGE ESTIMATION
2. A3C: theory
Similar to how we computed the target in n-step TD learning, we agonize over which k to choose. (bias-variance...)
Generalized Advantage Estimation(GAE)
Schulman, 2016, HIGH-DIMENSIONAL CONTINUOUS CONTROL USING GENERALIZED ADVANTAGE ESTIMATION
2. A3C: theory
We estimate the advantage function as a weighted sum of discounted TD prediction errors. This approach is analogous to \( TD(\lambda) \) that estimates value function.
why not just use \(Q(s,a)\)? read FAQ in this paper below.
Finally, A3C.
Babaeizadeh, 2017, REINFORCEMENT LEARNING THROUGH ASYNCHRONOUS ADVANTAGE ACTOR-CRITIC ON A GPU
2. A3C: theory
recall: we wanted parallelizable RL algorithms that support on-policy as well as off-policy: something simpler, easier, faster, cheaper(?) and more flexible than DQN or Gorila.
Finally, A3C.
Babaeizadeh, 2017, REINFORCEMENT LEARNING THROUGH ASYNCHRONOUS ADVANTAGE ACTOR-CRITIC ON A GPU
Multiple instances of the same environment.
Multiple agents (actors) interacting in parallel.
2. A3C: theory
parallel scheme decorrelates states
and adds diversity in experience.
Finally, A3C.
Babaeizadeh, 2017, REINFORCEMENT LEARNING THROUGH ASYNCHRONOUS ADVANTAGE ACTOR-CRITIC ON A GPU
a local copy of the same global policy (master model). Actor may choose a slightly different exploration tactic.
a policy gradient is computed by a local agent (learner/trainer) and sent asynchronously.
asynchronous
2. A3C: theory
on a multi-core CPU
https://www.slideshare.net/ssuser07aa33/introduction-to-a3c-model
2. A3C: theory
thread 1
actor (master)
critic (master)
thread N
Basically, Gorila except GPU and replay buffer.
Finally, A3C.
Babaeizadeh, 2017, REINFORCEMENT LEARNING THROUGH ASYNCHRONOUS ADVANTAGE ACTOR-CRITIC ON A GPU
2. A3C: theory
simpler because there's no replay buffer, easier because it runs on CPU.
faster because it's asynchronous.
cheaper because it can distribute across a single machine.
more flexible because it supports on-policy.
Asynchronous Stochastic Gradient Descent (ASGD)
- ASGD is for speed, not for accuracy.
- Basically, gradient update happens whenever there's a new gradient computed.
- Asynchronous gradient updates, in practice, can cause instability because the gradient computed by an agent may not be valid if the master policy gets updated by another agent before the update reaches the master policy. This is called 'delayed gradient' in some literature (a.k.a. policy lag).
- With ASGD, strictly speaking, on-policy actor-critic becomes slightly off-policy.
2. A3C: theory
Minh, 2016, Asynchronous Methods for Deep Reinforcement Learning
3. A3C: implementation
In the experiment, RMSProp was used.
Advantage estimation: TD(t_max) prediction errors were used.
CNN + FF/LSTM: softmax output for actor, linear output for critic.
Experiments
https://www.slideshare.net/ssuser07aa33/introduction-to-a3c-model
- actor and critic parameters were shared.
- features extracted from input frames using CNN for both FF and LSTM.
3. A3C: implementation
Experiments
DQN was trained on a single Nvidia K40 GPU while the asynchronous methods were trained using 16 CPU cores.
3. A3C: implementation
make A3C faster with Gee-PU?
- Notice there was no replay buffer in the original CPU A3C. RL and A3C by design are largely sequential.
- Without mechanisms like Experience Replay, there's not much batch workload to compute gradients/predictions with GPU. GPU would stay idle mostly waiting for transition samples to be sent from agents.
- Okay, what if we can batch action selection (actor) and gradient (learner) tasks so GPU constantly has work to do?
4. improve A3C with GPU
GA3C
Babaeizadeh, 2017, REINFORCEMENT LEARNING THROUGH ASYNCHRONOUS ADVANTAGE ACTOR-CRITIC ON A GPU
4. improve A3C with GPU
agents receive an action without making the action prediction themselves.
agent send transition samples, without computing gradients themselves.
There's only one copy of the policy. No synchronization is needed.
If training queue is not empty and the model gets updated, the gradient can go stale.
Asynchronous
GA3C: evaluation metric
4. improve A3C with GPU
trainingper second (TPS)
prediction per second (PPS)
GA3C
4. improve A3C with GPU
still not a conclusive victory... (policy lag..)
4. improve A3C with GPU
mitigating the policy lag (stale gradient)
GPU
synchronous update
one agent interacts with multiple env. instances to generate a large batch of samples.
basically syncrhonous GA3C.
Efficient Parallel Methods for Deep RL, Clemente, 2017
4. improve A3C with GPU
Clemente, 2017
Batch size is \( n_e t_{max} \). Notice we are averaging the estimated gradients of all agents; therefore, synchronous.
Quad-core i7 intel CPU + Nvidia GTX 980 Ti GPU
Clemente, 2017
4. improve A3C with GPU
Clemente, 2017
- roughly the same architecture with Minh 2016 (CNN+FF, RMSProp).
- to the left is the gif of the model I personally trained. 15 hours of training on Nvidia GTX 1050. Began to achieve 400+ scores after 50 million time steps.
- implementation available at: https://github.com/Alfredvc/paac
4. improve A3C with GPU
A3C rocks, folks!
- You can't run Gorila on your modest laptop setup but you can with A3C!
- Recently PGQL was proposed which beat A3C. It's like A2C + Experience Replay.
- A3C implmentations can be found:
- https://github.com/openai/universe-starter-agent
- https://github.com/miyosuda/async_deep_reinforce
- https://github.com/dennybritz/reinforcement-learning/tree/master/PolicyGradient/a3c
- https://github.com/NVlabs/GA3C
- https://github.com/Alfredvc/paac
5. Conclusion
To be continued...
12M timesteps
= 48M frames of experience
total reward gained: 0.00
reward's too sparse...
Thanks!
A3C(WIP)
By dh
A3C(WIP)
- 1,875