dhfromkorea(at)gmail.com
1. Motivation
keep a separate target Q (a copy of the original Q) and synchronize with the original Q periodically.
Replay Memory helps de-correlate transition samples and increases data efficiency (recent samples stay alive).
1. Motivation
1. Motivation
why off-policy?: b/c this may have come from an outdated policy.
global and/or local buffer
multiple environments, actors/learners, replay momories.
num. of actors == num. of learners == num. of replay buffers
1. Motivation
asynchronous SGD
basically, scalable DQN
Motivation of A3C: we want parallelizable RL algorithms that support on-policy as well as off-policy.
1. Motivation
2. A3C: theory
2. A3C: theory
policy gradient theorem allows us to estimate policy gradient by sampling gradient log action probabilities scaled by some measure of reward and in a model-free way.
policy gradient (ascent vector): in which direction and by what magnitude should we move the policy in the weight(theta) space?
reward signal \( \psi_t \) can be expressed in various flavors; we will discuss this point later.
2. A3C: theory
This gives the steepest direction that increases the probability of taking action, \( a_t \) in the neighborhood of the current theta setup.
This scalar probability re-scales reward signal \( \psi_t \) in a way that produces smaller weight updates for high probability actions than low prob actions as high prob actions will be chosen more frequently. This is a fair treatment.
How good/bad was the action, \( a_t \) at \(s_t\) (or better/worse than average)?
2. A3C: theory
Notice sum of vectors creates another vector. This is sort of a summary vector for which we can best assign credit to our actions over the trajectory.
Intuition: this gradient suggests a direction to make high-reward actions more probable for each state.
2. A3C: theory
2. A3C: theory
Choice of \( \psi_t \) is critical. If chosen badly, the controller will likely have a slow convergence, if at all, due to high variance (large amounts of samples needed).
We want the absolute value of this term to be small. Recall Var(X) \( = \mathbb{E}[X^2] - \mathbb{E}[X]^2 \)
2. A3C: theory
The key is to choose one that reduces variance with acceptable bias.
\( \psi_{\theta_t} \) can be:
REINFORCE MONTE CARLO
REINFORCE BASELINE
Q actor-critic
Advantage actor-critic
TD actor-critic
Baseline is a control variate that does not shift the mean of the gradient estimator (remaining unbiased). If the baseline is highly correlated with the reward, it can reduce variance.
2. A3C: theory
TD error/residual can be used to estimate advantage function. (Schulman, 2016)
this seems to be the go-to measure for \( \psi \).
2. A3C: theory
Actor improves the policy parameters \( \theta \) as suggested by the critic. Actor hopes to minimizes the surrogate entropy-regularized policy loss function:
Critic evaluates the actor's policy. In practice, it updates action-value function parameters \( \theta^{V} \). Critic hopes to minimizes the value loss function:
If \( \theta \) and \( \theta^V\) share parameters, the gradient update for actor and critic can be done at one go:
2. A3C: theory
Actor: I know what to do!
Critic: hey man, the action you took was stupid!
2. A3C: theory
Notice actor's loss now depends on what critic says.
Since we're using policy gradients, we still have convergence properties better than purely value-based methods.
Intuitively, feature representations for actors and critics must overlap a lot, so people just share parameters in practice.
2. A3C: theory
total profit - opportunity cost*
2. A3C: theory
opportunity cost*
Schulman, 2016, HIGH-DIMENSIONAL CONTINUOUS CONTROL USING GENERALIZED ADVANTAGE ESTIMATION
2. A3C: theory
let's use this!
Schulman, 2016, HIGH-DIMENSIONAL CONTINUOUS CONTROL USING GENERALIZED ADVANTAGE ESTIMATION
2. A3C: theory
Similar to how we computed the target in n-step TD learning, we agonize over which k to choose. (bias-variance...)
Schulman, 2016, HIGH-DIMENSIONAL CONTINUOUS CONTROL USING GENERALIZED ADVANTAGE ESTIMATION
2. A3C: theory
We estimate the advantage function as a weighted sum of discounted TD prediction errors. This approach is analogous to \( TD(\lambda) \) that estimates value function.
why not just use \(Q(s,a)\)? read FAQ in this paper below.
Babaeizadeh, 2017, REINFORCEMENT LEARNING THROUGH ASYNCHRONOUS ADVANTAGE ACTOR-CRITIC ON A GPU
2. A3C: theory
recall: we wanted parallelizable RL algorithms that support on-policy as well as off-policy: something simpler, easier, faster, cheaper(?) and more flexible than DQN or Gorila.
Babaeizadeh, 2017, REINFORCEMENT LEARNING THROUGH ASYNCHRONOUS ADVANTAGE ACTOR-CRITIC ON A GPU
Multiple instances of the same environment.
Multiple agents (actors) interacting in parallel.
2. A3C: theory
parallel scheme decorrelates states
and adds diversity in experience.
Babaeizadeh, 2017, REINFORCEMENT LEARNING THROUGH ASYNCHRONOUS ADVANTAGE ACTOR-CRITIC ON A GPU
a local copy of the same global policy (master model). Actor may choose a slightly different exploration tactic.
a policy gradient is computed by a local agent (learner/trainer) and sent asynchronously.
asynchronous
2. A3C: theory
https://www.slideshare.net/ssuser07aa33/introduction-to-a3c-model
2. A3C: theory
thread 1
actor (master)
critic (master)
thread N
Basically, Gorila except GPU and replay buffer.
Babaeizadeh, 2017, REINFORCEMENT LEARNING THROUGH ASYNCHRONOUS ADVANTAGE ACTOR-CRITIC ON A GPU
2. A3C: theory
simpler because there's no replay buffer, easier because it runs on CPU.
faster because it's asynchronous.
cheaper because it can distribute across a single machine.
more flexible because it supports on-policy.
2. A3C: theory
Minh, 2016, Asynchronous Methods for Deep Reinforcement Learning
3. A3C: implementation
In the experiment, RMSProp was used.
Advantage estimation: TD(t_max) prediction errors were used.
CNN + FF/LSTM: softmax output for actor, linear output for critic.
https://www.slideshare.net/ssuser07aa33/introduction-to-a3c-model
3. A3C: implementation
DQN was trained on a single Nvidia K40 GPU while the asynchronous methods were trained using 16 CPU cores.
3. A3C: implementation
4. improve A3C with GPU
Babaeizadeh, 2017, REINFORCEMENT LEARNING THROUGH ASYNCHRONOUS ADVANTAGE ACTOR-CRITIC ON A GPU
4. improve A3C with GPU
agents receive an action without making the action prediction themselves.
agent send transition samples, without computing gradients themselves.
There's only one copy of the policy. No synchronization is needed.
If training queue is not empty and the model gets updated, the gradient can go stale.
Asynchronous
4. improve A3C with GPU
trainingper second (TPS)
prediction per second (PPS)
4. improve A3C with GPU
still not a conclusive victory... (policy lag..)
4. improve A3C with GPU
GPU
synchronous update
one agent interacts with multiple env. instances to generate a large batch of samples.
basically syncrhonous GA3C.
Efficient Parallel Methods for Deep RL, Clemente, 2017
4. improve A3C with GPU
Batch size is \( n_e t_{max} \). Notice we are averaging the estimated gradients of all agents; therefore, synchronous.
Quad-core i7 intel CPU + Nvidia GTX 980 Ti GPU
4. improve A3C with GPU
4. improve A3C with GPU
5. Conclusion
12M timesteps
= 48M frames of experience
total reward gained: 0.00
reward's too sparse...