Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver & Daan Wierstra Google Deepmind London, UK
Presented by: Tyler Becker
Even extremely coarse discretization: rotate CCW, rotate CW, no rotation $$|\mathcal{A}| = 3^6 = 729$$
Problem: Large Action Space
Even extremely coarse discretization: rotate CCW, rotate CW, no rotation $$|\mathcal{A}| = 3^6 = 729$$
Problem: Large Action Space
Mnih, V., Kavukcuoglu, K., Silver, D. et al. Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015). https://doi.org/10.1038/nature14236
Problem: Complex State Value Function
Want continuous \(\mathcal{A}\) compatibility as well as the general applicability provided by the Universal Approximation Theorem
"standard multilayer feedforward networks are capable of approximating any measurable function to any desired degree of accuracy, in a very specific and satisfying sense"
-
Hornik, Kurt, et al. “Multilayer Feedforward Networks Are Universal Approximators.” Neural Networks, vol. 2, no. 5, Jan. 1989, pp. 359–66. DOI.org (Crossref), https://doi.org/10.1016/0893-6080(89)90020-8.
Deep Q-Learning
Deterministic Policy Gradients
Deep Deterministic Policy Gradients
- Separate target networks
- Batch learning
- NN function approximation
- continuous \(\mathcal{A}\)
- Actor-critic sample efficiency
Deterministic Policy Gradient
Previously for RL, we had:
\(\pi : \mathcal{S} \rightarrow \mathbb{R}^{|\mathcal{A}|}\)
$$\pi_\theta(a|s;\theta) = \mathbb{P}[a|s;\theta]$$
What if \(\mathcal{A} = \mathbb{R}^N \)?
Instead, have the policy be deterministic
\(\pi : \mathcal{S} \rightarrow \mathcal{A}\)
$$\pi_\theta(s|\theta) = a$$
\(\pi : \mathcal{S} \rightarrow \mathcal{A}\)
$$\pi_\theta(s|\theta) = a$$
How do we optimize \(\theta\) for \(\pi_\theta\)?
Want \(\pi_\theta(s) = \argmax_a\left\{Q(s,a)\right\}\)
New unknown variable that comes back despite getting rid of it with determinization:
$$\pi_\theta(a|s;\theta) = \mathbb{P}[a|s;\theta] \rightarrow \pi_\theta(s|\theta) = a$$
How do we optimize \(\theta\) for \(\pi_\theta\)?
Want \(\pi_\theta(s) = \argmax_a\left\{Q(s,a)\right\}\)
Actor \(\pi_\theta(s)\)
Critic \(Q_\phi(s,a)\)
Climb Q gradient
Descend TD MSE gradient
Deep Q-Network
Vanilla Q-learning
Samples \((s,a,r,s')\) highly correlated
Batches consist of 1 sample
Moving Targets
Changing \(\theta\) to move \(Q_\theta(s,a)\) towards target
(\(r + \gamma\max_{a'}Q_\theta(s',a')\)) also changes target in the process
Maintain an experience buffer to sample from randomly
Maintain a separate target network \(Q_\phi(s',a')\) that is frozen intermittently to prevent target values from varying wildly with current \(Q\) estimates
Problems
Solutions
Have DQN which works well for discrete actions spaces and can approximate complex Q functions
Have DPG which works with continuous \(\mathcal{A}\), but complexity of Q function that we are able to approximate is dependent on the complexity of the parametric model that we choose
In combining DQN and DPG, we are now given more freedom in the state/action space complexity of the problems we wish to solve
Given direct state information:
DNN taking body position/velocity
Given direct pixel data to be processed by CNN
Exploration:
Softmax sample from \(a\in\mathcal{A}\) via \(Q(s,a)\)
No explicitly listed \(a\in \mathcal{A}\) to choose from
For critic: want to minimize TD MSE
For actor: want to climb Q gradient
For target networks: want steady/stable convergence
Batch Normalization
Target Network
Target Network + Batch Normalization
Pixel-only inputs
https://www.jeremyjordan.me/batch-normalization/
Batch Normalization
Target Network
Target Network + Batch Normalization
Pixel-only inputs
Does not work in partially observable environments
Yan Duan, Xi Chen, Rein Houthooft, John Schulman, & Pieter Abbeel. (2016). Benchmarking Deep Reinforcement Learning for Continuous Control.
Original paper does not compare to other algorithms
Walker
Half-Cheetah
Inclusion in plenty of RL overviews
Effort to quantify sample efficiency
- Measuring Progress in Deep Reinforcement Learning Sample Efficiency - Dorner
Deep Q-Learning
Deterministic Policy Gradients
Deep Deterministic Policy Gradients
- Separate target networks
- Batch learning
- NN function approximation
- continuous \(\mathcal{A}\)
- Actor-critic sample efficiency