On the Convergence of Smooth Regularized Approximate Value Iteration Schemes
Elena Smirnova
Joint work with Elvis Dohmatob, Criteo AI Lab
SOTA RL algorithms
like Soft Actor-Critic
feature
1. Q-value smoothing
2. Entropy regularization
3. Neural network function approximators
In large-scale setting, approximations are omnipresent
=> errors accumulation
=> high risk of divergence
We analyse error propagation of RL algorithms with these techniques
using
Approximate Dynamic Programming
Approximate Modified Policy Iteration
Policy improvement
(Partial) Policy evaluation
\(m = \infty\) policy iteration; \(m=1\) value iteration
Scherrer et al. 2015
Approximate Modified Policy Iteration
Policy improvement
(Partial) Policy evaluation
\(m = \infty\) policy iteration; \(m=1\) value iteration
Scherrer et al. 2015
AMPI Error propagation
Scherrer et al. 2015
Initialization
Cumulative error
\(\gamma\) rate of convergence
1. Value Smoothing
Weighted average of the value iterates
1. Value Smoothing
\(\tilde{\gamma}\) - contraction
Does not change the fixed point!
Main tool: Smooth Bellman operator
1. Value Smoothing
Errors are downweighted by a factor of \(\beta\)
Slower convergence due to smaller contraction \(\tilde{\gamma} \geq \gamma\)
Increased stability
but
Sensitivity to initialization / random seed
Slower convergence
1. Value Smoothing
Entropy-regularized Bellman operator
2. Entropy regularization
\(\gamma\)-contraction
converges to a different regularized value function
Boltzmann / softmax policy
Geist et al. 2019
Main tool: Regularization gap
2. Entropy regularization
\(\Omega^*(A_V)\) = smooth maximum of action advantages
Controlled by the temperature and the entropy of softmax policy
Positive overestimation errors \(\epsilon_t > 0\) could be reduced
2. Entropy regularization
Good temperature parameter matches the level of noise
Robustness to noise
but
Convergence to a different value function
2. Entropy regularization
3. Neural Network FA
Function approximation errors
Value network trained using gradient descent over the squared loss
Scaling of intermediate layers to preserve the input norm
\(m_h\) is the width of layer \(h\)
3. Neural Network FA
\(L\)-layer fully connected neural network \(f_\theta: \mathbb{R}^{m_0} \rightarrow \mathbb{R}^{m_L}\) with randomly initialized weights
Main tool: Value function NTK
Full-batch gradient flow
3. Neural Network FA
Jacot et al. 2018
for large \(m\)
Lee et al. 2019
Spectrum of the limiting NTK defines convergence
With sufficiently large width \(K(t) \approx K\)
3. Neural Network FA
Function approximation errors vanish,
if the limiting NTK is positive definite,
for sufficiently large value network
1. Value smoothing
stability vs sensitivity to the initialization / random seed
2. Entropy regularization
reduces overestimation errors vs modifying the original problem
3. Neural network approximation
FA errors vanish, under certain conditions, at large width of the network
Summary
References
- B. Scherrer, M. Ghavamzadeh, V. Gabillon, B. Lesner, and M. Geist. Approximate modified policy iteration and its application to the game of tetris. JMLR, 2015.
- M. Geist, B. Scherrer, and O. Pietquin. A theory of regularized Markov decision processes. ICML, 2019.
- A. Jacot, F. Gabriel, and C. Hongler. Neural tangent kernel: Convergence and generalization in neural networks. NeurIPS, 2018.
- T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. ICML, 2018