On the Convergence of Smooth Regularized Approximate Value Iteration Schemes

Elena Smirnova

Joint work with Elvis Dohmatob, Criteo AI Lab

SOTA RL algorithms

like Soft Actor-Critic

feature

1. Q-value smoothing

 

2. Entropy regularization

 

3. Neural network function approximators

In large-scale setting, approximations are omnipresent

 

=> errors accumulation

 

=> high risk of divergence

We analyse error propagation of RL algorithms with these techniques

using

Approximate Dynamic Programming

Approximate Modified Policy Iteration

Policy improvement    

\pi_{t+1} \leftarrow \mathcal{G}^{\epsilon'_{t+1}}(V_t)
V_{t+1} \leftarrow (\mathcal{T}^{\pi_{t+1}})^m V_t + \epsilon_{t+1}

(Partial) Policy evaluation

\(m = \infty\)  policy iteration; \(m=1\) value iteration

Scherrer et al. 2015

Approximate Modified Policy Iteration

Policy improvement    

\pi_{t+1} \leftarrow \mathcal{G}^{\epsilon'_{t+1}}(V_t)
V_{t+1} \leftarrow (\mathcal{T}^{\pi_{t+1}})^m V_t + \epsilon_{t+1}

(Partial) Policy evaluation

\(m = \infty\)  policy iteration; \(m=1\) value iteration

Scherrer et al. 2015

AMPI Error propagation

\|V^{\pi_N} - V^*\|_\infty \le \frac{2}{1-\gamma}\left( E_N + \gamma^N\|V_0-V^*\|_\infty \right)

Scherrer et al. 2015

Initialization

Cumulative error

\(\gamma\) rate of convergence

E_N:=\sum_{t=1}^{N-1}\tilde{\gamma}^{N-t}(\|\epsilon_t\|_\infty+\|\epsilon'_t\|_\infty)

1. Value Smoothing

\begin{cases} \pi_{t+1} = \mathcal{G}^{\epsilon'_{t+1}}(\tilde{V}_t) \\ V_{t+1} = \mathcal{T}^{\pi_{t+1}} V_t + \epsilon_{t+1} \\ \tilde{V}_{t+1} = \beta V_{t+1} + (1-\beta) \tilde{V}_t \end{cases}
\beta \in (0,1]

Weighted average of the value iterates

1. Value Smoothing

\mathcal{T}^\pi_\beta := \beta \mathcal{T}^\pi + (1-\beta)I
\gamma \leq \tilde{\gamma}:=\beta \gamma + (1-\beta) < 1

\(\tilde{\gamma}\) - contraction  

Does not change the fixed point!

Main tool: Smooth Bellman operator

\begin{cases} \pi_{t+1} = \mathcal{G}_{\beta}^{\beta\epsilon'_{t+1}}(V_t) \\ V_{t+1} = \mathcal{T}^{\pi_{t+1}}_\beta V_t + \beta\epsilon_{t+1} \end{cases}

1. Value Smoothing

Errors are downweighted by a factor of \(\beta\)

Slower convergence due to smaller contraction \(\tilde{\gamma} \geq \gamma\)

Increased stability 

but

Sensitivity to initialization / random seed

Slower convergence

\|V^{\pi_N} - V^*\|_\infty \le \frac{2}{1-\tilde{\gamma}}\left( \beta E_N + \tilde{\gamma}^N\|V_0-V^*\|_\infty \right)
\tilde{\gamma} \geq \gamma

1. Value Smoothing

Entropy-regularized Bellman operator

\mathcal{T}^\pi_\Omega := \mathcal{T}^\pi + \alpha \mathcal{H}(\pi)

2. Entropy regularization

\(\gamma\)-contraction

converges to a different regularized value function

G_\Omega(V) := (softmax(Q_V(s,\cdot)))_{s \in \mathcal{S}}

Boltzmann / softmax policy

Geist et al. 2019

Main tool: Regularization gap

\Omega^*(A_V):= \mathcal{T}^{\pi^{\text{softmax}}}_\Omega V - \mathcal{T}^{\pi^{\max}} V

2. Entropy regularization

0 \leq \Omega^*(A_V(s, \cdot)) \leq \alpha\mathcal{H}(\pi^{\text{softmax}}(\cdot|s))
A_V(s,a) := Q_V(s,a) - \max Q_V(s,\cdot)

\(\Omega^*(A_V)\) = smooth maximum of action advantages

Controlled by the temperature and the entropy of softmax policy

Positive overestimation errors \(\epsilon_t > 0\) could be reduced

V_{t+1} \leftarrow \mathcal{T}^{\pi_t^{\text{softmax}}}_{\Omega} V_t + \bar{\epsilon}_{t+1}

2. Entropy regularization

\bar{\epsilon}_{t+1} = \epsilon_{t+1} - \Omega^*(A_{V_t})

Good temperature parameter matches the level of noise

Robustness to noise

but 

Convergence to a different value function

2. Entropy regularization

\|V_{N} - V^*\|_\infty \le E_N + A_N + \gamma^N\|V_0-V^*\|_\infty
E_N:=\sum_{t=1}^{N}\gamma^{N-t} \|\epsilon_t - \Omega^*(A_{V_{t-1}})\|_\infty
A_N: = \sum_{t=1}^{N} \gamma^{N-t} \|\Omega^*(A_{V_{t-1}})\|_\infty
\epsilon_t: = V_t - \mathcal{T} V_{t-1}

3. Neural Network FA

\theta_{k+1} \leftarrow \argmin_{\theta} \|V_{\theta} - \mathcal{T}_{\Omega,\beta} V_k\|_2^2 \\ V_{k+1} \leftarrow V_{\theta_{k+1}}

Function approximation errors

V(s) := V_\theta(s)

Value network trained using gradient descent over the squared loss 

Scaling of intermediate layers to preserve the input norm

\(m_h\) is the width of layer \(h\)

3. Neural Network FA

f_\theta^{(h)}(x) = \theta^{(h)} g^{(h-1)}(x) \\ g^{(h)}(x) = \sqrt{\frac{c_\sigma}{m_h}} \sigma(f_\theta^{(h)}(x))

\(L\)-layer fully connected neural network \(f_\theta: \mathbb{R}^{m_0} \rightarrow \mathbb{R}^{m_L}\) with randomly initialized weights

Main tool: Value function NTK

\dot{\theta}(t) = - (\nabla_\theta V_\theta)^T (V_{\theta} - \mathcal{T}_{\Omega,\beta} V_{k}) \big|_{\theta=\theta(t)}

Full-batch gradient flow

\frac{d V_\theta}{dt} = \nabla_\theta V_\theta \dot{\theta}(t) = -K(t) (V_{\theta(t)} - \mathcal{T}_{\Omega,\beta} V_{k})

3. Neural Network FA

K(t) \stackrel{\text{a.s.}}{=} K + \mathcal O(m^{-1/2})

Jacot et al. 2018

for large \(m\)

K(s,s') = \mathbb{E}_{\theta \sim \text{init}} [ \frac{\partial V_\theta(s)}{\partial \theta}^T \frac{\partial V_\theta(s')}{\partial \theta} ]

Lee et al. 2019

Spectrum of the limiting NTK defines convergence

With sufficiently large width \(K(t) \approx K\)

3. Neural Network FA

\|V_{\theta_{k+1}} - \mathcal{T}_{\Omega,\beta}V_k\| \stackrel{\text{a.s}}{=} \mathcal O(e^{-\lambda_{\min}(K)T})

Function approximation errors vanish,

if the limiting NTK is positive definite,

for sufficiently large value network

1. Value smoothing

stability vs sensitivity to the initialization / random seed

 

2. Entropy regularization

reduces overestimation errors vs modifying the original problem

 

3. Neural network approximation

FA errors vanish, under certain conditions, at large width of the network

Summary

References

  1. B. Scherrer, M. Ghavamzadeh, V. Gabillon, B. Lesner, and M. Geist. Approximate modified policy iteration and its application to the game of tetris. JMLR, 2015.
  2. M. Geist, B. Scherrer, and O. Pietquin. A theory of regularized Markov decision processes. ICML, 2019.
  3. A. Jacot, F. Gabriel, and C. Hongler. Neural tangent kernel: Convergence and generalization in neural networks. NeurIPS, 2018.
  4. T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. ICML, 2018