On Convergence of Neural asynchronous Q-iteration

Summary

Neural asynchronous Q-iteration

Kernel Value Iteration

We analyse Deep Q-Network algorithm (Mnih et al.) using an abstract algorithm, Neural asynchronous Q-iteration.
We show convergence rate of Neural asynchronous Q-iteration \(\|V_N - V^*\|_\infty = O(\tilde{\gamma}^N)\), where \(\tilde{\gamma} := \|K_{NTK}\|_{op} (\beta \gamma + 1 - \beta)\),
\(N\) is the number of iterations,
\(\beta\) is a smoothing coefficient given by the ratio of mini-batch to replay buffer size,
\(\|K_{NTK}\|_{op}\) is the operator norm of (empirical) Neural Tangent Kernel.
We show that \(\|K_{NTK}\|_{op} \leq 1\).

We show that in expectation, random mini-batch update is a smooth Bellman operator \(\mathcal{T}_\beta\) with \(\beta = m/M\) given by the ratio of mini-batch to replay buffer size.

Proof. Define asyncronous Bellman operator \([\hat{\mathcal{T}}_\beta Q](s,a) := \{[\mathcal{T}Q](s,a),\ \text{if}\ y_{s,a}=1,\ Q(s,a)\ \text{else} \}\), where \(y \sim \text{Bern}(\beta)\). Then, \(\mathbb{E}[\hat{\mathcal{T}}_\beta] = \mathbb{E}_y [(1-y) I + y \mathcal{T}] = (1-\beta)I + \beta \mathcal{T} = \mathcal{T}_\beta\).

We assume Q-network \(Q_\theta: \theta \in \mathbb{R}^d \rightarrow Q\) is approximatively linear function around initialization \(\theta=\theta_0\) (Jacot et al.), namely, \(Q_{\text{lin},\theta} = Q_0 + \nabla_{\theta} Q_0 (\theta - \theta_0)\), \(Q_0 = Q_{\theta_0}\).

Define Linearized Q-iteration \(\theta_{k+1} \leftarrow \argmin_{\theta \in \mathbb{R}^d} \|Q_{\text{lin},\theta} - \mathcal{T} Q_k\|^2_2, Q_{k+1} \leftarrow Q_{\theta_{k+1}}\).

where \(K: V \rightarrow V\) is a kernel operator, e.g. Neural Tangent Kernel \(K = \nabla V_{\theta_0} \nabla V_{\theta_0}^T\).

Previous work

(Xu et al.) DQN as an instance of fitted Q-iteration \(\|V_N-V^*\|_1 = O(\gamma^{N+1})\).
(Fan et al.) DQN as an instance of Q-learning \(\frac{1}{N} \mathbb{E} [Q_N - Q^*] = O(1/\sqrt{N})\).

Q-iteration

We consider Q-iteration \(Q_{k+1} \leftarrow \mathcal{T}Q_k\) with (neural network) function approximator of Q-function and sampled transitions (from replay buffer).

Mnih et al. Human-level control through deep reinforcement learning.

Jacot et al. Neural tangent kernel: Convergence and generalization in neural networks.

Smirnova and Dohmatob. On the convergence of smooth regularized approximate value iteration schemes.

Bietti and Mairal. On the inductive bias of neural tangent kernels.

Xu et al. A finite-time analysis of Q-learning with neural network function approximation.

Fan et al. A theoretical analysis of deep Q-learning.

Random mini-batch update

Incremental update of Q-network

Kernel operator \([KV](s) := \int_{\mathbb{S}} k(s, s') V(s') d\mu(s')\), where \(k(s,s')\) is a positive-definite kernel function, i.e. defines a valid dot-product on \(\mathbb{R}^n\).
State space \(\mathcal{S} \subset \mathbb{R}^d\) is a compact set with normalized measure \(\mu(\mathcal{S})=1\).
Operator norm \(\|K\|_{op} := \inf \{c \geq 0: \|KV\| \leq c\|V\|, V \in \mathbb{R}^{\mathcal{S}}\}\).

Note that \(K=I\) results in a standard Value Iteration.

We model DQN incremental update as a one-step gradient descent on the Linearized Q-iteration objective. We show that in Q-function space, it results in Kernel Q-iteration \(Q_{k+1} \leftarrow K_0 \mathcal{T}Q_k\), where \(K_0 := \nabla_{\theta} Q_0 \nabla_{\theta} Q_0^T\) is a Neural Tangent Kernel.

Proof. \(\theta_{k+1} \leftarrow \theta_0 - \eta \nabla_\theta Q_0^T (Q_0 + \nabla_\theta Q_0 (\theta-\theta_0) - \mathcal{T} Q_k)\big|_{\theta=\theta_0}\). In Q-function space, \(Q_{\theta_{k+1}} \leftarrow Q_0 + \nabla_\theta Q_0(\theta_{k+1} - \theta_0) = Q_0 + \nabla_\theta Q_0 \nabla_\theta Q_0^T (\mathcal{T}Q_k - Q_0)\). The result follows with \(Q_0 = 0\).

Neural asyncronous Q-iteration in Q-function space

References

\(\theta_{k+1} \leftarrow \argmin_{\theta \in \mathbb{R}^d} \|Q_{\theta} - \mathcal{T}_\beta Q_k\|^2_2\), \(Q_{k+1} \leftarrow Q_{\theta_{k+1}}\), where \(\mathcal{T}_\beta := (1-\beta)I + \beta \mathcal{T}\) is a smooth Bellman operator, \(Q_\theta: \theta \in \mathbb{R}^d \rightarrow Q\) is a Q-network.

\(V_{k+1} \leftarrow K\mathcal{T}V_k\)

Convergence

Proposition 1. Let \(k: \mathcal{S} \times \mathcal{S} \rightarrow \mathbb{R}\) is a positive definite kernel function. If \(\|K\|_{op}:=\sup_{s \in \mathcal{S}} \|k(s,\cdot)\|_1 \leq 1\), Kernel Value Iteration converges at a rate \(\tilde{\gamma} := \|K\|_{op} \gamma\). In particular, convergence holds with normalized kernel function, i.e. \(k(s,s) = 1\).

Proof. Under conditions of the Proposition 1, \(K\mathcal{T}\) is a valid Bellman operator, i.e. satisfies contraction, monotonicity, distributivity properties.

Fixed point

\(K\mathcal{T}\tilde{V}^* = \tilde{V}^*\), distance to the optimal value function is bounded \(\|V^* -\tilde{V}^*\|_\infty \leq \frac{1}{1-\gamma} \|(I-K)\mathcal{T}\tilde{V}^*\|_\infty\).

Kernel VI with sampled Bellman operator

\(V_{k+1} \leftarrow K \hat{\mathcal{T}} V_k\), where \([\hat{\mathcal{T}}V](s) := r(s, a^{\star}) + \gamma V(s')\) is a sampled Bellman operator, where \(s' \sim P(\cdot|s,a^*)\), \(a^* = \argmax_a Q_V(s, a)\).

Proposition 2. Sampling error propagation of Kernel VI is given by \(\|V_N - \tilde{V}^*\|_\infty \leq \sum_{k=1}^N \tilde{\gamma}^{N-k} \|\tilde{\epsilon}_k\|_\infty + \tilde{\gamma}^N\|V_0 - \tilde{V}^*\|_\infty\), where \(\tilde{\gamma}:=\|K\|_{op} \gamma\) and \(\tilde{\epsilon}_k:=K\epsilon_k\), \(\epsilon_k \in \mathbb{R}^{\mathcal{S}}\) is a noise vector.

Proof. \(V_{k+1} \leftarrow K(\mathcal{T}V_k + \epsilon_{k+1}) = K\mathcal{T}V_k + K\epsilon_{k+1}\), where \(\mathbb{E}[\epsilon_k] = 0\) and \(\tilde{\epsilon}_{k} := K\epsilon_k\) satisfies \(\mathbb{E}[\tilde{\epsilon_k}] = 0\), \(\|\tilde{\epsilon_k}\| \leq \|\epsilon_k\|\).

Empirical Kernel VI

\(V_{k+1} \leftarrow K_n \mathcal{T}V_k\), where \((K_n)_{ij} = \frac{1}{n} k(s_i, s_j), i,j=[n]\) is a kernel Gram matrix, \(s_i\) are independent samples from \(P_\mu\), converges at a rate \(\tilde{\gamma}:= \|K_n\|_{op}\gamma\). \(K_n\) is a finite-sample approximation of \(K\), i.e. \(K_n \xrightarrow[n \rightarrow \infty]{} K\).

In the eigenbasis of \(K_n\), the convergence is given by the largest eigenvalue of \(K_n\), i.e. \(\|K_n\|_{op} = \lambda_{\max}(K_n)\).

Smooth Kernel Q-iteration with transition sampling \(Q_{k+1} \leftarrow K_n \hat{\mathcal{T}}_\beta Q_k\).

Proposition 3. Sampling error propagation of smooth Kernel Q-iteration is given by \(\|Q_N - \tilde{Q}^*\|_\infty \leq \sum_{k=1}^N \tilde{\gamma}^{N-k} \|\beta\tilde{\epsilon}_k\|_\infty + \tilde{\gamma}^N\|Q_0 - \tilde{Q}^*\|_\infty\), where \(\tilde{\gamma}:=\|K_n\|_{op}(\beta \gamma + 1 - \beta)\) and \(\tilde{\epsilon}_k:=K_n \epsilon_k\), \(\epsilon_k \in \mathbb{R}^{\mathcal{S}}\) is a noise vector.

Proof. Follows from Proposition 2 and \(Q_{k+1} \leftarrow K_n ((1-\beta) Q_k + \beta\hat{\mathcal{T}} Q_k) = K_n (\mathcal{T}_\beta Q_k + \beta \epsilon_{k+1})\).

Linearized Q-iteration

Numerical illustration

\(V_{k+1} \leftarrow K_{NTK,n} \hat{\mathcal{T}}_\beta V_k\) on stochastic cliff walking problem.

Setup. \(K_{NTK, n}(s_i, s_j) = h_{NTK}(\langle s_i, s_j \rangle)/2\), \(\langle s_i, s_j \rangle = \mathbf{1}_{s_i=s_j}\), \(i,j=[n]\), \(h_{NTK}\) is a Neural Tangent Kernel of a two-layer ReLU neural network. \(\gamma = 0.9\). \(\beta = 0.1\).

Figure shows the empirical convergence to a fixed point and the theoretical rate \(C\tilde{\gamma}^N\), where \(\tilde{\gamma}:= \|K_{NTK,n}\|_{op}(\beta \gamma + 1-\beta) \approx 0.19\), \(\|K_{NTK,n}\|_{op} \approx 0.19\).

by Elena Smirnova

esmirnovae@gmail.com

Neural Tangent Kernel

NTK of a multi-layer ReLU neural network is given by a dot-product kernel function on the unit sphere, \(k_{NTK}(s,s') = h_{NTK}(\langle s,s' \rangle)\), where \(h_{NTK}: \mathbb{R} \rightarrow \mathbb{R}\) is related to arc-cosine kernel function (Bietti and Mairal). If \(\mathcal{S} = \mathbb{S}^{d-1}\) is the unit sphere, then the normalized NTK kernel \(h_{NTK}/h(1)\) satisfies Proposition 1.