We show that in expectation, random mini-batch update is a smooth Bellman operator \(\mathcal{T}_\beta\) with \(\beta = m/M\) given by the ratio of mini-batch to replay buffer size.
Proof. Define asyncronous Bellman operator \([\hat{\mathcal{T}}_\beta Q](s,a) := \{[\mathcal{T}Q](s,a),\ \text{if}\ y_{s,a}=1,\ Q(s,a)\ \text{else} \}\), where \(y \sim \text{Bern}(\beta)\). Then, \(\mathbb{E}[\hat{\mathcal{T}}_\beta] = \mathbb{E}_y [(1-y) I + y \mathcal{T}] = (1-\beta)I + \beta \mathcal{T} = \mathcal{T}_\beta\).
We assume Q-network \(Q_\theta: \theta \in \mathbb{R}^d \rightarrow Q\) is approximatively linear function around initialization \(\theta=\theta_0\) (Jacot et al.), namely, \(Q_{\text{lin},\theta} = Q_0 + \nabla_{\theta} Q_0 (\theta - \theta_0)\), \(Q_0 = Q_{\theta_0}\).
Define Linearized Q-iteration \(\theta_{k+1} \leftarrow \argmin_{\theta \in \mathbb{R}^d} \|Q_{\text{lin},\theta} - \mathcal{T} Q_k\|^2_2, Q_{k+1} \leftarrow Q_{\theta_{k+1}}\).
where \(K: V \rightarrow V\) is a kernel operator, e.g. Neural Tangent Kernel \(K = \nabla V_{\theta_0} \nabla V_{\theta_0}^T\).
We consider Q-iteration \(Q_{k+1} \leftarrow \mathcal{T}Q_k\) with (neural network) function approximator of Q-function and sampled transitions (from replay buffer).
Mnih et al. Human-level control through deep reinforcement learning.
Jacot et al. Neural tangent kernel: Convergence and generalization in neural networks.
Smirnova and Dohmatob. On the convergence of smooth regularized approximate value iteration schemes.
Bietti and Mairal. On the inductive bias of neural tangent kernels.
Xu et al. A finite-time analysis of Q-learning with neural network function approximation.
Fan et al. A theoretical analysis of deep Q-learning.
Note that \(K=I\) results in a standard Value Iteration.
We model DQN incremental update as a one-step gradient descent on the Linearized Q-iteration objective. We show that in Q-function space, it results in Kernel Q-iteration \(Q_{k+1} \leftarrow K_0 \mathcal{T}Q_k\), where \(K_0 := \nabla_{\theta} Q_0 \nabla_{\theta} Q_0^T\) is a Neural Tangent Kernel.
Proof. \(\theta_{k+1} \leftarrow \theta_0 - \eta \nabla_\theta Q_0^T (Q_0 + \nabla_\theta Q_0 (\theta-\theta_0) - \mathcal{T} Q_k)\big|_{\theta=\theta_0}\). In Q-function space, \(Q_{\theta_{k+1}} \leftarrow Q_0 + \nabla_\theta Q_0(\theta_{k+1} - \theta_0) = Q_0 + \nabla_\theta Q_0 \nabla_\theta Q_0^T (\mathcal{T}Q_k - Q_0)\). The result follows with \(Q_0 = 0\).
\(\theta_{k+1} \leftarrow \argmin_{\theta \in \mathbb{R}^d} \|Q_{\theta} - \mathcal{T}_\beta Q_k\|^2_2\), \(Q_{k+1} \leftarrow Q_{\theta_{k+1}}\), where \(\mathcal{T}_\beta := (1-\beta)I + \beta \mathcal{T}\) is a smooth Bellman operator, \(Q_\theta: \theta \in \mathbb{R}^d \rightarrow Q\) is a Q-network.
\(V_{k+1} \leftarrow K\mathcal{T}V_k\)
Proposition 1. Let \(k: \mathcal{S} \times \mathcal{S} \rightarrow \mathbb{R}\) is a positive definite kernel function. If \(\|K\|_{op}:=\sup_{s \in \mathcal{S}} \|k(s,\cdot)\|_1 \leq 1\), Kernel Value Iteration converges at a rate \(\tilde{\gamma} := \|K\|_{op} \gamma\). In particular, convergence holds with normalized kernel function, i.e. \(k(s,s) = 1\).
Proof. Under conditions of the Proposition 1, \(K\mathcal{T}\) is a valid Bellman operator, i.e. satisfies contraction, monotonicity, distributivity properties.
\(K\mathcal{T}\tilde{V}^* = \tilde{V}^*\), distance to the optimal value function is bounded \(\|V^* -\tilde{V}^*\|_\infty \leq \frac{1}{1-\gamma} \|(I-K)\mathcal{T}\tilde{V}^*\|_\infty\).
\(V_{k+1} \leftarrow K \hat{\mathcal{T}} V_k\), where \([\hat{\mathcal{T}}V](s) := r(s, a^{\star}) + \gamma V(s')\) is a sampled Bellman operator, where \(s' \sim P(\cdot|s,a^*)\), \(a^* = \argmax_a Q_V(s, a)\).
Proposition 2. Sampling error propagation of Kernel VI is given by \(\|V_N - \tilde{V}^*\|_\infty \leq \sum_{k=1}^N \tilde{\gamma}^{N-k} \|\tilde{\epsilon}_k\|_\infty + \tilde{\gamma}^N\|V_0 - \tilde{V}^*\|_\infty\), where \(\tilde{\gamma}:=\|K\|_{op} \gamma\) and \(\tilde{\epsilon}_k:=K\epsilon_k\), \(\epsilon_k \in \mathbb{R}^{\mathcal{S}}\) is a noise vector.
Proof. \(V_{k+1} \leftarrow K(\mathcal{T}V_k + \epsilon_{k+1}) = K\mathcal{T}V_k + K\epsilon_{k+1}\), where \(\mathbb{E}[\epsilon_k] = 0\) and \(\tilde{\epsilon}_{k} := K\epsilon_k\) satisfies \(\mathbb{E}[\tilde{\epsilon_k}] = 0\), \(\|\tilde{\epsilon_k}\| \leq \|\epsilon_k\|\).
\(V_{k+1} \leftarrow K_n \mathcal{T}V_k\), where \((K_n)_{ij} = \frac{1}{n} k(s_i, s_j), i,j=[n]\) is a kernel Gram matrix, \(s_i\) are independent samples from \(P_\mu\), converges at a rate \(\tilde{\gamma}:= \|K_n\|_{op}\gamma\). \(K_n\) is a finite-sample approximation of \(K\), i.e. \(K_n \xrightarrow[n \rightarrow \infty]{} K\).
In the eigenbasis of \(K_n\), the convergence is given by the largest eigenvalue of \(K_n\), i.e. \(\|K_n\|_{op} = \lambda_{\max}(K_n)\).
Smooth Kernel Q-iteration with transition sampling \(Q_{k+1} \leftarrow K_n \hat{\mathcal{T}}_\beta Q_k\).
Proposition 3. Sampling error propagation of smooth Kernel Q-iteration is given by \(\|Q_N - \tilde{Q}^*\|_\infty \leq \sum_{k=1}^N \tilde{\gamma}^{N-k} \|\beta\tilde{\epsilon}_k\|_\infty + \tilde{\gamma}^N\|Q_0 - \tilde{Q}^*\|_\infty\), where \(\tilde{\gamma}:=\|K_n\|_{op}(\beta \gamma + 1 - \beta)\) and \(\tilde{\epsilon}_k:=K_n \epsilon_k\), \(\epsilon_k \in \mathbb{R}^{\mathcal{S}}\) is a noise vector.
Proof. Follows from Proposition 2 and \(Q_{k+1} \leftarrow K_n ((1-\beta) Q_k + \beta\hat{\mathcal{T}} Q_k) = K_n (\mathcal{T}_\beta Q_k + \beta \epsilon_{k+1})\).
\(V_{k+1} \leftarrow K_{NTK,n} \hat{\mathcal{T}}_\beta V_k\) on stochastic cliff walking problem.
Setup. \(K_{NTK, n}(s_i, s_j) = h_{NTK}(\langle s_i, s_j \rangle)/2\), \(\langle s_i, s_j \rangle = \mathbf{1}_{s_i=s_j}\), \(i,j=[n]\), \(h_{NTK}\) is a Neural Tangent Kernel of a two-layer ReLU neural network. \(\gamma = 0.9\). \(\beta = 0.1\).
Figure shows the empirical convergence to a fixed point and the theoretical rate \(C\tilde{\gamma}^N\), where \(\tilde{\gamma}:= \|K_{NTK,n}\|_{op}(\beta \gamma + 1-\beta) \approx 0.19\), \(\|K_{NTK,n}\|_{op} \approx 0.19\).
esmirnovae@gmail.com
NTK of a multi-layer ReLU neural network is given by a dot-product kernel function on the unit sphere, \(k_{NTK}(s,s') = h_{NTK}(\langle s,s' \rangle)\), where \(h_{NTK}: \mathbb{R} \rightarrow \mathbb{R}\) is related to arc-cosine kernel function (Bietti and Mairal). If \(\mathcal{S} = \mathbb{S}^{d-1}\) is the unit sphere, then the normalized NTK kernel \(h_{NTK}/h(1)\) satisfies Proposition 1.