feature
m=∞ policy iteration; m=1 value iteration
Scherrer et al. 2015
m=∞ policy iteration; m=1 value iteration
Scherrer et al. 2015
Scherrer et al. 2015
Initialization
Cumulative error
γ rate of convergence
Geist et al. 2019
Ω∗(AV) = smooth maximum of action advantages
Controlled by the temperature and the entropy of softmax policy
Value network trained using gradient descent over the squared loss
Scaling of intermediate layers to preserve the input norm
mh is the width of layer h
L-layer fully connected neural network fθ:Rm0→RmL with randomly initialized weights
Full-batch gradient flow
Jacot et al. 2018
for large m
Lee et al. 2019
Spectrum of the limiting NTK defines convergence
With sufficiently large width K(t)≈K
stability vs sensitivity to the initialization / random seed
reduces overestimation errors vs modifying the original problem
FA errors vanish, under certain conditions, at large width of the network