Deep Learning HMC
Building Topological Samplers for Lattice QCD




Sam Foreman
May, 2021
Acknowledgements
This research used resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357.

Collaborators:
- Xiao-Yong Jin
- James C. Osborn
Huge thank you to:
- Norman Christ
- Akio Tomiya
- Luchang Jin
- Chulwoo Jung
- Peter Boyle
- Taku Izubuchi
- Critical Slowing Down group (ECP)
- ALCF Staff + Datascience group
MCMC in Lattice QCD
- Generating independent gauge configurations is a MAJOR bottleneck for LatticeQCD.

- As the lattice spacing, \(a \rightarrow 0\), the MCMC updates tend to get stuck in sectors of fixed gauge topology.
	- This causes the number of steps needed to adequately sample different topological sectors to increase exponentially.
 
Critical slowing down!
Markov Chain Monte Carlo (MCMC)
- Goal: Draw independent samples from a target distribution, \(p(x)\)
- Starting from some initial state \(x_{0}\sim \mathcal{N}(0, \mathbb{1})\) , we generate proposal configurations \(x^{\prime}\)
- Use Metropolis-Hastings acceptance criteria
Inefficient!
Issues with MCMC
saved
dropped
\(x_{0}\rightarrow x_{1}\rightarrow x_{2}\rightarrow\cdots\rightarrow x_{m-1}\rightarrow x_{m}\rightarrow x_{m+1}\rightarrow\cdots\rightarrow x_{n-2}\rightarrow x_{n-1}\rightarrow x_{n}\)
1. Construct chain:
Goal: Generate an ensemble of independent configurations
random walk
- Generate proposal \(x^{\prime}\):
\(x^{\prime} = x + \delta\), where \(\delta \sim \mathcal{N}(0, \mathbb{1})\)
\(x_{0}\rightarrow x_{1}\rightarrow x_{2}\rightarrow\cdots\rightarrow x_{m-1}\rightarrow x_{m}\rightarrow x_{m+1}\rightarrow\cdots\rightarrow x_{n-2}\rightarrow x_{n-1}\rightarrow x_{n}\)
2. Thermalize ("burn-in"):
3. Drop correlated samples ("thinning"):
\(x_{0}\rightarrow x_{1}\rightarrow x_{2}\rightarrow\cdots\rightarrow x_{m-1}\rightarrow x_{m}\rightarrow x_{m+1}\rightarrow\cdots\rightarrow x_{n-2}\rightarrow x_{n-1}\rightarrow x_{n}\)
Hamiltonian Monte Carlo (HMC)
- 
	Introduce fictitious momentum: 
\(v\sim\mathcal{N}(0, 1)\)
- 
	Target distribution: 
\(p(x)\propto e^{-S(x)}\)
- 
	Joint target distribution: 

lift to phase space
- 
	Hamilton's Equations 
HMC: Leapfrog Integrator
(trajectory)
- 
	Hamilton's Eqs:
- 
	Hamiltonian:
- 
	\(N_{\mathrm{LF}}\) leapfrog steps:
Leapfrog Integrator
2. Full-step \(x\)-update:
3. Half-step \(v\)-update:
1. Half-step \(v\)-update:
HMC: Issues
- Cannot easily traverse low-density zones.
- What do we want in a good sampler?
- Fast mixing
- Fast burn-in
- Mix across energy levels
- Mix between modes
- Energy levels selected randomly \(\longrightarrow\) slow mixing!
Stuck!


Leapfrog Layer
- Introduce a persistent direction \(d \sim \mathcal{U}(+,-)\) (forward/backward)
- Introduce a discrete index \(k \in \{1, 2, \ldots, N_{\mathrm{LF}}\}\) to denote the current leapfrog step
- Let \(\xi = (x, v, \pm)\) denote a complete state, then the target distribution is given by
- Each leapfrog step transforms \(\xi_{k} = (x_{k}, v_{k}, \pm) \rightarrow (x''_{k}, v''_{k}, \pm) = \xi''_{k}\) by passing it through the \(k^{\mathrm{th}}\) leapfrog layer
Leapfrog Layer
- Each leapfrog step transforms \(\xi_{k}=(x_{k}, v_{k}, \pm)\rightarrow (x''_{k}, v''_{k}, \pm) = \xi''_{k}\) by passing it through the \(k^{\mathrm{th}}\) leapfrog layer.
- \(x\)-update \((d = +)\):
(\(m_{t}\)\(\odot x\)) -independent
masks:
Momentum (\(v_{k}\)) scaling
Gradient \(\partial_{x}S(x_{k})\) scaling
Translation
- \(v\)-update \((d = +)\):
(\(v\)-independent)
where \((s_{v}^{k}, q^{k}_{v}, t^{k}_{v})\), and \((s_{x}^{k}, q^{k}_{x}, t^{k}_{x})\), are parameterized by neural networks
L2HMC: Generalized Leapfrog


- Complete (generalized) update:
	- Half-step \(v\) update:
- Full-step \(\frac{1}{2} x\) update:
- Full-step \(\frac{1}{2} x\) update:
- Half-step \(v\) update:
 
Leapfrog Layer

masks:

Stack of fully-connected layers
\(x_{k} \in U(1) \longrightarrow x_{k} = \left[\cos\theta, \sin\theta\right]\)

Training Algorithm
construct trajectory
Compute loss + backprop
Metropolis-Hastings accept/reject
     re-sample    
      momentum
   + direction
Annealing Schedule
- Introduce an annealing schedule during the training phase:
(varied slowly)
e.g. \( \{0.1, 0.2, \ldots, 0.9, 1.0\}\)
(increasing)
- Target distribution becomes:
- For \(\|\gamma_{t}\| < 1\), this helps to rescale (shrink) the energy barriers between isolated modes
	- Allows our sampler to explore previously inaccessible regions of the target distribution
 
Example: GMM \(\in\mathbb{R}^{2}\)
Note:
\(A(\xi',\xi)\) = acceptance probability
\(A(\xi'|\xi)\cdot\delta(\xi',\xi)\)= avg. distance
\(\xi\) = initial state
\(\xi\) = initial state
- Define the squared jump distance:


HMC
L2HMC
- Maximize
expected squared jump distance:
- Wilson action:
- Link variables:
Lattice Gauge Theory

- Topological charge:
continuous, differentiable
discrete, hard to work with
Non-Compact Projection
[1.]
- Project \([-\pi, \pi]\) onto \(\mathbb{R}\) using a transformation: \(z = g(x)\), \(g: [-\pi, \pi] \rightarrow \mathbb{R}\)
	- \(z = \tan\left(\frac{x}{2}\right)\)
 
- Perform the update in \(\mathbb{R}\)
	- \(z' = m^{t}\odot z + \bar{m}^{t}\odot [\alpha z + \beta]\)
 
- Project back to \([-\pi, \pi]\) using the inverse transformation \(x = g^{-1}(z)\), \(g^{-1}: \mathbb{R}\rightarrow [-\pi, \pi]\)
	- \(x = 2\tan^{-1}(z)\)
 
- These steps can be combined into a single update equation
	- \(x' = m^{t}\odot x + \bar{m}^{t}\odot\left[2\tan^{-1}\left(\alpha\tan\left(\frac{x}{2}\right)\right) + \beta\right]\)
- with corresponding Jacobian factor
		- \(\frac{\partial x'}{\partial x} = \frac{\exp(\varepsilon s_{x})}{\cos^{2}(x/2) + \exp(2\varepsilon s_{x})\sin(x/2)}\)
 
 
\(x_{k} \in U(1) \longrightarrow x_{k} = \left[\cos\theta, \sin\theta\right]\)
- We maximize the expected squared charge difference:
Loss function: \(\mathcal{L}(\theta)\)




\(\beta = 5\)
\(\beta = 6\)
\(\beta = 7\)
Results: \(\tau^{\mathcal{Q}_{\mathbb{Z}}}_{\mathrm{int}}\)
- Instead, we account for the autocorrelation, so the variance becomes: \(\sigma^{2} = \frac{\tau^{\mathcal{O}}_{int}}{N}\mathrm{Var}\left[\mathcal{O}(x)\right]\)
Rescale: \(N_{\mathrm{LF}}\cdot\tau^{\mathcal{Q}_{\mathbb{Z}}}_{\mathrm{int}}\) to account for different trajectory lengths
- If we had independent configurations, we could approximate by \(\langle\mathcal{O}\rangle \simeq \frac{1}{N}\sum_{n=1}^{N} \mathcal{O}(x_{n})\longrightarrow \sigma^{2}=\frac{1}{N}\mathrm{Var}\left[\mathcal{O}(x)\right]\propto\frac{1}{N}\)
- Want to calculate: \(\langle \mathcal{O}\rangle\propto \int \left[\mathcal{D} x\right] \mathcal{O}(x)e^{-S[x]}\)
Results: \(\tau^{\mathcal{Q}_{\mathbb{Z}}}_{\mathrm{int}}\)

- We maximize the expected squared charge difference:




Interpretation
- Look at how different quantities evolve over a single trajectory
	- See that the sampler artificially increases the energy during the first half of the trajectory (before returning to original value)
 
Leapfrog step
variation in the avg plaquette
continuous topological charge
shifted energy

Interpretation
- Look at how the variation in \(\langle\delta x_{P}\rangle\) varies for different values of \(\beta\)
\(\beta = 7\)
\(\simeq \beta = 3\)

\(\beta = 7\)
\(\simeq \beta = 3\)
Training Costs
- We trained our model(s) using Horovod with TensorFlow on the ThetaGPU supercomputer at the Argonne Leadership Computing Facility.
- A typical training run:
	- 1 node (8\(\times\) NVIDIA A100 GPUs)
- Batch size \(M = 2048\)
- Hidden layer shapes \(=\{256, 256, 256\}\)
- Leapfrog layers \(N_{\mathrm{LF}}=10\)
- Lattice volume \(=16\times 16\)
- Training steps \(=5\times 10^{5}\)
- \(\simeq\) 24 hours to complete.
 
Next Steps
- Going forward, we plan to:
	- Reduce training cost
- Continue testing on larger lattice volumes to better understand scaling efficiency
- Generalize to 2D / 4D \(SU(3)\)
- Test alternative network architectures
- Gauge Equivariant layers
 
Leapfrog Layer
- Each leapfrog step transforms \(\xi_{k}=(x_{k}, v_{k}, \pm)\rightarrow (x''_{k}, v''_{k}, \pm) = \xi''_{k}\) by passing it through the \(k^{\mathrm{th}}\) leapfrog layer.
- \(x\)-update \((d = +)\):
Momentum (\(v_{k}\)) scaling
Gradient \(\partial_{x}S(x_{k})\) scaling
Translation
- \(v\)-update \((d = +)\):
where \((s_{v}^{k}, q^{k}_{v}, t^{k}_{v})\), and \((s_{x}^{k}, q^{k}_{x}, t^{k}_{x})\), are parameterized by neural networks
\(\alpha_{s^{k}_{v}}\cdot s_{v}^{k}(\zeta_{v_{k}})\)
\(\alpha_{q^{k}_{v}}\cdot q_{v}^{k}(\zeta_{v_{k}})\)
\(\alpha_{t^{k}_{v}}\cdot t_{v}^{k}(\zeta_{v_{k}})\)
\(\alpha_{s^{k}_{x}}\cdot s_{x}^{k}(\zeta_{x_{k}})\)
\(\alpha_{q^{k}_{x}}\cdot q_{x}^{k}(\zeta_{x_{k}})\)
\(\alpha_{t^{k}_{x}}\cdot t_{x}^{k}(\zeta_{x_{k}})\)
\(\alpha \in (0, 1)\)

4096
8192
1024
2048
512
Scaling test: Training
\(4096 \sim 1.73\times\)
\(8192 \sim 2.19\times\)
\(1024 \sim 1.04\times\)
\(2048 \sim 1.29\times\)
\(512\sim 1\times\)
Scaling test: Training

Scaling test: Training


\(8192\sim \times\)
4096
1024
2048
512
Scaling test: Inference









