Neural Representations for Computational Physics:

from Reduced Order Models to Scalable Transformers

Vedant Puri
DEC 04, 2025

Committee:

Levent Burak Kara, Yongjie Jessica Zhang, Amir Barati Farimani, Krishna Garikipati

Computer simulations are critical for industrial applications

Modern engineering is reliant on computer simulations

Predictive maintenance

Design space exploration

[2]

[1]

[3]

[1] CFD Direct / OpenFOAM – “OpenFOAM HPC on AWS with EFA”, cfd.direct
[2] EurekAlert — “New concrete system may reduce wind-turbine costs”
[3] Flow-3D, “FLOW-3D AM” product page, flow3d.com

Process optimization

Automotive engineering

Civil engineering

Advanced manufacturing

[1]

Physics-based simulations bottleneck many engineering workflows

[1] COMSOL — “Mesh Refinement”

[2] Langtangen, H. P. — INF5620: Finite Element Methods (Lecture Notes)

[3] GridPro Blog — “The Art and Science of Meshing Airfoil”

[4] ResearchGate — “Transition to turbulence of Taylor-Green Vortex at different time (DNS)” (figure)
[5] ORNL / U.S. Department of Energy — “DOE and Cray deliver record-setting Frontier supercomputer at ORNL”

\partial_t \boldsymbol{u} + (\boldsymbol{u} \cdot \boldsymbol{\nabla})\boldsymbol{u} = -\boldsymbol{\nabla} p + \frac{1}{\text{Re}}\Delta \vec{v} + f\\ \boldsymbol{\nabla}\cdot\boldsymbol{u} = 0

Governing Equations

\boldsymbol{\nabla}\cdot \boldsymbol{\sigma} + \boldsymbol{F} = \rho\boldsymbol{\ddot{{u}}}

[1]

[2]

Discretization machinery

Repeated large system solves

[5]

Multiscale physics \(\implies\) small \(\Delta t\)

[4]

Complex geometry \(\implies\) fine meshes

[3]

Complex geometry \(\implies\) fine meshes

The cost of this procedure scales poorly for several reasons.

Neural signal representations learn to emulate physics from data

{\mathbf{u}}(\mathbf{x}) = \Sigma_{i=1}^{N} \mathbf{u}_i \phi_i(\mathbf{x})

(Explicit) Weighted sum of polynomial interpolants

Finite Elements

[1]

{\mathbf{u}}(\mathbf{x}) = (Z_L \circ \dotsc \circ Z_0)(\mathbf{x})

(Implicit) High-dim nonlinear feature learners

Multilayer Perceptron (MLP)

[2]

Cannot learn from data

Can learn from data

Large cost per simulation

Cheap evaluation after training

High-accuracy

Problem-specific

Robust

Up to \(0.1\%\) accuracy

[1] Math StackExchange — “Interpolation in Finite Element Method”
[2] ResearchGate — “Structure of a Deep Neural Network” (figure)

\(\text{Mesh ansatz}\)

\({u}(x)=\)

\(u(x) = \)

\(\text{Neural ansatz}\)

\(\text{Physics-based}\)

\(\text{Data-driven}\)

\(\text{Numerical}\)

\(\text{Simulation}\)

\(\text{Reduced Order}\)

\(\text{Modeling}\)

\(\text{Neural ROMs}\)

\(\text{Surrogate}\)

\(\text{Learning}\)

\(\text{Transformers}\)

\(\text{PINNs}\)

\(\text{Finite Elements}\)

\(\text{PCA/POD}\)

\(\text{Graph Networks}\)

Landscape of data-driven methods in computational physics

Fast and accurate latent space traversal in neural ROMs

Scalable transformer models for large-scale surrogate modeling

[1]

[3]

[2]

[1] CFD Direct / OpenFOAM — “Introduction to Computational Fluid Dynamics”
[2] ResearchGate — “Schematic of a Vanilla Physics-Informed Neural Network” (figure)

[3] Kutz, J. N. — “Data-Driven Modeling & Dynamical Systems” (UW)

Outline

Project 2: PDE surrogate models

Project 1: Neural Reduced Order Modeling

Proposed work: transient PDE surrogates

Accelerate PDE solves with structure learned from data.

Replace simulation with solution operator learned from data.

Extend surrogate methodology to transient PDE problems.

Neural Reduced Order Modeling

Accelerate simulations with structure learned from data.

\mathbb{R}^{N_\text{FOM}}

\bar{u}(0)

\tilde{u}(0)

\tilde{u}(T)

\mathcal{M}

\bar{u}(T)

h_\text{ROM}

g_\text{ROM}

\begin{pmatrix} \hspace{0.4em} \\ \\ \\ \end{pmatrix}

Primer on model order reduction

Learn low-dimensional solution representation with data + evolve with physics.

High-dimensional simulation data

\frac{\partial \boldsymbol{u}}{\partial t} = \mathcal{L}(\boldsymbol{x}, t, \boldsymbol{u}; \boldsymbol{\mu})

Collect and compress data

Evolve ODE on low-dim manifold

\bar{u}(0)

\tilde{u}(0)

\tilde{u}(T)

\mathcal{M}

\bar{u}(T)

\begin{pmatrix} \hspace{0.4em} \\ \\ \\ \end{pmatrix}

\vec{u}_1

\tilde{u}_1

\bar{u}_1

\bar{u}_T

\cdots

\begin{bmatrix} \hspace{6.5em} \\ \\ \\ \\ \\ \end{bmatrix}

\cdots

\tilde{u}_T

Cheap online solve can be deployed for time-critical applications.

Cost savings from solving smaller ODE system.

\frac{\partial \textcolor{red}{\bar{u}} }{\partial t} = \bar{\mathcal{L}}(t, \textcolor{red}{\bar{u}}; \boldsymbol{\mu})

\frac{\partial \textcolor{blue}{\tilde{u}} }{\partial t} = \tilde{\mathcal{L}}(t, \textcolor{blue}{\tilde{u}}; \boldsymbol{\mu})

\text{dim}(\textcolor{red}{\bar{u}}) \gg \text{dim}(\textcolor{blue}{\tilde{u}})

Background on manifold learning

\frac{\partial \boldsymbol{u}}{\partial t} = \mathcal{L}(\boldsymbol{x}, t, \boldsymbol{u}; \boldsymbol{\mu})

Full order model (FOM)

\boldsymbol{u}(\boldsymbol{x}, t; \boldsymbol{\mu}) \approx g_\text{FOM}(\boldsymbol{x}, \textcolor{red}{\bar{u}(t; \boldsymbol{\mu}})) = \mathbf{\Phi}(\boldsymbol{x}) \cdot \textcolor{red}{\bar{u}(t; \boldsymbol{\mu}})

\begin{pmatrix} \hspace{0.8em} \\ \\ \\ \\ \end{pmatrix}

\mathbb{R}^{N_\text{FOM}}

\bar{u}(0)

\bar{u}(T)

Background on manifold learning

\frac{\partial \boldsymbol{u}}{\partial t} = \mathcal{L}(\boldsymbol{x}, t, \boldsymbol{u}; \boldsymbol{\mu})

Full order model (FOM)

Linear POD-ROM

\textcolor{red}{\bar{u}(t; \boldsymbol{\mu})} \approx g'_\text{ROM}(\textcolor{orange}{\bar{u}(t; \boldsymbol{\mu})}) = \bar{u}_0 + \mathbf{P} \cdot \textcolor{orange}{\tilde{u}(t; \mathbf{\mu})}

Learn low-order spatial representations

\textcolor{orange}{N_\text{Lin-ROM}} << \textcolor{red}{N_\text{FOM}}

\begin{pmatrix} \hspace{0.8em} \\ \\ \\ \\ \end{pmatrix}

\mathbb{R}^{N_\text{FOM}}

\bar{u}(0)

\bar{u}(T)

Background on manifold learning

\frac{\partial \boldsymbol{u}}{\partial t} = \mathcal{L}(\boldsymbol{x}, t, \boldsymbol{u}; \boldsymbol{\mu})

Full order model (FOM)

Linear POD-ROM

Nonlinear ROM

\textcolor{red}{\bar{u}(t; \boldsymbol{\mu})} \approx g'_\text{ROM}(\textcolor{orange}{\bar{u}(t; \boldsymbol{\mu})}) = \bar{u}_0 + \mathbf{P} \cdot \textcolor{orange}{\tilde{u}(t; \mathbf{\mu})}

\boldsymbol{u}(\boldsymbol{x}, t; \boldsymbol{\mu}) \approx g_\text{ROM}(\boldsymbol{x}, \textcolor{blue}{\tilde{u}(t; \boldsymbol{\mu}})) = \mathrm{NN}_\theta\left(\boldsymbol{x}, \textcolor{blue}{\tilde{u}(t; \boldsymbol{\mu}}) \right)

Learn low-order spatial representations

\textcolor{orange}{N_\text{Lin-ROM}} << \textcolor{red}{N_\text{FOM}}

\textcolor{blue}{N_\text{Nl-ROM}} \leq\,\,

\begin{pmatrix} \hspace{0.8em} \\ \\ \\ \\ \end{pmatrix}

\mathbb{R}^{N_\text{FOM}}

\bar{u}(0)

\bar{u}(T)

Conv. Autoencoder ROMs [1] cause deviations in physics solve

\text{FOM (ground truth)}

\text{CAE-ROM} \,[1]

\tilde{u}(t)

\begin{pmatrix} \hspace{0.9em} \\ \\ \\ \\ \end{pmatrix}

\begin{pmatrix} \textcolor{blue}*\\ \textcolor{blue}*\\ \end{pmatrix}

\begin{pmatrix} \hspace{0.9em} \\ \\ \\ \\ \end{pmatrix}

\bar{u}(t)

\(\text{Encoder}\)

\(\text{Decoder}\)

\text{Projection}

\text{Inference}

Intrinsic perspective

[1] Lee & Carlberg — Nonlinear manifold ROM via CNN autoencoders (JCP 2020)

Extrinsic perspective

\frac{\partial \textcolor{black}{\tilde{u}} }{\partial t} = \tilde{\mathcal{L}}(t, \textcolor{black}{\tilde{u}}; \boldsymbol{\mu})

\text{Projection}

\text{Encoder projection}

\text{Physics solve}

\text{distribution of }\tilde{u}

Compression/decompression workflow offers no control over latent trajectories.

2D Burgers \(\mathit{Re}=1\mathit{k}\)

Smooth Neural Field ROM directly controls latent trajectories

\varrho, \, \theta = \argmin_{\varrho, \, \theta}\left\{ \sum_{\boldsymbol{x}, \, t, \, \boldsymbol{\mu}} || \boldsymbol{u}(\boldsymbol{x}, t; \boldsymbol{\mu}) - g_\theta(\boldsymbol{x}, \Xi_\varrho(t; \boldsymbol{\mu})) ||_2^2 \right\}

Supervised learning problem jointly learns latent trajectories and data manifold.

\(\text{Loss } (L)\)

\(\text{Backpropagation}\)

\(\nabla_\theta L\)

\(\nabla_\varrho L\)

\(\nabla_\theta L\)

\(\text{PDE Problem}\)

\((\boldsymbol{x}, t, \boldsymbol{\mu})\)

\(\text{ Parameters}\)

\( \text{and time}\)

\(\text{ Intrinsic ROM manifold}\)

\tilde{\mathcal{U}} = \left\{ \tilde{u}(t; \mathbf{\mu}) |~ t,\, \mathbf{\mu} \right\}

\tilde{u}(t; \mathbf{\mu})

\(\text{Coordinates}\)

\(\text{Smooth neural field MLP }(g_\theta)\)

\(\tilde{u}\)

\(\boldsymbol{x}\)

\(\boldsymbol{u}\left( \boldsymbol{x}, t; \boldsymbol{\mu} \right)\)

Force \( t \mapsto \tilde{u}(t) \) to be simple, e.g., shallow MLP.

Coordinate MLPs with sinusoidal activations offers grid-independence.

Replace autoencoder with a direct prediction workflow.

Accurate derivatives require smooth neural field representations

\mathrm{NN}(x) \approx u^{}(x)

\textcolor{red}\times

\dfrac{\mathrm{d}^k}{\mathrm{d}x^k} \mathrm{NN}(x) \approx \dfrac{\mathrm{d}^k}{\mathrm{d}x^k} u(x)

SNF-ROM with Lipschitz regularization (SNFL-ROM)

\(\text{Penalize the \textcolor{blue}{Lipschitz constant} of the MLP [arXiv:2202.08345]}\)

\varrho, \, \theta = \argmin_{\varrho, \, \theta}\left\{ L_\text{data}(\varrho, \theta) + \textcolor{blue}{\alpha \bar{c}(\theta)} \right\}

\text{For MLP: } c_\theta \leq \textcolor{blue}{\bar{c}(\theta)} = \prod_{l=1}^L ||W_l||_p

||f(x_2) - f(x_1)||_p \leq \textcolor{blue}{c}||x_2 - x_1||_p

\text{change in output}

\text{change in input}

\text{For a single layer: } \textcolor{blue}{c_l} = ||W_l||_p

\(\text{[enwiki:1230354413]}\)

SNF-ROM with Weight regularization (SNFW-ROM)

\(\text{Directly penalize \textcolor{red}{high-frequency components} in }\dfrac{\text{d}}{\text{d} x}\text{NN}_\theta(x)\)

\frac{\text{d}}{\text{d} x} \mathrm{NN}_\theta(x) = \left( \prod_{l=2}^L W_l \cdot \text{diag}(\textcolor{red}{\sigma'(z_{l-1})}) \right) \cdot W_1

\textcolor{red}{ \cos\left( W_l z_{l-1} + b_l \right) }

\varrho, \, \theta = \argmin_{\varrho, \, \theta}\left\{ L_\text{data}(\varrho, \theta) + \textcolor{red}{ \frac{\gamma}{2} \sum_{l=1}^L \sum_{i,j} ||W_l^{ij}||_2^2 } \right\}

We present two approaches to learn inherently smooth and accurately differentiable neural field MLPs.

\({x}\)

\({u(x)}\)

\mathrm{NN}(x)

u(x)

\text{NN}

\frac{\mathrm{d}}{\mathrm{d}x} \mathrm{NN}(x)

\frac{\mathrm{d}^2}{\mathrm{d}x^2} \mathrm{NN}(x)

\textbf{SNFL (ours)}

\textbf{SNFW (ours)}

High freq. noise

Experiment: 1D Viscous Burgers problem \( (\mathit{Re} = 10~{k})\)

\frac{\partial {u}}{\partial t} + {u} \frac{\partial {u}}{\partial x} = \nu \frac{\partial^2 u}{\partial x^2}, \qquad u(t=0) = u_0(\mu)

\(\text{CAE-ROM}\) [1]

\(\text{SNFL-ROM (ours)}\)

\(\text{SNFW-ROM (ours)}\)

Online dynamics solve matches learned trajectories

Online evaluation deviates!

Distribution of reduced states \((\tilde{u})\)

[1] Lee & Carlberg — Nonlinear manifold ROM via CNN autoencoders (JCP 2020)

Experiment: 1D Kuramoto-Sivashinsky problem

\frac{\partial {u}}{\partial t} + u\frac{\partial {u}}{\partial x} + \frac{\partial^2 {u}}{\partial x^2} + \nu\frac{\partial^4 {u}}{\partial x^4} = 0

SNF-ROM maintains high accuracy even with larger time-steps.

\(\text{Relative error vs time } (\Delta t = \Delta t_0)\)

\(\text{Relative error vs time } (\Delta t = 10\Delta t_0)\)

[1] Lee & Carlberg — Nonlinear manifold ROM via CNN autoencoders (JCP 2020)

Experiment: 2D Viscous Burgers problem \( (\mathit{Re} = 1~{k})\)

\frac{\partial \boldsymbol{u}}{\partial t} + \boldsymbol{u} \cdot \boldsymbol{\nabla}\boldsymbol{u} = \nu \Delta \boldsymbol{u}

\(\text{CAE-ROM}\) [1]

\(\text{SNFL-ROM (ours)}\)

\(\text{SNFW-ROM (ours)}\)

Relative error

[1] Lee & Carlberg — Nonlinear manifold ROM via CNN autoencoders (JCP 2020)

\([1]\)

\(0.4\%\) relative error

\(\text{DoFs: }524~k \to 2\)

\(\text{Time }(t)\)

\(\text{Relative Error}(t)\)

\(199\times\) speed-up

\text{FOM}

\text{POD-ROM}

\text{SNFL-ROM (ours)}

\text{CAE-ROM}

\text{SNFW-ROM (ours)}

Takeaways from SNF-ROM

Accurate derivate evaluation for neural representations.

Fast and accurate latent space traversal in neural ROMs

Contributions

Won poster award at World Conf. Comp. Mech. 2024

Published in Journal of Comp. Phys.

Data-Driven Modeling

Scalable neural surrogates for PDEs and beyond!

Surrogate models learn PDE solution operator from data

\mathcal{L}(\boldsymbol{x}, t, \boldsymbol{u}; \boldsymbol{\mu}) = 0

\mathcal{G}: \boldsymbol{\mu} \mapsto \boldsymbol{u}

Training

\mathcal{G}_\theta \approx \mathcal{G}

Inference

\mathcal{G}_\theta

Large training cost is amortized over several evaluations

Model learns to predict \(\boldsymbol{u}\) over a distribution of \(\boldsymbol{\mu}\)

Transformers [1] are state-of-the-art surrogate models

Message-passing on a dynamic all-to-all graph.

[1] Vaswani et al. — “Attention Is All You Need”, NeurIPS 2017

Quadratic (\(\mathcal{O}(N^2)\)) cost limits scalability

Quadratic \((\mathcal{O}(N^2))\) cost in transformers limit scalability

Over \(20~\text{s}\) per gradient step on a mesh of 1m poins!

Goal: enable transformer models on large meshes.

[1] Vaswani et al. — “Attention Is All You Need”, NeurIPS 2017

\([1]\)

What are the limitations on communication patterns?

\Delta u = f

\begin{bmatrix} &&&\\ &&&\\ &&& \end{bmatrix} \cdot \begin{bmatrix} \\ \underline{u}\\ \\ \end{bmatrix} = \begin{bmatrix} \\ \underline{f}\\ \\ \end{bmatrix}

\begin{bmatrix} \\ \underline{u}\\ \\ \end{bmatrix} = \begin{bmatrix} &&&&\\ &&&&\\ &&&&\\ \end{bmatrix} \begin{bmatrix} \\ \underline{f}\\ \\ \end{bmatrix}

Solution operator requires global communication.

Forward operator is implemented with sparse, structured communication.

Need principled strategy for reducing communication cost.

Detour: finite elements

[1] ParticleInCell.com — “Finite Element Experiments in MATLAB” (2012)

[1]

Are \(N \times N\) messages really necessary?

Smoothness implies redundancy in communication.

Are \(N \times N\) messages really necessary?

Smoothness implies redundancy in communication.

Method: club matching points to one cluster and communicate together.

FLARE: Fast Low-rank Attention Routing Engine

Encoding: introduce \(M\) latent clusters to pool messages from matching tokens

\(M\) learned queries

Decoding: map pooled messages to matching output tokens

FLARE: Fast Low-rank Attention Routing Engine

\(\mathcal{O}(2MN) \ll \mathcal{O}(N^2)\)

\(\text{rank}(W_\text{encode}\cdot W_\text{decode}) \leq M\)

\(>200\times\) speedup

\(\text{(} M \text{ tokens)}\)

\(\text{Latent}\)

[1] Vaswani et al. — “Attention Is All You Need”, NeurIPS 2017

\([1]\)

PDE surrogate benchmark problems

Relative \(L_2\) error \( (\times 10^{-3})\) (lower is better)

Pipe

Darcy

Elasticity

LPBF

DrivAerML

[1] Vaswani et al. — “Attention Is All You Need”, NeurIPS 2017

[2] Jaegle et al. — "PercieverIO: A General Architecture for Structured Inputs & Outputs", ICLR 2022

[3] Hao et al., — "GNOT: A General Neural Operator Transformer for Operator Learning", PMLR 2023

[4] Wang et al. —"Latent Neural Operator", NeurIPS 2024

[5] We et al. — "Transolver: A Fast Transformer Solver for PDEs on General Geometries", ICML 2024

Elasticity benchmark problem

Pipe flow, Darcy flow benchmark problems

Laser powder bed fusion benchmark problem

FLARE learns surrogate on a million-point mesh!

Largest experiment on a single GPU!

[1]

[1] Ashton et al. — “DrivAerML: High-Fidelity CFD Dataset for Road-Car Aerodynamics” (arXiv:2408.11969, 2024)

FLARE generalizes beyond PDE tasks

Pathfinder

\texttt{INPUT:\, [MAX 4 3 [MIN 2 3 ] 1 0 [MEDIAN 1 5 8 9, 2]] \,OUTPUT: 5}

Listops

\texttt{INPUT:\, [MAX 7 [MEDIAN 1 2 3 ] [MAX 9 2 2] [MIN 2 8]] \,OUTPUT: 9}

Image classification

Text sentiment analysis

[7]

[8]

[1]

[5]Choromanski et al. — "Rethinking Attention with Performers", ICLR 2021

[6] Tay, Y. et al. — “Long Range Arena: A Benchmark for Efficient Transformers” (arXiv 2020)

[7] Centric Consulting — “Sentiment Analysis: Way Beyond Polarity” (blog)
[8] Krizhevsky — CIFAR dataset homepage

[6]

Accuracy \((\%)\) (higher is better)

[1] Vaswani et al. — “Attention Is All You Need”, NeurIPS 2017

[2] Katharopoulos et al. — "Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention", ICML 2020

[3] Wang et al. — "Linformer: Self-attention with linear complexity", arXiv:2006.04768 2020

[4] Qin et al. — "The devil in linear transformer", arXiv:2210.10340 2022

Proposed Work

Extend FLARE to transient problems

Motivation: preempt build failures in metal additive manufacturing

Laser Powder Bed Fusion (LPBF)

Dataset of 20k LPBF calculations

Goal: develop fast surrogate model to predict warpage during build

\rho C_p \frac{dT}{dt} = \nabla \cdot k\Delta T(\mathbf{x}, t) + Q(\mathbf{x}, t)

\nabla \cdot \phi = 0, \,\, \sigma = C\varepsilon_e

Governing equations

End results could be deployed as a valuable design tool for metal AM.

[1]

[2]

[1] Nature Scientific Data — High-resolution dataset (2025)
[2] TechXplore — “Synergetic optimization reduces residual warpage in LPBF” (2022)

Proposed aims

Aim 1: Advance FLARE for enhanced surrogate modeling

Aim 2: Develop decoder version of FLARE

AIM 1(a): rank-adaptive FLARE

AIM 1(b): conditioning mechanism for FLARE

AIM 1: advance FLARE for enhanced surrogate learning

AIM 1(a): rank-adaptive FLARE for faster training

Complexity scales with latents (\(M\)): \(\mathcal{O}(2MN)\)

Accuracy increases with \(M\)

Method: progressively increase latents (\(M\)) through training.

Challenge: Minimize loss spikes, training instabilities.

Aim 1(b) background on conditioning in transformers

\left( t,\, \mathbf{x},\, \mathbf{u}_{t-k},\, \cdots, \mathbf{u}_{t} \right) \mapsto \mathbf{u}_{t+1}

Token mixing [1] (\(\mathcal{O}(N^2)\))

Conditioning [1] (\(\mathcal{O}(N\cdot C)\))

\begin{bmatrix} {t} \\ \mathbf{x}\\ \mathbf{u}_{t-k:t} \end{bmatrix}

Token mixing

\times B

\begin{bmatrix} \mathbf{u}_{t+1} \end{bmatrix}

\begin{bmatrix} \mathbf{x}\\ \end{bmatrix}

Token mixing

\begin{bmatrix} \mathbf{u}_{t+1} \end{bmatrix}

Conditioning

\begin{bmatrix} t & \mathbf{u}_{t-k:t} \end{bmatrix}

\times B

[1] Vaswani et al. — “Attention Is All You Need”, NeurIPS 2017

Key idea: Modulate token-mixing with conditioning tokens

\begin{bmatrix} \mathbf{x}\\ \end{bmatrix}

Cross FLARE

\times B

\begin{bmatrix} \mathbf{u}_{t+1} \end{bmatrix}

\begin{bmatrix} t & \mathbf{u}_{t-k:t} \end{bmatrix}

Aim 1(b) conditioning mechanism for FLARE

We propose to handle token mixing and conditioning in one unified block

\(\mathcal{O}(2MN + MC) \) complexity

Aim 2: background on next-token prediction transformers [1]

y_t = \frac{\sum_{\tau = 1}^t\exp\left(q_t \cdot k_\tau \right) v_\tau}{\sum_{\tau = 1}^t \exp \left(q_t \cdot k_\tau \right)}

All previous key/value \(\{k_\tau, v_\tau \}_{\tau \leq t}\) must be cached on the GPU.

Major memory and latency bottleneck!

[1] Vaswani et al. — “Attention Is All You Need”, NeurIPS 2017

Training algorithm (causal masking)

Inference algorithm (recurrence relation)

Dot-products need to be recomputed for every \(q_t\).

\(\mathcal{O}(N^2)\) complexity.

Aim 2: Develop decoder version of FLARE

Linear time auto-regressive attention.

Fixed memory footprint (only store \(\mathcal{O}(M)\) cache).

Flexible latent capacity.

Advantages

Required components

Fused GPU kernels for training and inference.

Bespoke training algorithm for causal FLARE.

Extensive benchmarking and evaluation.

Z_t = \text{online\_softmax}(Z_{t-1}, k_t, v_t)\\ y_t = \text{softmax}(Q^T \cdot k_t)^T\cdot Z_t

Inference algorithm (recurrence rule)

Next-token prediction with FLARE

Proposed timeline

Expected graduation: Summer 2026

Summary of this dissertation

PDE Surrogates

Fast and accurate latent space traversal in neural ROMs

Scalable and accurate self attention mechanism

Neural ROMs

(Planned) Transient PDE surrogates

Flexible and scalable cross-attention mechanism

Efficient and flexible decoder model.

Publications

Shankar, Varun, Vedant Puri, Ramesh Balakrishnan, Romit Maulik, and Venkatasubramanian Viswanathan. "Differentiable physics-enabled closure modeling for Burgers’ turbulence." Machine Learning: Science and Technology 4, no. 1 (2023): 015017.
Puri, Vedant, Aviral Prakash, Levent Burak Kara, and Yongjie Jessica Zhang. "SNF-ROM: Projection-based nonlinear reduced order modeling with smooth neural fields." Journal of Computational Physics 532 (2025): 113957.
Puri, Vedant, Aditya Joglekar, Kevin Ferguson, Yu-hsuan Chen, Yongjie Jessica Zhang, and Levent Burak Kara. "FLARE: Fast Low-rank Attention Routing Engine." arXiv preprint arXiv:2508.12594 (2025).
(In preparation)

Thank you

Questions?

Machine learning dominates several fields of scientific discovery

Enhancing PDE solvers with ML

Landscape of ML for PDEs

Mesh ansatz

PDE-Based

Neural Ansatz

Data-driven

FEM, FVM, IGA, Spectral

Fourier Neural Operator

Neural Field

DeepONet

Physics Informed NNs

Convolution NNs

Graph NNs

Adapted from Núñez, CEMRACS 2023

Neural ODEs

Universal Diff Eq

u =

\dfrac{du}{dt} =

\dfrac{d\tilde{u}}{dt} = \tilde{\mathcal{L}}_p(\tilde{u}) +

\dfrac{du}{dt} = \mathcal{L}_p(u) + \mathcal{N}_p(u)

\begin{cases} \dfrac{d u}{dt} = \mathcal{L}_p(u) + \mathcal{N}_p(u), & x\in\Omega\\ u|_{\partial\Omega} = g(t) \end{cases}

Reduced Order Modeling

Enhancing PDE solvers with ML

Newsflash: Neural signal representations beats the curse of dimensionality-ish

Orthogonal Functions	Deep Neural Networks

f = \tilde{f} + \mathcal{O}(h)

\( N \) parameters, \(M\) points

\( h \sim 1 / N \) (for shallow networks)

\( N \) points

\( \dfrac{d}{dx} \tilde{f}\sim \mathcal{O}(N^2) \) (exact)

\( \dfrac{d}{dx} \tilde{f} \sim \mathcal{O}(N) \) (exact, AD)

\( \int_\Omega \tilde{f} dx \sim \mathcal{O}(N) \) (exact)

(Weinan, 2020)

\( \int_\Omega \tilde{f} dx \sim \mathcal{O}(M) \) (approx)

Model size scales with signal complexity

Model size scales exponentially with dimension

\( N \sim h^{-d/c} \)

\tilde{u}(x) = \Sigma_{i=1}^N u_i \phi_i(x)

\tilde{u}(x) = (Z_L \circ \dotsc \circ Z_0)(x)

Efficient transformers models

Triple Attention or Multi-linear attention

FEATURES

Considers N-tuples of tokens at a time.
More expressive than standard attention, linear attention
Easily parallelizable across multiple GPUs
Kernel-based interpretation
As efficient and accurate as FLARE (SOTA)

DEMONSTRATIONS

Encoder transformer
- PDE Surrogate modeling, Long-range arena
Decoder transformer
- Next-token prediction/ language modeling

Triple Attention Scaling Study

Challenge: Learn PDE surrogate on 5-10 m points on multiGPU cluster

Adaptive Layer Norm in Diffusion Transformer allows for token mixing + time-conditioning in one go

This is only possible with a single token as conditioning vector, and won't work when you want to condition on a sequence.

Linear transformers only store state \( S \in \mathbb{R}^{D \times D} \) but their performance is not on par with softmax attention

Linear transformers replace the softmax kernel with a feature map \(\phi(\cdot)\) such that

This factorization allows causal attention to be computed recurrently:

\mathrm{softmax}(QK^\top) V \approx \phi(Q)\,\big(\phi(K)^\top V\big)

S_t = S_{t-1} + \phi(k_t)^\top v_t, \qquad \mathbf{y}_t = \phi(q_t)\, S_t,

Chunkwise training for linear transformers

https://manifestai.com/articles/linear-transformers-are-faster/

Premise: strong encoder model --> strong LLM

Next-token prediction model
- FLARE, Triple Attention
- Write CUDA kernels --> get scaling plots
- Test on language tasks
Extend FLARE
- Allow model to increase/ decrease \(M\) during training
- Create efficient conditioning mechanism (time-series PDE problems)
FOCUS ON NEW CONTRIBUTIONS and how we can differentiate ourselves from SOTA
explain novelty compared to SOTA

Computer simulations are critical for industrial applications

Mesosphere

Wind farm

Turbine

Blade

1000\,\mathrm{km}

10\,\mathrm{km}

100\,\mathrm{m}

10\,\mathrm{m}

Modern engineering is reliant on computer simulations

Design space exploration

Predictive maintenance

[1]

[2]

FLARE decoder recurrence

Takeaways from FLARE

Under review at Int'l Conf. Learning Representations

Contributions

Scalable and accurate self attention mechanism

Challenges in learning transient dynamics

Model must capture spatial structure and temporal evolution.

Increases training data by an order of magnitude.

Time-stepping logic may cause drift from ground truth.

\left( t,\, \mathbf{x},\, \mathbf{u}_{t-k},\, \cdots, \mathbf{u}_{t} \right) \mapsto \mathbf{u}_{t+1}

FLARE allows for tradeoff between accuracy and compute

Elasticity problem

Darcy problem

Low-rank structure allows for efficient eigenanalysis

Message-passing is fundamentally low-rank

SNF-ROM: Smooth Neural Field ROM

2D Viscous Burgers problem \( (\mathit{Re} = 1\text{k})\)

\(199\times\) speed-up

High freq. noise

Non-differentiable!

Accurately capture of dynamics with smooth neural fields

\textcolor{red}\times

\mathrm{NN}(x) \approx u^{}(x) \implies \dfrac{\mathrm{d}^k}{\mathrm{d}x^k} \mathrm{NN}(x) \approx u^{(k)}(x)

\mathrm{NN}(x)

\frac{\mathrm{d}}{\mathrm{d}x} \mathrm{NN}(x)

\frac{\mathrm{d}^2}{\mathrm{d}x^2} \mathrm{NN}(x)

Large deviations!

Learning smooth latent space trajectories

\(\text{Autoencoder ROM}\)

\(\text{SNF-ROM (ours)}\)

\text{Projection}

\text{Online solve}

\text{Projection}

\text{Online solve}

Evolution of ROM states

No deviation

\text{FOM}

\text{POD-ROM}

\text{SNFL-ROM (ours)}

\text{CAE-ROM}

\text{SNFW-ROM (ours)}

Accurate capture of dynamics

\(\text{DoFs: }524~k \to 2\)

Primer on model order reduction

\frac{\partial \boldsymbol{u}}{\partial t} = \mathcal{L}(\boldsymbol{x}, t, \boldsymbol{u}; \boldsymbol{\mu})

Full order model (FOM)

\boldsymbol{u}(\boldsymbol{x}, t; \boldsymbol{\mu}) \approx g_\text{FOM}(\boldsymbol{x}, \textcolor{red}{\bar{u}(t; \boldsymbol{\mu}})) = \mathbf{\Phi} \cdot \textcolor{red}{\bar{u}(t; \boldsymbol{\mu}})

Linear POD-ROM

Nonlinear ROM

\textcolor{red}{\bar{u}(t; \boldsymbol{\mu})} \approx g'_\text{ROM}(\textcolor{orange}{\bar{u}(t; \boldsymbol{\mu}}) = \bar{u}_0 + \mathbf{P} \cdot \textcolor{orange}{\tilde{u}(t; \mathbf{\mu})}

Learn low-order spatial representations

Time-evolution of reduced representation with Galerkin projection