Weekly Meeting

MAY 08, 2024

Vedant Puri

https://github.com/vpuri3

Mechanical Engineering, Carnegie Mellon University

Advisors: Prof. Burak Kara, Prof. Jessica Zhang

Attention Research Proposal

MAY 08, 2024

Datta ---

https://github.com/vpuri3

Mechanical Engineering, Carnegie Mellon University

Advisors: Prof. Burak Kara, Prof. Jessica Zhang

Motivation

ML surrogates for biomedical applications

Application 1
Application 2
Application 3

FLARE: Fast Low...

Highlights
What needs to be done with FLARE
- Develop time-stepper, time-conditioning mechanism for FLARE

Timeline

PHASE 1: Understand and test/ benchmark on simple problems
- Long Range Arena
- Steady benchmarks
- OCT - NOV
PHASE 2: Propose modifications for time-series calculations
- Time conditioning
- Use prefix-conditioning in Long Range Arena as precursor to time-conditioning problem
- NOV - JAN
PHASE 3: Test on biomedical applications
- Application 1: dataset, goals, benchmark model performance
- Application 2:
- FEB - MAY

FLARE + Time
1. transient calculations (LPBF, cylinder flow)
2. diffusion (image generation, ...)
Enhancements to FLARE/ Transformer
1. PDE problems
2. Long Range Arena problems

Update 4/02/25

Modeling dynamical deformation in LPBF with neural network surrogates

Contributions:
- Novel transformer architecture for spatial slicing
- Novel transformer architecture for time-series modeling
- Time-series dataset for LPBF
Updates:
- Testing architecture modifications on Cylinder Flow benchmar dataset
Next steps:
- Continue experimenting with architecture for steady state
- Continue benchmarking for rollout

Data-flow arguments

Slicing
- Larger key, value projection
- Learned query [H, M, D]
  - Transolver applies Linear projection to permuted x. Perceiver POV allows us to view the weights of the linear layer as latent query embedding. Transolver thus uses the same query vector for each head. We give each head a unique one.
- ~~QK normalization~~
  - ~~Normalize query and key embeddings for stable training (ref. NGPT paper)~~
- Query - head mixing
  - Allow slice weights in different heads to communicate with each other (ref. Multi-Token attention, Talking-Head attention papers)
  - cannot do key mixing because key vector is of size N (point cloud). can't apply any permutation-dependent conv either.
Head-wise normalizatoin - Stability, break symmetry
Self-attention
- Permute & QKV projection [H*D, H*D]
  - Transolver applies same projection to each head in parallel. No head-mixing. We allow for more head-mixing here.
  - Query-head mixing above is only on the attention weights. Here, we mix token value across heads.
  - Think of transolver as applying a block diagonal matrix with each block being the same as us as applying a full matrix.
Deslicing

Experiments - standard PDE benchmarks

0.0153

0.0092

							ShapeNet Car	-
LNO		0.0029	0.0049	0.0026	0.0845		-	-
CAT (ours)	0.00315	-	-	-	-	0.00590	0.0637	-
Transolver w/ conv Transolver w/o conv	/ 0.0064	- -	0.0055 0.0082	- -	- -	0.00594 0.014313	/ 0.0760	/ -

Cluster Attention Transformer

CAT Block

Elasticity: blocks vs layers scaling

Layers: Number of projections (latent encoding / decoding operations)

Blocks: Number of attention blocks in latent space in each layer.

Layers: Number of projections (latent encoding / decoding operations).

Blocks: Number of attention blocks in latent space in each layer.

Projection heads: Number of latent encoding/ decoding projections happening in parallel in each layer.

Clusters: Projection dimension

Elasticity: Scaling WRT Projection Heads

Layers: Number of projections (latent encoding / decoding operations).

Blocks: Number of attention blocks in latent space in each layer.

Projection heads: Number of latent encoding/ decoding projections happening in parallel in each layer.

Clusters: Projection dimension

Darcy: blocks vs layers scaling

Layers: Number of projections (latent encoding / decoding operations)

Blocks: Number of attention blocks in latent space in each layer.

Layers: Number of projections (latent encoding / decoding operations).

Blocks: Number of attention blocks in latent space in each layer.

Projection heads: Number of latent encoding/ decoding projections happening in parallel in each layer.

Clusters: Projection dimension

Darcy: blocks vs layers scaling for different batch sizes

Darcy overfitting: Channels=64, Clusters=64, Layers=1, Blocks=8

MLP block in latent and pointwise space

MLP block in latent space only

MLP block in pointwise space only

Projection Heads=4

Projection Heads=1

(Train/test) rel error: 5.765e-3 / 2.027e-2

(Train/test) rel error: 5.999e-3 / 1.440e-2

(Train/test) rel error: 7.363e-3 / 1.465e-2

(Train/test) rel error: 6.076e-3 / 1.144e-2

(Train/test) rel error: 6.776e-3 / 1.182e-2

(Train/test) rel error: 7.234-3 / 1.176e-2

Darcy overfitting: Channels=64, Clusters=64, Layers=8, Blocks=1

Projection Heads=4

(Train/test) rel error: 1.915e-3 / 6.935e-3

(Train/test) rel error: 2.243e-3 / 7.581e-3

(Train/test) rel error: 2.078e-3 / 7.101e-3

Projection Heads=1

(Train/test) rel error: 2.780e-3 / 6.956e-3

(Train/test) rel error: 2.999e-3 / 6.918e-3

MLP block in latent and pointwise space

MLP block in latent space only

MLP block in pointwise space only

(Train/test) rel error: 3.526e-3 / 7.109e-3

Cluster Attention Transformer

CAT Block

Scaling study

Channel dim: Model working dimension.

Blocks: Number of CAT projection blocks (latent encoding / decoding operations).

Latent Blocks: Number of self-attention blocks in latent space in each CAT block.

Projection heads: Number of latent encoding/ decoding projections happening in parallel in each layer.

Clusters: Projection dimension

Scaling study

OBSERVATIONS:

More CAT blocks, fewer latent blocks!
- Expected
Greater projection heads work better!
- Validates our approach of
  multiple parallel projections
With many projection heads,
CAT works well with no latent blocks!
- SURPRISING!

Scaling study

OBSERVATIONS:

More CAT blocks, fewer latent blocks!
- Expected
Greater projection heads work better!
- Validates our approach of
  multiple parallel projections
With many projection heads,
CAT works well with no latent blocks!
- SURPRISING!

Scaling study

During projection, each cluster is amassing information from hundreds/thousands of points.
This is like a pooling (averaging) operation
Then cluster values are projected (copy-pasted) back to point cloud.

CAT with Latent Blocks = 0

Project to many sets of cluster locations with
All-to-all interaction among projection weights (before projection).
Project back to point cloud

Transolver

Project to 1 set of slices
Attention among slices
Project back to point cloud

Scaling study

OBSERVATIONS:

More CAT blocks, fewer latent blocks!
- Expected
Greater projection heads work better!
- Validates our approach of
  multiple parallel projections
With many projection heads,
CAT works well with no latent blocks!
- SURPRISING!

\( = \frac{\text{Channel dim}}{\text{Number of projection heads}}\)

Scaling study

Elasticity: Decay in singular values in different layers (EPOCH 0)

Elasticity: Decay in singular values in different layers (EPOCH 500)

ShapeNet-Car: Decay in singular values in different layers

Darcy: Decay in singular values in different layers

Latent Cross-Attention (DCA)

Expressive transformer architecture

TASKS

[~] Figures
[~] Method exposition
[~] Mathematical analysis, discussion
[X] CUDA kernel implementation
[ ] Scaling, ablations

APPLICATIONS

[X] PDEs
[~] Point cloud segmentation/ classification
- [X] ModelNet40 Classification
- [~] ScanNet semantic segmentation
- [ ] S3DIS semantic segmentation
~~[~] Image classification, diffusion~~
- Requires comprehensive hyperparameter tuning
- Better to focus on this in future work

FUTURE WORK

DCA for CV tasks
Foundation model for 3D shape understanding
- Ref. ShapeLLM

FLARE scaling study

Weekly meeting - 10/23/25

PROPOSAL

Waiting on Prof. Farimani for scheduling. Potential dates:

NEXT PAPER - ICML (Int'l conference of Machine Learning)

Triple attention method
- Competitive with FLARE on PDE problems
Multilinear attention method (maybe)

WINTER BREAK PLAN - visit India Dec 11 - Jan 10

Attend wedding Dec 12 - 14
Work remotely the weeks of Dec 15
Take off the week of Dec 22
Work remotely week of Jan 5

PROGRESS

Setting up Long Range Arena benchmark problems for FLARE reviews.
Wrote CUDA kernels to make triple attention method as efficient as FLARE
Fixed numerical instabilities with FP16 training
Fixed issue with Darcy dataset (removed downsampling)
Testing triple attention with different kernels (parameter study)

NEXT STEPS

Decide on PDE experimental suite for Triple attention
Test on Long Range Arena benchmarks
Set up toy problems that demonstrate advantage of three-way attention.

Triple attention vs FLARE scaling

Triple attention vs FLARE on Darcy (58k): C=128, B=8

Triple v1 (4.2m)

Triple v2 (1.9m)

FLARE (2.4m)

Different depths of ResidualMLPs

ROM Project

Nonlinear kernel parameterizations for Neural Galerkin

Equation-based, data-free numerical methods for solving PDEs
Fast PDE solve in comparison to FEM thanks to compact representation
Smaller representation, faster solve in comparison to ML-based ROMs

Status and plan

Promising preliminary results on a host of 1D problems
Need at least 1 semester of time to flesh out these ideas

Potential new contributions and timeline

Develop and finalize proposed parameterizations (1-2 moths)
- Test different parameterization ideas
- Test on 1D, 2D test cases
Develop adaptive refinement/ coarsening techniques (1 month)

u(x, t) = \sum_{i=1}^{\textcolor{magenta}{N}} \frac{\textcolor{blue}{c_i(t)}}{2} \left( \tanh(\textcolor{blue}{\omega_0(t)} (x - \textcolor{blue}{x_0(t)})) - \tanh(\textcolor{blue}{\omega_1(t)} (x - \textcolor{blue}{x_1(t)})) \right)

Parameterized Tanh kernels

ROM Project - Nonlinear parameterizations

Nonlinear kernel parameterizations for Neural Galerkin

Equation-based, data-free numerical methods for solving PDEs

Status and plan

Promising preliminary results on a host of 1D problems
Need at least 1 semester of time to flesh out these ideas

Potential new contributions and timeline

Develop novel parameterizations that have several benefits
- Very expressive (handles shocks) with few parameters (speedup)
- Fast hyper-reduction as parameterization is naturally sparse
- Accurate integration as parameterization is sparse
- In comparison, DNNparameterizations are large and do not result in a speedup; other kernelized parameterizations (e.g. Gaussian kernels) are not as expressive
Develop adaptive refinement techniques
Develop adaptive coarsening techniques

Parameterized Tanh kernels

Neural Galerkin - Advection Diffusion problem

Parameterized Gaussian (OURS)

3 parameters

8 collocation points

Deep Neural Network (BASELINE)

~150 parameters

256 collocation points

Multiplicative filter network (MFN)

~210 parameters

256 collocation points

Error due to limited expressivity of this simple model

FAILED TO CONVERGE

Improve model fit by splitting kernels

At time t=0, we are fitting the initial condition given to us with our nonlinear model. This is the projection step.
To improve the fit, we are going to make the model more expressive with boosting.
We do this by repeatedly dividing each kernel in two and optimizing both.
This is akin to adaptive refinement

1 Kernel (6 params)

4 Kernel (21 params)

advisor_meetings_3

By Vedant Puri

advisor_meetings_3

Biweekly co-advisor meeting

Vedant Puri

PhD student at Carnegie Mellon University

vpuri3.github.io

Weekly Meeting

Attention Research Proposal

Motivation

FLARE: Fast Low...

Timeline

Update 4/02/25

Data-flow arguments

Experiments - standard PDE benchmarks

Cluster Attention Transformer

CAT Block

Elasticity: blocks vs layers scaling

Elasticity: Scaling WRT Projection Heads

Darcy: blocks vs layers scaling

Darcy: blocks vs layers scaling for different batch sizes

Darcy overfitting: Channels=64, Clusters=64, Layers=1, Blocks=8

Darcy overfitting: Channels=64, Clusters=64, Layers=8, Blocks=1

Cluster Attention Transformer

CAT Block

Scaling study

Scaling study

Scaling study

Scaling study

Scaling study

\( = \frac{\text{Channel dim}}{\text{Number of projection heads}}\)

Scaling study

Scaling study

Elasticity: Decay in singular values in different layers (EPOCH 0)

Elasticity: Decay in singular values in different layers (EPOCH 500)

Elasticity: Decay in singular values in different layers (EPOCH 500)

ShapeNet-Car: Decay in singular values in different layers

Darcy: Decay in singular values in different layers

Latent Cross-Attention (DCA)

FLARE scaling study

Weekly meeting - 10/23/25

Triple attention vs FLARE scaling

Triple attention vs FLARE on Darcy (58k): C=128, B=8

ROM Project

ROM Project - Nonlinear parameterizations

Neural Galerkin - Advection Diffusion problem

Improve model fit by splitting kernels

advisor_meetings_3

More from Vedant Puri