Weekly Meeting

MAY 08, 2024

Vedant Puri

https://github.com/vpuri3

Mechanical Engineering, Carnegie Mellon University

Advisors: Prof. Burak Kara, Prof. Jessica Zhang

Update 4/02/25

Modeling dynamical deformation in LPBF with neural network surrogates

  • Contributions:
    • Novel transformer architecture for spatial slicing
    • Novel transformer architecture for time-series modeling
    • Time-series dataset for LPBF
  • Updates:
    • Testing architecture modifications on Cylinder Flow benchmar dataset
  • Next steps:
    • Continue experimenting with architecture for steady state
    • Continue benchmarking for rollout

Data-flow arguments

  • Slicing
    • Larger key, value projection
    • Learned query [H, M, D]
      • Transolver applies Linear projection to permuted x. Perceiver POV allows us to view the weights of the linear layer as latent query embedding. Transolver thus uses the same query vector for each head. We give each head a unique one.
    • QK normalization
      • Normalize query and key embeddings for stable training (ref. NGPT paper)
    • Query - head mixing
      • Allow slice weights in different heads to communicate with each other (ref. Multi-Token attention, Talking-Head attention papers)
      • cannot do key mixing because key vector is of size N (point cloud). can't apply any permutation-dependent conv either.
  • Head-wise normalizatoin - Stability, break symmetry
  • Self-attention
    • Permute & QKV projection [H*D, H*D]
      • Transolver applies same projection to each head in parallel. No head-mixing. We allow for more head-mixing here. 
      • Query-head mixing above is only on the attention weights. Here, we mix token value across heads.
      • Think of transolver as applying a block diagonal matrix with each block being the same as us as applying a full matrix.
  • Deslicing

Experiments - standard PDE benchmarks

0.0153

0.0092

ShapeNet
Car
-
LNO 0.0029 0.0049 0.0026 0.0845 - -
CAT (ours) 0.00315 - - - - 0.00590 0.0637 -
Transolver w/   conv
Transolver w/o conv
/
0.0064
-
-
0.0055
0.0082
-
-
-
-
0.00594
0.014313
/
0.0760
/
-

Cluster Attention Transformer

CAT Block

Elasticity: blocks vs layers scaling

Layers: Number of projections (latent encoding / decoding operations)

 

Blocks: Number of attention blocks in latent space in each layer.

Layers: Number of projections (latent encoding / decoding operations).

 

Blocks: Number of attention blocks in latent space in each layer.

 

Projection heads: Number of latent encoding/ decoding projections happening in parallel in each layer.

 

Clusters: Projection dimension

Elasticity: Scaling WRT Projection Heads

Layers: Number of projections (latent encoding / decoding operations).

 

Blocks: Number of attention blocks in latent space in each layer.

 

Projection heads: Number of latent encoding/ decoding projections happening in parallel in each layer.

 

Clusters: Projection dimension

Darcy: blocks vs layers scaling

Layers: Number of projections (latent encoding / decoding operations)

 

Blocks: Number of attention blocks in latent space in each layer.

Layers: Number of projections (latent encoding / decoding operations).

 

Blocks: Number of attention blocks in latent space in each layer.

 

Projection heads: Number of latent encoding/ decoding projections happening in parallel in each layer.

 

Clusters: Projection dimension

Darcy: blocks vs layers scaling for different batch sizes

Darcy overfitting: Channels=64, Clusters=64, Layers=1, Blocks=8

MLP block in latent and pointwise space

MLP block in latent space only

MLP block in pointwise space only

Projection Heads=4

Projection Heads=1

(Train/test) rel error: 5.765e-3 / 2.027e-2

(Train/test) rel error: 5.999e-3 / 1.440e-2

(Train/test) rel error: 7.363e-3 / 1.465e-2

(Train/test) rel error: 6.076e-3 / 1.144e-2

(Train/test) rel error: 6.776e-3 / 1.182e-2

(Train/test) rel error: 7.234-3 / 1.176e-2

Darcy overfitting: Channels=64, Clusters=64, Layers=8, Blocks=1

Projection Heads=4

(Train/test) rel error: 1.915e-3 / 6.935e-3

(Train/test) rel error: 2.243e-3 / 7.581e-3

(Train/test) rel error: 2.078e-3 / 7.101e-3

Projection Heads=1

(Train/test) rel error: 2.780e-3 / 6.956e-3

(Train/test) rel error: 2.999e-3 / 6.918e-3

MLP block in latent and pointwise space

MLP block in latent space only

MLP block in pointwise space only

(Train/test) rel error: 3.526e-3 / 7.109e-3

Cluster Attention Transformer

CAT Block

Scaling study

Channel dim: Model working dimension.

 

Blocks: Number of CAT projection blocks (latent encoding / decoding operations).

 

Latent Blocks: Number of self-attention blocks in latent space in each CAT block.

 

Projection heads: Number of latent encoding/ decoding projections happening in parallel in each layer.

 

Clusters: Projection dimension

Scaling study

OBSERVATIONS:

  • More CAT blocks, fewer latent blocks!
    • Expected
  • Greater projection heads work better!
    • Validates our approach of
      multiple parallel projections
  • With many projection heads,
    CAT works well with no latent blocks!
    • SURPRISING!

Scaling study

OBSERVATIONS:

  • More CAT blocks, fewer latent blocks!
    • Expected
  • Greater projection heads work better!
    • Validates our approach of
      multiple parallel projections
  • With many projection heads,
    CAT works well with no latent blocks!
    • SURPRISING!

Scaling study

  • During projection, each cluster is amassing information from hundreds/thousands of points.
  • This is like a pooling (averaging) operation
  • Then cluster values are projected (copy-pasted) back to point cloud.

CAT with Latent Blocks = 0

  • Project to many sets of cluster locations with
  • All-to-all interaction among projection weights (before projection).
  • Project back to point cloud

Transolver

  • Project to 1 set of slices
  • Attention among slices
  • Project back to point cloud

Scaling study

OBSERVATIONS:

  • More CAT blocks, fewer latent blocks!
    • Expected
  • Greater projection heads work better!
    • Validates our approach of
      multiple parallel projections
  • With many projection heads,
    CAT works well with no latent blocks!
    • SURPRISING!

\( = \frac{\text{Channel dim}}{\text{Number of projection heads}}\)

Scaling study

Scaling study

Elasticity: Decay in singular values in different layers (EPOCH 0)

Elasticity: Decay in singular values in different layers (EPOCH 500)

Elasticity: Decay in singular values in different layers (EPOCH 500)

ShapeNet-Car: Decay in singular values in different layers

Darcy: Decay in singular values in different layers

Latent Cross-Attention (DCA)

Expressive transformer architecture

TASKS

  • [~] Figures
  • [~] Method exposition
  • [~] Mathematical analysis, discussion
  • [X] CUDA kernel implementation
  • [ ] Scaling, ablations

APPLICATIONS

  • [X] PDEs
  • [~] Point cloud segmentation/ classification
    • [X] ModelNet40 Classification
    • [~] ScanNet semantic segmentation
    • [ ] S3DIS semantic segmentation
  • [~] Image classification, diffusion
    • ​Requires comprehensive hyperparameter tuning
    • Better to focus on this in future work

FUTURE WORK

  • DCA for CV tasks
  • Foundation model for 3D shape understanding
    • Ref. ShapeLLM

FLARE scaling study

Kernel implementation of CAT

TASKS

  • [~] Figures
  • [~] Method exposition
  • [~] Mathematical analysis, discussion
  • [X] Kernel implementation
  • [ ] Scaling study on at least one dataset

APPLICATIONS

  • [X] PDEs
  • [~] Point cloud segmentation/ classification
  • [~] Image classification, diffusion
(CAT) [vedantpu@eagle GeomLearning.py]:python bench/models/cat.py 

Forward Pass:
Time (ms): Vanilla=43.50, Flash=33.04, Speedup=1.32x
Memory (MB): Vanilla=399.62, Flash=692.66, Ratio=1.73
Value difference (mean abs): 0.000000

Backward Pass:
Time (ms): Vanilla=230.45, Flash=6.60, Speedup=34.93x
Memory (MB): Vanilla=1184.75, Flash=932.25, Ratio=0.79
Gradient difference (mean abs): q=0.000000, k=0.000000, v=0.000000
  • 1.3x faster inference
  • 35x faster training

ROM Project

Nonlinear kernel parameterizations for Neural Galerkin

  • Equation-based, data-free numerical methods for solving PDEs
  • Fast PDE solve in comparison to FEM thanks to compact representation
  • Smaller representation, faster solve in comparison to ML-based ROMs

Status and plan

  • Promising preliminary results on a host of 1D problems
  • Need at least 1 semester of time to flesh out these ideas

Potential new contributions and timeline

  • Develop and finalize proposed parameterizations (1-2 moths)
    • Test different parameterization ideas
    • Test on 1D, 2D test cases
  • Develop adaptive refinement/ coarsening techniques (1 month)
u(x, t) = \sum_{i=1}^{\textcolor{magenta}{N}} \frac{\textcolor{blue}{c_i(t)}}{2} \left( \tanh(\textcolor{blue}{\omega_0(t)} (x - \textcolor{blue}{x_0(t)})) - \tanh(\textcolor{blue}{\omega_1(t)} (x - \textcolor{blue}{x_1(t)})) \right)

Parameterized Tanh kernels

ROM Project - Nonlinear parameterizations

Nonlinear kernel parameterizations for Neural Galerkin

  • Equation-based, data-free numerical methods for solving PDEs

Status and plan

  • Promising preliminary results on a host of 1D problems
  • Need at least 1 semester of time to flesh out these ideas

Potential new contributions and timeline

  • Develop novel parameterizations that have several benefits
    • Very expressive (handles shocks) with few parameters (speedup)
    • Fast hyper-reduction as parameterization is naturally sparse
    • Accurate integration as parameterization is sparse
    • In comparison, DNNparameterizations are large and do not result in a speedup; other kernelized parameterizations (e.g. Gaussian kernels) are not as expressive
  • Develop adaptive refinement techniques
  • Develop adaptive coarsening techniques
u(x, t) = \sum_{i=1}^{\textcolor{magenta}{N}} \frac{\textcolor{blue}{c_i(t)}}{2} \left( \tanh(\textcolor{blue}{\omega_0(t)} (x - \textcolor{blue}{x_0(t)})) - \tanh(\textcolor{blue}{\omega_1(t)} (x - \textcolor{blue}{x_1(t)})) \right)

Parameterized Tanh kernels

Neural Galerkin - Advection Diffusion problem

Parameterized Gaussian (OURS)

3 parameters

8 collocation points

Deep Neural Network (BASELINE)

~150 parameters

256 collocation points

 Multiplicative filter network (MFN)

~210 parameters

256 collocation points

Error due to limited expressivity of this simple model

FAILED TO CONVERGE

Improve model fit by splitting kernels

  • At time t=0, we are fitting the initial condition given to us with our nonlinear model. This is the projection step.
  • To improve the fit, we are going to make the model more expressive with boosting.
  • We do this by repeatedly dividing each kernel in two and optimizing both.
  • This is akin to adaptive refinement

1 Kernel (6 params)

4 Kernel (21 params)