Ramchandran Muthukumar
Mentors : Frank Permenter, Chenyang Yuan
Manager : Avinash Balachandran
Computer Science Ph.D. Defense
cat
Given an image, classify it
Pneumonia Detection from Chest X-ray
Traffic Sign Detection for
Autonomous Vehicles
Face Recognition
Digit Recognition
Machine learning has been effective in classification*
Our understanding remains incomplete
In this talk:
Evaluate machine learning models rigorously
\(^*\) We built flying machines before we fully understood the aerodynamics of flight.
Performance of machine learning models
Provable mathematical statements
Statistics transforms anecdotes into evidence.
cat
\(x\)
Input
Label
\( y \)
\( \{\textit{cat}, \textit{dog}, \textit{bird}, \ldots \} \)
Classify an input in \(\mathcal{X}\),
with an appropriate label in \(\mathcal{Y}\)
\( \mathcal{X}\): images of pets
\( \mathcal{Y}\): types of pets
Some inputs are more common than others
e.g. cats vs pandas
A distribution \( \mathcal{D} \) captures the probability of sampling an input-label pair
\( \mathcal{X}\): images of pets
\( \mathcal{Y}\): types of pets
Classify an input in \(\mathcal{X}\),
with an appropriate label in \(\mathcal{Y}\)
\( \mathcal{X}\): images of pets
\( \mathcal{Y}\): types of pets
\(^*\) The symbol \(\sim\) denotes sampling
Classify an input in \(\mathcal{X}\),
with an appropriate label in \(\mathcal{Y}\)
Unfortunately \( \mathcal{D} \) is unknown.
Instead, we have samples\(^\dagger\)
\(^\dagger\) i.i.d = independent and identically distributed
For random labeled data \((x,y) \sim \mathcal{D}\) \(^*\),
classify input \(x\) as the label \(y\)
\(S\)
\( \overset{\mathrm{i.i.d}}{\sim} (\mathcal{D})^m\)
\(= \{ (x_1, y_1), (x_2, y_2), \ldots (x_m, y_m) \} \)
Training Data
bird
dog
cat
(Proxy)
For training data \((x_i,y_i)\) in \(S\),
classify input \(x_i\) as the label \(y_i\)
\( \overset{\mathrm{i.i.d}}{\sim} (\mathcal{D})^m\)
\(S\)
Does doing homework \(\implies\) scoring well in the test?
(Proxy)
For random labeled data \((x,y) \sim \mathcal{D}\),
classify input \(x\) as the label \(y\)
For training data \((x_i,y_i)\) in \(S\),
classify input \(x_i\) as the label \(y_i\)
Does the ability to classify training data \(S\),
mean we can also classify data from \(\mathcal{D}\) ?
when do we generalize?
cat
\(^*\) \( \mathrm{label}(h, x) \coloneqq \underset{c}{\arg\max}\; [h (x)]_c \)
\(^\dagger\) \(\mathrm{margin}(h,(x, y)) \coloneqq [ h(x)]_{y} - \argmax_{j \neq y} [h(x)]_j\)
\(x\)
\( y \)
\( h \)
Fraction\(^\star\) of training data
where the margin is insufficient
Probability\(^\dagger\) of sampling data
where the margin is insufficient
For \( \gamma = 0 \), \(\mathrm{Test Error}_{0}(h) \) is the probability of misclassification
\(^\star\) \(\mathrm{TrainingError}_{\gamma}(h) := \frac{1}{|\texttt{S}|} \sum_{(x_i, y_i) \text{ in } \texttt{S}} \mathbf{1}\{\mathrm{margin}(h, (x_i, y_i)) < \gamma\}\)
\(^\dagger\) \(\mathrm{TestError}_{\gamma}(h) = \underset{(x, y) \sim \mathcal{D}}{\mathbf{Prob}} \left\{ \mathrm{margin}(h, (x, y)) <\gamma \right\}\)
Training samples
Inputs
\( \mathcal{X} \subset \mathbb{R}^d\)
Labels
\( \mathcal{Y} := \{1, \ldots, C\} \)
Data Distribution
\( \mathcal{D} \) over \( \mathcal{X} \times \mathcal{Y} \) (unknown)
Hypothesis Class
Predicted Label
Margin
$$ \mathrm{label}(h, x) \coloneqq \underset{c}{\arg\max}\; [h (x)]_c$$
\(\mathcal{H} : \mathcal{X} \rightarrow \mathbb{R}^C\)
\( \texttt{S} := \{ (x_i, y_i) \}_{i=1}^m \overset{\mathrm{i.i.d}}{\sim}\) \((\mathcal{D})^m \)
$$\mathrm{margin}( h,( x, y)) \coloneqq [ h(x)]_{y} - \argmax_{j \neq y} [ h(x)]_j $$
Training Error
Test Error
\( \frac{1}{|\texttt{S}|} \sum_{(x_i, y_i) \text{ in } \texttt{S}} \mathbf{1}\{\mathrm{margin}(h, (x_i, y_i)) < \gamma\}\)
\( \underset{(x, y) \sim \mathcal{D}}{\mathbf{Prob}} \left\{ \mathrm{margin}(h, (x, y)) <\gamma \right\} \)
Does the ability to classify training data \(S\),
mean we can also classify data from \(\mathcal{D}\) ?
when do we generalize?
(Proxy)
For random labeled data \((x,y) \sim \mathcal{D}\),
classify input \(x\) as the label \(y\)
For training data \((x_i,y_i)\) in \(S\),
classify input \(x_i\) as the label \(y_i\)
when do we generalize?
If \(\mathrm{TrainingError}_{\gamma}(h)\) is small,
how large can \(\mathrm{Test Error}_{\gamma}(h)\) be?
For random labeled data \((x,y) \sim \mathcal{D}\),
classify input \(x\) as the label \(y\)
For training data \((x_i,y_i)\) in \(S\),
classify input \(x_i\) as the label \(y_i\)
(Proxy)
A non-asymptotic, probabilistic bound on the test error of a model
With probability at least \(1-\delta\) over the sampling of training data, for any model \(h\) in \(\mathcal{H}\),
\({\bm{\kappa}(\cdot)}\) = capacity measure
valid for any finite training data S of size m
valid with high probability over randomly sampled training data \(S \overset{\textrm{i.i.d}}{\sim} (\mathcal{D})^m \)
Vacuous if the bound is larger than 1
\({ \kappa(\cdot)}\) can depend on several things: data distribution \(\mathcal{D}\), hypothesis class \(\mathcal{H}\), training data \(S\), learned model \(h\) etc.
How large is \(\mathcal{H}\) ?
How expressive is \(\mathcal{H}\) on \(S\) ?
VC-dimension \(\kappa_{\mathrm{VC}}(\mathcal{H})\), Rademacher complexity \(\kappa_{\mathrm{RC}}(\mathcal{H}, S)\)
Capacity measures that only depend on \(\mathcal{H}\) result in bounds,
1. Uniform over \(\mathcal{H}\) - including the bad classifiers
2. Oblivious to learning process
Finding capacity measures that correlate with test error in practice is an active area of research
global
Theorem\(^\star\) (Bartlett et. al. (2017), Neyshabur et. al (2017), etc.)
With high probability over the training data S, for any \( h \in \mathcal{H} \)
\( \tilde{\mathcal{O}} \) suppresses log factors, constants and failure probability.
\(^\star\) Simplified informal statement of results.
Global sensitivity depends on worst-case interaction between the model and data.
Capacity measures that only depend on \(\mathcal{H}\) result in uniform bounds
Can we do better with local information?
\(\gamma\) is a hyper-parameter chosen before observing data
Sensitivity of machine learning models
Within a local region
Radius within which linear approximation of \(h\) at \(x\) is exact.
The size \( \|\nabla_{\mathcal{H}} h(x) \|_2 \) of the first-order local linear approximation based on the Jacobian of \(h\) at \(x\)
\(\gamma\) is a hyper-parameter chosen before observing data
With high probability over the training data S, for any \( h \in \mathcal{H} \)
Theorem\(^\star\) (Nagarajan et. al. (2019), Wei et. al (2020), etc.)
\(^\star\) Simplified informal statement of results.
With high probability over the training data S, for any \( h \in \mathcal{H} \)
For some \( (x_i, y_i) \),
Theorem\(^\star\) (Nagarajan et. al. (2019), Wei et. al (2020), etc.)
When the local linear approximation is poor
e.g. high curvature,
non-linearity, etc.
\(^\star\) Simplified informal statement of results.
Sensitivity\((h)\)
Bound on
\(\mathrm{TestError}_0(h)\)
Bartlett et. al. (2017),
Neyshabur et. al (2017), etc.
Nagarajan et. al. (2019),
Wei et. al (2020), etc.
Global
Jacobian
\(1\)
\(0\)
Best of both worlds?
Is there a rigorous generalization bounds based on intermediate sensitivity?
A local sensitivity oracle\(^{\star}\) provides a radius \(\mathrm{r}_{\mathrm{local}}\) such that,
Model
Input
Desired Sensitivity Level
\(^\star\) An oracle is a black box assumed to answer queries, without revealing how.
We assume that the local sensitivity oracle is stable:
Local radius within \(h\) exhibits desired sensitivity at \(x\)
The desired level of local sensitivity \(\mathsf{L}\)
\(^\star\) Simplified informal statement of results.
\(\gamma, \mathsf{L}\) are hyper-parameters chosen before observing data
With high probability over the training data S, for any \( h \in \mathcal{H} \)
Theorem\(^\star\) (Stable Local Sensitive Oracle)
With high probability over the training data S, for any \( h \in \mathcal{H} \)
\(^\star\) Simplified informal statement of results.
\(\gamma, \mathsf{L}\) are hyper-parameters chosen before observing data
Intermediate sensitivity can provide rigorous generalization bounds for all hypothesis classes!
Search for the optimal sensitivity level \(\mathsf{L}\)
for each model \(h\) and training data \(S\)
Theorem\(^\star\) (Stable Local Sensitive Oracle)
\(^\star\) Exact computation is NP-hard even for shallow feedforward neural networks as per (Scaman et. al. 2016)
Structure in the interactions between the model and data
aka Occam's razor
Start simple, add complexity only if essential.
When is \(\mathsf{L}\) large or small?
Interpretation\(^{\star}\) of \(\mathsf{L}\) depends
on the scale of the output: \(\|h(x)\|_2\)
\(^\star\) A salary increase of $1000 is insignificant to Jeff Bezos but significant to me.
\(\mathsf{L}_{\rm global} \propto \sup_{h \in \mathcal{H}} \;\sup_{x \in \mathcal{X}} \; \|h(x)\|_2\)
Misleading for a particular \(h\) and input \(x\) when the scale varies significantly
worst-case scale across \(\mathcal{H}\) and \(\mathcal{X}\)
Local sensitivity should be
proportional to the local scale:
\(\sup_{\hat{h}\; \mathrm{ nearby }\; h}\; \sup_{\tilde{x}\; \mathrm{ nearby }\; {x}} \|\hat{h}(\tilde{x})\|_2\).
My brain in full
Reading
The Local Parsimony Principle
Locally, complex models \(\approx\) simpler models
Different simple models of varying complexity for each \( (h, x) \)
Listening
Thinking
Local sensitivity should be
proportional to the local scale:
\(\sup_{\hat{h}\; \mathrm{ nearby }\; h}\; \sup_{\tilde{x}\; \mathrm{ nearby }\; {x}} \|\hat{h}(\tilde{x})\|_2\).
Only 3% of neurons are needed at any input.
Neural networks are not brains but do exhibit local parsimony
We will now show a systematic framework
linking parsimony and sensitivity.
Each step uses the example of a feedforward map
Observe parsimony in the interaction between model and data
The output \(h(x) \) is sparse with an index set J of size \(s\) containing only zero entries
\( s\)
\(\Big(\)
\(\Big)\)
= \(\texttt{ReLU}\)
\(x\)
\(W\)
\(h(x)\)
\({J} \)
\({J^c} \)
An observation has 3 parts
Form
Degree
Context
(sparsity, \(s\), \( J\))
Identify the active and inactive parts
\(W[J,:]\) is active and \(W[J^c,:]\) is inactive
\( s\)
\({J^c} \)
\(\Big(\)
\(\Big)\)
= \(\texttt{ReLU}\)
\(x\)
\(W\)
\(h(x)\)
\({J} \)
Isolate the structural trigger of parsimony
Reduce the complexity of the model at an input
\( s\)
\({J} \)
\(\Big(\)
\(\Big)\)
= \(\texttt{ReLU}\)
\(x\)
\(W\)
\(h(x)\)
\({J} \)
\(\Big(\)
\(\Big)\)
= \(\texttt{ReLU}\)
\(x\)
\(h_{J}(x)\)
\({J^c} \)
\({J^c} \)
\(\mathcal{P}_{J,:} (W)\)
\(\mathcal{P}_{J,:} (W)\) = rows of \(W\) in \(J^c\) are zeroed
Reduce the complexity of the model at an input
\({J} \)
\(\Big(\)
\(\Big)\)
= \(\texttt{ReLU}\)
\(x\)
\(\mathcal{P}_{J,:} (W)\)
\(h_{J}(x)\)
\({J^c} \)
At \(x\), the complex model \(h\) is equivalent to the simpler model \(h_{J}\)
\(\|h(x)\|_2 = \|h_J(x)\|_2 \leq \|\mathcal{P}_{J,:}(W)\|_2 \|x\|_2\)
\( s\)
\({J} \)
\(h(x)\)
\({J^c} \)
= \(\texttt{ReLU}\)
=
Localize the reduction in complexity to nearby\(^\star\) models
\({J} \)
\(\Big(\)
\(\Big)\)
= \(\texttt{ReLU}\)
\(x\)
\(\hat{h}_{J}(x)\)
\({J^c} \)
Local radius
\(\mathcal{P}_{J,:} (\hat{W})\)
\(^\star\) For an appropriately chosen distance metric
Measure sensitivity locally within the neighborhood
Local radius
For nearby models \(\hat{h}\) within the local radius,
Local sensitivity
Local sensitivity is proportional to the local scale
Measure sensitivity locally within the neighborhood
\(\mathsf{L}_{\rm jacobian}(h,x) \leq \mathsf{L}_{\rm sparse} (h,x, J) \leq \mathsf{L}_{\rm global}\)
For all observations of parsimony with context \(J\)
\(\mathsf{r}_{\rm jacobian}(h,x) \leq \mathsf{r}_{\rm sparse} (h, x, J) \leq \mathrm{r}_{\rm global} = \infty\)
A larger local sensitivity holding within a larger neighborhood
Local radius
Local sensitivity
So far, we saw how a single observation of parsimony yields a local measure of sensitivity.
Collect and aggregate measurements across different contexts for a fixed degree of sparsity \(s\)
Vary \(s\) to interpolate between Jacobian and global sensitivity
\({J}_1 \)
\({J}_1 \)
\({J}^c_1 \)
\({J}^c_1 \)
\({J}_1 \)
\({J}^c_1 \)
\({J}_2 \)
\({J}^c_3 \)
\(W_1\)
\(W_2\)
\(W_3\)
\({J}^c_2 \)
\({J}_2 \)
\({J}_3 \)
\({J}^c_2 \)
From single layer feedforward map to multiple layers
\(\vec{s} = (s_1, s_2, \ldots, s_K)\)
This workflow can be reproduced for other \(\mathcal{H}\)
e.g convolutional networks, transformers,
dictionary learning, center-based clustering etc.
Observe Parsimony
Collect
and Aggregate
Identify
and Isolate
Measure
Sensitivity
Reduce
and Localize
Chain
Sequentially
Radius within \(h\) exhibits desired stable sparsity at \(x\)
The sensitivity corresponding to the desired level of stable sparsity \(\mathsf{L}\)
Trade-off margin-threshold \(\gamma\) and sparsity levels \(\vec{s}\) for an optimal bound for each model \(h\) and data \(S\)!
Theorem (Sparse local sensitivity-normalized margin bounds\(^\star\))
With high probability over the training data S, for any \( h \in \mathcal{H} \)
Random Initialization
Pretrained Initialization
Optimized generalization bound for overparameterized
3-layer feedforward networks on MNIST
11k
22k
33k
44k
55k
11k
22k
33k
44k
55k
10
1
0.1
10
1
0.1
Size of Training Data
Histogram of \( \tau(h, x, \gamma) \) across training data
12
10
8
6
4
2
0
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Models with larger layer widths have
smaller effective dimensionality ratio
Results for general hypothesis classes \( \mathcal{H} \)
using a local sensitivity oracle
Systematic framework shown via feedforward neural networks
Applicable to other forms of parsimony (e.g. rank)
2017
2019
2021
2023
2025
Rising Star Award in ML
Start of Ph.D.
Conference on Neural Information Processing Systems (NeurIPS '20)
SIAM Journal on Mathematics of Data Science
(SI-MODS '22)
SIAM Journal on Optimization
(SI-OPT '21)
Conference on Learning Theory
(COLT '23)
Conference on Parsimony and Learning
(CPAL) '24
Today
(under preparation)
Conference on Computer Vision and Pattern Recognition (CVPR '25)
Jan 2023, SlowDNN @ Abu Dhabi
May 2022, NSF Grant Workshop @ Denver
July 2023, COLT @ Bangalore
Jan 2024, CPAL @ HK
July 2025, CVPR @ Nashville
Aug 2024, Learning Theory Workshop @ Aarhus, DK
June 2023, CCSI @ Boston
Nov 2023, DeepMath @ San Diego
For 1.5/2 hours, roughly ever 2 week @ Baltimore