Biomedical Engineering Seminar
Yale University
"The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin. [...] Seeking an improvement that makes a difference in the shorter term, researchers seek to leverage their human knowledge of the domain, but the only thing that matters in the long run is the leveraging of computation. [...]
We want AI agents that can discover like we can, not which contain what we have discovered."
The Bitter Lesson, Rich Sutton 2019
measurements
reconstruction
measurements
reconstruction
degree to which a material is magnetized when placed in a magnetic field
[from talk by S. Bollmann]
[Li et al, 2012]
dipole inversion via
Option A: One-shot methods
Given enough training pairs \({(x_i,y_i)}\) train a network
\(f_\theta(y) = g_\theta(A^+y) \approx x\)
[Mousavi & Baraniuk, 2017]
[Ongie, Willet, et al, 2020]
Option A: One-shot methods
Given enough training pairs \({(x_i,y_i)}\) train a network
\(f_\theta(y) = g_\theta(A^+y) \approx x\)
Option B: data-driven regularizer
[Lunz, Öktem, Schönlieb, 2020][Bora et al, 2017][Romano et al, 2017][Ye Tan, ..., Schönlieb, 2024]
Proximal Gradient Descent: \( x^{k+1} = \text{prox}_R \left(x^k - \eta A^T(Ax^k-y)\right) \)
What if we don't know \(R(x)\), or \(\text{prox}_R\) ?
Train via:
Collect data:
pick a function class
[Lai, Aggarwal, van Zijl, Li, Sulam, Learned Proximal Networks for Quantitative Susceptibility Mapping, MICCAI 2020]
unseen angles during training
[Lai, Aggarwal, van Zijl, Li, Sulam, Learned Proximal Networks for Quantitative Susceptibility Mapping, MICCAI 2020]
What are these networks actually computing?
Proximal Gradient Descent: \( x^{t+1} = \text{prox}_R \left(x^t - \eta A^T(Ax^t-y)\right) \)
... a denoiser
any latest NN denoiser
[Venkatakrishnan et al., 2013; Zhang et al., 2017b; Meinhardt et al., 2017; Zhang et al., 2021; Kamilov et al., 2023b; Terris et al., 2023]
[Gilton, Ongie, Willett, 2019]
Proximal Gradient Descent: \( x^{t+1} = {\color{red}f_\theta} \left(x^t - \eta A^T(A(x^t)-y)\right) \)
Question 1)
When will \(f_\theta(x)\) compute a \(\text{prox}_R(x)\) ? and for what \(R(x)\)?
\(\mathcal H_\text{prox} = \{f : \text{prox}_R~ \text{for some }R\}\)
\(\mathcal H = \{f: \mathbb R^n \to \mathbb R^n\}\)
Question 1)
When will \(f_\theta(x)\) compute a \(\text{prox}_R(x)\) ? and for what \(R(x)\)?
Question 2)
Can we estimate the "correct" prox?
\(\mathcal H_\text{prox} = \{f : \text{prox}_R~ \text{for some }R\}\)
\(\mathcal H = \{f: \mathbb R^n \to \mathbb R^n\}\)
Question 1)
When will \(f_\theta(x)\) compute a \(\text{prox}_R(x)\) ?
Theorem [Gibonval & Nikolova, 2020]
\( f(x) \in \text{prox}_R(x) ~\Leftrightarrow \exist ~ \text{convex l.s.c.}~ \psi: \mathbb R^n\to\mathbb R : f(x) \in \partial \psi(x)~\)
Question 1)
When will \(f_\theta(x)\) compute a \(\text{prox}_R(x)\) ?
\(R(x)\) need not be convex!
Theorem [Gibonval & Nikolova, 2020]
Take \(f_\theta(x) = \nabla \psi_\theta(x)\) for convex (and differentiable) \(\psi_\theta\)
\( f(x) \in \text{prox}_R(x) ~\Leftrightarrow \exist ~ \text{convex l.s.c.}~ \psi: \mathbb R^n\to\mathbb R : f(x) \in \partial \psi(x)~\)
Given \(f_\theta(x)\), we can compute \(R(x)\) via a LP
If so, can you know for what \(R(x)\)?
Yes
[Gibonval & Nikolova]
Easy! \[{\color{grey}y^* =} \arg\min_{y} \psi(y) - \langle y,x\rangle {\color{grey}= \hat{f}_\theta^{-1}(x)}\]
Question 2)
Could we have \(R(x) = -\log p_x(x)\)?
(we don't know \(p_x\)!)
i.e. \(f_\theta(y) = \text{prox}_R(y) = \texttt{MAP}(x|y)\)
Which loss function?
i.e. \(f_\theta(y) = \text{prox}_R(y) = \texttt{MAP}(x|y)\)
Theorem (informal)
\(\gamma\)
Question 2)
Could we have \(R(x) = -\log p_x(x)\)?
(we don't know \(p_x\)!)
Fang, Buchanan & S. What's in a Prior? Learned Proximal Networks for Inverse Problems. ICLR 2024.
What parts of the image are important for this prediction?
What are the subsets of the input so that
efficiency
nullity
symmetry
exponential complexity
Lloyd S Shapley. A value for n-person games. Contributions to the Theory of Games, 2(28):307–317, 1953.
Let be an -person cooperative game with characteristic function
How important is each player for the outcome of the game?
inputs
responses
predictor
How important is feature \(x_i\) for \(f(x)\)?
\(X_{S_j^c}\sim \mathcal D_{X_{S_j}={x_{S_j}}}\)
Scott Lundberg and Su-In Lee. A Unified Approach to Interpreting Model Predictions, NeurIPS , 2017
inputs
responses
How important is feature \(x_i\) for \(f(x)\)?
predictor
\(X_{S_j^c}\sim \mathcal D_{X_{S_j}={x_{S_j}}}\)
Scott Lundberg and Su-In Lee. A Unified Approach to Interpreting Model Predictions, NeurIPS , 2017
Question 1)
Can we resolve the computational bottleneck (and when) ?
Question 2)
What do these coefficients mean, really?
Question 3)
How to go beyond input-features explanations?
We focus on data with certain structure:
Example:
if contains a sick cell
Question 1) Can we resolve the computational bottleneck (and when) ?
hierarchical Shap runs in linear time
Under A1, h-Shap \(\to\) Shapley
[Teneggi, Luster & S., IEEE TPAMI, 2022]
We focus on data with certain structure:
Example:
if contains a sick cell
Question 1) Can we resolve the computational bottleneck (and when) ?
hierarchical Shap runs in linear time
Under A1, h-Shap \(\to\) Shapley
[Teneggi, Luster & S., IEEE TPAMI, 2022]
Question 2) What do these coefficients mean, really?
[Candes et al, 2018]
Question 2) What do these coefficients mean, really?
XRT: eXplanation Randomization Test
returns a \(\hat{p}_{i,S}\) for the test above
Given the Shapley coefficient of any feature
Then
and the (expected) p-value obtained for , i.e. ,
Theorem:
Teneggi, Bharti, Romano, and S. "SHAP-XRT: The Shapley Value Meets Conditional Independence Testing." TMLR (2023).
Question 3)
How to go beyond input-features explanations?
Is the piano important for \(\hat Y = \text{cat}\) given that there is a cute mammal?
Question 3) How to go beyond input-features explanations?
semantics \(Z = c^TH\)
embeddings \(H = f(X)\)
predictions \(\hat{Y} = g(H)\)
Concept Bottleneck Models (CBM)
[Koh et al '20, Yang et al '23, Yuan et al '22, Yuksekgonul '22 ]
semantic XRT
\[H^{j,S}_0:~g(\widetilde{H}_{S \cup \{j\}}) \overset{d}{=} g(\widetilde{H}_S),\quad\widetilde{H}_C \sim P_{H | Z_C = z_C}\]
"The classifier (its distribution) does not change if we condition
on concepts \(S\) vs on concepts \(S\cup\{j\} \)"
semantics \(Z = c^TH\)
embeddings \(H = f(X)\)
predictions \(\hat{Y} = g(H)\)
semantic XRT
\[H^{j,S}_0:~g(\widetilde{H}_{S \cup \{j\}}) \overset{d}{=} g(\widetilde{H}_S),\quad\widetilde{H}_C \sim P_{H | Z_C = z_C}\]
"The classifier (its distribution) does not change if we condition
on concepts \(S\) vs on concepts \(S\cup\{j\} \)"
\(\hat{Y}_\text{gas pump}\)
\(Z_S\cup Z_{j}\)
\(Z_{S}\)
semantic XRT
\[H^{j,S}_0:~g(\widetilde{H}_{S \cup \{j\}}) \overset{d}{=} g(\widetilde{H}_S),\quad\widetilde{H}_C \sim P_{H | Z_C = z_C}\]
"The classifier (its distribution) does not change if we condition
on concepts \(S\) vs on concepts \(S\cup\{j\} \)"
\(\hat{Y}_\text{gas pump}\)
\(\hat{Y}_\text{gas pump}\)
\(Z_S\cup Z_{j}\)
\(Z_{S}\)
\(Z_S\cup Z_{j}\)
\(Z_{S}\)
semantic XRT
\[H^{j,S}_0:~g(\widetilde{H}_{S \cup \{j\}}) \overset{d}{=} g(\widetilde{H}_S),\quad\widetilde{H}_C \sim P_{H | Z_C = z_C}\]
[Shaer et al. 2023, Shekhar and Ramdas 2023 ]
Important Semantic Concepts
(Reject \(H_0\))
Unimportant Semantic Concepts
(fail to reject \(H_0\))
rejection rate
rejection time
Exciting open problems to making AI tools safe, trustworthy and interpretable
Importance in clear definitions and guarantees
* Fang, Z., Buchanan, S., & J.S. (2023). What's in a Prior? Learned Proximal Networks for Inverse Problems. International Conference on Learning Representations. * Teneggi, J., Luster, A., & J.S. (2022). Fast hierarchical games for image explanations. IEEE Transactions on Pattern Analysis and Machine Intelligence. * Teneggi, J., B. Bharti, Y. Romano, and J.S. (2023) SHAP-XRT: The Shapley Value Meets Conditional Independence Testing. Transactions on Machine Learning Research. * Teneggi, J. and J.S. I Bet You Did Not Mean That: Testing Semantic Importance via Betting. NeurIPS 2024 (to appear).
Is the model fair?
Pneumonia
Clear
Does your model achieve a \(\Delta_{\text{TPR}}\) of at most (say) 6% ?
Pneumonia
Clear
Tight upper bounds to fairness violations
(optimally) Actionable
Bharti, B., Yi, P., & Sulam, J. (2023). Estimating and Controlling for Equalized Odds via Sensitive Attribute Predictors NeurIPS 2023
Pneumonia
Clear
For an observation \(y\)
\[y = x + \epsilon,~\epsilon \sim \mathcal{N}(0, \sigma^2\mathbb{I})\]
reconstruct \(x\) with
\[\hat{x} = F(y) \sim \mathcal{Q}_y \approx p(x \mid y)\]
\(x\)
\(y\)
\(F(y)\)
Lemma
\(\mathcal I(y)\) provides entrywise coverage for pixel \(j\), i.e.
\[\mathbb{P}\left[\text{next sample}_j \in \mathcal{I}(y)_j\right] \geq 1 - \alpha\]
If \[\mathcal{I}(y)_j = \left[ \frac{\lfloor(m+1)Q_{\alpha/2}(y_j)\rfloor}{m} , \frac{\lceil(m+1)Q_{1-\alpha/2}(y_j)\rceil}{m}\right]\]
\(0\)
\(1\)
low: \( l(y) \)
\(\mathcal{I}(y)\)
up: \( u(y) \)
(distribution free)
\(x\)
\(y\)
lower
upper
intervals
\(|\mathcal I(y)_j|\)
\(0\)
\(1\)
ground-truth is
contained
\(\mathcal{I}(y_j)\)
\(x_j\)
Procedure For pixel \(j\)
\[\mathcal{I}_{\lambda}(y)_j = [\text{lower} - \lambda, \text{upper} + \lambda]\]
choose
\[\hat{\lambda} = \inf\{\lambda \in \mathbb{R}:~\forall \lambda' \geq \lambda,~\text{risk}(\lambda') \leq \epsilon\}\]
ground-truth is
contained
\(0\)
\(1\)
\(\mathcal{I}(y_j)\)
\(\lambda\)
\(x_j\)
Definition For risk level \(\epsilon\), failure probability \(\delta\), \(\mathcal{I}(y_j) \) is a RCPS if
\[\mathbb{P}\left[\mathbb{E}\left[\text{fraction of pixels not in intervals}\right] \leq \epsilon\right] \geq 1 - \delta\]
scalar \(\lambda \in \mathbb{R}\)
\(\mathcal{I}_{\lambda}(y)_j = [\text{low} - \lambda, \text{up} + \lambda]\)
\(\rightarrow\)
vector \(\bm{\lambda} \in \mathbb{R}^d\)
\(\rightarrow\)
\(\mathcal{I}_{\bm{\lambda}}(y)_j = [\text{low} - \lambda_j, \text{up} + \lambda_j]\)
Guarantee: \(\mathcal{I}_{\bm{\lambda}}(y)_j = [\text{low} - \lambda_j, \text{up} + \lambda_j]\) are RCPS
For a \(K\)-partition of the pixels \(M \in \{0, 1\}^{d \times K}\)
\(K=4\)
\(K=8\)
\(K=32\)
scalar \(\lambda \in \mathbb{R}\)
\(\rightarrow\)
vector \(\bm{\lambda} \in \mathbb{R}^d\)
\(\mathcal{I}_{\lambda}(y)_j = [\text{low} - \lambda, \text{up} + \lambda]\)
\(\rightarrow\)
\(\mathcal{I}_{\bm{\lambda}}(y)_j = [\text{low} - \lambda_j, \text{up} + \lambda_j]\)
1. Solve
\[\tilde{\bm{\lambda}}_K = \arg\min~\sum_{k \in [K]}n_k\lambda_k~\quad\text{s.t. empirical risk} \leq \epsilon\]
2. Choose
\[\hat{\beta} = \inf\{\beta \in \mathbb{R}:~\forall \beta' \geq \beta,~\text{risk}(M\tilde{\bm{\lambda}}_K + \beta') \leq \epsilon\}\]
\(\hat{\lambda}_K\)
conformalized uncertainty maps
\(K=4\)
\(K=8\)
\[\mathbb{P}\left[\mathbb{E}\left[\text{fraction of pixels not in intervals}\right] \leq \epsilon\right] \geq 1 - \delta\]
Teneggi, J., Tivnan, M., Stayman, W., & Sulam, J. (2023, July). How to trust your diffusion model: A convex optimization approach to conformal risk control. In International Conference on Machine Learning. PMLR.
Zhenghan Fang
JHU
Jacopo Teneggi
JHU
Beepul Bharti
JHU
Sam Buchanan
TTIC
Yaniv Romano
Technion
Convergence guarantees for PnP
Theorem (PGD with Learned Proximal Networks)
Let \(f_\theta = \text{prox}_{\hat{R}} {\color{grey}\text{ with } \alpha>0}, \text{ and } 0<\eta<1/\sigma_{\max}(A) \) with smooth activations
(Analogous results hold for ADMM)
Convergence guarantees for PnP
Convergence guarantees for PnP
Theorem (PGD with Learned Proximal Networks)
Let \(f_\theta = \text{prox}_{\hat{R}} {\color{grey}\text{ with } \alpha>0}, \text{ and } 0<\eta<1/\sigma_{\max}(A) \) with smooth activations
(Analogous results hold for ADMM)
Convergence guarantees for PnP
inputs
responses
predictor
inputs
responses
predictor
inputs
responses
predictor
Scott Lundberg and Su-In Lee. A Unified Approach to Interpreting Model Predictions, NeurIPS , 2017
efficiency
nullity
symmetry
exponential complexity
inputs
responses
predictor
We focus on data with certain structure:
h-Shap runs in linear time
Under A1, h-Shap \(\to\) Shapley
Teneggi, Luster & S. Fast hierarchical games for image explanations, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022
[Chattopadhyay et al, 2024]
Image-by-image supervision (strong learner)
true/false
Image-by-image supervision (strong learner)
Study/volume supervision (weak learner)
true/false
true/false
Both methods do as well for case screaning
Teneggi, J., Yi, P. H., & Sulam, J. (2023). Examination-level supervision for deep learning–based intracranial hemorrhage detection at head CT. Radiology: Artificial Intelligence, e230159.
Weak learner is more efficient for detecting positive slices
training labels
Teneggi, J., Yi, P. H., & Sulam, J. (2023). Examination-level supervision for deep learning–based intracranial hemorrhage detection at head CT. Radiology: Artificial Intelligence, e230159.
Teneggi, J., Yi, P. H., & Sulam, J. (2023). Examination-level supervision for deep learning–based intracranial hemorrhage detection at head CT. Radiology: Artificial Intelligence, e230159.