June 2025
first CT scan
ELECTRIC & MUSICAL INDUSTRIES
imaging
diagnostics
data-driven imaging
automatic analysis and rec.
societal implications
Data
Compute & Hardware
Sensors & Connectivity
Research & Engineering
data-driven imaging
automatic analysis and rec.
societal implications
data-driven imaging
automatic analysis and rec.
societal implications
Inputs (features): \(X\in \mathcal X \subset \mathbb R^d\)
Responses (labels): \(Y\in \{0,1\}\)
Sensitive attributes: \(Z \in \mathbb R^k \) (sex, race, age, etc)
Random variables sampled: \((X,Y,Z) \sim \mathcal D\)
Eg: \(Z_1: \) biological sex, \(X_1: \) BMI, then
\( g(Z,X) = \boldsymbol{1}\{Z_1 = 1 ~\texttt{and}~ X_1 > 35 \}: \) women with BMI > 35
Goal: ensure that \(f\) is fair w.r.t groups \(g \in \mathcal G\)
Group memberships \( \mathcal G = \{ g(X,Z) \mapsto \{0,1\} \} \)
Predictor \( f(X) : \mathcal X \to [0,1]\) (e.g. likelihood of X having disease Y)
Equal Opportunity
\(\mathbb P[\hat Y=1 | Y=1, G_{\texttt{age}\leq 60}=0] = \mathbb P[\hat Y = 1 | Y=1, G_{\texttt{age}>60}=1]\)
\(\Delta \text{TPR}_\text{age} = \left| \text{TPR}_{\texttt{age}\leq 60} - \text{TPR}_{\texttt{age}>60}\right| \leq \alpha \)
Multiaccuracy
\(\text{MA} (f,g) = \big| \mathbb E [ g(X,Z) (f(X) - Y) ] \big| \)
\(f\) is \(\alpha\)-multiaccurate if \( \max_{g\in\mathcal G} \text{MA}(f,g) \leq \alpha \)
Multiaccuracy
\(\text{MA} (f,g) = \big| \mathbb E [ g(X,Z) (f(X) - Y) ] \big| \)
\(f\) is \(\alpha\)-multiaccurate if \( \max_{g\in\mathcal G} \text{MA}(f,g) \leq \alpha \)
Example: predicting high risk of complications from flu based on clinical features
Multiaccuracy
\(\text{MA} (f,g) = \big| \mathbb E [ g(X,Z) (f(X) - Y) ] \big| \)
\(f\) is \(\alpha\)-multiaccurate if \( \max_{g\in\mathcal G} \text{MA}(f,g) \leq \alpha \)
Example: predicting high risk of complications from flu based on clinical features
Observation:
Evaluation for fairness notions requires samples over \((X,Y,Z)\)
Multiaccuracy
\(\text{MA} (f,g) = \big| \mathbb E [ g(X,Z) (f(X) - Y) ] \big| \)
\(f\) is \(\alpha\)-multiaccurate if \( \max_{g\in\mathcal G} \text{MA}(f,g) \leq \alpha \)
Example: predicting high risk of complications from flu based on clinical features
Observation:
Evaluation for fairness notions requires samples over \((X,Y,Z)\)
Problem: This is not always possible...
Problem: This is not always possible...
Problem: This is not always possible...
?
?
?
?
?
We observe samples over \((X,Y)\) to obtain \(\hat Y = f(X)\) for \(Y\)
\( \text{MSE}(f) = \mathbb E [(Y-f(X))^2 ] \)
A developer provides us with proxies \( \color{Red} \hat{g} : \mathcal X \to \{0,1\} \)
\( \text{err}(\hat g) = \mathbb P [({\color{Red}\hat g(X)} \neq {\color{blue}g(X,Z)} ] \)
Can we use \(\hat g\) to measure (and correct) for fairness metrics?
[Awasti et al, '21][Kallus et al, '22][Zhu et al, '23][Bharti et al, '24]
We observe samples over \((X,Y)\) to obtain \(\hat Y = f(X)\) for \(Y\)
\( \text{MSE}(f) = \mathbb E [(Y-f(X))^2 ] \)
A developer provides us with proxies \( \color{Red} \hat{g} : \mathcal X \to \{0,1\} \)
\( \text{err}(\hat g) = \mathbb P [({\color{Red}\hat g(X)} \neq {\color{blue}g(X,Z)} ] \)
[Awasti et al, '21][Kallus et al, '22][Zhu et al, '23][Bharti et al, '24]
Can we use \(\hat g\) to measure (and correct) for fairness metrics?
Theorem [Bharti, Clemens-Sewall, Yi, Sulam]
With access to \((X,Y)\sim \mathcal D_{\mathcal{XY}}\), proxies \( \hat{\mathcal G}\) and predictor \(f\)
\[ \max_{\color{Blue}g\in\mathcal G} \text{MA}(f,{\color{blue}g}) ~\leq ~\max_{\color{red}\hat g\in \hat{\mathcal{G}} } \text{MA}(f,{\color{red}\hat{g}}) + B(f,{\color{red}\hat g}) \]
with \(B(f,\hat g) = \min \left( \text{err}(\hat g), \sqrt{MSE(f)\cdot \text{err}(\hat g)} \right) \)
true error
worst possible error
Theorem [Bharti, Clemens-Sewall, Yi, Sulam]
With access to \((X,Y)\sim \mathcal D_{\mathcal{XY}}\), proxies \( \hat{\mathcal G}\) and predictor \(f\)
\[ \max_{\color{Blue}g\in\mathcal G} MA(f,{\color{blue}g}) ~\leq ~\max_{\color{red}\hat g\in \hat{\mathcal{G}} } MA(f,{\color{red}\hat{g}}) + B(f,{\color{red}\hat g}) \]
with \(B(f,\hat g) = \min \left( \text{err}(\hat g), \sqrt{MSE(f)\cdot \text{err}(\hat g)} \right) \)
true error
worst possible error
[Gopalan et al. (2022)][Roth (2022)][Bharti et al (2025)]
Theorem [Bharti, Clemens-Sewall, Yi, Sulam]
With access to \((X,Y)\sim \mathcal D_{\mathcal{XY}}\), proxies \( \hat{\mathcal G}\) and predictor \(f\)
\[ \max_{\color{Blue}g\in\mathcal G} \text{MA}(f,{\color{blue}g}) ~\leq ~\max_{\color{red}\hat g\in \hat{\mathcal{G}} } \text{MA}(f,{\color{red}\hat{g}}) + B(f,{\color{red}\hat g}) \]
with \(B(f,\hat g) = \min \left( \text{err}(\hat g), \sqrt{MSE(f)\cdot \text{err}(\hat g)} \right) \)
true error
worst possible error
[Gopalan et al. (2022)][Roth (2022)][Bharti et al (2025)]
Theorem [Bharti, Clemens-Sewall, Yi, Sulam]
With access to \((X,Y)\sim \mathcal D_{\mathcal{XY}}\), proxies \( \hat{\mathcal G}\) and predictor \(f\)
\[ \max_{\color{Blue}g\in\mathcal G} \text{MA}(f,{\color{blue}g}) ~\leq ~\max_{\color{red}\hat g\in \hat{\mathcal{G}} } \text{MA}(f,{\color{red}\hat{g}}) + B(f,{\color{red}\hat g}) \]
with \(B(f,\hat g) = \min \left( \text{err}(\hat g), \sqrt{MSE(f)\cdot \text{err}(\hat g)} \right) \)
true error
worst possible error
CheXpert: Predicting abnormal findings in chest X-rays
(not accessing race or biological sex)
\(f(X): \) likelihood of \(X\) having \(\texttt{pleural effusion}\)
Take-home message
CheXpert: Predicting abnormal findings in chest X-rays
(not accessing race or biological sex)
\(f(X): \) likelihood of \(X\) having \(\texttt{pleural effusion}\)
Take-home message
data-driven imaging
automatic analysis and rec.
societal implications
"The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin. [...] Seeking an improvement that makes a difference in the shorter term, researchers seek to leverage their human knowledge of the domain, but the only thing that matters in the long run is the leveraging of computation. [...]
We want AI agents that can discover like we can, not which contain what we have discovered."
The Bitter Lesson, Rich Sutton 2019
"The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin. [...] Seeking an improvement that makes a difference in the shorter term, researchers seek to leverage their human knowledge of the domain, but the only thing that matters in the long run is the leveraging of computation. [...]
We want AI agents that can discover like we can, not which contain what we have discovered."
The Bitter Lesson, Rich Sutton 2019
Predictor \(f(x)\) trained to predict \(\texttt{sick/healthy}\)
efficiency
nullity
symmetry
exponential complexity
Lloyd S Shapley. A value for n-person games. Contributions to the Theory of Games, 2(28):307–317, 1953.
Let \(G = ([n],f)\) be an \(n\)-person cooperative game with characteristic function \(f:\mathcal P([n])\to \mathbb R\)
We focus on data with certain structure:
Example:
if contains a sick cell
h-Shap runs in linear time
Under A1, h-Shap \(\to\) Shapley
\(\tilde{X}_i \sim \mathcal D_{X|X_i=x_i}\)
Fast hierarchical games for image explanations, Teneggi, Luster & S., IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022
Image-by-image supervision (strong learner)
true/false
Image-by-image supervision (strong learner)
true/false
Study/volume supervision (weak learner)
true/false
training labels
Teneggi, J., Yi, P. H., & Sulam, J. (2023). Examination-level supervision for deep learning–based intracranial hemorrhage detection at head CT. Radiology: Artificial Intelligence.
Teneggi, J., Yi, P. H., & Sulam, J. (2023). Examination-level supervision for deep learning–based intracranial hemorrhage detection at head CT. Radiology: Artificial Intelligence.
data-driven imaging
automatic analysis and rec.
societal implications
data-driven imaging
automatic analysis and rec.
societal implications
data-driven imaging
automatic analysis and rec.
societal implications
data-driven imaging
automatic analysis and rec.
societal implications
measurements
reconstruction
measurements
reconstruction
Proximal Gradient Descent: \( x^{t+1} = \text{prox}_R \left(x^t - \eta A^\top(Ax^t-y)\right) \)
... a denoiser
\({\color{red}f_\theta}\): off-the-shelf denoiser
[Venkatakrishnan et al., 2013; Zhang et al., 2017b; Meinhardt et al., 2017; Zhang et al., 2021; Gilton, Ongie, Willett, 2019; Kamilov et al., 2023b; Terris et al., 2023; S Hurault et al. 2021, Ongie et al, 2020; ...]
Proximal Gradient Descent: \( x^{t+1} = \text{prox}_R \left(x^t - \eta A^\top(Ax^t-y)\right) \)
... a denoiser
\({\color{red}f_\theta}\): off-the-shelf denoiser
[Venkatakrishnan et al., 2013; Zhang et al., 2017b; Meinhardt et al., 2017; Zhang et al., 2021; Gilton, Ongie, Willett, 2019; Kamilov et al., 2023b; Terris et al., 2023; S Hurault et al. 2021, Ongie et al, 2020; ...]
Question 1)
What are these black-box functions computing? and what have they learned about the data?
Theorem [Fang, Buchanan, S.]
When will \(f_\theta(x)\) compute a \(\text{prox}_R(x)\), and for what \(R(x)\)?
Let \(f_\theta : \mathbb R^n\to\mathbb R^n\) be a network : \(f_\theta (x) = \nabla \psi_\theta (x)\),
where \(\psi_\theta : \mathbb R^n \to \mathbb R,\) convex and differentiable (ICNN).
Then,
1. Existence of regularizer
\(\exists ~R_\theta : \mathbb R^n \to \mathbb R\) not necessarily convex : \(f_\theta(x) \in \text{prox}_{R_\theta}(x),\)
2. Computability
We can compute \(R_{\theta}(x)\) by solving a convex problem
How do we find \(f(x) = \text{prox}_R(x)\) for the "correct" \(R(x) \propto -\log p_x(x)\)?
Theorem [Fang, Buchanan, S.]
Proximal Matching Loss:
\(\gamma\)
Goal: train a denoiser \(f(y)\approx x\)
Let
Then,
a.s.
Fang, Buchanan & S. What's in a Prior? Learned Proximal Networks for Inverse Problems, ICLR 2024.
Fang, Buchanan & S. What's in a Prior? Learned Proximal Networks for Inverse Problems, ICLR 2024.
\(R_\theta(x) = 0.0\)
\(R_\theta(x) = 127.37\)
\(R_\theta(x) = 274.13\)
\(R_\theta(x) = 290.45\)
Understanding the learned model provides new insights:
Take-home message 1
\(R(\tilde{x})\)
via
Theorem (PGD with Learned Proximal Networks)
Let \(f_\theta = \text{prox}_{\hat{R}} {\color{grey}\text{ with } \alpha>0}, \text{ and } 0<\eta<1/\sigma_{\max}(A) \) with smooth activations
(Analogous results hold for ADMM)
Convergence guarantees for PnP
\[y = Ax + \epsilon,~\epsilon \sim \mathcal{N}(0, \sigma^2\mathbb{I})\]
\[\hat{x} = F(y) \sim \mathcal{P}_y\]
Hopefully \(\mathcal{P}_y \approx p(x \mid y)\), but not needed!
Question 3)
How much uncertainty is there in the samples \(\hat x \sim \mathcal P_y?\)
Question 4)
How far will the samples \(\hat x \sim \mathcal P_y\) be from the true \(x\)?
Lemma
Given \(m\) samples from \(\mathcal P_y\), let
\[\mathcal{I}(y)_j = \left[ Q_{y_j}\left(\frac{\lfloor(m+1)\alpha/2\rfloor}{m}\right), Q_{y_j}\left(\frac{\lceil(m+1)(1-\alpha/2)\rceil}{m}\right)\right]\]
Then \(\mathcal I(y)\) provides entriwise coverage for a new sample \(\hat x \sim \mathcal P_y\), i.e.
\[\mathbb{P}\left[\hat{x}_j \in \mathcal{I}(y)_j\right] \geq 1 - \alpha\]
\(0\)
\(1\)
low: \( l(y) \)
\(\mathcal{I}(y)\)
up: \( u(y) \)
Question 3)
How much uncertainty is there in the samples \(\hat x \sim \mathcal P_y?\)
(distribution free)
cf [Feldman, Bates, Romano, 2023]
\(y\)
lower
upper
intervals
\(|\mathcal I(y)_j|\)
\(0\)
\(1\)
ground-truth is
contained
\(\mathcal{I}(y_j)\)
\(x_j\)
Question 4)
How far will the samples \(\hat x \sim \mathcal P_y\) be from the true \(x\)?
[Angelopoulos et al, 2022]
[Angelopoulos et al, 2022]
Risk Controlling Prediction Set
For risk level \(\epsilon\), failure probability \(\delta\), \(\mathcal{I}(y_j) \) is a RCPS if
\[\mathbb{P}\left[\mathbb{E}\left[\text{fraction of pixels not in intervals}\right] \leq \epsilon\right] \geq 1 - \delta\]
[Angelopoulos et al, 2022]
Question 4)
How far will the samples \(\hat x \sim \mathcal P_y\) be from the true \(x\)?
\(0\)
\(1\)
ground-truth is
contained
\(\mathcal{I}(y_j)\)
\(x_j\)
[Angelopoulos et al, 2022]
ground-truth is
contained
\(0\)
\(1\)
\(\mathcal{I}(y_j)\)
\(\lambda\)
\(x_j\)
Procedure:
\[\hat{\lambda} = \inf\{\lambda \in \mathbb{R}:~ \hat{\text{risk}}_{(\mathcal S_{cal})} \leq \epsilon,~\forall \lambda' \geq \lambda \}\]
[Angelopoulos et al, 2022]
single \(\lambda\) for all \(\mathcal I(y_j)\)!
Risk Controlling Prediction Set
For risk level \(\epsilon\), failure probability \(\delta\), \(\mathcal{I}(y_j) \) is a RCPS if
\[\mathbb{P}\left[\mathbb{E}\left[\text{fraction of pixels not in intervals}\right] \leq \epsilon\right] \geq 1 - \delta\]
[Angelopoulos et al, 2022]
Question 4)
How far will the samples \(\hat x \sim \mathcal P_y\) be from the true \(x\)?
\(\mathcal{I}_{\bm{\lambda}}(y)_j = [l_\text{low,j} - \lambda, l_\text{up,j} + \lambda]\)
\[\tilde{{\lambda}}_K = \underset{\lambda \in \mathbb R^K}{\arg\min}~\sum_{k \in [K]}\lambda_k~\quad \text{s.t. }\quad \mathcal I_{\lambda_j}(y) : \text{RCPS}\]
scalar \(\lambda \in \mathbb{R}\)
vector \(\bm{\lambda} \in \mathbb{R}^d\)
\(\mathcal{I}_{\lambda}(y)_j = [\text{low}_j - \lambda, \text{up}_j + \lambda]\)
\(\mathcal{I}_{\bm{\lambda}}(y)_j = [\text{low}_j - \lambda_j, \text{up}_j + \lambda_j]\)
\(\rightarrow\)
\(\rightarrow\)
Procedure:
1. Find anchor point
\[\tilde{\bm{\lambda}}_K = \underset{\bm{\lambda}}{\arg\min}~\sum_{k \in [K]}\lambda_k~\quad\text{s.t.}~~~\hat{\text{risk}}^+(\bm{\lambda})_{(S_{opt})} \leq \epsilon\]
2. Choose
\[\hat{\beta} = \inf\{\beta \in \mathbb{R}:~\hat{\text{risk}}_{S_{cal}}^+(\tilde{\bm{\lambda}}_K + \beta'\bf{1}) \leq \epsilon,~\forall~ \beta' \geq \beta\}\]
\(\tilde{\bm{\lambda}}_K\)
\[\tilde{{\lambda}}_K = \underset{\lambda \in \mathbb R^K}{\arg\min}~\sum_{k \in [K]}\lambda_k~\quad \text{s.t. }\quad \mathcal I_{\lambda_j}(y) : \text{RCPS}\]
scalar \(\lambda \in \mathbb{R}\)
vector \(\bm{\lambda} \in \mathbb{R}^d\)
\(\rightarrow\)
\(\rightarrow\)
Procedure:
1. Find anchor point
\[\tilde{\bm{\lambda}}_K = \underset{\bm{\lambda}}{\arg\min}~\sum_{k \in [K]}\lambda_k~\quad\text{s.t.}~~~\hat{\text{risk}}^+(\bm{\lambda})_{(S_{opt})} \leq \epsilon\]
2. Choose
\[\hat{\beta} = \inf\{\beta \in \mathbb{R}:~\hat{\text{risk}}_{S_{cal}}^+(\tilde{\bm{\lambda}}_K + \beta'\bf{1}) \leq \epsilon,~\forall~ \beta' \geq \beta\}\]
\(\hat{R}^{\gamma}(\bm{\lambda}_{S_{opt}})\leq \epsilon\)
Guarantee: \(\mathcal{I}_{\bm{\lambda}_K,\hat{\beta}}(y)_j \) are \((\epsilon,\delta)\)-RCPS
\(\tilde{\bm{\lambda}}_K\)
\(\mathcal{I}_{\lambda}(y)_j = [\text{low}_j - \lambda, \text{up}_j + \lambda]\)
\(\mathcal{I}_{\bm{\lambda}}(y)_j = [\text{low}_j - \lambda_j, \text{up}_j + \lambda_j]\)
\(\hat{\lambda}_K\)
conformalized uncertainty maps
\(K=4\)
\(K=8\)
\[\mathbb{P}\left[\mathbb{E}\left[\text{fraction of pixels not in intervals}\right] \leq \epsilon\right] \geq 1 - \delta\]
c.f. [Kiyani et al, 2024]
Teneggi, Tivnan, Stayman, S. How to trust your diffusion model: A convex optimization approach to conformal risk control. ICML 2023