Machine learning interpretability

to study biology?

J. Sulam

Institute for Computational Medicine

September 2nd, 2025

"The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin. [...] Seeking an improvement that makes a difference in the shorter term, researchers seek to leverage their human knowledge of the domain, but the only thing that matters in the long run is the leveraging of computation. [...]
We want AI agents that can discover like we can, not which contain what we have discovered."

The Bitter Lesson, Rich Sutton 2019

Interpreting machine learning predictors for discovery in biology?

1. Interpreting Opaque ML models

2. Interpretable ML models

sick

healthy

(large \(n\), small Bayes error)

(small \(n\), non-negligable Bayes error)

Interpreting Opaque ML models

{f}\huge(

{\huge)} = \text{\texttt{sick}}

What parts of the image are important for this prediction?

What are the subsets of the input so that

{f}(x_C) \approx {f}(x) ?

C \subseteq [d]

Interpreting Opaque ML models

Sensitivity or Gradient-based perturbations
Shapley coefficients
Variational formulations
Counterfactual & causal explanations

LIME [Ribeiro et al, '16], CAM [Zhou et al, '16], Grad-CAM [Selvaraju et al, '17]

Shap [Lundberg & Lee, '17], ...

RDE [Macdonald et al, '19], ...

[Sani et al, 2020] [Singla et al '19],..

Shapley Values

efficiency
nullity
symmetry

exponential complexity

Lloyd S Shapley. A value for n-person games. Contributions to the Theory of Games, 2(28):307–317, 1953.

Let be an -person cooperative game with characteristic function

G = ([n],v)

v : \mathcal P([n]) \mapsto \mathbb R

How important is each player for the outcome of the game?

\displaystyle \phi_i = \sum_{S_j\subseteq [n]\setminus \{i\} } w_{S_j} \left[ v(S_j\cup \{i\}) - v(S_j) \right]

marginal contribution of player i with coalition S

Shapley Values

Lloyd S Shapley. A value for n-person games. Contributions to the Theory of Games, 2(28):307–317, 1953.

\displaystyle \phi_i(v) = \sum_{S\subseteq [n]\setminus \{i\} } \frac{|S|!(n-|S|-1)!}{n!} \left[ v(S\cup \{i\}) - v(S) \right]

Efficiency

\displaystyle v([n]) = v(\empty) + \sum_{i=1}^n \phi_i(v)

Linearity

\displaystyle \phi_i( \alpha_1 v_1 + \alpha_2 v_2) = \alpha_1 \phi_i(v_1) + \alpha_2 \phi_i(v_2) \newline \text{ for characteristic functions }v_1,v_2

Symmetry

\text{If}~~ v(S\cup i) = v(S\cup j) ~~ \forall S\subseteq [n]\setminus \{i,j\} \newline \text{Then}~~ \phi_i(v) = \phi_j(v)

\text{If}~~ v(S\cup i) = v(S) ~~ \forall S\subseteq [n]\setminus \{i\} \newline \text{Then}~~ \phi_i(v)=0

Nullity

Shapley Explanations for ML

sick

healthy

\displaystyle \phi_i = \sum_{S_j\subseteq [n]\setminus \{i\} } w_{S_j} \left[ v(S_j\cup \{i\}) - v(S_j) \right]

X \in \mathcal X \subset \mathbb R^n

inputs

responses

f:\mathcal X \to \mathcal Y

predictor

f(X) = \hat{Y} \approx Y

Y\in \mathcal Y = [C]

Shapley Explanations for ML

X \in \mathcal X \subset \mathbb R^n

inputs

responses

f:\mathcal X \to \mathcal Y

predictor

f(X) = \hat{Y} \approx Y

Y\in \mathcal Y = [C]

Question 2:

How can we, or when, compute \(\phi_i(v)\)?

Question 3:

What do \(\phi_i(v)\) say (and don't) about the problem?

Question 1:

How should we choose the function \(v_f(S)\)?

Question 4:

What can we do with them?

\displaystyle \phi_i = \sum_{S_j\subseteq [n]\setminus \{i\} } w_{S_j} \left[ v(S_j\cup \{i\}) - v(S_j) \right]

Question 1:

How should we choose the function \(v\)?

\(v_f(S,x) = \mathbb{E} [f(x_S,\tilde{X}_{\bar{S}})|X_S = x_S]\)

"True to the data"

Conditional Data Distribution

[Chen et al, Algorithms to estimate Shapley value feature attributions, 2022] [Aas et al, 2019] [Teneggi et al, 2023][Frye et al, 2021][Janzing et al, 2019][Chen et al, 2020]

Question 1:

How should we choose the function \(v\)?

(x_S,\tilde{X}_{\bar{S}})

x_S

\(v_f(S,x) = \mathbb{E} [f(x_S,\tilde{X}_{\bar{S}})|X_S = x_S]\)

Conditional Data Distribution

[Chen et al, Algorithms to estimate Shapley value feature attributions, 2022] [Aas et al, 2019] [Teneggi et al, 2023][Frye et al, 2021][Janzing et al, 2019][Chen et al, 2020]

Question 2:

How can we, or when, compute \(\phi_i(v)\)?

{\huge)} = 0

{f}\huge(

{\huge)} = 1

{f}\huge(

Example:

f(x) = 1

if contains a sick cell

{f}\huge(

{\huge)} = 0

Hierarchical Shapley (h-Shap)

[Teneggi et al, 2022]

Observation: \(f(x) = 1 \Leftrightarrow \exist~ i: f(x_i,\tilde{X}_{-i}) = 1\) (A1)

Hierarchical Shapley (h-Shap)

[Teneggi et al, 2022]

Under A1, \(\phi^\text{h-Shap}_i(f) = \phi_i(f)\)

Bounded approximation as deviating from A1

2. Correct approximation (informal)

1. Complexity \(\mathcal O(2^\gamma k \log n)\)

Observation: \(f(x) = 1 \Leftrightarrow \exist~ i: f(x_i,\tilde{X}_{-i}) = 1\) (A1)

\gamma = 2

Question 2:

How can we, or when, compute \(\phi_i(v)\)?

Hierarchical Shapley (h-Shap)

[Teneggi et al, 2022]

Fast hierarchical games for image explanations, Teneggi, Luster & S., IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022

Hierarchical Shapley (h-Shap)

[Teneggi et al, 2022]

Question 4:

What can we do with them?

Hemorrhage detection in head CT

Question 4:

What can we do with them?

Image-by-image supervision (strong learner)

one label per image!

Study/volume supervision (weak learner)

one label per study!

Question 4:

What can we do with them?

Weak learner is more efficient for detecting positive slices

Can find hemorrhages without labels

Image-by-image supervision (strong learner)

one label per image!

Study/volume supervision (weak learner)

one label per study!

Teneggi, J., Yi, P. H., & Sulam, J. (2023). Examination-level supervision for deep learning–based intracranial hemorrhage detection at head CT. Radiology: Artificial Intelligence

Interpreting machine learning predictors for discovery in biology?

1. Interpreting Opaque ML models

2. Interpretable ML models

sick

healthy

[Marilyn Bui]

Cancer biology/pathology

[Burke et al]

Protein-Protein interactions

[Herbst-Kralovetz Lab]

Host/microbe interactions

[Allen Institute]

Spatial transcriptomics

[MICrONs]

Connectomics

How can we learn descriptive local patterns for different phenotypes?

2. Interpretable ML models

Breast Cancer

Highly heterogeneous: multiple subtypes (e.g., HR+/HER2−, HER2+, TNBC), distinct biological and clinical behavior, subtype-specific treatment strategies.

HR+/HER2- (Hormone Receptor-Positive, HER2-Negative)
- Express estrogen/progesterone receptors but not HER2
- Generally favorable prognosis
HER2+ (HER2-positive, regardless of HR)
- Over-express HER2 (a receptor that promotes cell growth).
- Traditionally poor prognosis, now better with therapy
TNBC (Triple negative breast cancer)
- Don't express estrogen, progesterone, nor HER2 receptors
- Aggressive, poor prognosis - good immunotherapy in some cases!

Breast Cancer

Highly heterogeneous: multiple subtypes (e.g., HR+/HER2−, HER2+, TNBC), distinct biological and clinical behavior, subtype-specific treatment strategies.
Subtypes are limited clinical biomarkers for prognosis (or therapeutic response)
Tumor MicroEnvironment (TME) influences tumor progression & treatment [Hinshaw et al, Khalaf et al]

[Jackson et al, 2020]

Breast Cancer

Highly heterogeneous: multiple subtypes (e.g., HR+/HER2−, HER2+, TNBC), distinct biological and clinical behavior, subtype-specific treatment strategies.
Subtypes are limited clinical biomarkers for prognosis (or therapeutic response)
Tumor MicroEnvironment (TME) influences tumor progression & treatment [Hinshaw et al, Khalaf et al]
Imaging Mass Cytometry (IMC): "large" and new datasets of protein expression

[Danenberg et al, 2022]

Breast Cancer: data

37 proteins simultaneously (independent channels)
n = 693
32 cell phenotypes are identified from different protein expressions
- e.g. a cell expressing \(\texttt{CD3}\) and \(\texttt{CD8}\) is (likely) a \(\texttt{cytotoxic T cell}\)
a cell expressing \(\texttt{CD68}\) is likely a \(\texttt{macrophage}\)

[Danenberg et al, 2022]

Imaging Mass Cytometry (IMC): "large" and new datasets of protein expression

37 proteins simultaneously (independent channels)
n = 693
32 cell phenotypes are identified from different protein expressions
- e.g. a cell expressing \(\texttt{CD3}\) and \(\texttt{CD8}\) is (likely) a \(\texttt{cytotoxic T cell}\)
a cell expressing \(\texttt{CD68}\) is likely a \(\texttt{macrophage}\)

Breast Cancer: data

Imaging Mass Cytometry (IMC): "large" and new datasets of protein expression

Bi-level graph learning for breast cancer biomarkers

Undirected graph: \(G = (V,W)\)

Set of nodes \(v_i = (v^\text{loc}_i,x^\text{type}_i)\)

Set of edges \(W \subseteq \{(v_i,v_j)| v_i\neq v_j\} \)

cell phenotype \(x_i^\text{type} \in \{1,\dots,K\}\)

\(w_{i,j} = \exp(-\alpha \|v^\text{loc}_i-v^\text{loc}_j\|^2_2) \)

coordinates \(v_i^\text{loc} \in\mathbb R^2\)

Graph Kernels

\(\displaystyle x_i^{t} = x_i^{t-1} + \sum_{v_j\in V\setminus v_i} x_j^{t-1}A_{i,j}\)

Graph Kernels

generate
subtrees

Graph Kernels

\(\kappa(G_1,G_2) = \frac{\langle \phi(G_1), \phi(G_2) \rangle}{\|\phi(G_1)\|_2\|\phi(G_2\|_2}\)

\(=0.301\)

generate
subtrees

Graph Kernels

Soft WL-kernel:

[Shervashidze, et al JMLR (2011)][Schulz et al. ML (2022)]

Graph Kernels

Tumor niche

Immune niche

Stromal niche

Interface niche

Bi-level Graphs: BiGraph

Bi-level Graphs: BiGraph

Community detection via
Louvain algorithm [Biondel et al, 2008]

Bi-level Graphs: BiGraph

(validation)

Bi-level Graphs: BiGraph

(cytokeratins + chemokine)

(validation)

BiGraph provides complementary information to standard clinical characteristics

Tumor Stage

Tumor Grade

Tumor Subtype

BiGraph provides complementary information to standard clinical characteristics

Biomarkers for TNBC

279 early triple-negative breast cancer (TNBC) patients in a randomized pre-surgical neoadjuvant immunotherapy clinical trial (ICB)
pathology Complete Response (pCR) vs
residual disease (RD)

all subtypes

TNBC

Biomarkers for TNBC

TLS are known to promote local immune responses and are generally associated with favorable prognosis in immunologically “hot” tumors.
Prior studies have shown mixed results in breast cancer overall, but more consistent positive associations in TNBC.

all subtypes

TNBC

proportion of

TLS-like niche

proportion of

TLS-like niche

proportion of

TLS-like niche

proportion of

TLS-like niche

[Z Wang et al, Bi-level Graph Learning Unveils Prognosis-Relevant Tumor Microenvironment Patterns from Breast Multiplexed Digital Pathology, Patterns: Cell Press 2025]

Concluding Remarks

Zhenzhen Wang
BME

Carlos A. Santa-Maria
Oncology

Aleksander Popel
BME

Jacopo Teneggi

Paul Yi

St Jude Radiology

General, opaque interpretability ML can be useful in high SNR settings, especially for speeding up annotation

True biomarker discovery, in low SNR settings, like requires interpretable-by-construction methods

Thank you for inviting me

How can we learn descriptive local patterns for different phenotypes?

How can we learn descriptive local patterns for different phenotypes?

Phenotype A

Phenotype B

How can we learn descriptive local patterns for different phenotypes?

A background story

Dr. Jeremias Sulam
Johns Hopkins

PATTERNS-D-24-00095

Dear Dr. Sulam,
Thank you for submitting your paper to Patterns. I am enclosing the comments that the reviewers have made on your paper. In this case, we receive input from three expert reviewers. As you will see, the feedback are mixed. Reviewer#1 was unfortunately clearly negative about the paper and felt that it would not be well suited to the journal.

Reviewer#1 acknowledged the topic, but raised a series of technical and interpretable issues including the lack of comparison with other SOTA methods and the detailed clinical interpretation of several results. While Reviewers #2 and #3 were more positive,....
[..]
Best wishes,

Wanying Wang
Scientific Editor, Patterns

Reviewer #1: The paper introduces BiGraph, an unsupervised bi-level graph learning method designed to identify prognosis-relevant TME patterns from single-cell imaging data. BiGraph constructs both cellular graphs, representing the TME of individual patients, and population graphs, representing inter-patient similarities based on their TME patterns. By employing a novel relaxation of the Weisfeiler-Lehman subtree kernel, the method generates and analyzes TME patterns, facilitating robust risk stratification and uncovering prognostic biomarkers in breast cancer. However, there are some concerns which are mentioned below:

1. Could you explain how the BiGraph approach integrates cellular and population graphs to capture multi-scale information about the tumor microenvironment (TME)? How does this differ from conventional methods of patient stratification? Can you elaborate on how the Hodges-Lehmann statistic is used to identify characteristic TME patterns within patient subgroups?
2. What specific characteristics of the cellular graph and population graph are utilized in the BiGraph method to identify prognosis-relevant patterns in breast cancer patients?
3. How does the Soft-WL subtree kernel contribute to measuring the similarity between patients based on their TME patterns? Can you explain the process of generating and clustering TME patterns using this kernel?
4. In the analysis of the cellular graphs, what insights were gained from the clustering of subtrees into TME patterns? How did these patterns contribute to understanding the heterogeneity of the TME across patients?
5. Can you elaborate on how the Soft-WL subtree kernel quantifies inter-patient similarity based on the abundance of TME patterns? How is this similarity score calculated and interpreted? The process of combining multiple TMA cores into a single graph could be elaborated further.
6. How does the BiGraph-derived subtyping system compare with the standard clinical subtyping system for breast cancer? What are the main differences in their ability to capture inter-patient heterogeneity and predict survival outcomes?
7. How are the prognostic impacts of characteristic TME patterns correlated with clinical factors such as tumor grade, stage, and metastasis? Can you provide insights into the association between these patterns and tumor grade based on the validation results?
8. Can you explain the validation process used to assess the reproducibility and generalization of the BiGraph approach? How are patient risk stratification and the prognostic quality of TME patterns validated in both inner and external validation sets?
9. Based on the comparison with alternative methods, what insights can be drawn regarding the superiority of the Soft-WL subtree kernel approach in capturing patient subgroups with distinct survival outcomes and identifying prognosis-relevant TME patterns?
10. The methods described, particularly the generation and analysis of weighted cellular graphs and the Soft-WL subtree kernel, are computationally intensive. The paper could benefit from a discussion on computational efficiency and scalability, especially for large datasets.
11. The performance of the proposed methods is likely sensitive to several hyperparameters (e.g., α in the Gaussian kernel, h in the subtree generation, and τ in neighborhood boundary determination). While some parameter choices are justified, a more extensive discussion or empirical analysis of parameter sensitivity would strengthen the paper.
12. While the methodology is well-explained, the paper would benefit from more extensive validation. Comparisons with other state-of-the-art methods using benchmark datasets could highlight the advantages and potential limitations of the proposed approach.
13. A pseudocode will be helpful to understand the overall workflow of the paper.
14. Discuss the potential generalizability of the methods to other data types and possible adaptations required.

Dr. Jeremias Sulam
Johns Hopkins

PATTERNS-D-24-00095

Dear Dr. Sulam,
Thank you for submitting your paper to Patterns. I am enclosing the comments that the reviewers have made on your paper. In this case, we receive input from three expert reviewers. As you will see, the feedback are mixed. Reviewer#1 was unfortunately clearly negative about the paper and felt that it would not be well suited to the journal.

Reviewer#1 acknowledged the topic, but raised a series of technical and interpretable issues including the lack of comparison with other SOTA methods and the detailed clinical interpretation of several results. While Reviewers #2 and #3 were more positive,....
[..]
Best wishes,

Wanying Wang
Scientific Editor, Patterns

A background story

Dr. Jeremias Sulam
Johns Hopkins

PATTERNS-D-24-00095

Dear Dr. Sulam,
Thank you for submitting your paper to Patterns. I am enclosing the comments that the reviewers have made on your paper. In this case, we receive input from three expert reviewers. As you will see, the feedback are mixed. Reviewer#1 was unfortunately clearly negative about the paper and felt that it would not be well suited to the journal.

Reviewer#1 acknowledged the topic, but raised a series of technical and interpretable issues including the lack of comparison with other SOTA methods and the detailed clinical interpretation of several results. While Reviewers #2 and #3 were more positive,....
[..]
Best wishes,

Wanying Wang
Scientific Editor, Patterns

A background story

Dr. Jeremias Sulam
Johns Hopkins

PATTERNS-D-24-00095

Dear Dr. Sulam,
Thank you for submitting your paper to Patterns. I am enclosing the comments that the reviewers have made on your paper. In this case, we receive input from three expert reviewers. As you will see, the feedback are mixed. Reviewer#1 was unfortunately clearly negative about the paper and felt that it would not be well suited to the journal.

Reviewer#1 acknowledged the topic, but raised a series of technical and interpretable issues including the lack of comparison with other SOTA methods and the detailed clinical interpretation of several results. While Reviewers #2 and #3 were more positive,....
[..]
Best wishes,

Wanying Wang
Scientific Editor, Patterns

Dear Dr Sulam,   

Thanks for your continued interest in Patterns and your point-by-point response to reviewers’ comments.   

We also acknowledge your concerns regarding the report provided by Reviewer #1. Please understand that we do take accusations of this kind seriously. At the same time, please understand that we have substantial reason to believe that public AI-generated content detectors show significant false-positive rates, especially for content outside of the original training domain or for text written by non-native English speakers. For this reason, validating claims of this kind is inherently difficult. 
We have contacted our internal ethics team and we are advised to investigate this matter further. We just received feedback from the reviewers for your manuscript. Reviewer#1 admitted that they used ChatGPT [...] 

To avoid delay, we have decided to move forward with considering your point-by-point response on its merits.

Best wishes, 
Wanying     

Associate Scientific Editor, Patterns Cell Press, 125 London Wall, London, EC2Y 5AS, UK

A background story

Dr. Jeremias Sulam
Johns Hopkins

PATTERNS-D-24-00095

Dear Dr. Sulam,
Thank you for submitting your paper to Patterns. I am enclosing the comments that the reviewers have made on your paper. In this case, we receive input from three expert reviewers. As you will see, the feedback are mixed. Reviewer#1 was unfortunately clearly negative about the paper and felt that it would not be well suited to the journal.

Reviewer#1 acknowledged the topic, but raised a series of technical and interpretable issues including the lack of comparison with other SOTA methods and the detailed clinical interpretation of several results. While Reviewers #2 and #3 were more positive,....
[..]
Best wishes,

Wanying Wang
Scientific Editor, Patterns

A background story

Dear Dr Sulam,   

Thanks for your continued interest in Patterns and your point-by-point response to reviewers’ comments.   

We also acknowledge your concerns regarding the report provided by Reviewer #1. Please understand that we do take accusations of this kind seriously. At the same time, please understand that we have substantial reason to believe that public AI-generated content detectors show significant false-positive rates, especially for content outside of the original training domain or for text written by non-native English speakers. For this reason, validating claims of this kind is inherently difficult. 
We have contacted our internal ethics team and we are advised to investigate this matter further. We just received feedback from the reviewers for your manuscript. Reviewer#1 admitted that they used ChatGPT [...] 

To avoid delay, we have decided to move forward with considering your point-by-point response on its merits.

Best wishes, 
Wanying     

Associate Scientific Editor, Patterns Cell Press, 125 London Wall, London, EC2Y 5AS, UK

Machine learning interpretability

to study biology?

J. Sulam

Institute for Computational Medicine

Interpreting machine learning predictors for discovery in biology?

Interpreting Opaque ML models

Interpreting Opaque ML models

Sensitivity or Gradient-based perturbations

Shapley coefficients

Variational formulations

Counterfactual & causal explanations

LIME [Ribeiro et al, '16], CAM [Zhou et al, '16], Grad-CAM [Selvaraju et al, '17]

Shap [Lundberg & Lee, '17], ...

RDE [Macdonald et al, '19], ...

[Sani et al, 2020] [Singla et al '19],..

Shapley Values

marginal contribution of player i with coalition S

Shapley Values

Efficiency

Linearity

Symmetry

Nullity

Shapley Explanations for ML

Shapley Explanations for ML

Hemorrhage detection in head CT

one label per image!

one label per study!

one label per image!

one label per study!

Interpreting machine learning predictors for discovery in biology?

2. Interpretable ML models

Breast Cancer

Breast Cancer

Breast Cancer

Breast Cancer: data

Breast Cancer: data

Concluding Remarks