From random forests to regulation: interpreting supervised learners to guide biological discovery
Karl Kumbier
September 12, 2022
Natural phenomena
ML models
Domain insights
?
Prediction
Interpretation
"Accuracy generally requires more complex prediction methods."
- Leo Breiman, Statistical Modeling: The Two Cultures (2001)
From genomic to statistical interactions
Market baskets and genomics
Iterative random forests & signed iterative random forests
Case studies: Interaction discovery in Drosophila and UK Biobank cohort
0-1:20 hours
1:20-3:00 hours
3:00-3:40 hours
3:40-5:20 hours
5:20-9:00 hours
9:20-16:00 hours
image: Volker Hartenstein
images: BDGP
Kr expression
Enhancers: segments of the genome that coordinate transcription factor (TF) activity to regulate gene expression.
Pfeiffer et al. (2008)
even-skipped
expression
wt
transgenic
Hiromi et al. (1985), Harding et al. (1989), Goto et al. (1989), Pfeiffer et al. (2008)
even-skipped expression
wt
transgenic
Problems
Goto et al. (1989), Harding et al. (1989), Small et al. (1992), Isley et al. (2013), Levine et al. (2013)
activators
repressors
Segment of the genome
DNA binding for p transcription factors (TFs)
Order-s interaction: s = #activators + #repressors
Computational costs grow as
Misses interactions with weak marginal effects
image: Lee and Haber (2014)
Chopra and Levine (2009)
Dl +
Dl -
Wolpert (1968), Jaeger and Reinitz (2006), Chopra and Levine (2009), Zizen et al. (2009), Knowles and Biggin (2013), Levine (2013), Staller al. (2015), ...
Jaeger and Reinitz (2009)
(1) How precisely does an interaction predict class-1 observations?
(2) How prevalent is an interaction among class-1 observations?
Interactions:
Responses:
?
What combinations of items do customers purchase together?
What combinations of items do customers purchase together?
What combinations of items do different types of customers purchase together?
Feature-index sets
Leverage sparsity in market baskets to search for frequently co-occurring items in a computationally efficient manner
Randomly sampled
class-C observation
"survived" interaction
Genomic response
Genomic features
Genomic response
Genomic features
Challenges:
iterative Random Forests (iRF)
&
signed iterative Random Forests (siRF)
Joint work with Sumanta Basu, James B. Brown, and Bin Yu
Iterative Random Forests (iRF) build on predictability, computability, and stability to identify genomic interactions in developing Drosophila embryos
Open source R implementation: github.com/karlkumbier/iRF2.0
Breiman et al. (1984)
For current node:
Breiman et al. (1984)
For current node:
Random forests modify CART to improve predictive accuracy:
Random forest modifications improve generalization but reduce stability!
Random forests:
At each node of the decision tree, uniformly sample a subset of features
Feature-weighted random forests:
At each node of the decision tree, sample a subset of features with probability proportional to
Feature weights
Proportion positive responses
Number of observations
Gini impurity:
Decrease in Gini impurity:
Mean decrease in impurity:
On average, how much does splitting on a variable decrease the Gini impurity?
Gini importance
Iteration 1
Iteration K
Feature weights
Active
Inactive
Continuous measurements
Binary features
Enriched
Depleted
1. Iteratively re-weighted random forests
3. RIT on random forest decision paths
2. Decision path feature transformation
.
.
.
Continuous measurements
Binary feature encoding
Decision rules
. . .
Prevalent interactions
Binary feature encoding
RIT
. . .
(Yu and Kumbier, 2020)
Predictability (from ML, Stats) evaluates whether models/results reflect external reality
Computability (from ML) enables domain-inspired simulations to compare against known structure
Stability (from Stats) assesses the reproducibility of results relative to data and model perturbations
The PCS framework unifies and expands on ideas from statistics and machine learning
Precision:
Examples:
Prevalence:
Examples:
Biology: TF enrichment among active enhancers
Null model: prevalence among inactive elements
Biology: cooperative binding among TFs
Null model: expected prevalence under independent selection
Biology: functional binding v. inactive binding
Null model: precision of interaction subsets
1. Iteratively re-weighted RF stabilize decision paths
2. gRIT searches for high-order interactions along decision paths
3. Importance metrics evaluate interactions in fitted RF
Outer layer bootstrap samples
Discovering "functional" TF binding and interactions in the Drosophila embryo
Joint work with Sumanta Basu, James B. Brown, Susan Celniker, Erwin Frise, and Bin Yu
Enhancers: Pfeiffer et al. 2008, Fisher et al. 2012, Kvon et al. 2014
ChIP: MacArthur et al. 2009, Li et al. 2008, modENCODE/modERN consortia
Early stage (not shown): 24 TFs; Basu, K., Brown, and Yu (2018)
All stages (shown): 307 TFs; K., Basu, Brown, Celniker, Frise and Yu
Early stage (not shown): 24 TFs; Basu, K., Brown, and Yu (2018)
All stages (shown): 307 TFs; K., Basu, Brown, Celniker, Frise and Yu
Binding of known regulators correctly identified by siRF and missed by IDR
Binding with no reported function identified by IDR and not called by siRF
siRF functional binding GO term enrichment
IDR binding GO term enrichment
Figure: Wotton et al. (2015)
Gap gene network as validation
EpiTree pipeline for detecting epistatic interactions in the UK Biobank
Joint work with: Merle Behr, Aldo Cordova-Palomera, Matthew Aguirre, Euan Ashley, Atul J. Butte, Rima Arnaout, Ben Brown, James Priest, Bin Yu
Positive control phenotype: red hair
UK Biobank
Learn models
Inference
Amyotrophic lateral sclerosis (ALS) - fatal, neurodegenerative.
Over 25 known genetic causes. 90% of cases are sporadic (SALS); many of the familial (FALS) cases also have unknown cause
No effective treatments exist
Can we accelerate drug discovery by learning ALS subtypes and the patterns of dysregulation that define them?
Joint work with: Julia Lazzari-Dean, Maike Roth, Steven Altschuler, and Lani Wu
Figure: Zou, Z. Y. et al. (2017)
Drosophila
Sumata Basu, Erwin Frise, Susan Celniker, James B. Brown, Bin Yu
UK biobank
Merle Behr, Aldo Cordova-Palomera, Matthew Aguirre, Euan Ashley, Atul J. Butte, Rima Arnaout, Ben Brown, James Priest, Bin Yu
Thank You!