From random forests to regulatory rules: interpreting supervised learners to guide biological discovery
Karl Kumbier
UC Berkeley Statistics
Advisor: Prof. Bin Yu
Domain knowledge
Modeling/
analysis
Experimental design/data collection
In collaboration with: Susan Celniker (LBNL), James B. Brown (LBNL)
From genomic to statistical interactions
Market baskets and genomics
Iterative Random Forests
Case studies in Drosophila development
0-1:20 hours
1:20-3:00 hours
3:00-3:40 hours
3:40-5:20 hours
5:20-9:00 hours
9:20-16:00 hours
image: Volker Hartenstein
images: BDGP
Kr expression
Enhancers: segments of the genome that coordinate transcription factor (TF) activity to regulate gene expression.
Pfeiffer et al. (2008)
even-skipped
expression
wt
transgenic
Hiromi et al. (1985), Harding et al. (1989), Goto et al. (1989), Pfeiffer et al. (2008)
even-skipped expression
wt
transgenic
Goto et al. (1989), Harding et al. (1989), Small et al. (1992), Isley et al. (2013), Levine et al. (2013)
Experimentally validated enhancer elements.
Whole-embryo ChIP-chip/ChIP-seq measurements of transcription factor (TF) DNA binding
activators
repressors
Segment of the genome
DNA binding for p transcription factors (TFs)
Order-s interaction: s = #activators + #repressors
Chopra and Levine (2009)
Dl +
Dl -
Wolpert (1968), Jaeger and Reinitz (2006), Chopra and Levine (2009), Zizen et al. (2009), Knowles and Biggin (2013), Levine (2013), Staller al. (2015), ...
Jaeger and Reinitz (2009)
(1) How precisely does an interaction predict class-1 observations?
(2) How prevalent is an interaction among class-1 observations?
Interactions:
Responses:
?
Computational costs grow as
Misses interactions with weak marginal effects
image: Lee and Haber (2014)
What combinations of items do customers purchase together?
What combinations of items do customers purchase together?
What combinations of items do different types of customers purchase together?
Feature-index sets
Leverage sparsity in market baskets to search for frequently co-occurring items in a computationally efficient manner
Randomly sampled
class-C observation
"survived" interaction
Genomic response
Genomic features
Genomic response
Genomic features
Challenges:
iterative Random Forests (iRF)
&
signed iterative Random Forests (s-iRF)
Joint work with Sumanta Basu, James B. Brown, Susan Celniker, and Bin Yu
iterative Random Forests (iRF) build on PCS to identify genomic interactions in developing Drosophila embryos
Open source R implementation: https://cran.r-project.org/web/packages/iRF/
Breiman et al. (1984)
For current node:
Breiman et al. (1984)
For current node:
Proportion positive responses
Number of observations
Gini impurity:
Decrease in Gini impurity:
Mean decrease in impurity:
On average, how much does splitting on a variable decrease the Gini impurity?
Random forests modify CART to improve predictive accuracy:
Random forests:
At each node of the decision tree, uniformly sample a subset of features
Feature-weighted random forests:
At each node of the decision tree, sample a subset of features with probability proportional to
Feature weights
Gini importance
Iteration 1
Iteration K
Feature weights
Active
Inactive
Continuous measurements
Binary features
Enriched
Depleted
1. Iteratively re-weighted random forests
3. RIT on random forest decision paths
2. Decision path feature transformation
.
.
.
Importance measures:
Null importance measures:
Prevalence:
Examples:
Precision:
Examples:
1. Iteratively re-weighted RF stabilize decision paths
2. gRIT searches for high-order interactions along decision paths
3. Importance metrics evaluate interactions in fitted RF
Outer layer bootstrap samples
Case studies in Drosophila
Enhancers: Pfeiffer et al. 2008, Fisher et al. 2012, Kvon et al. 2014
ChIP: MacArthur et al. 2009, Li et al. 2008
active enhancer
Rule predicted probability
Known target of order-3 interaction among Gt, Kr, and Hb
Gt
Kr
Hb
eve
Gt, Kr, Hb binding
Gt, Kr, Hb
binding
Known gap gene target
Kni
Gt
Kr
Hb
Gt, Kr, Hb
binding
Known gap gene target
Gt
Kr
Hb
Cad early
Cad late
Zld low
Zld high
Hb
Kr
Kr
Hb
Zld low
Zld high
Gt
Kr
Kr
Gt
Berkeley Drosophila Genome Project:
TF spatial gene expression patterns
Pre-organ region principal patterns (PP)
Wu et al. (2016)
Detection and registration
Registered
images
Stain
extraction
Data: Wu et al. (2016)
Principal patterns (PP)
activators
repressors
PP7 expression rule
Region of the embryo
Expression levels of p transcription factors (TFs)
Fruitfly embryo segmentation
Human embryo segmentation
Nobel Prize in Physiology or Medicine 1995
Edward B. Lewis, Christiane Nusslein-Volhard, Eric F. Wieschaus
: interactions correctly predicted
: interactions missed
Cad
Tll
Ftz
Blimp-1
Cad
Tll
Ftz
Cad, Tll, Ftz
binding
h
Cad
Tll
Ftz
Cad, Tll, Ftz
binding
Joint with: Bin Yu
Joint with: Runjing Liu, Erwin Frise, Susan Celniker, Bin Yu
Joint with: Reza Abbasi Asl, Jamie Murdoch, Chandan Singh, and Bin Yu
Biological challenges
Statistical challenges
S. Basu
J. Brown
B. Yu
S. Celniker
E. Frise