Iterative random forests (iRF) to guide biological discovery
Karl Kumbier
Chan Zuckerberg Meeting
March 14, 2019
The PCS framework for reliable and reproducible data science
- Predictability: Does my model reflect external reality?
- Computability: Can I tractably build/train my model?
- Stability: Are my results consistent with respect to "reasonable" perturbations of the data/model?
Joint work with Bin Yu
Generating domain insights through supervised learning



Natural phenomena
Supervised learning
Domain insights
Outline
-
From genomic to statistical interactions
-
Market baskets and genomics
-
Iterative random forests (iRF)
-
Case studies using iRF
-
iRF-enabled genome wide epistasis studies (GWES)
From genomic to statistical interactions
Embryonic development in Drosophila

0-1:20 hours
1:20-3:00 hours
3:00-3:40 hours
3:40-5:20 hours
5:20-9:00 hours
9:20-16:00 hours
image: Volker Hartenstein
images: BDGP
Kr expression
High-order interactions at enhancer elements drive embryonic development



Goto et al. (1989), Harding et al. (1989), Small et al. (1992), Isley et al. (2013), Levine et al. (2013)
Identifying regulatory interactions from high-throughput genomic data
Regulatory elements (e.g. enhancers)
Whole-embryo ChIP-chip/ChIP-seq measurements of transcription factor (TF) DNA binding

From genomic to statistical interactions
activators
repressors
Segment of the genome
DNA binding for p transcription factors (TFs)
Order-s interaction: s = #activators + #repressors

Thresholding rules define expression domains


Chopra and Levine (2009)
Dl +
Dl -
Wolpert (1968), Jaeger and Reinitz (2006), Chopra and Levine (2009), Zizen et al. (2009), Knowles and Biggin (2013), Levine (2013), Staller al. (2015), ...
Jaeger and Reinitz (2009)
RuleFit: rule-based interaction discovery (Friedman and Popescu, 2008)
- Identify a collection of marginally important features
- Search for predictive order-2 rules among marginally important features

Computational costs grow as

Misses interactions with weak marginal effects
image: Lee and Haber (2014)
Market baskets and genomics
Interactions in market baskets



















What combinations of items do customers purchase together?
Interactions in market baskets















What combinations of items do customers purchase together?
What combinations of items do different types of customers purchase together?




Interactions in market baskets



















Feature-index sets
Random intersection trees (RIT)
Shah and Meinshausen (2014)
Leverage sparsity in market baskets to search for frequently co-occurring items in a computationally efficient manner
- Randomly sample feature index sets from class-C observations:
- Intersect sampled feature index sets in a tree like fashion up to depth D
- Return all feature combinations that "survive" intersection procedure up to depth D
Random intersection trees (RIT)
Shah and Meinshausen (2014)









Randomly sampled
class-C observation
"survived" interaction
Random intersection trees (RIT)
Shah and Meinshausen (2014)




















Random intersection trees (RIT)
Shah and Meinshausen (2014)





































Genomic response
Genomic features
Translating the market basket problem into genomics




Genomic response
Genomic features
Translating the market basket problem into genomics
Challenges:
- Genomic features are typically measured in concentrations/counts
- Binding does not imply regulation (Li et al. 2008)




Iterative random forests (iRF)
&
Signed iterative random forests (siRF)
Joint work with Sumanta Basu, James B. Brown, Susan Celniker, and Bin Yu
Iterative random forest to identify high-order interactions in genomic data
- Iteratively re-weighted random forests stabilize decision path
- Generalized random intersection trees search for high-order interactions
- Stability bagging evaluates interactions
Iterative random forests (iRF) build on PCS to identify genomic interactions in developing Drosophila embryos
Open source R implementation: https://cran.r-project.org/web/packages/iRF/
Iteratively re-weighted random forests
Random Forests
Breiman (2001)

Random forests modify CART to improve predictive accuracy:
- CART trees are trained on bootstrap samples of the data
- CART criterion evaluated on subset of features sampled uniformly at random
Feature-weighted random forests
Amaratunga et al. (2008)
Random forests:
At each node of the decision tree, uniformly sample a subset of features
Feature-weighted random forests:
At each node of the decision tree, sample a subset of features with probability proportional to
Feature weights
The CART criterion: Gini impurity

Proportion positive responses
Number of observations
Gini impurity:
Decrease in Gini impurity:
Mean decrease in impurity:
On average, how much does splitting on a variable decrease the Gini impurity?
Iterative re-weighting stabilizes random forest decision paths

Gini importance
Iteration 1
Iteration K

Feature weights
Iterative re-weighting helps recover high-order interactions


Generalized random intersection trees
Encoding decision paths to extract active features

Active
Inactive
Continuous measurements
Binary features

Encoding decision paths to extract enriched and depleted features

Enriched
Depleted
Generalized random intersection trees search for high-order interactions




1. Iteratively re-weighted random forests
3. RIT on random forest decision paths

2. Decision path feature transformation
.
.
.
Runtime comparison between iRF and RuleFit

Evaluating interactions
Summary of importance measures for high-order interactions
- Prevalence : how frequently is an interaction observed among positive responses?
- Precision : how accurately does an interaction predict positive responses?
- Stability : how frequently is an interaction recovered across outer layer bootstrap samples?
Prevalence measures the stability of an interaction across an RF
Prevalence:

Examples:
Precision measures the predictive accuracy of an interaction across an RF
Precision:


Examples:
Bagging evaluates stability of interactions across entire iRF workflow relative to resampling








1. Iteratively re-weighted RF stabilize decision paths
2. gRIT searches for high-order interactions along decision paths
3. Importance metrics evaluate interactions in fitted RF
Outer layer bootstrap samples
Case studies using iRF
Predicting enhancer activity in early stage Drosophila embryos
Enhancers: Pfeiffer et al. 2008, Fisher et al. 2012, Kvon et al. 2014
ChIP: MacArthur et al. 2009, Li et al. 2008
- 7809 loci representing ~14% of the non-coding genome
- : 24 TF ChIP assays, stage 4-6 blastoderm embryos
- : enhancer activity (0: inactive, 1: active)

Predicting enhancer activity in early stage Drosophila embryos



iterative Random Forests recover well-known pairwise interactions

Novel order-3 interactions exhibit AND-like behavior (Gt, Kr, Zld)
Zld low
Zld high
Gt
Kr
Kr
Gt


Predicting gap gene interactions from spatial gene expression data


Nobel prize in physiology or medicine (1995):
Lewis, Nüsslein-Volhard, and Wieschaus

Fowlkes et al. (2008)
Drosophila 3D expression atlas
siRF recovers all spatially local gap gene interactions



iRF-enabled genome wide
epistasis studies (GWES)
Joint work with Merle Behr,
James B. Brown, and Bin Yu
Ackowledgements




S. Basu
J. Brown
B. Yu
S. Celniker
E. Frise





iRF_CZ
By kkumbier
iRF_CZ
- 66