Iterative random forests (iRF) to guide biological discovery

Karl Kumbier

Chan Zuckerberg Meeting

March 14, 2019

The PCS framework for reliable and reproducible data science

Predictability: Does my model reflect external reality?
Computability: Can I tractably build/train my model?
Stability: Are my results consistent with respect to "reasonable" perturbations of the data/model?

Joint work with Bin Yu

Generating domain insights through supervised learning

Natural phenomena

Supervised learning

Domain insights

Outline

From genomic to statistical interactions
Market baskets and genomics
Iterative random forests (iRF)
Case studies using iRF
iRF-enabled genome wide epistasis studies (GWES)

From genomic to statistical interactions

Embryonic development in Drosophila

0-1:20 hours

1:20-3:00 hours

3:00-3:40 hours

3:40-5:20 hours

5:20-9:00 hours

9:20-16:00 hours

image: Volker Hartenstein

images: BDGP

Kr expression

High-order interactions at enhancer elements drive embryonic development

Goto et al. (1989), Harding et al. (1989), Small et al. (1992), Isley et al. (2013), Levine et al. (2013)

Identifying regulatory interactions from high-throughput genomic data

Regulatory elements (e.g. enhancers)

Whole-embryo ChIP-chip/ChIP-seq measurements of transcription factor (TF) DNA binding

From genomic to statistical interactions

r(\mathbf{x})=

\prod_{j\in\text{A}} 1(x_j > t_j)

\cdot

\prod_{j\in\text{R}} 1(x_j \le t_j)

activators

repressors

\mathbf{x}=(x_1, \dots, x_p)

Segment of the genome

DNA binding for p transcription factors (TFs)

Order-s interaction: s = #activators + #repressors

r(\mathbf{x}) = 1

x_{bcd}

x_{cad}

\dots

Thresholding rules define expression domains

Chopra and Levine (2009)

Dl +

Dl -

Wolpert (1968), Jaeger and Reinitz (2006), Chopra and Levine (2009), Zizen et al. (2009), Knowles and Biggin (2013), Levine (2013), Staller al. (2015), ...

Jaeger and Reinitz (2009)

RuleFit: rule-based interaction discovery (Friedman and Popescu, 2008)

Identify a collection of marginally important features
Search for predictive order-2 rules among marginally important features

\dots

Computational costs grow as

O(p^s)

Misses interactions with weak marginal effects

image: Lee and Haber (2014)

Market baskets and genomics

Interactions in market baskets

What combinations of items do customers purchase together?

Interactions in market baskets

What combinations of items do customers purchase together?

What combinations of items do different types of customers purchase together?

Interactions in market baskets

Z_1

Z_2

Z_3

Z_4

\mathcal{I}_1

\mathcal{I}_2

\mathcal{I}_3

\mathcal{I}_4

Feature-index sets

Random intersection trees (RIT)

Shah and Meinshausen (2014)

Leverage sparsity in market baskets to search for frequently co-occurring items in a computationally efficient manner

Randomly sample feature index sets from class-C observations:
Intersect sampled feature index sets in a tree like fashion up to depth D
Return all feature combinations that "survive" intersection procedure up to depth D

\{\mathcal{I}_i \subseteq\{1, \dots, p\}: Z_i = C\}

Random intersection trees (RIT)

Shah and Meinshausen (2014)

Randomly sampled

class-C observation

"survived" interaction

Random intersection trees (RIT)

Shah and Meinshausen (2014)

Random intersection trees (RIT)

Shah and Meinshausen (2014)

Genomic response

Genomic features

Translating the market basket problem into genomics

\dots

Genomic response

Genomic features

Translating the market basket problem into genomics

Challenges:

Genomic features are typically measured in concentrations/counts
Binding does not imply regulation (Li et al. 2008)

\dots

Iterative random forests (iRF)

Signed iterative random forests (siRF)

Joint work with Sumanta Basu, James B. Brown, Susan Celniker, and Bin Yu

Iterative random forest to identify high-order interactions in genomic data

Iteratively re-weighted random forests stabilize decision path
Generalized random intersection trees search for high-order interactions
Stability bagging evaluates interactions

Iterative random forests (iRF) build on PCS to identify genomic interactions in developing Drosophila embryos

Open source R implementation: https://cran.r-project.org/web/packages/iRF/

Iteratively re-weighted random forests

Random Forests

Breiman (2001)

Random forests modify CART to improve predictive accuracy:

CART trees are trained on bootstrap samples of the data
CART criterion evaluated on subset of features sampled uniformly at random

Feature-weighted random forests

Amaratunga et al. (2008)

Random forests:

At each node of the decision tree, uniformly sample a subset of features

Feature-weighted random forests:

At each node of the decision tree, sample a subset of features with probability proportional to

w\in\mathbb{R}^p_+

w_1

w_2

w_3

w_4

w_5

w_1

w_2

w_3

w_4

w_5

Feature weights

The CART criterion: Gini impurity

(\pi, N)

(\pi_l, N_l)

(\pi_r, N_r)

Proportion positive responses

Number of observations

Gini impurity:

I_G(\pi) = \pi (1-\pi)

Decrease in Gini impurity:

I_G(\pi)-\frac{N_l}{N}I_G(\pi_l) - \frac{N_r}{N}I_G(\pi_r)

Mean decrease in impurity:

On average, how much does splitting on a variable decrease the Gini impurity?

Iterative re-weighting stabilizes random forest decision paths

Gini importance

Iteration 1

Iteration K

w_1

w_2

w_3

w_4

w_5

w_1

w_2

w_3

w_4

w_5

Feature weights

Iterative re-weighting helps recover high-order interactions

Generalized random intersection trees

Encoding decision paths to extract active features

Active

Inactive

Continuous measurements

Binary features

Encoding decision paths to extract enriched and depleted features

Enriched

Depleted

Generalized random intersection trees search for high-order interactions

\cap

\{1,4\}

1. Iteratively re-weighted random forests

3. RIT on random forest decision paths

2. Decision path feature transformation

\mathbf{x}_1 \rightarrow \mathcal{I}_1 =

\mathbf{x}_n \rightarrow \mathcal{I}_n =

\emptyset

Runtime comparison between iRF and RuleFit

Evaluating interactions

Summary of importance measures for high-order interactions

Prevalence : how frequently is an interaction observed among positive responses?

Precision : how accurately does an interaction predict positive responses?

Stability : how frequently is an interaction recovered across outer layer bootstrap samples?

P(S|C)

P(C|S)

sta(S)

Prevalence measures the stability of an interaction across an RF

Prevalence:

P(S|y=1)=\frac{1}{T} \sum_{t=1}^T \frac{\sum_{i=1}^n I(S\subseteq \mathcal{I}_{i_t}) \cdot I(y_i = 1)}{\sum_{i=1}^n I(y_i = 1)}

P(\{1, 3, 4\}|y = 1) =3/5

P(\{1, 4\}|y = 1) =4/5

Examples:

Precision measures the predictive accuracy of an interaction across an RF

Precision:

P(y=1|S)=\frac{1}{T} \sum_{t=1}^T \frac{\sum_{i=1}^n I(S\subseteq \mathcal{I}_{i_t}) \cdot I(y_i = 1)}{\sum_{i=1}^n I(S\subseteq \mathcal{I}_{i_t})}

P(y=1|\{1, 4\}) = 4/6

P(y=1|\{1, 3, 4\}) = 3/4

Examples:

Bagging evaluates stability of interactions across entire iRF workflow relative to resampling

\cap

\{1,4\}

\{1,3\}

P(S|y=1)

\dots

1. Iteratively re-weighted RF stabilize decision paths

2. gRIT searches for high-order interactions along decision paths

3. Importance metrics evaluate interactions in fitted RF

Outer layer bootstrap samples

\emptyset

\{1,4\}

Case studies using iRF

Predicting enhancer activity in early stage Drosophila embryos

Enhancers: Pfeiffer et al. 2008, Fisher et al. 2012, Kvon et al. 2014

ChIP: MacArthur et al. 2009, Li et al. 2008

7809 loci representing ~14% of the non-coding genome
: 24 TF ChIP assays, stage 4-6 blastoderm embryos
: enhancer activity (0: inactive, 1: active)

\mathbf{x}

x_1

x_2

\dots

Predicting enhancer activity in early stage Drosophila embryos

iterative Random Forests recover well-known pairwise interactions

Novel order-3 interactions exhibit AND-like behavior (Gt, Kr, Zld)

P(y=1)

Zld low

Zld high

Predicting gap gene interactions from spatial gene expression data

Nobel prize in physiology or medicine (1995):

Lewis, Nüsslein-Volhard, and Wieschaus

Fowlkes et al. (2008)

Drosophila 3D expression atlas

siRF recovers all spatially local gap gene interactions

iRF-enabled genome wide

epistasis studies (GWES)

Joint work with Merle Behr,

James B. Brown, and Bin Yu

Ackowledgements

S. Basu

J. Brown

B. Yu

S. Celniker

E. Frise

iRF_CZ

By kkumbier

The PCS framework for reliable and reproducible data science

Generating domain insights through supervised learning

Outline

Embryonic development in Drosophila

High-order interactions at enhancer elements drive embryonic development

Identifying regulatory interactions from high-throughput genomic data

From genomic to statistical interactions

Thresholding rules define expression domains

RuleFit: rule-based interaction discovery (Friedman and Popescu, 2008)

Interactions in market baskets

Interactions in market baskets

Interactions in market baskets

Random intersection trees (RIT)

Shah and Meinshausen (2014)

Random intersection trees (RIT)

Shah and Meinshausen (2014)

Random intersection trees (RIT)

Shah and Meinshausen (2014)

Random intersection trees (RIT)

Shah and Meinshausen (2014)

Translating the market basket problem into genomics

Translating the market basket problem into genomics

Iterative random forest to identify high-order interactions in genomic data

Iteratively re-weighted random forests

Random Forests

Breiman (2001)

Feature-weighted random forests

Amaratunga et al. (2008)

The CART criterion: Gini impurity

Iterative re-weighting stabilizes random forest decision paths

Iterative re-weighting helps recover high-order interactions

Generalized random intersection trees

Encoding decision paths to extract active features

Encoding decision paths to extract enriched and depleted features

Generalized random intersection trees search for high-order interactions

Runtime comparison between iRF and RuleFit

Evaluating interactions

Summary of importance measures for high-order interactions

Prevalence measures the stability of an interaction across an RF

Precision measures the predictive accuracy of an interaction across an RF

Bagging evaluates stability of interactions across entire iRF workflow relative to resampling

Predicting enhancer activity in early stage Drosophila embryos

Predicting enhancer activity in early stage Drosophila embryos

iterative Random Forests recover well-known pairwise interactions

Novel order-3 interactions exhibit AND-like behavior (Gt, Kr, Zld)

Predicting gap gene interactions from spatial gene expression data

siRF recovers all spatially local gap gene interactions

Ackowledgements

iRF_CZ

More from kkumbier