Tree-based automated machine learning for analyzing biomedical big data

Trang Lê

University of Pennsylvania

Virtual GECCO 2020

@trang1618

doi:10.1093/bioinformatics/btz470

What is autoML?

Clean data

Select features

Preprocess features

Construct features

Select classifier

Optimize parameters

Validate model

Pre-processed data

Automate

Typical supervised ML pipeline

Automate

TPOT is a an AutoML system that uses GP to
optimize the pipeline with the objective of

maximizing cross-validated score
minimizing pipeline complexity

Gijsbers et al. 2019

Entire data set

Olson & Moore, 2016

Entire data set

PCA

Polynomial features

Combine features

Select 10% best features

Support vector machines

Multiple copies of the data set can enter the pipeline for analysis

Pipeline operators modify the features

Modified data set flows through the pipeline operators

Final classification is performed on the final feature set

an example individual

GP primitive Feature selector & preprocessor, Supervised classifier/regressor

from tpot import TPOTClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target,
    train_size = 0.75, test_size = 0.25, random_state = 42)

tpot = TPOTClassifier(generations = 5,population_size = 50,
                      verbosity = 2, random_state = 42)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))
tpot.export('tpot_digits_pipeline.py')

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer
from tpot.export_utils import set_param_recursive

# NOTE: Make sure that the outcome column is labeled 'target' in the data file
tpot_data = pd.read_csv(
  'PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
features = tpot_data.drop('target', axis=1)
training_features, testing_features, training_target, testing_target = \
  train_test_split(features, tpot_data['target'], random_state=42)

# Average CV score on the training set was: 0.9826086956521738
exported_pipeline = make_pipeline(
  Normalizer(norm="l2"),
  KNeighborsClassifier(n_neighbors=5, p=2, weights="distance")
)
# Fix random state for all the steps in exported pipeline
set_param_recursive(exported_pipeline.steps, 'random_state', 42)

exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)

but... there's a caveat

primitives/operators

feature selectors
feature preprocessors
supervised classifiers/regressors
feature set selector (FSS)

parameters

generations
population_size
offspring_size
mutation_rate
...
...
template

new in this work

template:

FeatureSetSelector-Transformer-Classifier

Template + feature set selector

from tpot.config import classifier_config_dict

classifier_config_dict['tpot.builtins.FeatureSetSelector'] = {
    'subset_list': ['subsets.csv'],
    'sel_subset': range(19)
}

tpot = TPOTClassifier(
  generations = 100, population_size = 100,
  verbosity = 2, random_state = 42, early_stop = 10, 
  config_dict = classifier_config_dict,
  template = 'FeatureSetSelector-Transformer-Classifier')

For each training and testing set

m = 200 observations (Outcome: 100 cases and 100 controls)
p = 5 000 real-valued features
- 4% functional: true positive association with outcome
- Subsetting: functional features more likely to be included in earlier subsets (a lot of functional features in \(S_1\), some in \(S_2\), very few to 0 in \(S_{10}, ..., S_{20}\))

100 replicates of TPOT runs

Data simulation

template:

FeatureSetSelector-Transformer-Classifier

The optimal pipelines from TPOT-FSS significantly outperform those of XGBoost and standard TPOT.

In 100 replications, FSS correctly selects subset \(S_1\) 75 times.

78 individuals with MDD
79 healthy controls (HCs)
19 968 protein-coding genes → 5 912 transcripts
Feature sets: 23 depression gene modules (DGMs)

Real-world RNA-Seq expression data

template:

FeatureSetSelector-Transformer-Classifier

The outperformance is still statistically significant.

Pipelines that include DGM-5, on average, produce higher MDD prediction accuracies in the holdout set.

TPOT-FSS is the first AutoML tool to offer the option of feature selection at the group level.
FSS can identify the most meaningful group of features to include in the prediction pipeline.
Reducing pipeline complexity can lead to considerable increase in interpretability and generalizability.

Future works

extend to GWAS
re-formulate complexity
- number of features used in pipeline
- each operator: flexibility, runtime
- assess over-fitting

Jason Moore
Weixuan Fu
Alena Orlenko
Nadia Penrod
Elisabetta Manduchi
Bill La Cava
Ruowang Li

https://epistasislab.github.io/tpot/

https://slides.com/trang1618/tpot-fss/

Thanks!

Tree-based automated machine learning for analyzing biomedical big data

By Trang Le

Tree-based automated machine learning for analyzing biomedical big data

Presentation recorded on 2020-06-10 at GECCO, virtual conference

Trang Le

#math graduate. Postdoc fellow with Jason Moore.

Tree-based automated machine learning for analyzing biomedical big data

Trang Lê

What is autoML?

Typical supervised ML pipeline

an example individual

but... there's a caveat

primitives/operators

parameters

Template + feature set selector

Data simulation

In 100 replications, FSS correctly selects subset \(S_1\) 75 times.

Real-world RNA-Seq expression data

The outperformance is still statistically significant.

Pipelines that include DGM-5, on average, produce higher MDD prediction accuracies in the holdout set.

Future works

Thanks!

Tree-based automated machine learning for analyzing biomedical big data

More from Trang Le