Happy Day!!!

https://www.thedailymeal.com/recipes/triple-berry-pi-day-pie-recipe

TPOT: A Tree-based Pipeline Optimization Tool

Trang Lê

mathematician

postdoctoral researcher

@UPenn IBI

amateur runner

https://slides.com/trang1618/tpot-cic

@trang1618

0000-0003-3737-6565

Clean data

Select features

Preprocess features

Construct features

Select classifier

Optimize parameters

Validate model

Raw data

Automate

Typical ML pipeline

Automate

Open source AutoML tools

auto-sklearn (python) github.com/automl/auto-sklearn
- Bayesian optimzation over a fixed 3-step ML pipeline

auto-Weka (java) github.com/automl/autoweka
- similar to auto-sklearn, built on top of Weka

TPOT (python) github.com/epistasislab/tpot
- genetic programming over a configurable ML pipeline

H20.ai (java w/ python, scala, R, web GUI) github.com/h2oai/h2o-3
- basic data prep w/ grid/random search over ML algorithms

devol (python) github.com/joeddav/devol
- deep learning architecture search via GP

Randy Olson

DEAP
Objective:
- maximize pipeline's CV classification performance
- minimize pipeline’s complexity
Pareto front with NSGA-II

TPOT

Weixuan Fu

Entire data set

Olson RS, Moore JH. In Workshop on Automatic Machine Learning

Entire data set

PCA

Polynomial features

Combine features

Select k best features

Logistic regression

Multiple copies of the data set can enter the pipeline for analysis

Pipeline operators modify the features

Modified data set flows through the pipeline operators

Final classification is performed on the final feature set

Genetic programming

GP primitives Dataset selectors, Feature selectors & preprocessors, Supervised classifiers

Population sequences of pipeline operators

Generations

Mutation and crossover

(a) insertion mutation

(b) deletion mutation

(d) substitution mutation

(e) crossover

TPOT configs

Default TPOT
TPOT light
TPOT sparse
TPOT-MDR (Multi-Directional Reduction)

Classification
Regression

Template + Dataset Selector

https://trang1618.github.io/tpot-ds-ms/

Mutation restriction

Complexity reformulation

Number of pipeline operators
- Flexibility of each operator
- Runtime
Number of features used in pipeline
Number of parameters
By accessing over-fitting: stability of the covariance of predictors, rank differences of importance metrics

Integration with neural nets

Challenges

preprocessing
scalability
computational expense

Live demo!

Jason Moore

Weixuan Fu

https://epistasislab.github.io/tpot/

https://slides.com/trang1618/tpot-cic