Automatic Machine Learning
Challenge & Lessons
http://automl.chalearn.org
Data selection
Data cleaning/augmentation
Others pre-processing
Features engineering
Models selection
Hyperparameters optimisation
And quite a bit of time trying/failling
until reaching an "acceptable" solution
Training
Data
Trained model
AutoML box
Query
on
New Data
Training
Data
Trained model
Data Scientist
Query
on
New Data
Training
Data
Trained model
Crowd intelligence
AutoML box
Query
on
New Data
Text
SMAC: Sequential Model-Based Algorithm Configuration
repeat
until time budget exhausted
construct RF model to predict performance
use that model to select promising configurations
compare each selected configuration against the best known
Hyperparameter optimization library: automl.org/hpolib
Benchmarks
Optimizers
Results
Feature selection
In total: 768 parameters, 10^47 configurations
The AutoWEKA approach applied to scikit-lean
Improvements
Scikit-learn [Pedregosa et al, 2011-current]
instead of WEKA [Witten et al, 1999-current]
110 hyperpameters vs. 768 in Auto-WEKA
Meta Learning provide better models earlier
=> Ensembling can start being helpful earlier
Sensible allocation of computation for
ensemble construction for multi class classification
An extension of Freeze-Thaw Bayesian Optimization to ensemble contruction
An extension of Freeze-Thaw Bayesian Optimization to ensemble contruction
Make use of the partial information gained during the training of a machine learning model in order to decide wether to:
An extension of Freeze-Thaw Bayesian Optimization to ensemble contruction
Components of the algorithm:
- infinite mixture of exp decays GP
- Standard smooth GP
- Entropy search
An extension of Freeze-Thaw Bayesian Optimization to ensemble contruction
Components of the algorithm
- Most of scikit-learn
– A time pressured hack Decision trees
- Cross validation
– Mixture of exponential decays GP
– Stacking
HP, models, features engineering, data pre-processing
But also 3 Nvidia Titan X
NIPS 2015
Hackathon team
Marc Boullé
Lukasz Romaszco
Sébastian Treger
Emilia Vaajoensuu
Philippe Vandermersch
Software development
Eric Carmichael
Ivan Judson
Christophe Poulain
Percy Liang
Arthur Pesah
Xavier Baro Solé
Lukasz Romaszco
Michael Zyskowski
Codalab management
Evelyne Viegas
Percy Liang
Erick Watson
Advisors and beta testers
Kristin Bennett
Marc Boullé
Cecile Germain
Cecile Capponi
Richard Caruana
Gavin Cawley
Gideon Dror
Sergio Escalera
Tin Kam Ho
Balasz Kégl
Hugo Larochelle
Víctor Ponce López
Nuria Macia
Simon Mercer
Florin Popescu
Michèle Sebag
Danny Silver
Many thanks to Isabelle Guyon and all contributors
Data providers
Yindalon Aphinyanaphongs
Olivier Chapelle
Hugo Jair Escalante
Sergio Escalera
Zainab Iftikhar Malhi
Vincent Lemaire
Chih Jen Lin
Meysam Madani
Bisakha Ray
Mehreen Saeed
Alexander Statnikov
Gustavo Stolovitzky
H-J. Thiesen
Ioannis Tsamardinos
http://automl.chalearn.org
Further details
Sébastien Treguer
@ST4Good
Contact
Participation
http://codalab.org/AutoML