Data science for lazy people, Automated Machine Learning
Diego Hueltes
Big Data Spain 2017
Original image from https://github.com/rhiever/tpot
![](https://s3.amazonaws.com/media-p.slid.es/uploads/707813/images/3776955/ml-pipeline.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/707813/images/3776955/ml-pipeline.png)
Data cleaning
![](https://s3.amazonaws.com/media-p.slid.es/uploads/707813/images/4297412/giphy__1_.gif)
Data cleaning
![](https://s3.amazonaws.com/media-p.slid.es/uploads/707813/images/4295992/Screen_Shot_2017-11-04_at_17.44.33.png)
Data cleaning
![](https://s3.amazonaws.com/media-p.slid.es/uploads/707813/images/4296079/__results___23_0.png)
Data cleaning
![](https://s3.amazonaws.com/media-p.slid.es/uploads/707813/images/4296065/Unknown-1.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/707813/images/3776955/ml-pipeline.png)
Feature selection
![](https://s3.amazonaws.com/media-p.slid.es/uploads/707813/images/4307497/hm.png)
Feature selection
![](https://s3.amazonaws.com/media-p.slid.es/uploads/707813/images/4295609/download.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/707813/images/3776955/ml-pipeline.png)
Feature preprocessing
![](https://s3.amazonaws.com/media-p.slid.es/uploads/707813/images/4296164/about_standardization_normalization_44_0.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/707813/images/3776955/ml-pipeline.png)
Feature construction
![](https://s3.amazonaws.com/media-p.slid.es/uploads/707813/images/4302918/giphy__2_.gif)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/707813/images/3776955/ml-pipeline.png)
Model selection
![](https://s3.amazonaws.com/media-p.slid.es/uploads/707813/images/4302929/plot_ols.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/707813/images/4295606/svmimage.png)
Model selection
![](https://s3.amazonaws.com/media-p.slid.es/uploads/707813/images/4302928/Decision-Tree-Example.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/707813/images/3776955/ml-pipeline.png)
Parameter optimization
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=2, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
oob_score=False, random_state=0, verbose=0, warm_start=False)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/707813/images/4297409/source.gif)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/707813/images/3776955/ml-pipeline.png)
Model validation
![](https://s3.amazonaws.com/media-p.slid.es/uploads/707813/images/4303065/50interval.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/707813/images/3776955/ml-pipeline.png)
lazy Oxford dictionary
Unwilling to work or use energy
lazy Oxford dictionary
Unwilling to work or use energy
in repetitive tasks
Diego dictionary
![](https://s3.amazonaws.com/media-p.slid.es/uploads/707813/images/3776955/ml-pipeline.png)
naive_bayes.GaussianNB
naive_bayes.BernoulliNB
naive_bayes.MultinomialNB
tree.DecisionTreeClassifier
ensemble.ExtraTreesClassifier
ensemble.RandomForestClassifier
ensemble.GradientBoostingClassifier
neighbors.KNeighborsClassifier
svm.LinearSVC
linear_model.LogisticRegression
xgboost.XGBClassifier
preprocessing.Binarizer
decomposition.FastICA
cluster.FeatureAgglomeration
cluster.FeatureAgglomeration
preprocessing.MaxAbsScaler
preprocessing.MinMaxScaler
preprocessing.Normalizer
kernel_approximation.Nystroem
decomposition.PCA
preprocessing.PolynomialFeatures
kernel_approximation.RBFSampler
preprocessing.RobustScaler
preprocessing.StandardScaler
tpot.builtins.ZeroCount
feature_selection.SelectFwe
feature_selection.SelectPercentile
feature_selection.VarianceThreshold
feature_selection.RFE
feature_selection.SelectFromModel
linear_model.ElasticNetCV
linear_model.ElasticNetCV
ensemble.ExtraTreesRegressor
ensemble.GradientBoostingRegressor
ensemble.AdaBoostRegressor
tree.DecisionTreeRegressor
neighbors.KNeighborsRegressor
linear_model.LassoLarsCV
svm.LinearSVR
ensemble.RandomForestRegressor
linear_model.RidgeCV
xgboost.XGBRegressor
Automated Machine Learning
TPOT is a Python tool that automatically creates and optimizes machine learning pipelines using genetic programming.
![](https://s3.amazonaws.com/media-p.slid.es/uploads/707813/images/3776984/tpot-logo.png)
auto-sklearn frees a machine learning user from algorithm selection and hyperparameter tuning. It leverages recent advantages in Bayesian optimization, meta-learning and ensemble construction
auto-sklearn
Genetic programming
![](https://s3.amazonaws.com/media-p.slid.es/uploads/707813/images/3777002/pasted-from-clipboard.png)
Source: http://www.genetic-programming.org/gpbook4toc.html
Photo: Diego Hueltes
![](https://s3.amazonaws.com/media-p.slid.es/uploads/707813/images/3814332/29676718084_7de6d00d7a_o.jpg)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/707813/images/3777218/opsGP.png)
Crossover
Mutation
Source http://w3.onera.fr/smac/?q=tracker
![](https://s3.amazonaws.com/media-p.slid.es/uploads/707813/images/3777154/genetic-walker.gif)
Bayesian optimization
Bayesian optimization
source: https://advancedoptimizationatharvard.wordpress.com/2014/04/28/bayesian-optimization-part-ii/
![](https://s3.amazonaws.com/media-p.slid.es/uploads/707813/images/4303129/capture.png)
Bayesian optimization
source: https://advancedoptimizationatharvard.wordpress.com/2014/04/28/bayesian-optimization-part-ii/
![](https://s3.amazonaws.com/media-p.slid.es/uploads/707813/images/4303129/capture.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/707813/images/4306371/untitled.png)
TPOT is a Python tool that automatically creates and optimizes machine learning pipelines using genetic programming.
![](https://s3.amazonaws.com/media-p.slid.es/uploads/707813/images/3776984/tpot-logo.png)
Scikit-learn pipelines
![](https://s3.amazonaws.com/media-p.slid.es/uploads/707813/images/4306432/Revitlink_Damien_WArnings.png)
Warning
Code is coming
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
count_vectorizer = CountVectorizer(ngram_range=(1, 4), analyzer='char')
X_train = count_vectorizer.fit_transform(train)
X_test = count_vectorizer.transform(test)
linear_svc = LinearSVC()
model = linear_svc.fit(X_train, y_train)
y_test = model.predict(X_test)
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
pipeline = Pipeline([
('count_vectorizer', CountVectorizer(ngram_range=(1, 4), analyzer='char')),
('linear_svc', LinearSVC())
])
model = pipeline.fit(train)
y_test = model.predict(test)
TPOT
![](https://s3.amazonaws.com/media-p.slid.es/uploads/707813/images/3776984/tpot-logo.png)
from tpot import TPOTClassifier, TPOTRegressor
tpot = TPOTClassifier()
tpot.fit(X_train, y_train)
predictions = tpot.predict(X_test)
tpot = TPOTRegressor()
tpot.fit(X_train, y_train)
predictions = tpot.predict(X_test)
Basic usage
![](https://s3.amazonaws.com/media-p.slid.es/uploads/707813/images/3776984/tpot-logo.png)
Config dict
TPOTClassifier(config_dict = {
'sklearn.ensemble.RandomForestClassifier': {
'n_estimators': [100],
'criterion': ["gini", "entropy"],
'max_features': np.arange(0.05, 1.01, 0.05),
'min_samples_split': range(2, 21),
'min_samples_leaf': range(1, 21),
'bootstrap': [True, False]
},
'sklearn.feature_selection.RFE': {
'step': np.arange(0.05, 1.01, 0.05),
'estimator': {
'sklearn.ensemble.ExtraTreesClassifier': {
'n_estimators': [100],
'criterion': ['gini', 'entropy'],
'max_features': np.arange(0.05, 1.01, 0.05)
}
}
}
})
auto-sklearn frees a machine learning user from algorithm selection and hyperparameter tuning. It leverages recent advantages in Bayesian optimization, meta-learning and ensemble construction
auto-sklearn
auto-sklearn
Source: http://papers.nips.cc/paper/5872-efficient-and-robust-automated-machine-learning.pdf
![](https://s3.amazonaws.com/media-p.slid.es/uploads/707813/images/4307374/Screen_Shot_2017-11-08_at_00.17.54.png)
import autosklearn.classification
import autosklearn.regression
automl = autosklearn.classification.AutoSklearnClassifier()
automl.fit(X_train, y_train)
predictions = automl.predict(X_test)
automl = autosklearn.regression.AutoSklearnRegressor()
automl.fit(X_train, y_train)
predictions = automl.predict(X_test)
Basic usage
auto-sklearn
Custom configuration
auto-sklearn
- include_estimators
- exclude_estimators
- include_preprocessors
- exclude_preprocessors
Automated Machine Learning — A Paradigm Shift That Accelerates Data Scientist Productivity @ Airbnb
Automated Machine Learning
- Exploratory analysis
- Selective discovering
- New ideas for your model
- Model optimization
Thank you!
![](https://s3.amazonaws.com/media-p.slid.es/uploads/707813/images/3776931/pasted-from-clipboard.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/707813/images/3776935/pasted-from-clipboard.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/707813/images/3776940/pasted-from-clipboard.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/707813/images/4307412/static_qr_code_without_logo.jpg)
Data science for lazy people, Automated Machine Learning
By J. Diego Hueltes Vega
Data science for lazy people, Automated Machine Learning
Data science is fun… right? Data cleaning, feature selection, feature preprocessing, feature construction, model selection, parameter optimization, model validation… oh wait… are you sure? What about automating 80% of the work even doing better choices than you? Automated Machine Learning has arrived to be your personal assistant in Data Science
- 2,748