Trang Le
#math graduate. Postdoc fellow with Jason Moore.
Trang Lê
R Ladies Philly Meetup
December 2, 2019
@trang1618
R packages::recipes
adapted from Drew Conway’s Data Science Venn Diagram
hacking
skills
machine learning
math
statistics
domain knowledge
Danger zone!
data
science
traditional research
outcome: continuous/quantitative
e.g. How sick is patient?
on a scale of 10
outcome: discrete/categorical/class
e.g. Is patient sick?
TRUE, FALSE
1, 0
During training...
hyperparameters
parameters
We tune the model's hyperparameters.
As we fit the model to the data, model learns parameters from the data.
e.g. Elastic net is a type of models
An Introduction to Statistical Learning, 2014
5 fold cross-validation
Data
Train
Test
Train
V
Train
V
Train
V
Train
V
Train
V
observation
sample
individual
subject
variable feature attribute
1. Define sets of model hyperparameters values
2. for each hyperparameter set:
3. for each resampling iteration:
4. Hold out specific samples
5. Fit model on training data
6. Predict the hold out samples
7. end
8. Average performance across hold out predictions
9. end
10. Find optimal hyperparameter set
11. Fit final model to all training data with
optimal parameter set
1. Define sets of model hyperparameters values
2. for each hyperparameter set:
3. for each resampling iteration:
4. Hold out specific samples
5. Fit model on training data
6. Predict the hold out samples
7. end
8. Average performance across hold out predictions
9. end
10. Find optimal hyperparameter set
11. Fit final model to all training data with
optimal parameter set
rf_fit <- train(
price ~ .,
data = Boston,
method = 'randomForest',
tuneGrid = expand.grid(
mtry = 2:7,
ntree = c(1000, 1500)
),
trControl = trainControl('cv')
)
create a unified interface for predictive modeling with 238 models
Max Kuhn, useR-2010
h(g(f(x)))
x %>% f() %>% g() %>% h()
beer/name: John Harvards Simcoe IPA
beer/ABV: 5.4
beer/style: India Pale Ale (IPA)
review/appearance: 4/5
review/aroma: 6/10
review/palate: 3/5
review/taste: 6/10
review/overall: 13/20
review/time: 1157587200
review/text: On tap at the Springfield, PA location. Poured a deep and cloudy orange (almost a copper) color with a small sized white head. Tastes of oranges, light caramel and a very light grapefruit finish. I too would not believe the 80+ IBUs - I found this one to have a very light bitterness with a medium sweetness to it. Light lacing left on the glass.
Repo for the workshop: https://github.com/trang1618/rladies-caret
Beer ratings dataset: https://www.kaggle.com/c/beer-ratings/data
Emil Hvitfeldt's post on tidy text and caret: https://www.hvitfeldt.me/blog/binary-text-classification-with-tidytext-and-caret/
Standard hyperparameter grids: https://github.com/EpistasisLab/tpot/blob/master/tpot/config/classifier.py
Elements of Statistical Learning:
https://web.stanford.edu/~hastie/ElemStatLearn/
By Trang Le
Presentation on 2019-12-02 at R Ladies Philly Meetup