fun with kaggle
or
A MACHINE LEARNING PLAYGROUND
Outline
-
motivation
-
kaggle
-
machine learning
-
learning on kaggle
-
vowpal wabbit
Motivation
-
Interesting problems
-
Interesting environment
Motivation 1a: What's this?
c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 21L, 130L, 190L, 254L, 254L, 250L, 175L, 135L, 96L, 96L, 16L, 4L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 26L, 102L, 186L, 254L, 254L, 248L, 222L, 222L, 225L, 254L, 254L, 254L, 254L, 254L, 206L, 112L, 4L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 207L, 254L, 254L, 177L, 117L, 39L, 0L, 0L, 56L, 248L, 102L, 48L, 48L, 103L, 192L, 254L, 135L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 91L, 111L, 36L, 0L, 0L, 0L, 0L, 0L, 72L, 92L, 0L, 0L, 0L, 0L, 12L, 224L, 210L, 5L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 50L, 139L, 240L, 254L, 66L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 7L, 121L, 220L, 254L, 244L, 194L, 15L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 8L, 107L, 112L, 112L, 112L, 87L, 112L, 141L, 218L, 248L, 177L, 68L, 20L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 77L, 221L, 254L, 254L, 254L, 254L, 254L, 225L, 104L, 39L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 10L, 32L, 32L, 32L, 32L, 130L, 215L, 195L, 47L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 6L, 111L, 231L, 174L, 5L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 47L, 18L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 40L, 228L, 205L, 35L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 22L, 234L, 42L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 56L, 212L, 226L, 38L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 96L, 157L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 30L, 215L, 188L, 9L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 96L, 142L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 86L, 254L, 68L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 71L, 202L, 15L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 6L, 214L, 151L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 10L, 231L, 86L, 2L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 191L, 207L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 93L, 248L, 129L, 7L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 117L, 238L, 112L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 94L, 248L, 209L, 73L, 12L, 0L, 0L, 0L, 0L, 0L, 0L, 42L, 147L, 252L, 136L, 9L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 48L, 160L, 215L, 230L, 158L, 74L, 64L, 94L, 153L, 223L, 250L, 214L, 105L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 11L, 129L, 189L, 234L, 224L, 255L, 194L, 134L, 75L, 6L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)
Motivation 1b: What's this?
... pixel202 pixel203 pixel204 pixel205 pixel206 pixel207 ... ... 26 102 186 254 254 248 ...
Motivation 2:
Try kaggle!
What is
kaggle
?
kaggle is not data science
kaggle is just the fun parts
- well-defined problems
- incredibly clean data
- prescribed success metrics
it's contrived
so you get to focus on interesting techniques
okay that's not quite fair
also:
- visualization competitions
- data cleansing competitions
- feature engineering, etc.
but mostly:
- machine learning competitions
What is machine learning
?
Just machine:
Look at this data and report back!
e.g.,
- count these words
- get the mean of these numbers
- serve this web site (usually)
- be this shooting game (usually)
example: expert systems
(this is not machine learning)
I asked an expert and he said that witches float, so I wrote this program!
is_witch = function(row) { if (row['mass_in_kg'] <= row['volume_in_L']) { return(TRUE) } else { return(FALSE) } }
mass_in_kg, volume_in_L # this is 'row' 70, 65
FALSE # this is the report
Machine learning:
Look at this data, learn something, and then look at this other data and report back!
- what is learned is called a model
- the model has some pre-specified logic
- the model has some learned "state"
- goal: get good reports back in the end
example: mimic model
(this is probably not a good idea)
I don't know, so look at one labeled example and say everything is like that.
witchy_state = training_row['witchiness'] is_witch = function(row) { return(witchy_state) }
mass_in_kg, volume_in_L, witchiness # this is 'training_row' 55, 40, TRUE
mass_in_kg, volume_in_L # this is 'row' 70, 65
TRUE # this is the report
note: depends on both algorithm and data
(if training data is different, performance is different)
I have different training data now!
witchy_state = training_row['witchiness'] is_witch = function(row) { return(witchy_state) }
mass_in_kg, volume_in_L, witchiness # this is 'training_row' 82, 90, FALSE
mass_in_kg, volume_in_L # this is 'row' 70, 65
FALSE # this is the report
Learning
Techniques
popular categories of techniques - that you can do!
A distinction:
The machine generally won't be able to figure out what technique is most appropriate. You're smart!
Learn:
x y
2 5
7 15
1 3
3 7
10 21
Learn:
x y
2 5
7 15
1 3
3 7
10 21
So?
x y
5
Linear
- you're pretty sure you're right, aren't you?
-
machine-learnable several ways
-
predicts a number (continuous)
Learn:
x y
2 cat
7 dog
1 cat
3 cat
10 dog
Learn:
x y
2 cat
7 dog
1 cat
3 cat
10 dog
So?
x y
6
Options!
- choose a "cut point": decision tree
- look at similar point(s): k-Nearest Neighbors
- regression for log odds: logistic regression
- you probably didn't do this in your head
Other techniques:
- neural nets
- support vector machines
- get all Bayesian with everything
- throw in dimensionality reduction
- bagging/boosting/ensembles of all kinds
- learn features ("deep learning")
- make features (next)
Features
Features are important
home away winner
22 5 home
4 5 away
4 2 home
3 7 away
22 23 away
Say we want to predict 'winner'.
'home' and 'away' are features.
'winner' is the labels.
How would a
machine learn it?
home away winner
22 5 home
4 5 away
4 2 home
3 7 away
22 23 away
Engineering a new feature:
home away diff winner
22 5 17 home
4 5 -1 away
4 2 2 home
3 7 -4 away
22 23 -1 away
With just 'diff', it's machine-easier!
Often, domain expertise
(like knowing how scores work)
can improve performance.
(Deep Learning techniques allow the
machine to do some feature-figuring.)
Supervision
- all labeled training data: supervised learning
- some labeled training data: semi-supervised learning
- no labeled training data: unsupervised learning
- you can request labels: active learning
- there are feedback loops: contextual learning
Learning
on
Kaggle
[demo]
vowpal
wabbit
vw
- fast
- online
- linear learning
- hashed features
- command line (pretty much) only
There are many tools.
Outline
-
motivation
-
kaggle
-
machine learning
-
learning on kaggle
-
vowpal wabbit
Hurrah!
fun with kaggle
By ajschumacher
fun with kaggle
- 2,271