fun with kaggle
or
A MACHINE LEARNING PLAYGROUND
Outline
-
motivation
-
kaggle
-
machine learning
-
learning on kaggle
-
vowpal wabbit
Motivation
-
Interesting problems
-
Interesting environment
Motivation 1a: What's this?
c
Motivation 1b: What's this?
... pixel202 pixel203 pixel204 pixel205 pixel206 pixel207 ... ... 26 102 186 254 254 248 ...
Motivation 2:
Try kaggle!
What is
kaggle
?
kaggle is not data science
kaggle is just the fun parts
- well-defined problems
- incredibly clean data
- prescribed success metrics
it's contrived
so you get to focus on interesting techniques
okay that's not quite fair
also:
- visualization competitions
- data cleansing competitions
- feature engineering, etc.
but mostly:
- machine learning competitions
What is machine learning
?
Just machine:
Look at this data and report back!
e.g.,
- count these words
- get the mean of these numbers
- serve this web site (usually)
- be this shooting game (usually)
example: expert systems
(this is not machine learning)
I asked an expert and he said that witches float, so I wrote this program!
is_witch = function(row) { if (row['mass_in_kg'] <= row['volume_in_L']) { return(TRUE) } else { return(FALSE) } }
mass_in_kg, volume_in_L # this is 'row' 70, 65
FALSE # this is the report
Machine learning:
Look at this data, learn something, and then look at this other data and report back!
- what is learned is called a model
- the model has some pre-specified logic
- the model has some learned "state"
- goal: get good reports back in the end
example: mimic model
(this is probably not a good idea)
I don't know, so look at one labeled example and say everything is like that.
witchy_state = training_row['witchiness'] is_witch = function(row) { return(witchy_state) }
mass_in_kg, volume_in_L, witchiness # this is 'training_row' 55, 40, TRUE
mass_in_kg, volume_in_L # this is 'row' 70, 65
TRUE # this is the report
note: depends on both algorithm and data
(if training data is different, performance is different)
I have different training data now!
witchy_state = training_row['witchiness'] is_witch = function(row) { return(witchy_state) }
mass_in_kg, volume_in_L, witchiness # this is 'training_row' 82, 90, FALSE
mass_in_kg, volume_in_L # this is 'row' 70, 65
FALSE # this is the report
Learning
Techniques
popular categories of techniques - that you can do!
A distinction:
The machine generally won't be able to figure out what technique is most appropriate. You're smart!
Learn:
x y
2 5
7 15
1 3
3 7
10 21
Learn:
x y
2 5
7 15
1 3
3 7
10 21
So?
x y
5
Linear
- you're pretty sure you're right, aren't you?
-
machine-learnable several ways
-
predicts a number (continuous)
Learn:
x y
2 cat
7 dog
1 cat
3 cat
10 dog
Learn:
x y
2 cat
7 dog
1 cat
3 cat
10 dog
So?
x y
6
Options!
- choose a "cut point": decision tree
- look at similar point(s): k-Nearest Neighbors
- regression for log odds: logistic regression
- you probably didn't do this in your head
Other techniques:
- neural nets
- support vector machines
- get all Bayesian with everything
- throw in dimensionality reduction
- bagging/boosting/ensembles of all kinds
- learn features ("deep learning")
- make features (next)
Features
Features are important
home away winner
22 5 home
4 5 away
4 2 home
3 7 away
22 23 away
Say we want to predict 'winner'.
'home' and 'away' are features.
'winner' is the labels.
How would a
machine learn it?
home away winner
22 5 home
4 5 away
4 2 home
3 7 away
22 23 away
Engineering a new feature:
home away diff winner
22 5 17 home
4 5 -1 away
4 2 2 home
3 7 -4 away
22 23 -1 away
With just 'diff', it's machine-easier!
Often, domain expertise
(like knowing how scores work)
can improve performance.
(Deep Learning techniques allow the
machine to do some feature-figuring.)
Supervision
- all labeled training data: supervised learning
- some labeled training data: semi-supervised learning
- no labeled training data: unsupervised learning
- you can request labels: active learning
- there are feedback loops: contextual learning
Learning
on
Kaggle
[demo]
vowpal
wabbit
vw
- fast
- online
- linear learning
- hashed features
- command line (pretty much) only
There are many tools.
Outline
-
motivation
-
kaggle
-
machine learning
-
learning on kaggle
-
vowpal wabbit
Hurrah!
fun with kaggle
By ajschumacher
fun with kaggle
- 2,257