fun with kaggle

or

A MACHINE LEARNING PLAYGROUND

Outline

motivation
kaggle
machine learning
learning on kaggle
vowpal wabbit

Motivation

Interesting problems
Interesting environment

Motivation 1a: What's this?

c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 21L, 130L, 190L, 254L, 254L, 250L, 175L, 135L, 96L, 96L, 
16L, 4L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
26L, 102L, 186L, 254L, 254L, 248L, 222L, 222L, 225L, 254L, 254L, 
254L, 254L, 254L, 206L, 112L, 4L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 207L, 254L, 254L, 177L, 117L, 39L, 0L, 0L, 56L, 
248L, 102L, 48L, 48L, 103L, 192L, 254L, 135L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 91L, 111L, 36L, 0L, 0L, 0L, 0L, 0L, 
72L, 92L, 0L, 0L, 0L, 0L, 12L, 224L, 210L, 5L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 50L, 139L, 240L, 254L, 66L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 7L, 121L, 
220L, 254L, 244L, 194L, 15L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 8L, 107L, 112L, 112L, 112L, 87L, 112L, 141L, 
218L, 248L, 177L, 68L, 20L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 77L, 221L, 254L, 254L, 254L, 254L, 254L, 
225L, 104L, 39L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 10L, 32L, 32L, 32L, 32L, 130L, 
215L, 195L, 47L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 6L, 111L, 
231L, 174L, 5L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 47L, 18L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 40L, 228L, 
205L, 35L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 22L, 
234L, 42L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 56L, 212L, 
226L, 38L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 96L, 157L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 30L, 215L, 188L, 
9L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 96L, 142L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 86L, 254L, 68L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 71L, 202L, 15L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 6L, 214L, 151L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 10L, 231L, 86L, 2L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 191L, 207L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 93L, 248L, 129L, 7L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 117L, 238L, 112L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 94L, 248L, 209L, 73L, 12L, 0L, 0L, 
0L, 0L, 0L, 0L, 42L, 147L, 252L, 136L, 9L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 48L, 160L, 215L, 230L, 158L, 
74L, 64L, 94L, 153L, 223L, 250L, 214L, 105L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 11L, 129L, 
189L, 234L, 224L, 255L, 194L, 134L, 75L, 6L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)

Motivation 1b: What's this?

... pixel202 pixel203 pixel204 pixel205 pixel206 pixel207 ...
...       26      102      186      254      254      248 ...

Motivation 2:

Try kaggle!

What is
kaggle

?

kaggle is not data science

kaggle is just the fun parts

well-defined problems
incredibly clean data
prescribed success metrics

it's contrived

so you get to focus on interesting techniques

okay that's not quite fair

also:

visualization competitions
data cleansing competitions
feature engineering, etc.

but mostly:

machine learning competitions

What is machine learning

?

Just machine:

Look at this data and report back!

e.g.,

count these words
get the mean of these numbers
serve this web site (usually)
be this shooting game (usually)

example: expert systems

(this is not machine learning)

I asked an expert and he said that witches float, so I wrote this program!

is_witch = function(row) {
  if (row['mass_in_kg'] <= row['volume_in_L']) {
    return(TRUE)
  } else {
    return(FALSE)
  }
}

mass_in_kg, volume_in_L   # this is 'row'
70,         65

FALSE                     # this is the report

Machine learning:

Look at this data, learn something, and then look at this other data and report back!

what is learned is called a model
the model has some pre-specified logic
the model has some learned "state"
goal: get good reports back in the end

example: mimic model

(this is probably not a good idea)

I don't know, so look at one labeled example and say everything is like that.

witchy_state = training_row['witchiness']
is_witch = function(row) {
  return(witchy_state)
}

mass_in_kg, volume_in_L, witchiness   # this is 'training_row'
55,         40,          TRUE

mass_in_kg, volume_in_L               # this is 'row'
70,         65

TRUE                                  # this is the report

note: depends on both algorithm and data

(if training data is different, performance is different)

I have different training data now!

witchy_state = training_row['witchiness']
is_witch = function(row) {
  return(witchy_state)
}

mass_in_kg, volume_in_L, witchiness   # this is 'training_row'
82,         90,          FALSE

mass_in_kg, volume_in_L               # this is 'row'
70,         65

FALSE                                 # this is the report

Learning
Techniques

popular categories of techniques - that you can do!

A distinction:

The machine generally won't be able to figure out what technique is most appropriate. You're smart!

Learn:

Learn:

So?

              x      y
              5

Linear

you're pretty sure you're right, aren't you?
machine-learnable several ways
predicts a number (continuous)

Learn:

Learn:

So?

              x      y
              6

Options!

choose a "cut point": decision tree
look at similar point(s): k-Nearest Neighbors
regression for log odds: logistic regression

you probably didn't do this in your head

Other techniques:

neural nets
support vector machines
get all Bayesian with everything
throw in dimensionality reduction
bagging/boosting/ensembles of all kinds
learn features ("deep learning")
make features (next)

Features

Features are important

   home   away    winner
     22      5      home
      4      5      away
      4      2      home
      3      7      away
     22     23      away

Say we want to predict 'winner'.

'home' and 'away' are features.

'winner' is the labels.

How would a
machine learn it?

   home   away    winner
     22      5      home
      4      5      away
      4      2      home
      3      7      away
     22     23      away

Engineering a new feature:

   home   away   diff   winner
     22      5     17     home
      4      5     -1     away
      4      2      2     home
      3      7     -4     away
     22     23     -1     away

With just 'diff', it's machine-easier!

Often, domain expertise

(like knowing how scores work)

can improve performance.

(Deep Learning techniques allow the

machine to do some feature-figuring.)

Supervision

all labeled training data: supervised learning

some labeled training data: semi-supervised learning

no labeled training data: unsupervised learning

you can request labels: active learning

there are feedback loops: contextual learning

Learning

on

Kaggle

[demo]

vowpal

wabbit

vw

fast
online
linear learning
hashed features
command line (pretty much) only

There are many tools.

Outline

motivation
kaggle
machine learning
learning on kaggle
vowpal wabbit

fun with kaggle

or

A MACHINE LEARNING PLAYGROUND

Outline

Motivation

Motivation 1a: What's this?

Motivation 1b: What's this?

Motivation 2:

Try kaggle!

What iskaggle

?

kaggle is not data science

kaggle is just the fun parts

it's contrived

okay that's not quite fair

What is machine learning

?

Just machine:

example: expert systems

Machine learning:

example: mimic model

note: depends on both algorithm and data

LearningTechniques

A distinction:

Learn:

x y 2 5 7 15 1 3 3 7 10 21

Learn:

x y 2 5 7 15 1 3 3 7 10 21

So?

x y 5

Linear

Learn:

x y 2 cat 7 dog 1 cat 3 cat 10 dog

Learn:

x y 2 cat 7 dog 1 cat 3 cat 10 dog

So?

x y 6

Options!

Other techniques:

Features

Features are important

home away winner 22 5 home 4 5 away 4 2 home 3 7 away 22 23 away

Say we want to predict 'winner'.

'home' and 'away' are features.

'winner' is the labels.

How would amachine learn it?

home away winner 22 5 home 4 5 away 4 2 home 3 7 away 22 23 away

Engineering a new feature:

home away diff winner 22 5 17 home 4 5 -1 away 4 2 2 home 3 7 -4 away 22 23 -1 away

With just 'diff', it's machine-easier!

Often, domain expertise

(like knowing how scores work)

can improve performance.

(Deep Learning techniques allow the

machine to do some feature-figuring.)

Supervision

Learning

on

Kaggle

[demo]

vowpal

wabbit

vw

There are many tools.

Outline

Hurrah!

fun with kaggle

More from ajschumacher

What is
kaggle

Learning
Techniques

How would a
machine learn it?