fun with kaggle


or


A MACHINE LEARNING PLAYGROUND



Outline

  • motivation
  • kaggle
  • machine learning
  • learning on kaggle
  • vowpal wabbit


Motivation

  • Interesting problems
  • Interesting environment

Motivation 1a: What's this?

c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 21L, 130L, 190L, 254L, 254L, 250L, 175L, 135L, 96L, 96L, 
16L, 4L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
26L, 102L, 186L, 254L, 254L, 248L, 222L, 222L, 225L, 254L, 254L, 
254L, 254L, 254L, 206L, 112L, 4L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 207L, 254L, 254L, 177L, 117L, 39L, 0L, 0L, 56L, 
248L, 102L, 48L, 48L, 103L, 192L, 254L, 135L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 91L, 111L, 36L, 0L, 0L, 0L, 0L, 0L, 
72L, 92L, 0L, 0L, 0L, 0L, 12L, 224L, 210L, 5L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 50L, 139L, 240L, 254L, 66L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 7L, 121L, 
220L, 254L, 244L, 194L, 15L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 8L, 107L, 112L, 112L, 112L, 87L, 112L, 141L, 
218L, 248L, 177L, 68L, 20L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 77L, 221L, 254L, 254L, 254L, 254L, 254L, 
225L, 104L, 39L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 10L, 32L, 32L, 32L, 32L, 130L, 
215L, 195L, 47L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 6L, 111L, 
231L, 174L, 5L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 47L, 18L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 40L, 228L, 
205L, 35L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 22L, 
234L, 42L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 56L, 212L, 
226L, 38L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 96L, 157L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 30L, 215L, 188L, 
9L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 96L, 142L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 86L, 254L, 68L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 71L, 202L, 15L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 6L, 214L, 151L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 10L, 231L, 86L, 2L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 191L, 207L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 93L, 248L, 129L, 7L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 117L, 238L, 112L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 94L, 248L, 209L, 73L, 12L, 0L, 0L, 
0L, 0L, 0L, 0L, 42L, 147L, 252L, 136L, 9L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 48L, 160L, 215L, 230L, 158L, 
74L, 64L, 94L, 153L, 223L, 250L, 214L, 105L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 11L, 129L, 
189L, 234L, 224L, 255L, 194L, 134L, 75L, 6L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)

Motivation 1b: What's this?


... pixel202 pixel203 pixel204 pixel205 pixel206 pixel207 ...
...       26      102      186      254      254      248 ...

Motivation 2:




Try kaggle!



What is
kaggle

?



kaggle is not data science

kaggle is just the fun parts


  • well-defined problems
  • incredibly clean data
  • prescribed success metrics

it's contrived

so you get to focus on interesting techniques


okay that's not quite fair


also:
  • visualization competitions
  • data cleansing competitions
  • feature engineering, etc.

but mostly:
  • machine learning competitions


What is machine learning

?

Just machine:


Look at this data and report back!


e.g.,

  • count these words
  • get the mean of these numbers
  • serve this web site (usually)
  • be this shooting game (usually)

example: expert systems

(this is not machine learning)

I asked an expert and he said that witches float, so I wrote this program!


is_witch = function(row) {
  if (row['mass_in_kg'] <= row['volume_in_L']) {
    return(TRUE)
  } else {
    return(FALSE)
  }
}
mass_in_kg, volume_in_L   # this is 'row'
70,         65
FALSE                     # this is the report

Machine learning:


Look at this data, learn something, and then look at this other data and report back!


  • what is learned is called a model
  • the model has some pre-specified logic
  • the model has some learned "state"
  • goal: get good reports back in the end

example: mimic model

(this is probably not a good idea)

I don't know, so look at one labeled example and say everything is like that.


witchy_state = training_row['witchiness']
is_witch = function(row) {
  return(witchy_state)
}
mass_in_kg, volume_in_L, witchiness   # this is 'training_row'
55,         40,          TRUE
mass_in_kg, volume_in_L               # this is 'row'
70,         65
TRUE                                  # this is the report

note: depends on both algorithm and data

(if training data is different, performance is different)

I have different training data now!


witchy_state = training_row['witchiness']
is_witch = function(row) {
  return(witchy_state)
}
mass_in_kg, volume_in_L, witchiness   # this is 'training_row'
82,         90,          FALSE
mass_in_kg, volume_in_L               # this is 'row'
70,         65
FALSE                                 # this is the report

Learning
Techniques



popular categories of techniques - that you can do!


A distinction:


The machine generally won't be able to figure out what technique is most appropriate. You're smart!

Learn:

              x      y
              2      5
              7     15
              1      3
              3      7
             10     21

Learn:

              x      y
              2      5
              7     15
              1      3
              3      7
             10     21

So?

              x      y
              5      



Linear


  • you're pretty sure you're right, aren't you?
  • machine-learnable several ways
  • predicts a number (continuous)

Learn:

              x      y
              2    cat
              7    dog
              1    cat
              3    cat
             10    dog

Learn:

              x      y
              2    cat
              7    dog
              1    cat
              3    cat
             10    dog

So?

              x      y
              6      


Options!


  • choose a "cut point": decision tree
  • look at similar point(s): k-Nearest Neighbors
  • regression for log odds: logistic regression
    • you probably didn't do this in your head


Other techniques:


  • neural nets
  • support vector machines
  • get all Bayesian with everything
  • throw in dimensionality reduction
  • bagging/boosting/ensembles of all kinds
  • learn features ("deep learning")
  • make features (next)


Features

Features are important


   home   away    winner
     22      5      home
      4      5      away
      4      2      home
      3      7      away
     22     23      away

Say we want to predict 'winner'.

'home' and 'away' are features.

'winner' is the labels.

How would a
machine learn it?


   home   away    winner
     22      5      home
      4      5      away
      4      2      home
      3      7      away
     22     23      away


Engineering a new feature:


   home   away   diff   winner
     22      5     17     home
      4      5     -1     away
      4      2      2     home
      3      7     -4     away
     22     23     -1     away


With just 'diff', it's machine-easier!


Often, domain expertise

(like knowing how scores work)

can improve performance.



(Deep Learning techniques allow the

 machine to do some feature-figuring.)



Supervision



  • all labeled training data: supervised learning

  • some labeled training data: semi-supervised learning

  • no labeled training data: unsupervised learning


  • you can request labels: active learning

  • there are feedback loops: contextual learning


Learning

on

Kaggle




[demo]









vowpal

wabbit


vw


  • fast
  • online
  • linear learning
  • hashed features
  • command line (pretty much) only





There are many tools.



Outline

  • motivation
  • kaggle
  • machine learning
  • learning on kaggle
  • vowpal wabbit



Hurrah!

fun with kaggle

By ajschumacher

fun with kaggle

  • 2,170