More interesting example
kaggle.com Digit Recognizer competition
Classify handwritten digits using MNIST dataset of 70k + images
kaggle competition
Training set 42000 images
Test set 28000 images
Graded on accuracy of the prediction
kNearestNeighbor Algorithm
Dataset
28x28 pixel images = 784 pixels, value of 0-255 for brightness
784 pixels -> 784 dimension feature space
k-Nearestneighbor
- Load your entire training set into memory (not actually necessary, just imagine.
- Get your query input, or test point.
- Map it with your training set.
- Take the 'distance' between your query input and all other points, choose the k nearest neighbors, and then 'poll' the neighbors and ask them what they are. Majority rules
k-nearestneighbor
k-nearestneighbor
Digit Recognizer problem has 784 dimension space.
How do you define 784 dimensional distance?
How do you define 3 dimensional distance?
What is distance?
Are you Feathr?
Metric
Discrete Metric:
d(x, y) := 1, x != y
minkowski metric
Euclidean distance is just a special case of the Minkowski metric for p = 2
p = 1 corresponds to the Manhattan Metric or Taxi Cab metric
Unit circle in different minkowski spaces
Holmes metric(*)
Lambda is an n-dimensional vector, that weights how 'important' a particular dimension is for the distance.
*Fails the triangle inequality, therefore not a real metric
Digit recognition for now