Lucas Oliveira David
Universidade Federal de São Carlos
December 2015
Machine Learning can help us with many tasks:
classification, estimation, data analysis and decision taking.
Data is an important piece of the learning process.
Original Dermatology data set
Reduced Dermatology data set
Original Spam data set
2048.88 KB
Reduced Spam data set
107.84 KB
Original R data set
(linearly separable).
Reduced R data set,
(still linearly separable).
Principal Component Analysis
1. Original data set.
2. Principal Components of K.
3. Basis change.
4. Component elimination.
Feature Covariance Matrix.
V is a change of basis orthonormal matrix.
Use V' to transform samples from X to Y.
Find V', the base formed by the most important principal components.
SVD
Multidimensional Scaling
Pairwise distances.
Full space reconstruction.
Sort eigenvalues and eigenvectors by
the value of the eigenvalues.
Embed it!
Q: what happens when data that follows a nonlinear distribution is reduced with linear methods?
A: very dissimilar samples become mixed as they are crushed onto lower dimensions.
Isometric Feature Mapping
1. Original data set.
2. Compute neighborhood graph.
4.A. Reduction with MDS
(2 dimensions).
4.B. Reduction with MDS
(1 dimension).
def isomap(data_set, n_components=2,
k=10, epsilon=1.0, use_k=True,
path_method='dijkstra'):
# Compute neighborhood graph.
delta = nearest_neighbors(delta, k if use_k else epsilon)
# Compute geodesic distances.
if path_method == 'dijkstra':
delta = all_pairs_dijkstra(delta)
else:
delta = floyd_warshall(delta)
# Embed the data.
embedding = mds(delta, n_components)
return embedding
Experiment 1: Digits, 1797 samples and 64 dimensions.
Grid search was performed using a Support Vector Classifier.
64 dimensions | 10 dimensions | |
---|---|---|
Accuracy | 98% | 96% |
Data size | 898.5 KB | 140.39 KB |
Grid time | 11.12 sec | 61.66 sec |
Best parameters | 'kernel': 'rbf', 'gamma': 0.001, 'C': 10 | 'kernel': 'linear', 'C': 1 |
Experiment 2: Leukemia, 72 samples and 7130 features.
Grid search was performed using a Support Vector Classifier.
7130 dimensions | 10 dimensions | |
---|---|---|
Accuracy | 99% | 88% |
Data size | 4010.06 KB | 16.88 KB |
Grid time | 2.61 sec | .36 sec |
Best parameters | 'degree': 2, 'coef0': 10, 'kernel': 'poly' | 'C': 1, 'kernel': 'linear' |
def isomap(data_set, n_components=2,
k=10, epsilon=1.0, use_k=True,
path_method='dijkstra'):
# 1. Compute neighborhood graph.
delta = nearest_neighbors(delta, k if use_k else epsilon)
# 2. Compute geodesic distances.
if path_method == 'dijkstra':
delta = all_pairs_dijkstra(delta) # 2.A
else:
delta = floyd_warshall(delta) # 2.B
# 3. Embed the data.
embedding = mds(delta, n_components)
return embedding
Reducing it to 3 dimensions took...
Experiment 3: Spam, 4601 samples and 57 features.
Obviously, we can do do better!
Studying scikit-learn's implementation:
ISOMAP as Kernel PCA
"Kernel Trick"
Sample Covariance Matrix.
L-ISOMAP
Manifold Assumption
Disconnected graphs.
Incorrect reductions.
Possible work-around: increasing the parameters k or epsilon.
Manifold Convexity
Deformation around the "hole".
Noise
Incorrect unfolding.
Possible work-around: reducing the parameters k or epsilon.
For ISOMAP implementation and experiments, check out my github: github.com/lucasdavid/manifold-learning.