Mohamad Amin Mohamadi
Prof. Pasi Franti
Summer 2019
1. Centroid Step:
Calculate probabilities of each category in each attribute for clusters as for each cluster, where is the maximum number of categories in attributes.
2. Assignment step:
Assign each data point to the cluster where is:
Entropy: 0
Entropy: 0
Entropy: 0.45
Entropy: 0
Entropy: 0.69
Entropy: 0
P(A) | P(B) | P(C) | P(D) | |
---|---|---|---|---|
Cluster 1 | 0.83 | - | 0.17 | - |
Cluster 2 | - | 1.00 | - | - |
Cluster 3 | - | - | - | 1.00 |
Data | Cluster 1 | Cluster 2 | Cluster 3 | Selected |
---|---|---|---|---|
0.18 |
|
1 | ||
0 | 2 | |||
1.79 | 3 | |||
0 | 1 |
We select the best cluster for each datapoint according to equation (1).
Problem:
Log of 0 is infinite, so the datapoints won't move if D-1 attributes feel comfortable about a move, but only one attribute doesn't!
Motivation:
Decrease the influence of categories with zero probability. Anticipate newcomers in less crowded clusters!
No datapoint moves! Each datapoint has it's best situation (according to eq. 1) in its own cluster!
A
A
A
A
A
A
B
B
B
B
B
B
B
D
E
E
F
C
?
?
?
B
A
* Costs Calculated according to equation 1
Let's start with a simple epsilon:
P(A) | P(B) | P(C) | P(D) | |
---|---|---|---|---|
Cluster 1 | 0.67 | 0.17 | 0.17 | 0.17 |
Cluster 2 | 0.25 | 0.75 | 0.25 | 0.25 |
Cluster 3 | 0.5 | 0.5 | 0.5 | 0.5 |
Data | Cluster 1 | Cluster 2 | Cluster 3 | Selected |
---|---|---|---|---|
0.40 | 1.38 | 0.69 | 1 | |
1.77 | 0.28 | 0.69 | 2 | |
1.77 | 1.38 | 0.69 | 3 | |
1.77 | 1.38 | 0.69 | 3 |
We select the best cluster for each datapoint according to equation (1).
Strategy | Average Impurity | Impurity SE |
---|---|---|
7.59* | 0.31* | |
7.49* | 0.26* |
* Over 100 tests on the test dataset
Input
-------------
X (N × D)
k
update_proportion_criterion
Algorithm
-------------
assign datapoints randomly to k clusters
do
centroids = calculate_centroids()
movements = assign_datapoints(centroids)
while movements > update_proportion_criterion * N
function calculate_centroids()
for j in 1..k
cluster_data = items in cluster j // n_j * D matrix
for d in 1..D
if there is empty category in dimension d of cluster_data:
for each category in dimension d as cat:
if count(cat) in cluster_data > 0:
centroids[k][d][cat] = count(cat) / (n_j + 1)
else:
centroids[k][d][cat] = 1 / (n_j + 1)
else:
for each category in dimension d as cat:
centroids[k][d][cat] = count(cat) / n_j
return centroids
function assign_datapoints(centroids)
movements = 0
for i in 1..N:
selected_cluster = -1
most_entropy = -infinite
for j in 1..k:
cluster_entropy = 0
for d in 1..D:
category = X[i][d]
cluster_entropy -= log(centroids[j][d][category])
if cluster_entropy > most_entropy:
most_entropy = cluster_entropy
selected_cluster = j
if selected_cluster != prev_cluster[i]:
movements++
move datapoint to selected_cluster
return movements
For testing the algorithm, we used mushroom dataset.
Mushroom dataset properties:
We used 16 clusters for the mushroom dataset.
Results of running the algorithm on the dataset are available in the next slides.
Results of 100 runs:
Statistic | Value |
---|---|
Average impurity | 7.49 |
Best (lowest) impurity | 7.04 |
Worst (highest) impurity | 8.12 |
Standard error of impurity | 0.26 |
Empty cluster situations* | 9 |
* By empty cluster situation, we mean that in some point the algorithm created some empty cluster and we couldn't continue the algorithm from there and thus, that case was dumped from the statistics
Impurity: 7.41
* Visualization in 2D using TSNE
Result of 100 runs:
Statistic | Value |
---|---|
Average impurity | 7.13 |
Best (lowest) impurity | 6.98 |
Worst (highest) impurity | 7.90 |
Standard error of impurity | 0.13 |
Empty cluster situations* | 4 |
* By empty cluster situation, we mean that in some point the algorithm created some empty cluster and we couldn't continue the algorithm from there and thus, that case was dumped from the statistics
Impurity: 7.10
* where cats represents the categories in an attribute
Here are the statistics of algorithm runs on mushroom dataset using the SE strategy assignment .
* By empty cluster situation, we mean that in some point the algorithm created some empty cluster and we couldn't continue the algorithm from there and thus, that case was dumped from the statistics
K-Means (100 Runs) | K-Means + 100 Random Swaps (50 Runs) | |
---|---|---|
Average impurity | 7.51 | 6.99 |
Best (lowest) impurity | 7.00 | 6.95 (New best) |
Worst (highest) impurity | 8.79 | 7.07 |
Standard error of impurity | 0.13 | 0.02 |
Empty cluster situations* | 27 | 2 |
Impurity: 6.95
Impurity | Epsilon = 1/N | SE Strategy |
---|---|---|
Average | 7.013 | 6.99 |
Best (lowest) | 6.98 | 6.95 |
Worst (highest) | 7.90 | 7.07 |
Standard error | 0.13 | 0.02 |
Convergence comparison plot for SE strategy (SASD) and initial epsilon 1/N
According to the statistics achieved using the experiments, it seems like SE strategy has much faster convergence rate compared to the epsilon strategy, and this is probably because of the epsilon not being a good anticipation of the future changes in the next iterations of the algorithm. But in general, a good epsilon should give the best convergence rate.
Algorithm |
Soybean M = 4 Cost std |
Mushroom M = 16 Cost std |
SPECT M = 8 Cost std |
---|---|---|---|
ACE | 7.83 0 | 7.01 n/a | 8.43 0.07 |
ROCK | 9.20 0 | n/a n/a | 10.82 0 |
K-medoids | 10.94 1.57 | 13.46 0.71 | 10.06 0.44 |
K-modes | 8.66 1.06 | 10.19 1.45 | 9.25 0.29 |
K-distributions | 9.00 0.93 | 7.87 0.44 | 8.25 0.13 |
K-represantives | 8.30 0.83 | 7.51 0.30 | 8.64 0.13 |
K-histograms | 8.04 0.53 | 7.31 0.18 | 8.64 0.15 |
K-entropy | 8.15 0.56 | 7.31 0.18 | 8.06 0.07 |
Proposed | 7.83 0 | 6.95 0 | 7.97 0 |