Clustering

UCI heart disease

Clustering

Clustering is an unsupervised learning technique used to group similar data points toghether based on their characteristic.

Clustering

Why we cluster data

1. Discover hidden patterns;

2. Simplify complex data;

3. Improve decision-making;

4. Preprocessing for other models.

UCI heart disease

1. sex

2. cp

3. fbs

4. restecg

5. exang

6. slope

7. thal

8. age

9. trestbps

10. chol

11. thalach

12. oldpeak

13. ca

Categorical

Continuous

UCI heart disease

1. sex

2. cp

3. fbs

4. restecg

5. exang

6. slope

7. thal

Categorical

Continuous

8. age

9. trestbps

10. chol

11. thalach

12. oldpeak

13. ca

1. To group samples into categories with correlated features.

2. To predict eventual heart disease

Goals

Used algorithm

K-Means

One of most famous algorithm
Try to find correlations

Input: number of cluster (k)
Output: visualization in a scatterplot

K-Means

Elbow-method

K-Means

Elbow-method

Find the suitable k for the dataset
When the curve get more linear,
we have the optimal k

Scatterplot

PCA plot

Easier to read
Good for low featured
dataset

Scatterplot

PCA plot

Easier to read
Good for low featured
dataset

t-SNE plot

More informative
Good for high featured datasets

Feature importance

Each feature has a different level of importance in determining the cluster assignment of a given record.

Feature importance

Credits

Pietro Mondini

Nicolò Moroni

Alessandro Crippa

Clustering

Clustering

Clustering

UCI heart disease

UCI heart disease

UCI heart disease

UCI heart disease

Used algorithm

Used algorithm

K-Means

K-Means

Scatterplot

Scatterplot

Scatterplot

Feature importance

Feature importance

Feature importance

Feature importance

Feature importance

Feature importance

Feature importance

Feature importance

Credits

Clustering

More from Pietro Mondini