Clustering
UCI heart disease
Clustering
Clustering is an unsupervised learning technique used to group similar data points toghether based on their characteristic.
Clustering
Why we cluster data
1. Discover hidden patterns;
2. Simplify complex data;
3. Improve decision-making;
4. Preprocessing for other models.
UCI heart disease
UCI heart disease
UCI heart disease
1. sex
2. cp
3. fbs
4. restecg
5. exang
6. slope
7. thal
8. age
9. trestbps
10. chol
11. thalach
12. oldpeak
13. ca
Categorical
Continuous
UCI heart disease
1. sex
2. cp
3. fbs
4. restecg
5. exang
6. slope
7. thal
Categorical
Continuous
8. age
9. trestbps
10. chol
11. thalach
12. oldpeak
13. ca
1. To group samples into categories with correlated features.
2. To predict eventual heart disease
Goals
Used algorithm
Used algorithm
K-Means
- One of most famous algorithm
- Try to find correlations
- Input: number of cluster (k)
- Output: visualization in a scatterplot
K-Means
Elbow-method
K-Means
Elbow-method
- Find the suitable k for the dataset
-
When the curve get more linear,
we have the optimal k

Scatterplot
Scatterplot

PCA plot
- Easier to read
- Good for low featured
dataset
Scatterplot

PCA plot
- Easier to read
- Good for low featured
dataset

t-SNE plot
- More informative
- Good for high featured datasets
Feature importance
Feature importance
Each feature has a different level of importance in determining the cluster assignment of a given record.
Feature importance

Feature importance

Feature importance

Feature importance

Feature importance

Feature importance

Credits
Pietro Mondini
Nicolò Moroni
Alessandro Crippa
Clustering
By Pietro Mondini
Clustering
- 38
