Reduce Dimensions Using PCA for Better Insights

Business Scenario

Welcome!

Today is your ninth day as a Junior Data Scientist at MediCare Analytics.

Our company has recently partnered with a network of hospitals and healthcare research centers to improve the understanding of heart disease risk factors using patient health records.

The client has shared a medical dataset containing patient information such as age, cholesterol levels, blood pressure, heart rate, chest pain type, ECG results, and other clinical measurements.

Although the dataset contains valuable information, it contains multiple health indicators that make visualization and interpretation difficult. The healthcare team wants to simplify the dataset while preserving the most important information so that researchers can better understand patient health patterns.

Before building advanced predictive models, your manager wants you to understand how Dimensionality Reduction can simplify complex datasets and improve data visualization.

As part of this project, you will:

  • Understand Dimensionality Reduction concepts.
  • Learn Principal Component Analysis (PCA).
  • Reduce feature dimensions using PCA.
  • Visualize patient groups using principal components.
  • Interpret hidden patterns in healthcare data.

Pre-Lab Preparation

Topic : Unsupervised Learning 

1) PCA

git pull origin branchName

Git Pull

Task 1: Understanding Dimensionality Reduction

Before applying PCA, it is important to understand why dimensionality reduction is needed in machine learning projects.

What is Dimensionality Reduction?

Dimensionality Reduction is the process of reducing the number of features in a dataset while preserving as much useful information as possible.

Datasets with many variables are often difficult to analyze and visualize. Dimensionality reduction helps simplify the dataset and improves computational efficiency.

Benefits of Dimensionality Reduction

  • Reduces data complexity
  • Improves visualization
  • Reduces computational cost
  • Removes redundant information
  • Helps improve model performance

What is PCA?

Principal Component Analysis (PCA) is an unsupervised machine learning technique used for dimensionality reduction.

PCA transforms the original features into a smaller set of new variables called Principal Components. These components capture the maximum variance present in the dataset.

The first principal component captures the highest variance, followed by the second principal component, and so on.

Advantages of PCA

  • Reduces feature count
  • Preserves important information
  • Removes multicollinearity
  • Improves visualization
  • Speeds up model training

Task 2: Applying PCA for Better Insights

Now that you understand dimensionality reduction, the healthcare analytics team wants you to apply PCA on patient health records.

The objective is to:

  • Standardize the dataset.

  • Apply PCA.

  • Analyze explained variance.

  • Reduce the dataset into principal components.

  • Visualize patient distributions.

While performing PCA, think about:

  • How much variance is explained by each component?

  • How many principal components should be selected?

  • Does PCA simplify the dataset?
  • Can patient groups be visualized more clearly?

Open a new file in Google Colab and import libraries

1

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

Load and explore dataset

2

Click to download Dataset : heart.csv

df = pd.read_csv("heart.csv")

df.head()

Explore Dataset

3

df.shape
df.info()
df.describe()

Separate Input and Target Variables

4

X = df.drop("target", axis=1)

Y = df["target"]

Split Dataset

5

from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(
    X,
    Y,
    test_size=0.3, random_state=1)

Standardize the Dataset

6

from sklearn.preprocessing import StandardScaler

ss = StandardScaler()

X_train = ss.fit_transform(X_train)

X_test = ss.transform(X_test)

Import PCA

7

from sklearn.decomposition import PCA

Apply PCA

8

pc1 = PCA(
    n_components=None,
    random_state=1
)

X_train1 = pc1.fit_transform(X_train)

X_test1 = pc1.transform(X_test)

Analyze Explained Variance

9

pc1.explained_variance_ratio_

Calculate Cumulative Variance

10

np.cumsum(pc1.explained_variance_ratio_)

Select Optimal Principal Components

11

pc2 = PCA(
    n_components=6,
    random_state=1
)

X_train1 = pc2.fit_transform(X_train)

X_test1 = pc2.transform(X_test)

Build Logistic Regression Model

12

from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()

lr.fit(X_train1, Y_train)

Y_pred = lr.predict(X_test1)

Evaluate Model

13

from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

print(classification_report(Y_test, Y_pred))

print(confusion_matrix(Y_test, Y_pred))

Reduce Data to Two Principal Components

14

pca_vis = PCA(n_components=2)

X_pca = pca_vis.fit_transform(
    StandardScaler().fit_transform(X)
)

Create PCA DataFrame

15

pca_df = pd.DataFrame(
    data=X_pca,
    columns=["PC1", "PC2"]
)

pca_df["target"] = Y

pca_df.head()

Visualize Clusters

16

plt.figure(figsize=(8,6))

sns.scatterplot(
    x="PC1",
    y="PC2",
    hue="target",
    data=pca_df
)

plt.title("PCA Cluster Visualization")

plt.show()

 

Great job!

You have successfully completed lab 15: Reduce Dimensions Using PCA for Better Insights.

In this lab, you have: Understood Dimensionality Reduction, Learned Principal Component Analysis (PCA), Applied PCA on the Heart Disease dataset, Analyzed explained variance, Reduced multiple features into principal components, Evaluated model performance using PCA-transformed data, and Visualized patient clusters using PCA.

You are now ready to move to the next stage of machine learning. In the next lab, we will learn how to Build an ML Web App Using Streamlit and deploy machine learning models through an interactive web interface.

Checkpoint

   Git Push

git push origin branchName

Next-Lab Preparation

Topic : Model Deployment Basics

1) Introduction to Streamlit for ML Model Deployment

2) Web Application for ML Predictions

 

         

ML Lab 15: Reduce Dimensions Using PCA for Better Insights

By Content ITV

ML Lab 15: Reduce Dimensions Using PCA for Better Insights

  • 13