EDA - Automobile Data

Handling Imbalanced Data (SMOTE, Undersampling, Oversampling

Learning Outcome

5

Visual Cue: A clear, bulleted list next to a "target/bullseye" icon.

4

Understand how SMOTE synthetically generates new data points without duplicating them.

3

Apply Undersampling and Oversampling to balance datasets.

2

Define class imbalance in classification problems.

1

Recognize the "Accuracy Paradox" and why 99% accuracy isn't always a good thing.

The EDA Progress Checklist:

  • Step 1: Handled Missing Data.

  • Step 2: Removed Outliers.

  • Step 3: Encoded Categorical Text.

  • Step 4: Scaled the Data.

  • Step 5: Selected & Engineered the Best Features.

The Final Problem: Our data is perfectly clean, but what if we are trying to predict something that almost never happens?

 

Predicting if a car engine will explode.
Out of 10,000 cars, only 10 explode.

You hire a mechanic to inspect 100 cars to find the 1 with a faulty engine.

The mechanic doesn't even look at the cars.
He just blindly stamps "PASSED" on all 100 cars.

He is right 99 times out of 100. Statistically, he has 99% Accuracy.

 The Accuracy Paradox & Imbalance

A dataset where the target class has an uneven distribution of observations.

Class Type Number of Observations Percentage
Majority Class 9900 99%
Minority Class 100 1%

  Machine learning algorithms try to maximize overall accuracy.  

  • If the model predicts only the majority class, the accuracy becomes:
  • Predicted: All = Class 0

Accuracy = 9900 / 10000 = 99%

But the model failed to detect all minority cases.

This is called the Accuracy Paradox.

Total Samples = 10000

Actual \ Predicted Class 0 Class 1
Class 0 (990) 9900 0
Class 1 (10) 100 0

Majority
Class

Minority
Class

Before training the model, we balance the dataset.

Class Type Number of Observations
Majority Class 5000
Minority Class 5000

Now the model learns patterns from both classes.

Imbalanced Dataset

Balanced Dataset

Visual Cue

Technique 1: Undersampling​

 Mechanism

Randomly delete rows from the Majority Class until it matches the size of the Minority Class .

Class Number of Observations
Safe Cars 9900
Faulty Cars 100
Class Number of Observations
Safe Cars 100
Faulty Cars 100

Dataset before balancing:

After Undersampling:

We remove 9,800 Safe Car samples to create a balanced dataset (100 vs 100).

Pros
Very fast training because the dataset becomes much smaller.

Cons
We discard a huge amount of useful information from the majority class.

Visual clue

Interpretation

  •  Large Majority Class → cut down
  •  Random removal of samples
  •  Minority Class → remains the same

Result → Balanced dataset

Technique 2: Oversampling

 Mechanism

Randomly duplicate rows from the Minority Class until it matches the size of the Majority Class.

Class Number of Observations
Safe Cars 9900
Faulty Cars 100
Class Number of Observations
Safe Cars 9900
Faulty Cars 9900

Dataset before balancing:

After Oversampling:

We copy the 100 Faulty Car samples repeatedly until they reach 9,900 observations.

Pros
No information from the dataset is lost.

Cons
The model may memorize those same 100 faulty cars instead of learning general patterns.

Visual clue

Interpretation

  •  Small Minority Class → copied many times
  •  Copy / duplication process
  •  Majority Class → remains unchanged

Result → Balanced dataset without removing data

Copies of the
Minority Class

SMOTE (Synthetic Minority Over-sampling)

 Mechanism

SMOTE (Synthetic Minority Over-sampling Technique) creates new synthetic samples instead of duplicating existing ones.

  • It uses K-Nearest Neighbors (KNN) to generate new data points between existing minority samples.
Car Type Weight Speed Label
 Faulty Cars A Heavy Medium Faulty
Faulty Cars B Medium Fast Faulty

Existing Minority Samples:

Car Type Weight Speed Label
Synthetic Faulty Car Heavy-Medium Medium-Fast Faulty

SMOTE creates a synthetic sample between them:

The algorithm blends features of nearby minority points to create realistic new data.

  • Reduces overfitting compared to simple oversampling.
  •  Provides more diverse minority class examples.

Pros

Visual clue

Interpretation

  •  Existing minority data points 
  • Nearest neighbors identified
  • New synthetic point generated between them

Result →Balanced dataset with realistic synthetic samples

Comparison Table

Reduces the majority class by removing samples

Duplicates minority class samples

Creates synthetic minority samples using nearest neighbors

Information Loss

Overfitting

Noise introduction

You have millions of rows of data (Big Data)

You have very little data overall

Standard go-to technique for most imbalanced ML tasks

Summary

4

SMOTE innovates the small class by generating synthetic data.

3

Oversampling clones the small class.

2

Undersampling shrinks the big class.

1

Accuracy is a trap: Never trust standard accuracy on imbalanced data.

Quiz

Which technique creates entirely new data points by interpolating between existing minority class instances?

A. Random Undersampling

B. Normalization

C. Random Oversampling

D. SMOTE

Which technique creates entirely new data points by interpolating between existing minority class instances?

A. Random Undersampling

B. Normalization

C. Random Oversampling

D. SMOTE

Quiz-Answer

Handling Imbalanced Data (SMOTE, Undersampling, Oversampling

By Content ITV

Handling Imbalanced Data (SMOTE, Undersampling, Oversampling

  • 74