DM6 LAB: Clean and Prepare Dataset for Analysis

Clean and Prepare Dataset for Analysis

Business Scenario

Welcome!

Today is your sixth day as a Junior Data Analyst at a retail analytics company.

In real-world business environments, datasets often contain missing values, inconsistent data types, and duplicate records. Before performing analysis, data analysts must clean and prepare the data to ensure accurate and reliable results.

Data cleaning improves data quality, reduces errors, and helps organizations make better business decisions based on trustworthy information.

Pre-Lab Preparation

Click here to download previous lab file: DM LAB 5

git pull origin branchName

Git Pull

Click to download Dataset : Retail_Dataset

Topic: Data Manipulation with Pandas

1) Polishing Data: From Messy To Meaningful

Task 1: Fix data types

Retail datasets are collected from multiple sources such as billing systems, inventory software, supplier databases, and online stores. Sometimes numerical fields like Revenue, Unit Price, or Quantity Available may be stored as text instead of numbers.

When data types are incorrect:

Revenue calculations may fail.
Sales reports become inaccurate.
Statistical analysis cannot be performed.
Business dashboards may display incorrect results.

To ensure accurate analysis, data analysts must identify and correct data types before performing calculations.

What are Data Types?

Data types define the kind of values stored in each column. Common data types include integers, floating-point numbers, strings, and dates.

Open Google Colab

Import Required Libraries

import pandas as pd
import numpy as np

Upload the Retail Dataset

Load Dataset Using Pandas

df = pd.read_csv("/content/Retail_Dataset_Modified.csv")

print("Dataset Loaded Successfully")

Display First Five Records

df.head()

Display Data Types

print("Data Types:")
print(df.dtypes)

Check Dataset Information

df.info()

After examining the dataset using df.dtypes, the following columns were found to have incorrect data types:

Column Name	Current Data Type	Expected Data Type
Quantity_Available	object	float64
Units_Sold	object	float64

Convert Quantity_Available to Numeric

Column Name	Current Data Type	Expected Data Type
Revenue	object	float64
Shipping_Cost	object	float64

df['Quantity_Available'] = pd.to_numeric(
    df['Quantity_Available'],
    errors='coerce'
)

Here,

pd.to_numeric()

Converts the selected column into a numeric data type.

errors='coerce'

If Pandas finds any invalid value (such as text, symbols, or blank spaces), it converts that value into NaN (Not a Number) instead of showing an error.

print(df['Quantity_Available'].dtypes)

Convert Units_Sold to Numeric

df['Units_Sold'] = pd.to_numeric(
    df['Units_Sold'],
    errors='coerce'
)

print(df['Units_Sold'].dtypes)

Convert Revenue to Numeric

df['Revenue'] = pd.to_numeric(
    df['Revenue'],
    errors='coerce'
)

print(df['Revenue'].dtypes)

Convert Shipping_Cost to Numeric

df['Shipping_Cost'] = pd.to_numeric(
    df['Shipping_Cost'],
    errors='coerce'
)

print(df['Shipping_Cost'].dtypes)

Verify Updated Data Types

print(df[['Quantity_Available',
          'Units_Sold',
          'Revenue',
          'Shipping_Cost']].dtypes)

Task 2: Handle missing values

In retail datasets, missing values can occur due to data entry errors, system failures, or unavailable information. Pandas provides several methods to identify and handle missing values.

What are Missing Values?

Missing values occur when information is unavailable for certain records. These values must be handled before analysis.

Check Missing Values

print("Missing Values:")
print(df.isnull().sum())

Display Columns Having Missing Values

missing_cols = df.isnull().sum()
print(
    missing_cols[missing_cols > 0]
)

To fill missing values, we used:

Mean: Used for numerical data; replaces missing values with the average.
Median: Used for numerical data with outliers; replaces missing values with the middle value.
Mode: Used for categorical/text data; replaces missing values with the most frequent value.

Fill Missing Values in City Using Mode

df['City'].fillna(
    df['City'].mode()[0],
    inplace=True
)

Here:

fillna(): Used to replace missing values.
mode()[0]: Returns the most frequent value in the column.
inplace=True: Applies the changes directly to the original DataFrame.

Fill Missing Values in Product Name Using Mode

df['Product_Name'].fillna(
    df['Product_Name'].mode()[0],
    inplace=True
)

df['Units_Sold'].fillna(
    df['Units_Sold'].mean(), 
    inplace=True
    )

Fill Missing Values in Quantity Available Using Mean

df['Quantity_Available'].fillna(
    df['Quantity_Available'].mean(), 
    inplace=True
    )

Fill Missing Values in Units Sold Using Mean

Fill Missing Values in Revenue Using Mean

df['Revenue'].fillna(
    df['Revenue'].mean(), 
    inplace=True
    )

df['Supplier'].fillna(
    df['Supplier'].mode()[0],
    inplace=True
    )

Fill Missing Values in Supplier Using Mode

df['Customer_Satisfaction'].fillna(
    df['Customer_Satisfaction'].mean(),
    inplace=True
    )

Fill Missing Values in Customer Satisfaction Using Mean

Fill Missing Values in Delivery Time Using Mean

df['Delivery_Time'].fillna(
    df['Delivery_Time'].mean(), 
    inplace=True
    )

Fill Missing Values in Shipping Cost Using Mean

df['Shipping_Cost'].fillna(
    df['Shipping_Cost'].mean(), 
    inplace=True
    )

Verify Missing Values After Cleaning

print(df.isnull().sum())

Display Dataset Information

df.info()

Task 3: Remove Duplicates

Duplicate records can lead to incorrect calculations and misleading business insights.

What are Duplicate Records?

Duplicate records are repeated rows that can lead to inaccurate analysis and incorrect business insights.

Count Duplicate Records

Display Duplicate Count

duplicates = df.duplicated().sum()

print(
    "Duplicate Records:",
    duplicates
)

print(df.duplicated().value_counts())

Remove Duplicate Records

drop_duplicates() is a Pandas function used to remove duplicate rows from a DataFrame.

df = df.drop_duplicates()

Verify Dataset Shape After Removing Duplicates

print(df.shape)

Confirm No Duplicate Records Remain

print(
    "Remaining Duplicates:",
    df.duplicated().sum()
)

Display First Five Records

df.head()

Great job!

You have successfully completed your lab on Clean and Prepare Dataset for Analysis.

In this lab, you have: Fixed incorrect data types, Identified and handled missing values, Filled numerical and categorical missing data, Detected duplicate records, Removed duplicate entries, Prepared a clean dataset for analysis

You are now ready to move to the next stage of Junior Data Analyst.

Checkpoint

Git Push

git push origin branchName

Next-Lab Preparation

Topic: Data Manipulation with Pandas

1) Data Preparation Essentials