Data Preprocessing

Alireza Afzal Aghaei

M.Sc at SBU

Welcome to the real world!

Introduction

  • What's data preprocessing?
  • Garbage In, Garbage Out

​Preprocessing Targets

  • Improve accuracy
  • efficiency

An overview: Data Quality

Some factors of Data Quality

  • accuracy
  • completeness
  • consistency
  • timeliness
  • believability
  • interpretability

 data quality depends on the intended use of the data

An overview

 Major techniques in Data Preprocessing

  • Data cleaning
  • Data integration
  • Data reduction
  • Data transformations

Data cleaning

  • Missing Values
    1. Ignore the tuple
    2. Fill in manually
    3. Fill in with global constant
    4. Fill in with Mean/Median
    5. Fill in with Mean/Median of same class
    6. Fill in with most probable value

Data cleaning (cont.)

  • Noisy Values
    1. Binning by Mean/Median/Boundary
    2. Regression
    3. Outlier analysis by clustering

Data Integration

  • Entity Identification Problem
  • Redundancy and Correlation Analysis
    • Correlation analysis
      • χ2 Correlation Test
      • Correlation Coefficient
      • Covariance
  • Tuple Duplication
  • Data Value Conflict Detection

Data reduction 

  • Dimensionality reduction
  • Numerosity reduction
  • Data compression

Data reduction: Dimensionality reduction

  • Wavelet Transforms
  • Principal Components Analysis
  • Attribute subset selection

Data reduction: Numerosity reduction 

  • Parametric methods
    • Regression
    • Log-linear
  • Nonparametric methods
    • Histogram
    • Clustering
    • Sampling
    • Data cube aggregation

Data reduction: Data compression

The data reduction is lossless if the original data can be reconstructed from the compressed data without any loss of information; otherwise, it is lossy.

Dimensionality reduction: Wavelet Transforms

  • transforms vector X to a numerically different vector, X' , of wavelet coefficients
  • The two vectors are of the same length
  • But we can ignore some of them

Dimensionality reduction: PCA

Principal Components Analysis

 

searches for k n-dimensional orthogonal vectors that can best be used to represent the data, where k ≤ n.

Dimensionality reduction: Attribute Subset Selection

Attribute subset selection reduces the data set size by removing irrelevant or redundant attributes

Numerosity reduction: Parametric Data Reduction

  • linear regression
  • Log-linear models

Numerosity reduction: Nonparametric Data Reduction

  • Histograms
    • Equal-width
    • Equal-frequency
  • Clustering
  • Sampling
    • without replacement
    • with replacement
    • Cluster sample
    • Stratified sample
  • Data Cube Aggregation

Data Transformation & Data Discretization

  • Smoothing
  • Attribute construction
  • Aggregation
  • Normalization
  • Discretization
  • Concept hierarchy generation

Data Transformation by Normalization

  • Min-max normalization
  • z-score normalization
  • Normalization by decimal scaling

Data Transformation by Discretization

  • Binning
  • Clustering
  • Decision Tree
  • Correlation Analyses

Data Transformation for Nominal Data

 Concept Hierarchy Generation

Let's see some examples

Any Question?

Thank you

Data preprocessing

By Alireza Afzal Aghaei

Data preprocessing

  • 857