Data Quality

Cville Data Science Meetup

January 2024

Rachel House

Validating the Data Supply Chain

What is data quality and why do we care about it?

it correctly represents the real-world construct to which it refers.

it is fit for its intended uses in
operations, planning, and decision making.

Data is generally deemed high quality if:

The Real World Cost of Bad Data

2017, Gartner

$12.9 Million

$3.1 Trillion

2016, IBM

annual cost to the US economy

annual cost to an organization

Unity Technologies

Data Quality Wall of Shame

Public Health England

Samsung Securities

Amsterdam City Council

$110M

$300M

€188M

50k cases

Informed Decisions

Operational Efficiency

Trust

Regulation Compliance

High
Quality Data Drives

Financial Accuracy

Customer Satisfaction

Scalability

Competitive Advantage

Machine Learning

Data Science

Artificial Intelligence

High Quality Data

Data-centric AI

Systematically engineering the data needed to build successful AI systems.

Dimensions of Data Quality

The Big

Accuracy

Completeness

Validity

Consistency

Uniqueness

Timeliness

6

Completeness

All required data values are present.

Missing

Missing

Uniqueness

Distinct values appear only once.

Not unique

Not unique

Timeliness

Data represent the reality from a required point in time.

t

Late data

Consistency

Data characteristics are the same across instances.

Inconsistent

Validity

Data conforms to the format, type, or range of its definition.

Invalid

Invalid

Accuracy

Data values are as close as possible to real-world values.

Inaccurate

Freshness

Distribution

Volume

Schema

Timeliness

Is the data recent?

Validity, Consistency

Completeness, Accuracy, Validity

Has all the data arrived?

Is the data within accepted ranges? Is it complete?

Completeness, Timeliness, Consistency

What is the schema, and has it changed?

Data Quality Fundamentals, Barr Moses, Lior Gavish, Molly Vorwerck, O'Reilly Media, Inc., 2022

Related Data Observability Terms

Lineage

What are the upstream sources and downstream assets impacted by this data?

Who generates the data, and who relies on it for decision making?

Data Quality Fundamentals, Barr Moses, Lior Gavish, Molly Vorwerck, O'Reilly Media, Inc., 2022

Related Data Observability Terms

Data quality is context-dependent and multidimensional.

The Data Supply Chain

Information is manufactured from raw data.

Data is a resource.

material

supplier

manufacturer

distributor

retailer

consumer

decision maker

raw data

Data as a Product

warehouse

data lake

dashboard

Tangible Products

Example Supply Chains

Data Products

data analyst

informational report

Source Data Store

Data Pipeline

Destination Data Store

Transformation

Ingestion

Storage

data in transit

Data Stores

data at rest

Pipeline

User

Source Data

Pipeline

Pipeline

Upstream

Downstream

Target Data

Upstream

Downstream

Data Quality
Across the Organization

Data quality dimensions vary in definition
and importance between stakeholders.

schema

completeness

timeliness

Data Developer

Data Analyst

Executive

validity

Data Scientist

distribution

lineage

freshness

accuracy

validity

consistency

Data quality is a collaborative effort.

Data

Subject Matter Expertise

Business

Subject Matter Expertise

🤓

🤠

common languages
to express
data quality

shared, human-friendly artifacts

data quality democratization

🥸

trust in organization data

Assessing Data Quality

Data Profiling

Data Testing

Anomaly Detection

Data Contracts

Open Core Python Tooling
for Data Quality

ydata-profiling

  • Formerly pandas-profiling

github.com/ydataai/ydata-profiling

profiling

  • Produces human-friendly, shareable reports
  • Pandas and Spark dataframe support
  • "One line data quality profiling and EDA"

great-expectations

github.com/great-expectations/great_expectations

testing & validation

  • Expectations: unit tests for data
    • Library of common, core Expectations
    • Ability to create custom Expectations
  • User-friendly validation results hosted in shared "Data Docs" interface
  • Support for pandas, Spark, Snowflake, Postgres, and other SQL data sources

pandera

github.com/unionai-oss/pandera

testing & validation

  • Variety of supported dataframe formats: pandas, pyspark, dask, modin, fugue
  • Schema definition and validation for data
  • Testing for data transformation functions
  • Data science and ML focus

pydeequ

github.com/awslabs/python-deequ

testing & validation

  • Python API for Deequ (built in Scala, runs on Spark)
  • "Unit tests for data"
  • Built in anomaly detection
  • Measure quality across very large datasets

Live Demo!

In Summary

Data quality has a direct effect on an organization's success.

Data-centric AI is an emerging field focused on model data instead of model algorithms.

Raw data is a resource transformed into information
via the data supply chain.

Data quality is multidimensional and context-dependent.

Achieving high data quality within an organization is a collaborative effort.

1

2

3

4

5

Thank You!

Data Quality: Validating the Data Supply Chain

By Rachel House

Data Quality: Validating the Data Supply Chain

Data fuels the modern world. Organizations increasingly leverage data to build complex ecosystems, infrastructure, and products. Whether your interest is analytics-informed decision making, training the latest model, building a data-intensive application, or something in between, the quality of your data is fundamental to your outcomes. In this introductory talk, we’ll discuss the importance of data quality, common data quality dimensions, data quality across the organization, mechanisms to assess data quality, and open source Python tooling for data quality.

  • 111