Cville Data Science Meetup
January 2024
Rachel House
it correctly represents the real-world construct to which it refers.
it is fit for its intended uses in
operations, planning, and decision making.
Data is generally deemed high quality if:
The Real World Cost of Bad Data
2017, Gartner
2016, IBM
annual cost to the US economy
annual cost to an organization
Unity Technologies
Public Health England
Samsung Securities
Amsterdam City Council
Informed Decisions
Operational Efficiency
Trust
Regulation Compliance
Financial Accuracy
Customer Satisfaction
Scalability
Competitive Advantage
Machine Learning
Data Science
Artificial Intelligence
Data-centric AI
Systematically engineering the data needed to build successful AI systems.
The Big
Accuracy
Completeness
Validity
Consistency
Uniqueness
Timeliness
All required data values are present.
Missing
Missing
Distinct values appear only once.
Not unique
Not unique
Data represent the reality from a required point in time.
Late data
Data characteristics are the same across instances.
Inconsistent
Data conforms to the format, type, or range of its definition.
Invalid
Invalid
Data values are as close as possible to real-world values.
Inaccurate
Freshness
Distribution
Volume
Schema
Timeliness
Is the data recent?
Validity, Consistency
Completeness, Accuracy, Validity
Has all the data arrived?
Is the data within accepted ranges? Is it complete?
Completeness, Timeliness, Consistency
What is the schema, and has it changed?
Data Quality Fundamentals, Barr Moses, Lior Gavish, Molly Vorwerck, O'Reilly Media, Inc., 2022
Lineage
What are the upstream sources and downstream assets impacted by this data?
Who generates the data, and who relies on it for decision making?
Data Quality Fundamentals, Barr Moses, Lior Gavish, Molly Vorwerck, O'Reilly Media, Inc., 2022
Data quality is context-dependent and multidimensional.
Information is manufactured from raw data.
Data is a resource.
material
supplier
manufacturer
distributor
retailer
consumer
decision maker
raw data
warehouse
data lake
dashboard
Tangible Products
Example Supply Chains
Data Products
data analyst
informational report
Source Data Store
Data Pipeline
Destination Data Store
Transformation
Ingestion
Storage
data in transit
Data Stores
data at rest
Pipeline
User
Source Data
Pipeline
Pipeline
Upstream
Downstream
Target Data
Upstream
Downstream
Data quality dimensions vary in definition
and importance between stakeholders.
schema
completeness
timeliness
Data Developer
Data Analyst
Executive
validity
Data Scientist
distribution
lineage
freshness
accuracy
validity
consistency
Data
Subject Matter Expertise
Business
Subject Matter Expertise
🤓
🤠
common languages
to express
data quality
shared, human-friendly artifacts
data quality democratization
🥸
trust in organization data
github.com/ydataai/ydata-profiling
profiling
github.com/great-expectations/great_expectations
testing & validation
github.com/unionai-oss/pandera
testing & validation
github.com/awslabs/python-deequ
testing & validation
Data quality has a direct effect on an organization's success.
Data-centric AI is an emerging field focused on model data instead of model algorithms.
Raw data is a resource transformed into information
via the data supply chain.
Data quality is multidimensional and context-dependent.
Achieving high data quality within an organization is a collaborative effort.