Data quality,

Rachel House

Great Expectations

MLOps, and Great Expectations

Senior Developer Advocate

Data quality + MLOps

Great Expectations

Live demonstration

Q&A

Agenda

25 min

8 min

6 min

21 min

Data quality 101

High quality data

is fit for its intended use.

Data quality is 

Product quality

Outcome quality

Insight quality

Prediction quality

Decision quality

Data as a product

Raw material

Distributor

Retailer

Consumer

Raw data

Data lake

Data warehouse

Dashboard

Data analyst

Informational report

Decision maker

Example tangible product supply chain

Example data product supply chain

Processing facility

Warehouse

Supplier

The data supply chain

Upstream

Downstream

Data quality

Data observability

Content and quality of the data product

Health of the
data supply chain

freshness, distribution, schema, lineage, volume
accuracy, completeness, consistency, validity, timeliness, uniqueness

Data quality and MLOps

MLOps workflow

Prepare data

Develop model

Deploy model

Monitor and adjust

MLOps automates and streamlines the lifecycle of machine learning models.

Refresher: ML models and data

f(x)

Train on historical data

The algorithm is programmed using explicit instructions.

Generalize and perform on new, unseen data

"The data is the algorithm."

x
y

Before ML

Using ML

Data collection

Knowledge about data

Source data

Cleaned data,
fit for modeling 

Decisions and assumptions about data quality

Steps to prepare and clean data

EDA

Data cleaning

Definition of data quality

Consistent distribution for train/test/holdout sets

Candidate models

Selected model

Steps to prepare train/test/holdout sets

Feature engineering

Evalution

Experimentation

Train/test/holdout sets

Prepared dataset

Differences in training data and live data at deployment time

Deployed model in production

Deployment

Model handoff

Trained development model

Implicit knowledge about data dependencies and required quality

Potential reimplementation

Data quality definition and assumptions from development

Changes in model output

Maintained model

Newly deployed model

Evaluation and comparison

Ground truth evaluation

Input drift detection

Retraining

Data drift

Retrained model

Data quality themes for MLOps

Capture data quality standards applied during data preparation.

Codify data quality standards in a way that can be shared and applied by all MLOps personas.

Document assumptions about data quality.

Models are not deployed in a vacuum.

The definition of data quality from the preparation phase affects and is used in all other MLOps phases.

Data quality and collaboration

Data Scientist

Data Engineer

ML Engineer

Subject Matter Expert

Software Engineer

MLOps personas

DevOps Engineer

Model Governance Manager

Business Analyst

Executive

Data quality is a collaborative effort.

Technical

Business

Subject matter expertise

Subject matter expertise

Defining data quality

Driven by problem domains and desired outcomes

Informed by technical and business personas

Multidimensional

Data Scientist

Data Engineer

Subject Matter Expert

Executive

ML Engineer

distribution

validity

timeliness

schema

completeness

timeliness

distribution

accuracy

relevancy

bias

bias

accuracy

consistency

completeness

Sharing and applying definitions of data quality

  • data testing and validation
  • data contracts

Common language to express data quality

Single source of truth for creating and maintaining data quality definitions

Share

Apply

Tooling to apply shared data quality definitions through: 

Data quality and GX

Expectation-based approach to data validation

Comprehensive solution for data quality monitoring and management

Great Expectations (GX)

Collaborative tooling that enables a single source of truth for data quality definition and application

Flexible integration with data stacks

The GX Expectation

"I expect that all patient records contain the patient's age."

import great_expectations.expectations as gxe

gxe.ExpectColumnValuesToNotBeNull(column="age")

A verifiable assertion about data,

expressed in plain-language terms.

GX Cloud

Great Expectations data quality platform

GX Core

+

Open source Python framework

Fully hosted SaaS platform

UI workflows

Python workflows

GX platform workflow flexibility

Example modes of working together with GX Cloud and GX Core.

GX can enable creation and sharing of data quality definitions and data quality validation across the MLOps workflow.

Prepare data

Develop model

Deploy model

Monitor and adjust

Collaboratively define shared data quality definition for modeling

Validate train, test, and holdout sets for quality
consistency

Monitor new input data for data quality standards

Validate prepared data

Alert on issues

Evolve data quality definition as model and trends change

Enforce shared data quality standards when model is changed for deployment

Compare and validate initial live data against training data quality

GX demonstration

GX terminology 101

Data Source

Data Asset

Batch Definition

Batch

Validation Definition

Validation Result

Expectation

Expectation Suite

Checkpoint

Define your data

Define your quality

Validate your data

The GX representation of a data store.

A collection of records in a Data Source.

A verifiable assertion about data.

A collection of Expectations.

A collection of records in a Data Asset.

Defines how Data Asset records are split into Batches.

Pairs a Batch Definition to an Expectation Suite for validation.

Executes data validation using the Validation Definition.

Returns the results of data validation and related metadata.

Time to demonstrate!

Q & A

Resources

GX docs

Great Expectations website

Try GX Cloud

GX resources

GX Core GitHub repo

Tutorial series on integrating GX into your data pipeline

Demo and other code resources

Code used for the demonstration

(maintained and growing)

(not currently maintained)