Rachel House | Great Expectations
10 October 2024
Senior Developer Advocate
Great Expectations
https://greatexpectations.io
Stakeholder data literacy
was cited as an ongoing challenge
by almost 50% of respondents.
...there remains a considerable opportunity to
enhance data understanding among data consumers.
dbt Labs, 2024 State of Analytics Engineering
Analytics Engineering
Data Science
Machine Learning
Data Engineering
Data Analysis
Discipline basics
Subject matter expertise
Analytics Engineering
Data Science
Machine Learning
Data Engineering
Data Analysis
Discipline basics
Subject matter expertise
Discipline basics
Subject matter expertise
Analytics Engineering
Data Science
Machine Learning
Data Engineering
Data Analysis
Discipline basics
Subject matter expertise
Analytics Engineering
Data Science
Machine Learning
Data Engineering
Data Analysis
Discipline basics
Subject matter expertise
Foundation for data literacy
The data supply chain
ML in a nutshell
Data professional
Data stakeholder
or consumer
Aspiring data professional
Raw data
Insights
Raw material
Distributor
Retailer
Consumer
Raw data
Data lake
Data warehouse
Dashboard
Data analyst
Informational report
Decision maker
A tangible product supply chain
A data product supply chain
Processing facility
Warehouse
Supplier
Ingest
Transform
Store
Source
data store
Destination
data store
Upstream
Downstream
Pipeline
Pipeline
Pipeline
Phase
Phase
Phase
Data
Profit
Data scientist
Data analyst
Data stakeholder
Upstream
Downstream
Supervised learning
Attain a complex goal over many steps.
Surface patterns in data, without examples.
Learn from past experiences to generate predictions.
Unsupervised learning
Reinforcement learning
Program with examples, not instructions.
"Label a thing"
Predict a discrete category
Predict a continuous value
"Assign a number"
Map an input to an output, based on example pairs.
Classification
Regression
→ square
→ square
→ not a square
→ square
→ 324k
→ 599k
→ 202k
Test set
X1 → Y1
X2 → Y2
X3 → Y3
Train set
X4 → Y4
X5 → Y5
X6 → Y6
X7 → Y7
X8 → Y8
X9 → Y9
X0 → Y0
Labeled data
X1 → Y1
X2 → Y2
X3 → Y3
X4 → Y4
X0 → Y0
X8 → Y8
X9 → Y9
X5 → Y5
X7 → Y7
X6 → Y6
cat
Hey, that cat ate my sandwich.
noun
verb
noun
"Best cat video ever."
Labeled data examples
Training data input
Parameters
Model predictions
Machine learning model
Weights & biases
Learning algorithm
Evaluate predictions against training data labels
Tweak and iterate
Test set input
Machine learning model
Model predictions
Evaluate predictions against test set labels
Model performance
Data quality is context-dependent and multidimensional.
Garbage is garbage.
Business expertise
Data expertise
?
Upstream
Downstream
Data quality
Data is a product delivered via the
data supply chain.
Input
Output
?