with     key concepts

Rachel House | Great Expectations

10 October 2024

2

Boost your data literacy

Rachel House

Senior Developer Advocate

Great Expectations

https://greatexpectations.io

Data literacy and you

Stakeholder data literacy

was cited as an ongoing challenge

by almost 50% of respondents.

...there remains a considerable opportunity to

enhance data understanding among data consumers.

dbt Labs, 2024 State of Analytics Engineering

Analytics Engineering

Data Science

Machine Learning

Data Engineering

Data Analysis

Discipline basics

Subject matter expertise

Analytics Engineering

Data Science

Machine Learning

Data Engineering

Data Analysis

Discipline basics

Subject matter expertise

Discipline basics

Subject matter expertise

"Data"

?

Analytics Engineering

Data Science

Machine Learning

Data Engineering

Data Analysis

Discipline basics

Subject matter expertise

Analytics Engineering

Data Science

Machine Learning

Data Engineering

Data Analysis

Discipline basics

Subject matter expertise

Foundation for data literacy

The data supply chain

ML in a nutshell

Data professional

Data stakeholder
or consumer

Aspiring data professional

DATA.

DATA!

DATA?

The data supply chain

Raw data

Insights

Data as a product

Raw material

Distributor

Retailer

Consumer

Raw data

Data lake

Data warehouse

Dashboard

Data analyst

Informational report

Decision maker

A tangible product supply chain

A data product supply chain

Processing facility

Warehouse

Supplier

Ingest

Transform

Store

Source
data store

Destination
data store

Upstream

Downstream

Pipeline

Pipeline

Pipeline

DATA!

Phase

Phase

Phase

Data

?

Profit

1

2

3

Data scientist

Data analyst

 Data stakeholder

Upstream

Downstream

ML in a nutshell

Supervised learning

Attain a complex goal over many steps.

Surface patterns in data, without examples.

Learn from past experiences to generate predictions.

Program computers to...

Unsupervised learning

Reinforcement learning

Program with examples, not instructions.

Supervised learning

"Label a thing"

Predict a discrete category

Predict a continuous value

"Assign a number"

Map an input to an output, based on example pairs.

Classification

Regression

→   square

→   square

→   not a square

→   square

→   324k

→   599k

→   202k

Test set

X1 → Y1

X2 → Y2

X3 → Y3

Train set

X4 → Y4

X5 → Y5

X6 → Y6

X7 → Y7

X8 → Y8

X9 → Y9

X0 → Y0

Labeled data

X1 → Y1

X2 → Y2

X3 → Y3

X4 → Y4

X0 → Y0

X8 → Y8

X9 → Y9

X5 → Y5

X7 → Y7

X6 → Y6

cat

Hey, that   cat   ate   my   sandwich.

noun

verb

noun

"Best cat video ever."

Labeled data examples

Training data input

Parameters

Model predictions

Machine learning model

Weights & biases

Learning algorithm

Evaluate predictions against training data labels

Tweak and iterate

Test set input

Machine learning model

Model predictions

Evaluate predictions against test set labels

Model performance

Data quality is context-dependent and multidimensional.

Garbage is garbage.

Business expertise

Data expertise

f(x)

?

Wrap up

Data supply chain takeaways

Upstream

Downstream

Data quality

Data is a product delivered via the
data supply chain.

ML in a nutshell takeaways

Input

Output

?

Data literacy takeaways

DATA

DATA

DATA

DATA

DATA

DATA

DATA

Thank you