Take care of your data
Nastasia Saby
@saby_nastasia
Unit tests for data
Data versioning
Monitoring
Data are very very very important in data systems
@saby_nastasia
Data are the food
Data
Fonction
Program
Results
@saby_nastasia
Data
Fonction
Results
Program
@saby_nastasia
Data > Code
@saby_nastasia
Data versioning
@saby_nastasia
Manually
- year = 2019
- month = 11
- month = 12
@saby_nastasia
Automatically
@saby_nastasia
Data can will change
@saby_nastasia
Reproducibility
@saby_nastasia
Unit tests for your data
What a unit test for data?
@saby_nastasia
Tests for code
@saby_nastasia
Example 1
@saby_nastasia
Titanic dataset
@saby_nastasia
Check
DataKitchen @saby_nastasia
Assert
DataKitchen @saby_nastasia
It means that you must know your data
DataKitchen @saby_nastasia
Examples of tests you can do
DataKitchen @saby_nastasia
And then what to do?
It depends.
DataKitchen @saby_nastasia
Data monitoring
Classic Monitoring
DataKitchen @saby_nastasia
Dashboards
DataKitchen @saby_nastasia
Alerts
DataKitchen @saby_nastasia
Once upon a time, a small virus was born in Wuhan
DataKitchen @saby_nastasia
And data analysis and machine learning crashed
DataKitchen @saby_nastasia
Everything changes
DataKitchen @saby_nastasia
Different forms of data drift
- gradual
- abrupt
- seasonial
DataKitchen @saby_nastasia
How to prevent data drift?
DataKitchen @saby_nastasia
- Unit tests
- Measures distances
- Statistical tests
DataKitchen @saby_nastasia
- Measures distances
DataKitchen @saby_nastasia
- Statistical tests
DataKitchen @saby_nastasia
- In a conclusion: monitor
DataKitchen @saby_nastasia
Post-mortems
DataKitchen @saby_nastasia
Unit tests for data
Data versioning
Monitoring
DataKitchen @saby_nastasia
Take care of your data
By nastasiasaby
Take care of your data
- 631