Take care of your data
Nastasia Saby
@saby_nastasia
![](https://s3.amazonaws.com/media-p.slid.es/uploads/285298/images/8067037/path_learn.jpg)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/285298/images/8067051/1fd536e0-9dea-496f-a5c6-b9b5d9ae614a.png)
Unit tests for data
Data versioning
Monitoring
Data are very very very important in data systems
![](https://s3.amazonaws.com/media-p.slid.es/uploads/285298/images/8067037/path_learn.jpg)
@saby_nastasia
Data are the food
Data
Fonction
Program
Results
@saby_nastasia
Data
Fonction
Results
Program
@saby_nastasia
Data > Code
@saby_nastasia
Data versioning
@saby_nastasia
Manually
- year = 2019
- month = 11
- month = 12
@saby_nastasia
![](https://s3.amazonaws.com/media-p.slid.es/uploads/285298/images/8069661/pasted-from-clipboard.png)
Automatically
@saby_nastasia
Data can will change
@saby_nastasia
Reproducibility
@saby_nastasia
Unit tests for your data
![](https://s3.amazonaws.com/media-p.slid.es/uploads/285298/images/8067037/path_learn.jpg)
What a unit test for data?
@saby_nastasia
Tests for code
@saby_nastasia
Example 1
@saby_nastasia
Titanic dataset
@saby_nastasia
Check
DataKitchen @saby_nastasia
Assert
DataKitchen @saby_nastasia
It means that you must know your data
DataKitchen @saby_nastasia
Examples of tests you can do
DataKitchen @saby_nastasia
And then what to do?
It depends.
DataKitchen @saby_nastasia
Data monitoring
![](https://s3.amazonaws.com/media-p.slid.es/uploads/285298/images/8067037/path_learn.jpg)
Classic Monitoring
DataKitchen @saby_nastasia
Dashboards
DataKitchen @saby_nastasia
Alerts
DataKitchen @saby_nastasia
Once upon a time, a small virus was born in Wuhan
DataKitchen @saby_nastasia
And data analysis and machine learning crashed
DataKitchen @saby_nastasia
Everything changes
![](https://s3.amazonaws.com/media-p.slid.es/uploads/285298/images/8072355/pasted-from-clipboard.png)
DataKitchen @saby_nastasia
Different forms of data drift
- gradual
- abrupt
- seasonial
![](https://s3.amazonaws.com/media-p.slid.es/uploads/285298/images/8072313/pasted-from-clipboard.png)
DataKitchen @saby_nastasia
How to prevent data drift?
![](https://s3.amazonaws.com/media-p.slid.es/uploads/285298/images/8072317/pasted-from-clipboard.png)
DataKitchen @saby_nastasia
- Unit tests
- Measures distances
- Statistical tests
![](https://s3.amazonaws.com/media-p.slid.es/uploads/285298/images/8072317/pasted-from-clipboard.png)
DataKitchen @saby_nastasia
- Measures distances
![](https://s3.amazonaws.com/media-p.slid.es/uploads/285298/images/8072317/pasted-from-clipboard.png)
DataKitchen @saby_nastasia
- Statistical tests
![](https://s3.amazonaws.com/media-p.slid.es/uploads/285298/images/8072355/pasted-from-clipboard.png)
DataKitchen @saby_nastasia
- In a conclusion: monitor
![](https://s3.amazonaws.com/media-p.slid.es/uploads/285298/images/8072317/pasted-from-clipboard.png)
DataKitchen @saby_nastasia
Post-mortems
DataKitchen @saby_nastasia
Unit tests for data
Data versioning
Monitoring
DataKitchen @saby_nastasia
![](https://s3.amazonaws.com/media-p.slid.es/uploads/285298/images/8067037/path_learn.jpg)
Take care of your data
By nastasiasaby
Take care of your data
- 563