Take care of your data

 

 

 

 

 

 

 

Nastasia Saby

@saby_nastasia

Unit tests for data

Data versioning

Monitoring

Data are very very very important in data systems

@saby_nastasia

Data are the food

Data

Fonction

Program

Results

@saby_nastasia

Data

Fonction

Results

Program

@saby_nastasia

Data > Code

@saby_nastasia

Data versioning

@saby_nastasia

Manually

- year = 2019

   - month = 11

     - month = 12

@saby_nastasia

Automatically

@saby_nastasia

Data can will change

@saby_nastasia

Reproducibility

@saby_nastasia

Unit tests for your data

What a unit test for data?

@saby_nastasia

Tests for code

@saby_nastasia

Example 1

@saby_nastasia

Titanic dataset

@saby_nastasia

Check

DataKitchen                        @saby_nastasia

Assert

DataKitchen                        @saby_nastasia

It means that you must know your data

DataKitchen                        @saby_nastasia

Examples of tests you can do

DataKitchen                        @saby_nastasia

And then what to do?

It depends.

DataKitchen                        @saby_nastasia

Data monitoring

Classic Monitoring

DataKitchen                        @saby_nastasia

Dashboards

DataKitchen                        @saby_nastasia

Alerts

DataKitchen                        @saby_nastasia

Once upon a time, a small virus was born in Wuhan

DataKitchen                        @saby_nastasia

And data analysis and machine learning crashed

DataKitchen                        @saby_nastasia

Everything changes

DataKitchen                        @saby_nastasia

Different forms of data drift

- gradual

- abrupt

- seasonial

 

DataKitchen                        @saby_nastasia

How to prevent data drift?

 

DataKitchen                        @saby_nastasia

- Unit tests

- Measures distances

- Statistical tests

 

DataKitchen                        @saby_nastasia

- Measures distances

 

DataKitchen                        @saby_nastasia

- Statistical tests

 

DataKitchen                        @saby_nastasia

- In a conclusion: monitor

 

DataKitchen                        @saby_nastasia

Post-mortems

DataKitchen                        @saby_nastasia

Unit tests for data

Data versioning

Monitoring

DataKitchen                        @saby_nastasia

Thank you!

 

 

 

 

 

 

 

@saby_nastasia

https://mlinreallife.github.io/

Take care of your data

By nastasiasaby

Take care of your data

  • 528