Hidden Techinical Debt in ML Systems
Symanto Research
Reading Group
Friday 08, February 2019
presented by Angelo Basile
What this presentation is not about
A disclaimer on Meta
What this presentation IS about
source: https://en.wikipedia.org/wiki/Technical_debt
ML? Really?
Problems and possible solutions
Complex Models Erode Boundaries
traditional SE practice
ML
Data Dependencies Cost More than Code Dependencies
import pandas vs. import numpy
UNSTABLE DATA DEPENDENCIES
- tf/idf scores
- word embeddings
- sentence encoders
PROBLEM
SOLUTION
create a versioned copy of your data signals
UNDERUTILIZED DATA DEPENDENCIES
- legacy features
- bundled features
- ϵ-Features
- correlated features
PROBLEM
SOLUTION
do a feature ablation test
ML-System Anti-Patterns
GLUE CODE
- expensive changes
PROBLEM
SOLUTION
package black-box packages into common API's.
Exactly what I did with BERT for EmoContext
PIPELINE JUNGLES
- data preparation can be tricky
- possible intermediate file output
- expensive to test
PROBLEM
SOLUTION
think holistically about data, work closely with engineering team
Dead Experimental Codepaths
- low cost to branch and experiment
- high cost to merge after some time
PROBLEM
SOLUTION
see what you need and prune the unused branches
Common Smells
PROBLEM
SOLUTION
-
- Data-type smell
- multiple-languages (framework) smell
- prototype smell
Configuration Debt
Configuration Debt
PROBLEM
SOLUTION
- any large system has a wwide range of configurable options
- messiness makes configuration hard to modify correctly, and hard to reason about
Dealing with Changes in the External World
Other
- Data testing
- Reproducibility
- Process management
- Cultural debt
Conclusions
deck
By Angelo
deck
- 756