Hidden Techinical Debt in ML Systems

Symanto Research

Reading Group

Friday 08, February 2019

presented by Angelo Basile

What this presentation is not about

A disclaimer on Meta

What this presentation IS about

source: https://en.wikipedia.org/wiki/Technical_debt

ML? Really?

Problems and possible solutions

Complex Models Erode Boundaries

traditional SE practice

ML

Data Dependencies Cost More than Code Dependencies

import pandas vs. import numpy

UNSTABLE DATA DEPENDENCIES

tf/idf scores
word embeddings
sentence encoders

PROBLEM

SOLUTION

create a versioned copy of your data signals

UNDERUTILIZED DATA DEPENDENCIES

legacy features
bundled features
ϵ-Features
correlated features

PROBLEM

SOLUTION

do a feature ablation test

ML-System Anti-Patterns

GLUE CODE

expensive changes

PROBLEM

SOLUTION

package black-box packages into common API's.

Exactly what I did with BERT for EmoContext

PIPELINE JUNGLES

data preparation can be tricky
possible intermediate file output
expensive to test

PROBLEM

SOLUTION

think holistically about data, work closely with engineering team

Dead Experimental Codepaths

low cost to branch and experiment
high cost to merge after some time

PROBLEM

SOLUTION

see what you need and prune the unused branches

Common Smells

PROBLEM

SOLUTION

-

Data-type smell
multiple-languages (framework) smell
prototype smell

Configuration Debt

Configuration Debt

PROBLEM

SOLUTION

any large system has a wwide range of configurable options
messiness makes configuration hard to modify correctly, and hard to reason about

Dealing with Changes in the External World

Other

Data testing
Reproducibility
Process management
Cultural debt

Conclusions

deck

By Angelo

deck

756

Angelo