Hidden Techinical Debt in ML Systems

Symanto Research

Reading Group

Friday 08, February 2019

presented by Angelo Basile

What this presentation is not about

A disclaimer on Meta

What this presentation IS about

source: https://en.wikipedia.org/wiki/Technical_debt

ML? Really?

Problems and possible solutions

Complex Models Erode Boundaries

traditional SE practice

ML

Data Dependencies Cost More than Code Dependencies

import pandas vs. import numpy

UNSTABLE DATA DEPENDENCIES

  • tf/idf scores
  • word embeddings
  • sentence encoders

PROBLEM

SOLUTION

create a versioned copy of your data signals

UNDERUTILIZED DATA DEPENDENCIES

  • legacy features
  • bundled features
  • ϵ-Features
  • correlated features

PROBLEM

SOLUTION

do a feature ablation test

ML-System Anti-Patterns

GLUE CODE

  • expensive changes

PROBLEM

SOLUTION

package black-box packages into common API's.

Exactly what I did with BERT for EmoContext

PIPELINE JUNGLES

  • data preparation can be tricky
  • possible intermediate file output
  • expensive to test

PROBLEM

SOLUTION

think holistically about data, work closely with engineering team

Dead Experimental Codepaths

  • low cost to branch and experiment
  • high cost to merge after some time

PROBLEM

SOLUTION

see what you need and prune the unused branches

Common Smells

PROBLEM

SOLUTION

-

  • Data-type smell
  • multiple-languages (framework) smell
  • prototype smell

Configuration Debt

Configuration Debt

PROBLEM

SOLUTION

  • any large system has a wwide range of configurable options
  • messiness makes configuration hard to modify correctly, and hard to reason about

Dealing with Changes in the External World

Other

  • Data testing
  • Reproducibility
  • Process management
  • Cultural debt

Conclusions

deck

By Angelo