The MLOps tooling landscape

Tensorflow Belgium

Belgium

The MLOps tooling landscape

The MLOps tooling landscape

is an absolute mess

ljvmiranda921.github.io

The MLOps tooling landscape

is an absolute mess

and a marketing battle

MLOps providers dominate information

Every tool does everything perfectly

Constantly changing: no list can keep up

What can we do?

Exploration phase

Filter phase

Selection phase

Most articles are written by vendors

mlops.toys nice website, but not updated

mlops.community slack channel

Github Awesome Lists

EthicalML/awesome-production-machine-learning

visenger/awesome-mlops

kelvins/awesome-mlops

What can we do?

Exploration phase

Filter phase

Selection phase

https://ml-ops.org/content/mlops-stack-canvas

https://github.com/ai-infrastructure-alliance/blueprints

ml-ops.org Stack Canvas

AI Infrastructure Alliance Blueprints

What can we do?

Exploration phase

Filter phase

Selection phase

https://github.com/ai-infrastructure-alliance/blueprints

AI Infrastructure Alliance Blueprints

What can we do?

Exploration phase

Filter phase

Selection phase

https://github.com/ai-infrastructure-alliance/blueprints

AI Infrastructure Alliance Blueprints

What can we do?

Exploration phase

Filter phase

Selection phase

Try, do not count on the descriptions

Head to head battle

Why is it all so messy?

Why is it all so messy?

In essence, all these tools are trying to solve the challenges you'll encounter when doing ML

"

Premature optimization is the root of all evil

"

~ Ghandi

just kidding it was Donald Ervin Knuth in The Art of Computer Programming, Volume 1: Fundamental Algorithms

What challenges exactly are we solving for anyway?

Challenge 0: Buy or Build

No data scientists?

 

AutoML Platforms

Low-code

Usually include every step

https://twimlai.com/solutions/introducing-twiml-ml-ai-solutions-guide/

Challenge 1: Data Management

Issues

 

Level 1

 

Git LFS

Cloud Bucket

(Cloud) Database

Bigquery

 

 

Level 2

 

DVC

ClearML Data

LakeFS

Dolt

Pachyderm

 

 

Level 3

 

Feature Store

FAIS

 

Dataset size outgrows personal machines (+backup)

No overview, no metadata, no insights

Data accessibility

Versioning and Lineage

Challenge 2: Prototyping Phase

Issues

 

Experiment Manager

 

Weights & Biases

ClearML

MLFlow

Sacred

Guild.AI

 

 

Self Labeling

 

Label Studio

 

Chaotic by nature, not commit trigger

Track output files as well

Experiment Comparison

Reproducibility

Challenge 3: Remote Compute

Issues

 

Remote Machine

 

Every cloud ever

Jupyter / remote VSCode

Google Colab

 

 

Task-Based

 

More overhead

Cloud training jobs

Requires orchestration!

 

Local pc doesn't cut it

Privacy / Management concerns

Better hardware utilization

Unstable usage / demand

Challenge 4: Orchestration

Issues

 

Cloud Gang

 

Not only native tools!

All VMs in the end

 

 

 

 

Onprem Gang

 

ClearML Orchestrate: Queues and workers

Slurm: Queues and workers

Apache Airflow: Queues and workers

Metaflow: kubernetes

Kubeflow Pipelines: kubernetes

Kedro: backend agnostic!

 

Managing multiple users is hard

Multi-task scheduling is hard

GPU sharing and utilization is hard (thanks Nvidia)

Chain becomes complex, need pipelining

Challenge 5: Deployment

Issues

 

Optimize Model

 

ONNX

TensorRT

Tensorflow Lite

ML Kit

 

 

Model Serving

 

Nvidia Triton

Nvidia Triton

ClearML Serving*

BentoML

Seldon Core

 

 

Edge AI

 

~ Hardware

Tensorflow Lite

 

Production: don't half-ass this

How to make the model accessible?

Seamless model updates

Maximise hardware utilization

Challenge 6: Monitoring

Issues

 

DIY

 

Prometheus

Grafana
 

 

All-in-ones

 

Vertex/Sagemaker

ClearML

Comet

Datarobot

.............

 

 

Batteries Included

 

Data versioning system

Experiment manager

Orchestrator

 

Go to production and be ready to go back. Things will break.

Serving visibility (drift, latency etc.)

Traceability throughout the whole system

Alerting to be proactive on issues

A Final Argument for End-To-End

The click-though effect

Key Takeaways

Watch out for the marketing

Github Awesome lists are awesome

Don't prematurely optimize, solve problems as they present themselves

There's more than cloud native tools

Thank you!

https://app.clear.ml

Github:  https://github.com/allegroai/clearml

Slack: clearml.slack.com

Twitter (goodest memes): @clearMLapp

Me: @VictorSonck

Try it yourself for free! It’s open-source!

Made with Slides.com