Minimal

The MLOps tooling landscape

Tensorflow Belgium

Belgium

The MLOps tooling landscape

is an absolute mess

ljvmiranda921.github.io

The MLOps tooling landscape

is an absolute mess

and a marketing battle

MLOps providers dominate information

Every tool does everything perfectly

Constantly changing: no list can keep up

What can we do?

Exploration phase

Filter phase

Selection phase

Most articles are written by vendors

mlops.toys nice website, but not updated

mlops.community slack channel

Github Awesome Lists

EthicalML/awesome-production-machine-learning

visenger/awesome-mlops

kelvins/awesome-mlops

What can we do?

Exploration phase

Filter phase

Selection phase

https://ml-ops.org/content/mlops-stack-canvas

https://github.com/ai-infrastructure-alliance/blueprints

ml-ops.org Stack Canvas

AI Infrastructure Alliance Blueprints

What can we do?

Exploration phase

Filter phase

Selection phase

https://github.com/ai-infrastructure-alliance/blueprints

AI Infrastructure Alliance Blueprints

What can we do?

Exploration phase

Filter phase

Selection phase

https://github.com/ai-infrastructure-alliance/blueprints

AI Infrastructure Alliance Blueprints

What can we do?

Exploration phase

Filter phase

Selection phase

Try, do not count on the descriptions

Head to head battle

Why is it all so messy?

In essence, all these tools are trying to solve the challenges you'll encounter when doing ML

Premature optimization is the root of all evil

~ Ghandi

just kidding it was Donald Ervin Knuth in The Art of Computer Programming, Volume 1: Fundamental Algorithms

What challenges exactly are we solving for anyway?

Challenge 0: Buy or Build

No data scientists?

AutoML Platforms

Low-code

Usually include every step

https://twimlai.com/solutions/introducing-twiml-ml-ai-solutions-guide/

Challenge 1: Data Management

Issues

Level 1

Git LFS

Cloud Bucket

(Cloud) Database

Bigquery

Level 2

DVC

ClearML Data

LakeFS

Dolt

Pachyderm

Level 3

Feature Store

FAIS

Dataset size outgrows personal machines (+backup)

No overview, no metadata, no insights

Data accessibility

Versioning and Lineage

Challenge 2: Prototyping Phase

Issues

Experiment Manager

Weights & Biases

ClearML

MLFlow

Sacred

Guild.AI

Self Labeling

Label Studio

Chaotic by nature, not commit trigger

Track output files as well

Experiment Comparison

Reproducibility

Challenge 3: Remote Compute

Issues

Remote Machine

Every cloud ever

Jupyter / remote VSCode

Google Colab

Task-Based

More overhead

Cloud training jobs

Requires orchestration!

Local pc doesn't cut it

Privacy / Management concerns

Better hardware utilization

Unstable usage / demand

Challenge 4: Orchestration

Issues

Cloud Gang

Not only native tools!

All VMs in the end

Onprem Gang

ClearML Orchestrate: Queues and workers

Slurm: Queues and workers

Apache Airflow: Queues and workers

Metaflow: kubernetes

Kubeflow Pipelines: kubernetes

Kedro: backend agnostic!

Managing multiple users is hard

Multi-task scheduling is hard

GPU sharing and utilization is hard (thanks Nvidia)

Chain becomes complex, need pipelining

Challenge 5: Deployment

Issues

Optimize Model

ONNX

TensorRT

Tensorflow Lite

ML Kit

Model Serving

Nvidia Triton

ClearML Serving*

BentoML

Seldon Core

Edge AI

~ Hardware

Tensorflow Lite

Production: don't half-ass this

How to make the model accessible?

Seamless model updates

Maximise hardware utilization

Challenge 6: Monitoring

Issues

DIY

Prometheus

Grafana

All-in-ones

Vertex/Sagemaker

ClearML

Comet

Datarobot

.............

Batteries Included

Data versioning system

Experiment manager

Orchestrator

Go to production and be ready to go back. Things will break.

Serving visibility (drift, latency etc.)

Traceability throughout the whole system

Alerting to be proactive on issues

A Final Argument for End-To-End

The click-though effect

Key Takeaways

Watch out for the marketing

Github Awesome lists are awesome

Don't prematurely optimize, solve problems as they present themselves

There's more than cloud native tools

Thank you!

https://app.clear.ml

Github: https://github.com/allegroai/clearml

Slack: clearml.slack.com

Twitter (goodest memes): @clearMLapp

Me: @VictorSonck

Try it yourself for free! It’s open-source!