How to deploy and scale an ML model in production using

  • Freelance Senior Data Scientist
  • +7 years experience in Consulting, Tech, and Startups
  • Interests in NLP, MLOps, and AI products
  • ML trainer
  • Content creator on Medium


Follow me on

What we'll learn today

​This presentation will cover:

  • APIs and production machine learning
  • BentoML: an Open Source Model Serving framework
  • Overview of ML-related features
  • Demo
  • Resources to go further

Don't hesitate to interrupt and ask questions!

1. APIs and production machine learning

Because there's a life after jupyter notebooks


What happens when your model is done training?

An after-life

  • The infra team needs a minimum of code packaging and dependency management to deploy your model
  • DevOps needs to know about resource consumption
  • Business and product teams  need to stress-test the model (i.e. API needs to scale to multiple concurrent queries)
  • Developers need to access your API documentation to know how to consume it


API Requirements

As a data scientist, you want

  • Support for multiple frameworks (torch, TF, scikit learn)
  • Micro batching
  • Performance and scalability: parallelization, high throughput
  • Ability to use accelerated runtimes (GPUs)

As a thoughtful colleague, you want

  • Reproducibility
  • Dependency management 
  • Documentation
  • Monitoring 
  • Debugging
  • Data validation

2. BentoML


Open Source Model Serving

  • Simplifies model serving and deployment
  • Packages everything you need in a distribution format called a bento
  • Enables data science agility
    • Integrates Pre/Post processing
    • Supports many popular ML frameworks
    • Automatically generates Docker container
    • Deploy to any cloud infrastructure
  • Improves the inference performance

A Bento is like  Docker container, but for ML

Bento is a file archive with all the source code, models, data files, and dependency configurations required for running a user-defined bentoml.Service, packaged into a standardized format.

A Bento is Self-contained and


How to save a model with Bento, create an API and deploy it?

Step 1: save a model

import bentoml

from sklearn import svm
from sklearn import datasets

# Load training data set
iris = datasets.load_iris()
X, y =,

# Train the model
clf = svm.SVC(gamma='scale'), y)

# Save model to the BentoML local model store
saved_model = bentoml.sklearn.save_model("iris_clf", clf)
print(f"Model saved: {saved_model}")

# Model saved: Model(tag="iris_clf:hrcxybszzsm3khqa")

Step 2: Create a service

 import numpy as np
 import bentoml
 from import NumpyNdarray

 iris_clf_runner = bentoml.sklearn.get("iris_clf:latest").to_runner()

 svc = bentoml.Service("iris_classifier", runners=[iris_clf_runner])

 @svc.api(input=NumpyNdarray(), output=NumpyNdarray())
 def classify(input_series: np.ndarray) -> np.ndarray:
     result =
     return result
import requests
  headers={"content-type": "application/json"},
  data="[[5.9, 3, 5.1, 1.8]]"

Step 3: build a Bento 🍱

Define a bentofile.yaml

service: "service:svc"  # Same as the argument passed to `bentoml serve`
   owner: bentoml-team
   stage: dev
- "*.py"  # A pattern for matching which files to include in the bento
   packages:  # Additional pip packages required by the service
   - scikit-learn
   - pandas

Step 4: containerize

 bentoml containerize iris_classifier:latest
docker run -it --rm -p 3000:3000 iris_classifier:jclapisz2s6qyhqa serve --production

Step 5: deploy

We can do it in three ways


  1. 🐳 Generate container images from Bento for custom docker deployment
  2. 🚀 bentoctl: Fast model deployment on any cloud platform
  3. 🦄️ Yatai: Model Deployment at scale on Kubernetes

3. Super-charged ML features 

to make your life easier

1. Micro Batching

Dynamically group prediction requests in real-time into batches for model inference

Increases performance of your app


Increases throughput leverages acceleration hardware

✅  Multiple input requests are run in parallel
✅  A proxy (i.e. a load balancer) distributes requests between workers (a worker is a running instance of an API server)
✅  Each worker distributes the requests to the model runners that are in charge of inference
✅  Each runner dynamically groups the requests in batches by finding a tradeoff between latency and throughput
✅  Runners make predictions on each batch
✅  Batch predictions are then split and released as individual responses

- How to enable batching?

        "__call__": {
            "batchable": True,
            "batch_dim": (0, 0),

send your parallel requests

Bentoml will take care of the rest

2. Parallel inference

Inference graph

Customizable control flows

Combine multiple models 

gpt2_generator = (bentoml

distilgpt2_generator = (bentoml

distilbegpt2_medium_generator = (bentoml

bert_base_uncased_classifier = (bentoml
svc = bentoml.Service(

Load runners

Define service

Define inference workflow

@svc.api(input=Text(), output=JSON())
async def classify_generated_texts(original_sentence: str) -> dict:
    generated_sentences = [
        for result in await asyncio.gather(

    results = []
    for sentence in generated_sentences:
        score = (await bert_base_uncased_classifier.async_run(sentence))[0]["score"]
                "generated": sentence,
                "score": score,

    return results

3. Accelerated runtime

service: "service:svc"
- "*.py"
    - torch
    - torchvision
    - torchaudio
    - ""
    distro: debian
    python_version: "3.8.12"
    cuda_version: "11.6.2"

Use GPU and declare it when building the Bento

4. Other cool features

  • API Documentation
  • Data validation
  • gRPC
  • Monitoring


3. Demo


4. Resources

to learn more and become a Bento expert


Interesting reads

How to deploy an ML model in production using BentoML

By Ahmed Besbes

How to deploy an ML model in production using BentoML

  • 2,380