How to deploy and scale an ML model in production using
data:image/s3,"s3://crabby-images/2afba/2afba0910771f0906caad91bfc6d3ef6a2bf3296" alt=""
- Freelance Senior Data Scientist
- +7 years experience in Consulting, Tech, and Startups
- Interests in NLP, MLOps, and AI products
- ML trainer
- Content creator on Medium
Ahmed BESBES
What we'll learn today
This presentation will cover:
- APIs and production machine learning
- BentoML: an Open Source Model Serving framework
- Overview of ML-related features
- Demo
- Resources to go further
data:image/s3,"s3://crabby-images/8be48/8be48aa92ff15a84733d705d12a5d3297f4407de" alt=""
Don't hesitate to interrupt and ask questions!
data:image/s3,"s3://crabby-images/4fca5/4fca5ada181c78aebdcc34390c0ae0e70e0c6b6e" alt=""
1. APIs and production machine learning
Because there's a life after jupyter notebooks
What happens when your model is done training?
data:image/s3,"s3://crabby-images/723e1/723e1405e51a6e4f82a3f7ae5271978d5dc8e4c8" alt=""
An after-life
- The infra team needs a minimum of code packaging and dependency management to deploy your model
- DevOps needs to know about resource consumption
- Business and product teams need to stress-test the model (i.e. API needs to scale to multiple concurrent queries)
- Developers need to access your API documentation to know how to consume it
data:image/s3,"s3://crabby-images/723e1/723e1405e51a6e4f82a3f7ae5271978d5dc8e4c8" alt=""
API Requirements
As a data scientist, you want
- Support for multiple frameworks (torch, TF, scikit learn)
- Micro batching
- Performance and scalability: parallelization, high throughput
- Ability to use accelerated runtimes (GPUs)
As a thoughtful colleague, you want
- Reproducibility
- Dependency management
- Documentation
- Monitoring
- Debugging
- Data validation
2. BentoML
data:image/s3,"s3://crabby-images/845f9/845f97c86505fff30bef68ba520f1e24ef557850" alt=""
Open Source Model Serving
- Simplifies model serving and deployment
- Packages everything you need in a distribution format called a bento
- Enables data science agility
- Integrates Pre/Post processing
- Supports many popular ML frameworks
- Automatically generates Docker container
- Deploy to any cloud infrastructure
- Improves the inference performance
data:image/s3,"s3://crabby-images/6c167/6c1677724ea5e82f7538aa1920c2e3a80b1c16d1" alt=""
A Bento is like Docker container, but for ML
Bento is a file archive with all the source code, models, data files, and dependency configurations required for running a user-defined bentoml.Service
, packaged into a standardized format.
data:image/s3,"s3://crabby-images/23d61/23d6121f983dcfa3d4a2191c6471c58eb1e5424a" alt=""
A Bento is Self-contained and
deployable
Everywhere.
data:image/s3,"s3://crabby-images/d1ec1/d1ec1da13727bdf30609e716a702f85e8ba52ad5" alt=""
How to save a model with Bento, create an API and deploy it?
Step 1: save a model
import bentoml
from sklearn import svm
from sklearn import datasets
# Load training data set
iris = datasets.load_iris()
X, y = iris.data, iris.target
# Train the model
clf = svm.SVC(gamma='scale')
clf.fit(X, y)
# Save model to the BentoML local model store
saved_model = bentoml.sklearn.save_model("iris_clf", clf)
print(f"Model saved: {saved_model}")
# Model saved: Model(tag="iris_clf:hrcxybszzsm3khqa")
data:image/s3,"s3://crabby-images/5d7d2/5d7d2d1ef4f7ebce8567a6513c14d7c85465594c" alt=""
data:image/s3,"s3://crabby-images/3746f/3746f8f389d065ad83908ad56ed47ce8a518874c" alt=""
Step 2: Create a service
import numpy as np
import bentoml
from bentoml.io import NumpyNdarray
iris_clf_runner = bentoml.sklearn.get("iris_clf:latest").to_runner()
svc = bentoml.Service("iris_classifier", runners=[iris_clf_runner])
@svc.api(input=NumpyNdarray(), output=NumpyNdarray())
def classify(input_series: np.ndarray) -> np.ndarray:
result = iris_clf_runner.predict.run(input_series)
return result
data:image/s3,"s3://crabby-images/1f4fe/1f4fe87b5758bfa8c228a24c4a29f726d8ce8fbc" alt=""
import requests
requests.post(
"http://127.0.0.1:3000/classify",
headers={"content-type": "application/json"},
data="[[5.9, 3, 5.1, 1.8]]"
).text
'[2]'
data:image/s3,"s3://crabby-images/cf48c/cf48c5862eb1e5a32c881696b554a5654ba84595" alt=""
data:image/s3,"s3://crabby-images/331f4/331f42ae983b3a47fed6370630fffbd8ba91412d" alt=""
Step 3: build a Bento 🍱
Define a bentofile.yaml
service: "service:svc" # Same as the argument passed to `bentoml serve`
labels:
owner: bentoml-team
stage: dev
include:
- "*.py" # A pattern for matching which files to include in the bento
python:
packages: # Additional pip packages required by the service
- scikit-learn
- pandas
data:image/s3,"s3://crabby-images/7e161/7e161078341034d8518d26b219c011ea17fd49f0" alt=""
data:image/s3,"s3://crabby-images/be1e7/be1e7219d16feed51f56114099bdbee30f62704d" alt=""
Step 4: containerize
bentoml containerize iris_classifier:latest
data:image/s3,"s3://crabby-images/82390/823907f943cc5d9dfade76310ac72dde8aa5fed4" alt=""
data:image/s3,"s3://crabby-images/f82f6/f82f653c870c56f98be07cbbb6876044aef2acda" alt=""
docker run -it --rm -p 3000:3000 iris_classifier:jclapisz2s6qyhqa serve --production
data:image/s3,"s3://crabby-images/ecc2a/ecc2ac6017d4d894bdd63c95fbfadf36e490a255" alt=""
Step 5: deploy
3. Super-charged ML features
to make your life easier
1. Micro Batching
Dynamically group prediction requests in real-time into batches for model inference
Increases performance of your app
Increases throughput leverages acceleration hardware
data:image/s3,"s3://crabby-images/efd94/efd94a72f1eae9a0f30fa60a33aadc0e5bb2e2f6" alt=""
✅ Multiple input requests are run in parallel
✅ A proxy (i.e. a load balancer) distributes requests between workers (a worker is a running instance of an API server)
✅ Each worker distributes the requests to the model runners that are in charge of inference
✅ Each runner dynamically groups the requests in batches by finding a tradeoff between latency and throughput
✅ Runners make predictions on each batch
✅ Batch predictions are then split and released as individual responses
- How to enable batching?
bentoml.pytorch.save_model(
name="mnist",
model=model,
signature={
"__call__": {
"batchable": True,
"batch_dim": (0, 0),
},
},
)
send your parallel requests
Bentoml will take care of the rest
2. Parallel inference
Inference graph
Customizable control flows
Combine multiple models
data:image/s3,"s3://crabby-images/b6eba/b6eba1d34dd60de079d4f043c06ccb58be3dfb73" alt=""
gpt2_generator = (bentoml
.transformers
.get("gpt2-generation:latest").to_runner())
distilgpt2_generator = (bentoml
.transformers
.get("distilgpt2-generation:latest").to_runner())
distilbegpt2_medium_generator = (bentoml
.transformers
.get("gpt2-medium-generation:latest").to_runner())
bert_base_uncased_classifier = (bentoml
.transformers
.get("bert-base-uncased-classification:latest").to_runner())
svc = bentoml.Service(
"inference_graph",
runners=[
gpt2_generator,
distilgpt2_generator,
distilbegpt2_medium_generator,
bert_base_uncased_classifier,
],
)
Load runners
Define service
Define inference workflow
@svc.api(input=Text(), output=JSON())
async def classify_generated_texts(original_sentence: str) -> dict:
generated_sentences = [
result[0]["generated_text"]
for result in await asyncio.gather(
gpt2_generator.async_run(
original_sentence,
max_length=MAX_LENGTH,
num_return_sequences=NUM_RETURN_SEQUENCE,
),
distilgpt2_generator.async_run(
original_sentence,
max_length=MAX_LENGTH,
num_return_sequences=NUM_RETURN_SEQUENCE,
),
distilbegpt2_medium_generator.async_run(
original_sentence,
max_length=MAX_LENGTH,
num_return_sequences=NUM_RETURN_SEQUENCE,
),
)
]
results = []
for sentence in generated_sentences:
score = (await bert_base_uncased_classifier.async_run(sentence))[0]["score"]
results.append(
{
"generated": sentence,
"score": score,
}
)
return results
3. Accelerated runtime
service: "service:svc"
include:
- "*.py"
python:
packages:
- torch
- torchvision
- torchaudio
extra_index_url:
- "https://download.pytorch.org/whl/cu113"
docker:
distro: debian
python_version: "3.8.12"
cuda_version: "11.6.2"
Use GPU and declare it when building the Bento
4. Other cool features
- API Documentation
- Data validation
- gRPC
- Monitoring
https://docs.bentoml.org/en/latest/guides/index.html
3. Demo
data:image/s3,"s3://crabby-images/ef6b8/ef6b8137df23b2f56ff0d5a9228c410b1978bf4d" alt=""
4. Resources
to learn more and become a Bento expert
Interesting reads
- https://towardsdatascience.com/comprehensive-guide-to-deploying-any-ml-model-as-apis-with-python-and-aws-lambda-b441d257f1ec
- https://towardsdatascience.com/bentoml-create-an-ml-powered-prediction-service-in-minutes-23d135d6ca76
- https://neptune.ai/blog/ml-model-serving-best-tools
- https://www.reddit.com/r/mlops/comments/w4vl6r/hello_from_bentoml/
- https://docs.bentoml.org/en/latest/concepts/service.html#runners
- https://github.com/bentoml/BentoML/tree/main/examples
- https://modelserving.com/blog/breaking-up-with-flask-amp-fastapi-why-ml-model-serving-requires-a-specialized-framework
How to deploy an ML model in production using BentoML
By Ahmed Besbes
How to deploy an ML model in production using BentoML
- 3,086