Frameworks for serving machine learning models on Kubernetes
Having worked in the ML field for several years now, I keep seeing the same problems when it comes to professionalising ML workloads. Challenges include data and model lineage, drift detection, monitoring gaps, CI/CD struggles, security risks and team silos. Model deployment however is one of the things teams seem to struggle with the most, especially at scale. There are quite some frameworks available these days that can make serving machine learning models on Kubernetes a lot easier. In this post we'll look into three of the most popular frameworks: Seldon, KServe and BentoML.
TL;DR: Use Seldon for large-scale and advanced features, KServe for serverless ML model serving and BentoML for startups, small teams, and fast-moving ML projects.
One of the reasons model serving is such a difficult topic is the variety of options available when it comes to model deployment. There is no one solution that works best at all times: as usual the best approach to take depends on the situation your team is in. Generally speaking however, containerised deployments are the standard and recommended way to host any model.
This still leaves us with a lot of options. When we have a model that we want to run at scale, hosting a single docker container won’t suffice. More scalable options are either using a Kubernetes cluster, or some kind of container hosting solution provided by a cloud platform (AWS Fargate, Azure Container Instances and Cloud Run on GCP being the major players here).
I want to focus on using Kubernetes, because this provides more customisation options and it prevents vendor lock in. There are some interesting frameworks that can be utilised to deploy models when going the Kubernetes route. These frameworks all try to make life of ML developers easier, in the sense that they provide an abstraction layer over the API that you would otherwise have to write yourself. I created an overview of the three most promising options, with their strengths and weaknesses.
Seldon
Seldon is a model serving framework that is highly customisable and has an advanced set of features. This includes A/B testing, canary deployments, explainability and drift detection. There are possibilities to create so called inference graphs, where you can implement pre- and post processing in your inference process, or combine multiple models. It integrates with many other tools, including monitoring tools (Prometheus, Grafana) and most popular orchestrators (Airflow, Dagster, Kubeflow).
The framework has functionality that can easily turn your model artifacts and prediction code into a containerised API that can be used to serve your model. To do so, all that is required is to wrap your model initialisation and prediction code in a class.
import joblib
class Iris:
def __init__(self):
self.model = joblib.load("model.joblib")
def predict(self, X):
output = self.model(X)
return output
A predefined Dockerfile can be used to build a container that includes a functional API without having to write any additional code. For deployments you have to use a Custom Resource Definition (CRD): SeldonDeployment. CRD’s are an extension to the Kubernetes default resource types and can be deployed easily using kubectl. In the CRD you specify the Docker image and possible advanced features you want to include. An example can be found below.
apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
name: iris
namespace: dev
spec:
name: iris-spec
predictors:
- componentSpecs:
- spec:
containers:
- name: iris-container
image: iris-image:v0.1
imagePullPolicy: IfNotPresent
graph:
name: iris-graph
name: default
replicas: 1
This setup is more complex than the other frameworks mentioned in this post, but it does provide a very robust and scalable solution with advanced features and integrations. Whether this is required largely depends on the use case. There is functionality that allows you to create model servers using only model artifacts (so no need to even write the prediction part), but the python version (3.8) is very outdated, so I would not recommend this.
KServe
KServe was formerly known as KFServing and used to be a part of Kubeflow before splitting up. Even though they’ve been separated since 2021, the two still integrate well. KServe is a model serving framework with a big perk: allowing serverless model inference with autoscaling. This means it scales with the amount of requests, making use of Knative. It can scale down to zero if necessary.
KServe uses a CRD named InferenceService, similar to the approach Seldon is taking. For most common model frameworks, existing InferenceServices are available, which means only a model artifact has to be specified in order to run the model. As an example, this is what the CRD for our Iris sklearn model looks like:
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
name: "iris"
spec:
predictor:
model:
modelFormat:
name: sklearn
protocolVersion: v2
runtime: kserve-sklearnserver
storageUri: "gs://example_bucket/model.joblib"
Here the storageUri is pointing to a joblib model artifact on some external cloud storage (GCP in this example). You can also specify the number of concurrent requests you would like as a target, and set it to scale down to zero if no requests are coming in. The autoscaling is based on the number of requests to your service.
If you have a custom model, it is possible to create your own custom model server. This is less straightforward than the Seldon model server, but an example model.py is available. This will then need to be built before it can be deployed using a CRD.
If you would like more advanced functionality such as AB-testing and canary deployments, you would have to arrange this through the Istio service mesh, as it’s not natively supported.
BentoML
The name of this framework comes from a typical Japanese lunch box: a bento box. It’s a compartmentalised box to include multiple types of food in a single lunch box: rice, fish, beans etc. The idea of BentoML is that you package your model in a container like structure, called a Bento. This Bento contains everything required to serve the model.
To specify how to serve the model, the @bentoml.service decorator can be used to specify a service class, which should contain at least one model endpoint, by decorating a function with @bentoml.api. For multi-model serving, multiple endpoints can be defined.
import bentoml
import joblib
import numpy as np
@bentoml.service(
resources={"cpu": "2"},
traffic={"timeout": 30},
)
class Iris:
def __init__(self) -> None:
# Load model into pipeline
self.model = joblib.load("model.joblib")
@bentoml.api
def predict(self, X: np.ndarray) -> np.ndarray:
result = self.model.predict(X)
return result
One of the nice features of BentoML is the possibility to run your service locally before containerising it. It can be served locally using:
bentoml serve service:Iris
Packaging this into a Bento can be done by including a yaml configuration, describing the service class, what files to include and which packages to install. You could also alter the resulting Docker container by providing settings here, or define runtime options such as the timeout.
service: "service:Iris"
include:
- "model.joblib"
- "*.py"
python:
requirements_txt: "requirements.txt"
Creating the Bento and containerising it can now be done using the cli:
# Build bentoml
bentoml build
# List Bento's
bentoml list
# Create Docker container
bentoml containerize iris:latest
This Docker container can be served in a Kubernetes cluster using a Kubernetes Deployment for the container and a Kubernetes Service to do the load balancing. This does require some additional setting up. As mentioned earlier, an alternative for this would be using a container hosting service as offered by one of the cloud platforms.
The main perk of BentoML is the speed and ease in the development process: a model can be deployed quickly, compared to the other two frameworks. It is however less suitable for large production workloads, and the end result of BentoML is a container and not a full fledged Kubernetes deployment. There is a paid cloud platform solution available for fast scalable deployments, if you are lacking in-house Kubernetes expertise.
Guide to choosing the right tool
All these frameworks are valid options. Which framework to choose mostly depends on your team’s infrastructure, scalability needs, and deployment complexity.
Seldon: Best for Large-Scale, Enterprise Deployments
Choose Seldon if you need:
- Complex ML model serving with support for multi models and pre/post processing.
- Built-in A/B testing and canary deployments
- Deep integration with Kubernetes, including Istio-based security and Prometheus and Grafana monitoring.
- High scalability, especially for enterprise AI platforms that require robust governance and control.
Reasons not to pick Seldon:
- The setup is quite complex, it requires some Kubernetes expertise.
- Documentation is not super clear, some more advanced features lack clear examples.
- The resource overhead is relatively high and can be heavy for small-scale deployments.
Seldon is ideal for enterprises and organisations running large-scale ML pipelines that need advanced deployment strategies and monitoring.
It’s overkill for simple model serving of if you don’t have the required Kubernetes expertise.
KServe: Best for Scalable, Kubernetes-Native Model Serving
Choose KServe if you need:
- A Kubernetes-native model serving solution that supports multiple ML frameworks (TensorFlow, PyTorch, XGBoost, ONNX, Sklearn, etc) without the need to even build a container.
- Dynamic autoscaling with Knative, running serverless by scaling with the number of requests (can scale down to zero).
- A flexible and production-ready deployment system, especially if you’re using Kubeflow.
Reasons not to pick KServe:
- The setup is quite complex, and again requires some Kubernetes expertise.
- You need custom pre/post-processing. It has limited flexibility compared to Seldon.
KServe is a great fit for teams already working with Kubernetes and Kubeflow, looking for a flexible and scalable (serverless) ML model serving solution, but don’t have a need for advanced model inference pipelines.
KServe again is overkill for simple model serving or if you don’t have the required Kubernetes expertise. It also misses some features that Seldon does have.
BentoML: Best for Fast and Simple Model Deployment
Choose BentoML if you need:
- A lightweight and easy-to-use framework for packaging and serving ML models.
- Minimal setup and fast iterations, without deep Kubernetes expertise.
- Broad framework support, including TensorFlow, PyTorch, Scikit-learn, and more.
- Easier local testing and containerization, with an emphasis on developer-friendly workflows.
Reasons not to pick BentoML:
- Not suitable for massive-scale model serving, as it’s not optimised for extreme workloads.
- There is no native Kubernetes support like with Seldon and KServe.
- There is no built-in monitoring and scaling, this requires external tools.
BentoML is best suited for startups, small teams, and fast-moving ML projects that prioritise simplicity, quick deployments, and easy integration over complex Kubernetes-based infrastructures.
It is less suitable for large scale model serving and lacks the native Kubernetes support the other two frameworks have.
Further reading
Hopefully this overview can provide you with enough information to pick the right framework for your use case. If you would like to get some hands-on experience, each framework has a quick start guide available:
Want to learn more about the wonderful world of MLOps? Check out our other blogpost on how to create an MLOps platform.