Proof-of-concept scenarios using Red Hat OpenShift AI to serve variations of the Stable Diffusion text-to-image model. An exercise undertaken as reason to go hands-on with model serving via KServe & TorchServe in order to get a better understanding of their makeup, function & configuration aside Red Hat OpenShift AI.
It's been fairly trivial to find notebooks in the wild showcasing HuggingFace diffusers pipelines which handle model load through to image generation, however, there are fewer examples that I've found showcasing an "Ensemble (or Mixture) of Experts" type model pipeline served with trained weights, in particular via KServe & Red Hat OpenShift AI.
Project done for educational purposes. Content herein not intended for production use.
This repo includes several directories which each represent a scenario I assembled for understanding the various aspects of model serving with some common toolings alongside OpenShift AI.
- The stable-diffusion-pipelines directory showcases a simplistic approach using a few lines of code inside a notebook to load the Stable Diffusion model and generate an image from text prompt. If you're simply interested in seeing Stable Diffusion at work, this will cover it.
- The stable-diffusion-2 directory introduces model serving to the mix. If you want to be able to make some sense of what
goes on with the next few scenarios below, you'll want to make a stop through here. Partly from desire to learn a few items common to the
problem space and partly from being the first solution I crossed for working with a pipeline-type model with multiple components (ok, mostly the latter), I
decided to bundle the Stable Diffusion model as
.mar
archive to be served with TorchServe via KServe. The scenario includes all the files and notebooks needed for downloading the SD2 model, creating the model archive, setting up the model in OpenShift AI & hitting the inference endpoint.
- The previous example focused on pulling a 'lighter' version of Stable Diffusion (SD2) & setting up model serving just to see & understand the flow of things while minimalizing time lost between iterations as best I could (generating, downloading, uploading, and loading 11GB archives isn't speedy). With that patten established, I turned my attention to figuring out how to get things working with larger, more resource-intensive setups, such as using Stable Diffusion XL base & refiner models together. The stable-diffusion-xlr directory supplies archival & query notebooks, similarly as to before, as well as the custom handler class needed for the scenario.
- After walking through the previous exercises to understand SD model serving on OpenShift AI with TorchServe & KServe, then walking towards the most modern, capable version of the model, I next wanted to understand the more likely scenario wherein a user wishes to tune their model using their own images to better accommodate a specific line of queries. The stable-diffusion-xl-train directory includes notebooks & images for such training, the model archival, & querying the inference endpoint similarly to previous scenarios.
-
If you're after a look at the latest & greatest that StabilityAI has made available, the stable-diffusion-3 directory offers notebooks for emulating previous examples for serving the base SD3 model
or just using the diffusers pipeline with provided tuning weights. TheIf you decide to try and serve the base model, know what you're in for - thehandler
class for tuned weights serving has been included, but as before, will require updates should you alter the pipeline..mar
archive file is 12GB. -
When attempting training on GPU with 48GB of memory available, accelerate still maxed out and failed. Offloading also failed to a different cause.
-
SD3 offers some suggestions for memory optimization, however, I haven't had a chance to pursue updating dreambooth training scripts to accommodate the change(s), particularly around dropping the T5-XXL encoder.
- If you're interested in seeing a scenario similar to the XL-refiner scenario above, but without the added noise of figuring TorchServe & model archives, I have a recommendation. For a neatly-defined, containerized solution that offers Stable Diffusion XL base model with optional refiner via v1 KServe Inference Protocol, have a look at the igm-on-openshift repository put together by fellow Red Hatter Guillaume Moutier.
- If you're less interested in the investigative side of things, but instead seek a more robust, performant setup, then I'd recommend taking a look at the Stable Diffusion distributed workload example which showcases fine-tuning Stable Diffusion with Ray and DreamBooth on OpenShift AI. This will introduce some complexity around Ray clustering and shared file storage (Open Data Foundation), however, if you want a look at something closer approximating proper scalability, this is a good scenario to follow.
- OpenShift version
4.14.35
- OpenShift AI version
2.13.0
- Node Feature Discovery Operator
4.14.0
(enable GPU accelerators)- NVIDIA GPU Operator
24.6.2
(enable GPU accelerators)- Single-model serving cluster configuration
- Cluster backed by Red Hat OpenShift Dedicated
- S3 Bucket utilized for model storage, though Minio can easily do the same
- Additional worker Machine Pool added consisting of 2x EC2
g5.8xlarge
instances
The AI/ML Platform in use.
Red Hat OpenShift AI is an artificial intelligence (AI) platform that provides tools to rapidly develop, train, serve, and monitor machine learning models on-site, in the public cloud, or at the edge. It's a flexible, scalable artificial intelligence (AI) and machine learning (ML) platform that enables enterprises to create and deliver AI-enabled applications at scale across hybrid cloud environments.
Red Hat OpenShift AI helps you build out an enterprise grade AI and MLOps platform to create and deliver GenAI and predictive models by providing supported AI tooling on top of OpenShift. It’s based on Red Hat OpenShift, a container based application platform that efficiently scales to handle workload demands of AI operations and models. You can run your AI workloads across the hybrid cloud, including edge and disconnected environments.
Additionally, Red Hat actively partners with leading AI vendors, such as Anaconda, Intel, IBM, and NVIDIA to make their tools available for data scientists as OpenShift AI components. You can optionally activate these components to expand the OpenShift AI capabilities. Depending on the component that you want to activate, you might need to install an operator, use the Red Hat Ecosystem, or add a workbench image.
The following figure depicts the general architecture of a OpenShift AI deployment, including the most important concepts and components:
spec:
components:
codeflare:
managementState: Managed
kserve:
managementState: Managed
serving:
ingressGateway:
certificate:
type: SelfSigned
managementState: Managed
name: knative-serving
trustyai:
managementState: Removed
ray:
managementState: Managed
kueue:
managementState: Managed
workbenches:
managementState: Managed
dashboard:
managementState: Managed
modelmeshserving:
managementState: Managed
datasciencepipelines:
managementState: Managed
trainingoperator:
managementState: Removed
The model-serving mechanism in use.
KServe is a standard, cloud agnostic Model Inference Platform for serving predictive and generative AI models on Kubernetes, built for highly scalable use cases.
Provides performant, standardized inference protocol across ML frameworks including OpenAI specification for generative models.
Support modern serverless inference workload with request based autoscaling including scale-to-zero on CPU and GPU.
Provides high scalability, density packing and intelligent routing using ModelMesh.
Simple and pluggable production serving for inference, pre/post processing, monitoring and explainability. Advanced deployments for canary rollout, pipeline, ensembles with InferenceGraph.
The serving runtime in use.
PyTorch is an open-source machine learning framework, originally created by Facebook, that has become popular among ML researchers and data scientists for its ease of use and “Pythonic” interface. However, deploying and managing models in production is often the most difficult part of the machine learning process requiring customers to write prediction APIs and scale them.
TorchServe makes it easy to deploy PyTorch models at scale in production environments. It delivers lightweight serving with low latency, so you can deploy your models for high performance inference. It provides default handlers for the most common applications such as object detection and text classification, so you don’t have to write custom code to deploy your models. With powerful TorchServe features including multi-model serving, model versioning for A/B testing, metrics for monitoring, and RESTful endpoints for application integration, you can take your models from research to production quickly.
The model in use.
The Stable Diffusion models from Stability AI are deep-learning, text-to-image generative models originating from CompVis's latent diffusion project out of Germany. Though often loosely referred to as a "model", Stable Diffusion is more an "ensemble of experts" model pipeline composed of various models that, when combined, achieve the target goal. The number and nature of the individual models making up the composite vary from release to release
Stable Diffusion, being a Latent Diffusion Model (LDM), is trained by using a Markov chain to gradually add noise to the training images. The model is then trained to reverse this process, starting with a noisy image and gradually removing the noise until it recovers the original image. More specifically, the training process can be described as follows:
Forward diffusion process: Given a real image X0, a sequence of latent variables X1:T are generated by gradually adding Gaussian noise to the image, according to a pre-determined "noise schedule".
Reverse diffusion process: Starting from a Gaussian noise sample XT, the model learns to predict the noise added at each step, in order to reverse the diffusion process and obtain a reconstruction of the original image X0.
The model is trained to minimize the difference between the predicted noise and the actual noise added at each step. This is typically done using a mean squared error (MSE) loss function.
Once the model is trained, it can be used to generate new images by simply running the reverse diffusion process starting from a random noise sample. The model gradually removes the noise from the sample, guided by the learned noise distribution, until it generates a final image.
There are multiple versions of Stable Diffusion currently available, including 1.0
, 2.0
, XL
, TurboXL
, & 3.0
. Each version comes from building
on experience of previous iterations and thus components & methods can vary between them. Generally, the makeup of the Stable Diffusion model pipeline (see model
repository), is as follows:
- Example: Stable Diffusion XL Base 1.0
├── scheduler/
├── text_encoder/
├── text_encoder_2/
├── tokenizer/
├── tokenizer_2/
├── unet/
├── vae/
├── vae_1_0/
├── vae_decoder/
├── vae_encoder/
├── model_index.json
The following sections break down the purpose & general methodology for the various components.
To compress the image data, a variational autoencoder (VAE) is first trained on a dataset of images. The encoder part of the VAE takes an image as input and outputs a lower-dimensional latent representation of the image. This latent representation is then used as input to the U-Net. Once the model is trained, the encoder is used to encode images into latent representations, and the decoder is used to decode latent representations back into images.
The decoder part of the VAE transforms the latent representation back into an image. Denoised latents generated by the reverse diffusion process are converted into final images using the VAE decoder.
The U-Net predicts denoised image representation of noisy latents. Here, noisy latents act as input to Unet and the output of UNet is noise in the latents. Using this, we are able to get actual latents by subtracting the noise from the noisy latents.
The Unet that takes in the noisy latents (x) and predicts the noise. We use a conditional model that also takes in the timestep (t) and our text embedding as guidance.
Responsible for transforming the provided input prompt into an embedding space that serves as input for the U-Net neural network. Acts as a guidance for noisy latents when we train U-Net for denoising purposes. Responsible for conversion of input to collections of characters (tokens) which have semantic meaning for a model.
Stable Diffusion uses a pretrained Text Encoder known as CLIP.
Works in tandem with U-Net; dictates the denoising process while balancing quality of image with speed of inference.
Content of the model_index.json
is somewhat indicative of the pipeline makeup:
{
"_class_name": "StableDiffusionXLPipeline",
"_diffusers_version": "0.19.0.dev0",
"force_zeros_for_empty_prompt": true,
"add_watermarker": null,
"scheduler": [
"diffusers",
"EulerDiscreteScheduler"
],
"text_encoder": [
"transformers",
"CLIPTextModel"
],
"text_encoder_2": [
"transformers",
"CLIPTextModelWithProjection"
],
"tokenizer": [
"transformers",
"CLIPTokenizer"
],
"tokenizer_2": [
"transformers",
"CLIPTokenizer"
],
"unet": [
"diffusers",
"UNet2DConditionModel"
],
"vae": [
"diffusers",
"AutoencoderKL"
]
}