This model service is intended be used as the basis for a chat application. It is capable of having arbitrarily long conversations with users and retains a history of the conversation until it reaches the maximum context length of the model. At that point, the service will remove the earliest portions of the conversation from its memory.
To use this model service, please follow the steps below:
This example assumes that the developer already has a copy of the model that they would like to use downloaded onto their host machine and located in the /models
directory of this repo.
The two models that we have tested and recommend for this example are Llama2 and Mistral. Please download any of the GGUF variants you'd like to use.
- Llama2 - https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/tree/main
- Mistral - https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/tree/main
For a full list of supported model variants, please see the "Supported models" section of the llama.cpp repository.
cd models
wget https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q5_K_S.gguf
To build the image we will use a build.sh
script that will simply copy the desired model and shared code into the build directory temporarily. This prevents any large unused model files in the repo from being loaded into the podman environment during build which can cause a significant slowdown.
cd chatbot/model_services/builds
sh build.sh llama-2-7b-chat.Q5_K_S.gguf arm locallm
The user should provide the model name, the architecture and image name they want to use for the build.
Once the model service image is built, it can be run with the following:
podman run -it -p 7860:7860 locallm
Now the service can be interacted with by going to 0.0.0.0:7860
in your browser.
You can also use the ask.py
script under /ai_applications
to run the chat application in a terminal. If the --prompt
argument is left blank, it will just default to "Hello".
cd chatbot/ai_applications
python ask.py --prompt <YOUR-PROMPT>
Now that we've developed an application locally that leverages an LLM, we'll want to share it with a wider audience. Let's get it off our machine and run it on OpenShift.
We'll need to rebuild the image for the x86 architecture for most use case outside of our Mac. Since this is an AI workload, we will also want to take advantage of Nvidia GPU's available outside our local machine. Therefore, this image's base image contains CUDA and builds llama.cpp specifically for a CUDA environment.
cp chatapp/model_services/builds
sh build.sh llama-2-7b-chat.Q5_K_S.gguf x86 locallm
Before building the image, you can change line 6 of builds/x86/Containerfile
if you'd like to NOT use CUDA and GPU acceleration by setting -DLLAMA_CUBLAS
to off
ENV CMAKE_ARGS="-DLLAMA_CUBLAS=off"
Once you login to quay.io you can push your own newly built version of this LLM application to your repository for use by others.
podman login quay.io
podman push localhost/locallm quay.io/<YOUR-QUAY_REPO>/locallm
Now that your model lives in a remote repository we can deploy it. Go to your OpenShift developer dashboard and select "+Add" to use the Openshift UI to deploy the application.
Select "Container images"
Then fill out the form on the Deploy page with your quay.io image name and make sure to set the "Target port" to 7860.
Hit "Create" at the bottom and watch your application start.
Once the pods are up and the application is working, navigate to the "Routs" section and click on the link created for you to interact with your app.