Local Dev Service #505
Replies: 5 comments 5 replies
-
I initially had a hard time accepting that we should not download and even start Ollama on our own, but I think I have come to terms with it by thinking of |
Beta Was this translation helpful? Give feedback.
-
I think this is a probably a good idea regardless, as it's highly unlikely you would want to use a different model in production |
Beta Was this translation helpful? Give feedback.
-
I've started looking into this, and one of the things I am unsure of is whether we should handle OpenAI compatibility from the get go. |
Beta Was this translation helpful? Give feedback.
-
#557 is pretty much the implementation of what is described here (although not yet complete) |
Beta Was this translation helpful? Give feedback.
-
There is another user experience issue when trying to move this to OpenAI: How would the user opt to use the local inference? |
Beta Was this translation helpful? Give feedback.
-
Local Dev Service
Introduction
This document describes the local development service for running the application's model(s) locally.
The overall idea is to provide a zero-config approach to developing and testing your AI-infused application locally and to facilitate the transition to the model-serving infrastructure used in production.
Requirements
Local model serving tools
There are several ways to implement the local development service. The most common ones are:
Docker is not the best option. While pretty standard, running models are slow.
Podman Desktop with the AI Studio is a good option. However, the downside is the lack of an API to pull and list models.
This option might be considered in the future.
Ollama is the best option. It is a lightweight service that can run models locally on the three main operating systems.
It is fast and easy to use and proposes an API to pull and list models. It exposes an OpenAI-like API, which is a plus.
The dev service will not be able to run every model, but it should be able to support the most common ones.
Also, the dev service should not install Ollama but redirect the user to the download / install page.
Overall experience
This section describes the overall experience when such a development service is used.
Usage of a chat model
The user wants to develop a chatbot using the Mistral model.
The user can use the:
In the configuration, the model name is set to
mistral
(or whatever the model's name is).When the application starts, if the model serving URL is not configured and the dev service is enabled, it is started.
The dev service extracts the model name and pulls the model if unavailable.
An error is thrown if the model cannot be pulled (the user should install the model manually).
Once the model is pulled or already installed, the dev service produces the configuration required by the application to use the local model.
Note that this configuration is dependent on the extension used.
The application can now use the local model to develop and test the chatbot.
Usage of an embedding model
If the application uses an embedding model, the experience is similar to the chat model.
However, if the embedding model cannot be pulled, the dev service should propose to pull the
nomic-embed-text
model.Unfortunately, the behavior may differ from the actual application in production, but it is a good compromise.
Some design ideas and questions
Should the dev service be automatically enabled?
I think enabling the dev service by default would be a good idea if the model serving URL is not configured, Ollama is running, and the model is supportable.
Ollama REST API and availability
Ollama exposes a REST API to pull and list models.
This API is exposed on port 11434 by default.
This port should be configurable just in case the user changes it (I have no idea how to do that, but well).
The Ollama API documentation is available at: https://github.com/ollama/ollama/blob/main/docs/api.md.
How to handle when the model is not available?
If the model is not available, the dev service should propose to pull the model.
If the model cannot be pulled, an error should be thrown.
How to handle the case where the model is not supported?
If the chat model is not supported, an error should be thrown.
If the embedding model is not supported, the dev service should propose to pull the
nomic-embed-text
model.How to detect the model used
The dev service should detect the model used.
An idea is to make the configuration of these models "build time," and a
BuildItem
with the model name would be produced.Thus, the dev service processor can consume these build items and start the dev service accordingly.
How to configure the extensions
Once the dev service has started the model, it should emit a custom
Build Item
indicating how to invoke the model.Extensions should be able to retrieve this build item and configure themselves accordingly.
Beta Was this translation helpful? Give feedback.
All reactions