Local Dev Service #505

cescoffier · 2024-04-25T07:55:01Z

cescoffier
Apr 25, 2024
Maintainer

Local Dev Service

Introduction

This document describes the local development service for running the application's model(s) locally.
The overall idea is to provide a zero-config approach to developing and testing your AI-infused application locally and to facilitate the transition to the model-serving infrastructure used in production.

Requirements

The service should be able to run the model(s) the application uses locally. Not every model will be supported, but the most common ones should be (Llama, Mistral...).
The user should not add any configuration to run the service locally
If the model is already available locally, the model should be started rapidly (less than 10 seconds).
The model should be invoked relatively fast. Response times greater than 15s are not acceptable (for simple requests).
The service should expose an OpenAI-like API as it became the de-facto standard API for AI models

Local model serving tools

There are several ways to implement the local development service. The most common ones are:

Ollama
Docker
Podman Desktop with the AI Studio

Docker is not the best option. While pretty standard, running models are slow.

Podman Desktop with the AI Studio is a good option. However, the downside is the lack of an API to pull and list models.
This option might be considered in the future.

Ollama is the best option. It is a lightweight service that can run models locally on the three main operating systems.
It is fast and easy to use and proposes an API to pull and list models. It exposes an OpenAI-like API, which is a plus.

The dev service will not be able to run every model, but it should be able to support the most common ones.
Also, the dev service should not install Ollama but redirect the user to the download / install page.

Overall experience

This section describes the overall experience when such a development service is used.

Usage of a chat model

The user wants to develop a chatbot using the Mistral model.
The user can use the:

OpenAI extension
Ollama extension
MistralAI extension

In the configuration, the model name is set to mistral (or whatever the model's name is).

When the application starts, if the model serving URL is not configured and the dev service is enabled, it is started.
The dev service extracts the model name and pulls the model if unavailable.

An error is thrown if the model cannot be pulled (the user should install the model manually).

Once the model is pulled or already installed, the dev service produces the configuration required by the application to use the local model.
Note that this configuration is dependent on the extension used.

The application can now use the local model to develop and test the chatbot.

Usage of an embedding model

If the application uses an embedding model, the experience is similar to the chat model.
However, if the embedding model cannot be pulled, the dev service should propose to pull the nomic-embed-text model.
Unfortunately, the behavior may differ from the actual application in production, but it is a good compromise.

Some design ideas and questions

Should the dev service be automatically enabled?

I think enabling the dev service by default would be a good idea if the model serving URL is not configured, Ollama is running, and the model is supportable.

Ollama REST API and availability

Ollama exposes a REST API to pull and list models.
This API is exposed on port 11434 by default.
This port should be configurable just in case the user changes it (I have no idea how to do that, but well).

The Ollama API documentation is available at: https://github.com/ollama/ollama/blob/main/docs/api.md.

How to handle when the model is not available?

If the model is not available, the dev service should propose to pull the model.
If the model cannot be pulled, an error should be thrown.

How to handle the case where the model is not supported?

If the chat model is not supported, an error should be thrown.
If the embedding model is not supported, the dev service should propose to pull the nomic-embed-text model.

How to detect the model used

The dev service should detect the model used.
An idea is to make the configuration of these models "build time," and a BuildItem with the model name would be produced.

Thus, the dev service processor can consume these build items and start the dev service accordingly.

How to configure the extensions

Once the dev service has started the model, it should emit a custom Build Item indicating how to invoke the model.
Extensions should be able to retrieve this build item and configure themselves accordingly.

geoand · 2024-04-26T07:50:05Z

geoand
Apr 26, 2024
Maintainer

I initially had a hard time accepting that we should not download and even start Ollama on our own, but I think I have come to terms with it by thinking of Ollama as something analogous to Docker - i.e. as a "service" that needs to be present for us to use it.
FWIW, on Linux the Ollama installation even creates a SystemD Service

0 replies

geoand · 2024-04-26T08:04:27Z

geoand
Apr 26, 2024
Maintainer

An idea is to make the configuration of these models "build time," and a BuildItem with the model name would be produced.

I think this is a probably a good idea regardless, as it's highly unlikely you would want to use a different model in production

1 reply

cescoffier Apr 29, 2024
Maintainer Author

My rationale for this was that I am not sure we can install it on every OS. I see ollama a bit like podman or docker. We do not install it.

If we can, definitely +1.

geoand · 2024-05-08T14:44:46Z

geoand
May 8, 2024
Maintainer

I've started looking into this, and one of the things I am unsure of is whether we should handle OpenAI compatibility from the get go.
The reason I am skeptical is that Ollama's OpenAi compatibility is only provided for chat completion, nothing else (including embeddings) so I am not sure it's terribly useful (yet).

4 replies

geoand May 8, 2024
Maintainer

Another reason I would like to limit this change to Ollama only for now is that it would give us the ability to iron out whatever wrinkles there are in the user experience before touching all the other extensions.

cescoffier May 10, 2024
Maintainer Author

I think that having a single extension ;with both enabled by default) for the chat model and embedding model was probably a mistake.

geoand May 10, 2024
Maintainer

How so?

cescoffier May 13, 2024
Maintainer Author

Embedding is only one of these obscure areas that most people won't understand but should (because it is critical). By having a single extension exposing both the chat model and embedding, one can get a sub-optimal embedding "by mistake" (remember that, the default OpenAI one is not great).

I would say that when you implement a RAG pattern you need to pick:

a chat model extension and configure the model (you already do that)
an embedding model so you are aware of the embedding model that gets used

geoand · 2024-05-09T12:59:18Z

geoand
May 9, 2024
Maintainer

#557 is pretty much the implementation of what is described here (although not yet complete)

0 replies

geoand · 2024-05-14T09:01:22Z

geoand
May 14, 2024
Maintainer

There is another user experience issue when trying to move this to OpenAI:

How would the user opt to use the local inference?
I am asking because currently the user does not have to set the URL for using OpenAI, so we would not really know if the user wants the OpenAI SaaS or the local inference.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Local Dev Service #505

{{title}}

Replies: 5 comments 5 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Local Dev Service #505

cescoffier Apr 25, 2024 Maintainer

Local Dev Service

Introduction

Requirements

Local model serving tools

Overall experience

Usage of a chat model

Usage of an embedding model

Some design ideas and questions

Should the dev service be automatically enabled?

Ollama REST API and availability

How to handle when the model is not available?

How to handle the case where the model is not supported?

How to detect the model used

How to configure the extensions

Replies: 5 comments · 5 replies

geoand Apr 26, 2024 Maintainer

geoand Apr 26, 2024 Maintainer

cescoffier Apr 29, 2024 Maintainer Author

geoand May 8, 2024 Maintainer

geoand May 8, 2024 Maintainer

cescoffier May 10, 2024 Maintainer Author

geoand May 10, 2024 Maintainer

cescoffier May 13, 2024 Maintainer Author

geoand May 9, 2024 Maintainer

geoand May 14, 2024 Maintainer

cescoffier
Apr 25, 2024
Maintainer

Replies: 5 comments 5 replies

geoand
Apr 26, 2024
Maintainer

geoand
Apr 26, 2024
Maintainer

cescoffier Apr 29, 2024
Maintainer Author

geoand
May 8, 2024
Maintainer

geoand May 8, 2024
Maintainer

cescoffier May 10, 2024
Maintainer Author

geoand May 10, 2024
Maintainer

cescoffier May 13, 2024
Maintainer Author

geoand
May 9, 2024
Maintainer

geoand
May 14, 2024
Maintainer