feat: load multiple models #495

vansangpfiev · 2024-04-10T06:49:09Z

Issue link: #467

use std::unordered_map to store all llama_server_context (lsc)
refactor: move background thread to lsc, each lsc has its own background thread
TODO: update documentation

We use mode_id as a key to find the model, so they have to be unique. That requires some changes in request parameter:
For completions:
Have to set value the model_alias for inferences/llamaCPP/loadmodel and inferences/llamaCPP/unloadmodel , this has to be the same as model parameter in inferences/llamaCPP/chat_completion
loadmodel/unloadmodel:

{
...
 "llama_model_path": "file_to_location"
  "model_alias": "model1"
...
}

chat_completion:

{
...
  "model": "model1"  
...
}

For embeddings:
The same for loadmodel/unloadmodel, for embedding request, we need to add model parameter
loadmodel/unloadmodel

{
...
    "llama_model_path": "e:/workspace/model/nomic-embed-text-v1.5.f16.gguf",
    "model_alias": "model1",
    "model_type": "embedding"
...
}

embedding

{
...  
  "input": "how are you",
  "model": "model1"
...
}

louis-menlo · 2024-04-16T09:13:19Z

controllers/llamaCPP.cc

+    task_queue_thread_num = std::max(task_queue_thread_num, params.n_parallel);
+    LOG_INFO << "Start inference task queue, num threads: "
+             << task_queue_thread_num;
+    inference_task_queue = std::make_unique<trantor::ConcurrentTaskQueue>(


The inference task queue is constructed with the model's parameters. It is shared among the model's operations, which is incorrect, right? Let's say it would be renewed every time a new model request with a larger n_parallel number comes in:

How will pending tasks be processed since it is renewed?

Will later requests change thread_num, causing side effects for previously loaded models?"

One solution (though it is not optimal) is that we can use a task queue for each model. Will change the code to make it works correctly first.

louis-menlo · 2024-04-16T09:43:32Z

utils/nitro_utils.h

+    auto s = input.asString();
+    std::replace(s.begin(), s.end(), '\\', '/');
+    auto const pos = s.find_last_of('/');
+    // We only if file name has gguf extension or nothing


Missing some words

…-models

louis-menlo

LGTM

vansangpfiev · 2024-05-08T01:27:20Z

Moved PR to janhq/cortex.llamacpp#14

vansangpfiev self-assigned this Apr 10, 2024

vansangpfiev force-pushed the feat-load-multiple-models branch 2 times, most recently from 99ef8e5 to fc8cabd Compare April 12, 2024 03:01

vansangpfiev marked this pull request as ready for review April 12, 2024 03:26

feat: multiple models for llama

4dc2ccf

vansangpfiev force-pushed the feat-load-multiple-models branch from fc8cabd to 4dc2ccf Compare April 16, 2024 01:25

vansangpfiev requested review from tikikun, louis-menlo and hiro-v April 16, 2024 02:04

louis-menlo reviewed Apr 16, 2024

View reviewed changes

vansangpfiev added 2 commits April 17, 2024 08:19

Merge branch 'main' of github.com:janhq/nitro into feat-load-multiple…

efed8d6

…-models

fix: update comments, use a task queue for each server context

30881bb

louis-menlo approved these changes Apr 19, 2024

View reviewed changes

vansangpfiev closed this May 8, 2024

vansangpfiev deleted the feat-load-multiple-models branch July 8, 2024 05:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: load multiple models #495

feat: load multiple models #495

vansangpfiev commented Apr 10, 2024 •

edited

Loading

louis-menlo Apr 16, 2024

vansangpfiev Apr 17, 2024

louis-menlo Apr 16, 2024

louis-menlo left a comment

vansangpfiev commented May 8, 2024

feat: load multiple models #495

feat: load multiple models #495

Conversation

vansangpfiev commented Apr 10, 2024 • edited Loading

louis-menlo Apr 16, 2024

Choose a reason for hiding this comment

vansangpfiev Apr 17, 2024

Choose a reason for hiding this comment

louis-menlo Apr 16, 2024

Choose a reason for hiding this comment

louis-menlo left a comment

Choose a reason for hiding this comment

vansangpfiev commented May 8, 2024

vansangpfiev commented Apr 10, 2024 •

edited

Loading