-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: load multiple models #495
Conversation
99ef8e5
to
fc8cabd
Compare
fc8cabd
to
4dc2ccf
Compare
controllers/llamaCPP.cc
Outdated
task_queue_thread_num = std::max(task_queue_thread_num, params.n_parallel); | ||
LOG_INFO << "Start inference task queue, num threads: " | ||
<< task_queue_thread_num; | ||
inference_task_queue = std::make_unique<trantor::ConcurrentTaskQueue>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The inference task queue is constructed with the model's parameters. It is shared among the model's operations, which is incorrect, right? Let's say it would be renewed every time a new model request with a larger n_parallel number comes in:
- How will pending tasks be processed since it is renewed?
- Will later requests change thread_num, causing side effects for previously loaded models?"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One solution (though it is not optimal) is that we can use a task queue for each model. Will change the code to make it works correctly first.
utils/nitro_utils.h
Outdated
auto s = input.asString(); | ||
std::replace(s.begin(), s.end(), '\\', '/'); | ||
auto const pos = s.find_last_of('/'); | ||
// We only if file name has gguf extension or nothing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing some words
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Moved PR to janhq/cortex.llamacpp#14 |
Issue link: #467
We use mode_id as a key to find the model, so they have to be unique. That requires some changes in request parameter:
For completions:
Have to set value the model_alias for
inferences/llamaCPP/loadmodel
andinferences/llamaCPP/unloadmodel
, this has to be the same as model parameter ininferences/llamaCPP/chat_completion
loadmodel/unloadmodel:
chat_completion:
For embeddings:
The same for loadmodel/unloadmodel, for embedding request, we need to add model parameter
loadmodel/unloadmodel
embedding