-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server: Bring back multimodal support #8010
Comments
Can someone advise me on what is the latest (working) commit before multimodal was removed? |
It is rather annoying that multimodal support was removed for the server and has not been re-implemented for such a long time now (4 months?). Multimodal LLMs and interleaved image and text models are growing in capability recently, and not being able to run models that used to work before is unfortunate. Seemingly, the only way to restore this functionality is to downgrade to a version that loses support for most new models and improvements. I am not trying to demand multimodal/llava support to return, but show that this feature on the server is missed. |
Hello, is there still no multimodal support in llama-server ? According to ReadMe in LLaMA.cpp HTTP Server it should be supported? How to use it with OpenAI API format? |
Have there been any updates on this? |
So far, it appears that there hasn't been any updates. This really stinks because there were updates to llava recently to support new models. |
So, this functionality seems to be unavailable for months and there is no hope to get it running? With all the amazing new models we could work with, such as MiniCPM, even Pixtral, etc. Can someone point us at a working server software that allows working with the newer multimodal models ? We just need something like llama-server that should run these multimodal GGUFs. Perhaps one of the llama.cpp files allows to run a server? It's also important to have a standard (OpenAI) API to support standard interactions.. It's so frustrating to wait months and months for such an important feature with no one even bothering to reply! |
Not much has changes since the issue was created. We need contributions to improve the existing vision code and people to maintain it. There is interest to reintroduce full multimodal support, but there are other things with higher priority that are currently worked upon by the core maintainers of the project. |
Just to remind: Currently, llama-cpp-python has a server implementation that supports vision models (with OAI compat API). You can use it as an alternative. Of course it's much better to bring vision support into llama.cpp itself (instead of staying as |
@ggerganov, Meta released Llama-3.2 with multimodal capabilities. Does this affect the priority for core maintainers? I hope this question doesn’t come across as entitled... |
@chigkim My PoV is that adding multimodal support is a great opportunity for new people with good software architecture skills to get involved in the project. The general low to mid level patterns and details needed for the implementation are already available in the codebase - from model conversion, to data loading, backend usage and inference. It would take some high-level understanding of the project architecture in order to implement support for the vision models and extend the API in the correct way. We really need more people with this sort of skillset, so at this point I feel it is better to wait and see if somebody will show up and take the opportunity to help out with the project long-term. Otherwise, I'm afraid we won't be able to sustain the quality of the project. |
Great! Good opportunities, from a developer perspective everyone loves to dive into the code. I would love to help but don't know where to start, is there a list of requirements for the implementation or just make something work for now? What would the finished implementation look like? |
Correct me if I'm wrong, but actual multimodal opensource models are essentially just like usual llm plus accepting images as input.
|
IMO the
|
It's hard to make a list of requirements - I personally don't have the expertise and experience needed to decide what is the best way to integrate multimodal. It mainly depends on the ways that the functionality is used - what is the input and output. The core implementation should be 99% the same as every transformer based model.
Likely,
That's my understanding as well. In a similar way, Whisper is an LLM that instead of text tokens, accepts raw audio, that is encoded and passed to the decoder.
Yes, I agree.
Yes, something along these lines, though I don't really have a good picture. Maybe even consider to reuse |
CLIP is quite different from whisper because it doesn't use cross attention. Instead, the vision model outputs embeddings that can be taken as input for language model. It also depends on the chat template to know where to put the embedding. A common pattern that I observe looks like this:
So, My current solution on #9687 is to firstly call Not sure if this is the best way to do though, as I'm still new to vision models. Feedbacks are welcomed on this subject. |
Yes, actually I'm looking at AFAICT, we need llama.cpp/examples/llava/clip.cpp Lines 621 to 626 in 8277a81
I'm wondering if the right approach is to try to abstract only the part where the raw input becomes embeddings. For example, we currently support 2 input types with the existing
The natural continuation seems to be to extend support for images by adding them to |
I would suggest designing the API in a way that at least it is possible to implement it in a way such that it is possible to avoid the copy of the embeddings between the CPU and then back to the GPU. It doesn't have to be implemented that way in the first iteration, but at some point if this becomes an important feature we will want to optimize it and remove these copies, and being able to do that without changing the API would make it much less painful.
I have thought that |
Yes that make sense. Adding image input directly to int i = 0;
for (auto tok : token_list) {
if (llama_token_image(ctx) == tok) {
llama_batch_add(batch, img, i, { 0 }, false);
i += llama_n_tokens_image(ctx, img); // keep track of how many tokens an image occupies
} else {
llama_batch_add(batch, tok, i, { 0 }, false);
i++;
}
} @slaren This implementation should allow us to keep the current method for now (copy embeddings GPU --> CPU --> GPU), then eliminate the copy in the future without a breaking change. My initial idea was to make Having |
I have to say that I don't really see any good reason to put images in the batch. Images will need to be processed separately anyway, and it will make a data structure that is already quite complicated and rather unpleasant to use even more so. I also expect that the code to deal with this will have a significant amount of complexity. |
@ggerganov Sorry I misunderstood this part. I don't think we should separate the part where inp_raw become embeddings, since it will increase the complexity of the code, without provide much benefit for API user. Edit: To explain a bit more, in such case, user will have to call something like:
Re @slaren yeah I see your point. I thought that if But in this case, it no longer match the behavior of For now, I think I will stick with the separated // assuming we have batch of 2 images
int img_status = llama_vision_encode(ctx, &img_batch);
// maybe check img_status
// position of image tokens in language batch, based on tokenized text input
std::vector<llama_pos> img_pos = { 12, 14 };
// batch for language part
llama_batch batch = llama_batch_init(512, 0, 1);
llama_batch_add_decoded_imgs(ctx, img_pos.data(), img_pos.size()); |
The main reason I had in mind is to reuse llama_batch_clear();
llama_batch_add_img(batch, img);
llama_decode(ctx, batch);
embd_img = llama_get_embeddings(ctx);
llama_batch_clear();
llama_batch_add_token(batch, id, ...);
llama_batch_add_embd(batch, embd_img, ...);
llama_decode(ctx, batch); An alternative that also sounds good to me is to have a batch type for each input type: struct llama_batch_embd;
struct llama_batch_text;
struct llama_batch_img;
struct llama_batch_audio;
... In this case we would have the corresponding: llama_decode_embd (ctx, batch_embd);
llama_decode_text (ctx, batch_text);
llama_decode_img (ctx, batch_img);
llama_decode_audio(ctx, batch_audio);
... And internally all these will end up calling a common
No, I didn't mean the user to manually do the conversion to embeddings. Just to extend the existing internal implementation of I'm thinking along the line of having just |
Personally I would prefer having multiple separated function instead of a single Side story, I've been doing some llama.cpp intro for co-workers at HF, and I found it's very handy to say "for decoder-only model, use Something like |
I had a quick look on llama 3.2 vision implementation. They use cross attention instead of using embeddings. So I need to take into account usage with cross attention that will come in the future. I think an API like Compared to my initial proposal, now, the position for image must be added to the batch too: typedef struct llama_batch_img {
int32_t n_imgs;
llama_img ** imgs;
llama_pos * pos;
} llama_batch_img;
# usage:
llama_batch_img ibatch = llama_batch_img_init(2); // for example, batch of 2 images
...
llama_vision_decode(ctx, ibatch); // output is saved to ctx or KV cache depends on implementation
llama_decode(batch); // decode language batch
// then, get logits
... |
Chill out man, I've just came back from vacation |
Any updates on this? 9 month without multimodal |
Actually I lost my motivation to work on #9687 because I'm still not very happy about my proposal for the API. The problem is that placing the image embedding in the correct place can be quite tricky, as it's often controlled by the chat template. Thinking again about it today, it will be better to make the API more explicit. The main assumption is that we have these similar terms between language and vision:
So now my proposal for the flow is: Vision part:
Now the tricky part, how can we indicate in the language batch ( We can reserve 2 most significant bits to mark the token as “vision“, Then add both language tokens (tokenized from text) and vision tokens to the Upon receiving the batch, |
Is |
Yes, The difference is that there is an extra projection step to make sure the vision embd and text embd dimensions are the same. P/s: probably |
I think you could simplify all of this by working directly with |
@slaren indeed, my suggest is already (mostly) what you said. Instead of returning Upon doing |
Also, technically say,
|
I still think it would be good to modify |
Modifying With my last proposal, I think migrating from token-based batch to sequence-based batch won't be too complicated in the future. As said, Not sure if there're other things to worry about here, feel free to comment if you can think of any. On my side, I'll start working on this in the next few days - I should start when I'm still having the motivation 😆 |
Personally I wouldn't like adding special values to tokens, or coupling the |
Hmm with the approach you said, it will be tricky to support batched image, because But I think we could skip it for now, given that it will be also complicated for me to add batching to So for now we can simply it to:
Please note that,
Then, we need to call
|
llama_batch batch = {
/*n_tokens =*/ 0, // implied from embd_tensor dimensions
/*tokens =*/ nullptr,
/*embd =*/ nullptr,
/*embd_tensor =*/ llama_vision_get_embeddings(vision_ctx),
/*pos =*/ nullptr,
/*n_seq_id =*/ nullptr,
/*seq_id =*/ nullptr,
/*logits =*/ nullptr,
}; To support batching later, |
Hey @ngxson I noticed you were working on bringing back vision capabilities back to llama.cpp and your last commit shows you were working on minicpm. Not sure if it helps, but here is an implementation of this model to work with llama.cpp - Open BMB llama.cpp fork - minicpm-v2.5 Thank you for your work! |
Multimodal has been removed since #5882
Depends on the refactoring of
llava
, we will be able to bring back the support: #6027This issue is created mostly for tracking purpose. If someone want to take this task, feel free to comment below.
Currently, there is not yet any plan for this task.
The text was updated successfully, but these errors were encountered: