Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: Bring back multimodal support #8010

Open
ngxson opened this issue Jun 19, 2024 · 37 comments · May be fixed by #9687 or #11292
Open

server: Bring back multimodal support #8010

ngxson opened this issue Jun 19, 2024 · 37 comments · May be fixed by #9687 or #11292
Labels
enhancement New feature or request llava LLaVa and multimodal server

Comments

@ngxson
Copy link
Collaborator

ngxson commented Jun 19, 2024

Multimodal has been removed since #5882

Depends on the refactoring of llava, we will be able to bring back the support: #6027

This issue is created mostly for tracking purpose. If someone want to take this task, feel free to comment below.

Currently, there is not yet any plan for this task.

@ZhenyaPav
Copy link

Can someone advise me on what is the latest (working) commit before multimodal was removed?

@github-actions github-actions bot removed the stale label Jul 21, 2024
@HAV0X1014
Copy link

It is rather annoying that multimodal support was removed for the server and has not been re-implemented for such a long time now (4 months?). Multimodal LLMs and interleaved image and text models are growing in capability recently, and not being able to run models that used to work before is unfortunate. Seemingly, the only way to restore this functionality is to downgrade to a version that loses support for most new models and improvements.

I am not trying to demand multimodal/llava support to return, but show that this feature on the server is missed.

@micoraweb
Copy link

Hello, is there still no multimodal support in llama-server ? According to ReadMe in LLaMA.cpp HTTP Server it should be supported?

How to use it with OpenAI API format?

@its-ven
Copy link

its-ven commented Aug 22, 2024

Have there been any updates on this?

@HAV0X1014
Copy link

So far, it appears that there hasn't been any updates. This really stinks because there were updates to llava recently to support new models.

@micoraweb
Copy link

So, this functionality seems to be unavailable for months and there is no hope to get it running? With all the amazing new models we could work with, such as MiniCPM, even Pixtral, etc. Can someone point us at a working server software that allows working with the newer multimodal models ? We just need something like llama-server that should run these multimodal GGUFs. Perhaps one of the llama.cpp files allows to run a server? It's also important to have a standard (OpenAI) API to support standard interactions.. It's so frustrating to wait months and months for such an important feature with no one even bothering to reply!

@ggerganov
Copy link
Owner

Not much has changes since the issue was created. We need contributions to improve the existing vision code and people to maintain it. There is interest to reintroduce full multimodal support, but there are other things with higher priority that are currently worked upon by the core maintainers of the project.

@ngxson
Copy link
Collaborator Author

ngxson commented Sep 12, 2024

Just to remind: Currently, llama-cpp-python has a server implementation that supports vision models (with OAI compat API). You can use it as an alternative.

Of course it's much better to bring vision support into llama.cpp itself (instead of staying as llava example). The problem is that the current code requires a big clean up. We will eventually do that, as vision capability is becoming more and more streamline.

@chigkim
Copy link

chigkim commented Sep 26, 2024

@ggerganov, Meta released Llama-3.2 with multimodal capabilities. Does this affect the priority for core maintainers? I hope this question doesn’t come across as entitled...

@ggerganov
Copy link
Owner

@chigkim My PoV is that adding multimodal support is a great opportunity for new people with good software architecture skills to get involved in the project. The general low to mid level patterns and details needed for the implementation are already available in the codebase - from model conversion, to data loading, backend usage and inference. It would take some high-level understanding of the project architecture in order to implement support for the vision models and extend the API in the correct way.

We really need more people with this sort of skillset, so at this point I feel it is better to wait and see if somebody will show up and take the opportunity to help out with the project long-term. Otherwise, I'm afraid we won't be able to sustain the quality of the project.

@hidden1nin
Copy link

Great! Good opportunities, from a developer perspective everyone loves to dive into the code. I would love to help but don't know where to start, is there a list of requirements for the implementation or just make something work for now? What would the finished implementation look like?

@mattepiu
Copy link

Correct me if I'm wrong, but actual multimodal opensource models are essentially just like usual llm plus accepting images as input.
If so, keeping in mind it should be modular (i.e.: derivable to accept audio), it would essentially need a base class to load an image, convert/rescale/normalize with a library like openCV and then produce an output tensor which is used as input for the inference, after which the output would not change and be purely text based (no image generation, no automatic target recognition with frames around objects).
I can thus see a couple issues:

  1. dependency on external libraries (like openCV, not the slimmest of the dependencies)
  2. each llm has it's own way and it's own code, so all the conversion operation from the image to the tensor should be optional or even swappable (a rearrangeable/programmable pipeline to allow compatibility with as much as possible models)

@ngxson
Copy link
Collaborator Author

ngxson commented Sep 27, 2024

IMO the llava example and the clip.cpp implementation is already a good start. Basically, what we need now is:

  • Define list of API calls and data struct that must be added to llama.h. For example, something like clip_image_encode can be moved to llama_clip_image_encode
  • Get rid of model surgery. If llama.cpp has native vision support, it's better to have both language + vision model in one single gguf. convert_hf_to_gguf.py script must be modify to handle this.
  • Expand llama_model and llama_context to hold vision data (model weights, temporary tensors, etc). This will also allow using mmap and n_gpu_layers for loading vision model, which is currently missing from clip.cpp

@ggerganov
Copy link
Owner

I would love to help but don't know where to start, is there a list of requirements for the implementation or just make something work for now?

It's hard to make a list of requirements - I personally don't have the expertise and experience needed to decide what is the best way to integrate multimodal. It mainly depends on the ways that the functionality is used - what is the input and output. The core implementation should be 99% the same as every transformer based model.

What would the finished implementation look like?

Likely, libllama extended in a way to support clip and multimodal. When this is ready, llama-server should be extended to accept images.

Correct me if I'm wrong, but actual multimodal opensource models are essentially just like usual llm plus accepting images as input.

That's my understanding as well. In a similar way, Whisper is an LLM that instead of text tokens, accepts raw audio, that is encoded and passed to the decoder.

dependency on external libraries (like openCV, not the slimmest of the dependencies)

libllama should not depend on external libraries. The examples can depend on very lightweight, STB-like libraries. Particularly, OpenCV is a no-go.

IMO the llava example and the clip.cpp implementation is already a good start

Yes, I agree.

Define list of API calls and data struct that must be added to llama.h. For example, something like clip_image_encode can be moved to llama_clip_image_encode

Yes, something along these lines, though I don't really have a good picture. Maybe even consider to reuse llama_encode instead of new CLIP-specific encoding API. After all, AFAIK, all encoders take embeddings as input and produce cross KV + new embeddings, regardless if it is text, audio, images, etc.

@ngxson ngxson linked a pull request Sep 29, 2024 that will close this issue
7 tasks
@ngxson
Copy link
Collaborator Author

ngxson commented Sep 30, 2024

Yes, something along these lines, though I don't really have a good picture. Maybe even consider to reuse llama_encode instead of new CLIP-specific encoding API. After all, AFAIK, all encoders take embeddings as input and produce cross KV + new embeddings, regardless if it is text, audio, images, etc.

CLIP is quite different from whisper because it doesn't use cross attention. Instead, the vision model outputs embeddings that can be taken as input for language model. It also depends on the chat template to know where to put the embedding. A common pattern that I observe looks like this:

<|im_start|>user
<image><put_embeddings_here></image>
what do you see in the image?
<|im_end|>

So, llama_encode may not be a good fit here, because we expect encode to enable cross attention (please correct if I'm wrong here).

My current solution on #9687 is to firstly call llama_vision_encode, then call llama_vision_get_embeddings to get the embeddings. After that, add the image embeddings to llama_batch, then decode the batch to generate text.

Not sure if this is the best way to do though, as I'm still new to vision models. Feedbacks are welcomed on this subject.

@ggerganov
Copy link
Owner

So, llama_encode may not be a good fit here, because we expect encode to enable cross attention (please correct if I'm wrong here).

Yes, actually I'm looking at llama_encode and now I'm not sure it really needs to exist. Seems that all it does is set is_encoding = true and the rest is the same as llama_decode.

AFAICT, we need llama_vision_encode because the input for CLIP does not fit into the existing struct llama_batch and so we add the new struct llama_img_batch. Looking at clip.cpp, this is where the raw input images become embeddings:

struct ggml_tensor * inp_raw = ggml_new_tensor_4d(ctx0, GGML_TYPE_F32, image_size_width, image_size_height, 3, batch_size);
ggml_set_name(inp_raw, "inp_raw");
ggml_set_input(inp_raw);
struct ggml_tensor * inp = ggml_conv_2d(ctx0, model.patch_embeddings, inp_raw, patch_size, patch_size, 0, 0, 1, 1);

I'm wondering if the right approach is to try to abstract only the part where the raw input becomes embeddings. For example, we currently support 2 input types with the existing llama_batch + llm_build_inp_embd:

  • Raw embeddings through llama_batch.embd. Simply copy the input into the embedding tensor:

    llama.cpp/src/llama.cpp

    Lines 9103 to 9105 in 8277a81

    lctx.inp_embd = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, n_embd, batch.n_tokens);
    inpL = lctx.inp_embd;
    ggml_set_input(lctx.inp_embd);

  • Text tokens with learned embeddings. Get the embeddings for each token from a model weight tensor:

    llama.cpp/src/llama.cpp

    Lines 9097 to 9101 in 8277a81

    lctx.inp_tokens = ggml_new_tensor_1d(ctx, GGML_TYPE_I32, batch.n_tokens);
    cb(lctx.inp_tokens, "inp_tokens", -1);
    ggml_set_input(lctx.inp_tokens);
    inpL = ggml_get_rows(ctx, tok_embd, lctx.inp_tokens);

The natural continuation seems to be to extend support for images by adding them to llama_batch and update llm_build_inp_embd to do the convolution similar to current clip.cpp in order to obtain the embeddings. Obviously, in this approach llama_batch has to be updated in a way that would allow us to extend it in the future with more types of input such as audio, video, touch, smell, etc. So it's something to consider, though regardless what we decide, llama_batch could use some refactoring in the style of llama_sampling, since it's already showing some deficiencies (e.g. #9668).

@slaren
Copy link
Collaborator

slaren commented Sep 30, 2024

I would suggest designing the API in a way that at least it is possible to implement it in a way such that it is possible to avoid the copy of the embeddings between the CPU and then back to the GPU. It doesn't have to be implemented that way in the first iteration, but at some point if this becomes an important feature we will want to optimize it and remove these copies, and being able to do that without changing the API would make it much less painful.

llama_batch could use some refactoring in the style of llama_sampling

I have thought that llama_batch would more intuitive if it was a collection of sequences rather than a collection of tokens. Each sequence would only needs to have a seq_id and a collection of tokens. pos should not be necessary at all, just take it from the KV cache.

@ngxson
Copy link
Collaborator Author

ngxson commented Oct 1, 2024

The natural continuation seems to be to extend support for images by adding them to llama_batch and update llm_build_inp_embd to do the convolution similar to current clip.cpp in order to obtain the embeddings.

Yes that make sense. Adding image input directly to llama_batch is feasible, but with a small cost of having user to calculate the token position after inserting an image. An implementation that I'm thinking about is:

int i = 0;
for (auto tok : token_list) {
    if (llama_token_image(ctx) == tok) {
        llama_batch_add(batch, img, i, { 0 }, false);
        i += llama_n_tokens_image(ctx, img); // keep track of how many tokens an image occupies
    } else {
        llama_batch_add(batch, tok, i, { 0 }, false);
        i++;
    }
}

@slaren This implementation should allow us to keep the current method for now (copy embeddings GPU --> CPU --> GPU), then eliminate the copy in the future without a breaking change.

My initial idea was to make llama-vision.cpp become quite a black box when viewing from llama.cpp (image in, embeddings out).

Having llm_build_inp_embd to set the image into inp_raw will make it to expose more functionalities (i.e. cgraph). For now that will be quite messy, given that current llava implementation doesn't even support batching. So, I think I'll continue doing the initial way (but expose the API using llama_batch as proposed above), then improve it in the future.

@slaren
Copy link
Collaborator

slaren commented Oct 1, 2024

I have to say that I don't really see any good reason to put images in the batch. Images will need to be processed separately anyway, and it will make a data structure that is already quite complicated and rather unpleasant to use even more so. I also expect that the code to deal with this will have a significant amount of complexity.

@ngxson
Copy link
Collaborator Author

ngxson commented Oct 1, 2024

I'm wondering if the right approach is to try to abstract only the part where the raw input becomes embeddings. For example, we currently support 2 input types with the existing llama_batch + llm_build_inp_embd:

@ggerganov Sorry I misunderstood this part. I don't think we should separate the part where inp_raw become embeddings, since it will increase the complexity of the code, without provide much benefit for API user.

Edit: To explain a bit more, in such case, user will have to call something like:

  1. llama_img_to_embeddings to convert img to inp embeddings
  2. Put the embeddings into a batch
  3. Call llama_vision_encode
  4. Get the output embeddings after encode
  5. Put the embeddings into language batch
  6. llama_decode, then get logits

Re @slaren yeah I see your point. I thought that if llama_batch contains both image and text, then we expect llama_decode to internally call both vision encode and language decode, then output next token. A such API call will be very transparent to end user.

But in this case, it no longer match the behavior of llama_encode. And worse, doing this way will increase the complexity of llama_decode, which is bad for returning status on error.

For now, I think I will stick with the separated llama_img_batch for image. Probably with a better API to copy image tokens to language batch. Something like this:

// assuming we have batch of 2 images
int img_status = llama_vision_encode(ctx, &img_batch);
// maybe check img_status

// position of image tokens in language batch, based on tokenized text input
std::vector<llama_pos> img_pos = { 12, 14 };

// batch for language part
llama_batch batch = llama_batch_init(512, 0, 1);
llama_batch_add_decoded_imgs(ctx, img_pos.data(), img_pos.size());

@ggerganov
Copy link
Owner

I have to say that I don't really see any good reason to put images in the batch.

The main reason I had in mind is to reuse llama_decode for everything. Something like:

llama_batch_clear();
llama_batch_add_img(batch, img);
llama_decode(ctx, batch);
embd_img = llama_get_embeddings(ctx);

llama_batch_clear();
llama_batch_add_token(batch, id, ...);
llama_batch_add_embd(batch, embd_img, ...);
llama_decode(ctx, batch);

An alternative that also sounds good to me is to have a batch type for each input type:

struct llama_batch_embd;
struct llama_batch_text;
struct llama_batch_img;
struct llama_batch_audio;
...

In this case we would have the corresponding:

llama_decode_embd (ctx, batch_embd);
llama_decode_text (ctx, batch_text);
llama_decode_img  (ctx, batch_img);
llama_decode_audio(ctx, batch_audio);
...

And internally all these will end up calling a common llama_decode_impl().

llama_img_to_embeddings to convert img to inp embeddings

No, I didn't mean the user to manually do the conversion to embeddings. Just to extend the existing internal implementation of llm_build_inp_embd to convert input to embeddings.

I'm thinking along the line of having just llama_decode() as the API and avoid llama_encode() and llama_vision_encode(). Just throwing some thoughts, feel free to ignore, as I could be missing something.

@ngxson
Copy link
Collaborator Author

ngxson commented Oct 1, 2024

I'm thinking along the line of having just llama_decode() as the API and avoid llama_encode() and llama_vision_encode(). Just throwing some thoughts, feel free to ignore, as I could be missing something.

Personally I would prefer having multiple separated function instead of a single llama_decode(), even though they could be wrappers for a single llama_decode_impl() under the hood. The reason is that this is much easy to understand. In addition, breaking change in one API will not affect other API.

Side story, I've been doing some llama.cpp intro for co-workers at HF, and I found it's very handy to say "for decoder-only model, use llama_decode(), and for encoder-decoder models, just need to add llama_encode()". So having different functions is very good for explaining the code.

Something like llama_encode_img() will also be useful for error code in the future. For example in post-processing phrase, different errors can be reported, something like: "image is too small", "dimension not supported", etc.

@ngxson
Copy link
Collaborator Author

ngxson commented Oct 2, 2024

I had a quick look on llama 3.2 vision implementation. They use cross attention instead of using embeddings.

So I need to take into account usage with cross attention that will come in the future. I think an API like llama_encode would be the best choice.

Compared to my initial proposal, now, the position for image must be added to the batch too:

typedef struct llama_batch_img {
    int32_t      n_imgs;
    llama_img ** imgs;
    llama_pos *  pos;
} llama_batch_img;

# usage:
llama_batch_img ibatch = llama_batch_img_init(2); // for example, batch of 2 images
...
llama_vision_decode(ctx, ibatch); // output is saved to ctx or KV cache depends on implementation
llama_decode(batch); // decode language batch
// then, get logits
...

Copy link
Contributor

github-actions bot commented Nov 16, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

Chill out man, I've just came back from vacation

@Cheesper
Copy link

Any updates on this? 9 month without multimodal

@ngxson
Copy link
Collaborator Author

ngxson commented Jan 17, 2025

Actually I lost my motivation to work on #9687 because I'm still not very happy about my proposal for the API. The problem is that placing the image embedding in the correct place can be quite tricky, as it's often controlled by the chat template.

Thinking again about it today, it will be better to make the API more explicit.

The main assumption is that we have these similar terms between language and vision:

input symbol output
language text tokens logits
vision bitmap patches embeddings

So now my proposal for the flow is:

Vision part:

  1. User creates an image, llama_vision_bitmap
  2. llama_vision_patches_init(llama_vision_bitmap bmp) breaks the image into patches, preprocess it --> this can only be done on CPU anyway. It returns a llama_vision_patches.
    The reason why this should be a separated step, is because:
    • This step is equivalent to "tokenize" in language part.
    • Sometimes, number of patches is not fixed. It's important to know the number of projected tokens, so user can leave the position accordingly when tokenizing the prompt.
    • It's easier to maintain error codes if the function does just one job.
  3. Construct a llama_vision_batch containing multiple llama_vision_patches
  4. llama_vision_encode(llama_vision_batch ibatch) to run the inference, the result will be embeddings, saved to tensor in ctx.output_vision, avoid copying embeddings into user space.

Now the tricky part, how can we indicate in the language batch (llama_batch) that we want to use certain embeddings from ctx.output_vision. My proposal is to make special token IDs that will contains the row number of ctx.output_vision

We can reserve 2 most significant bits to mark the token as “vision“, vision_token_id = row_num | 0xC0 << 24

Then add both language tokens (tokenized from text) and vision tokens to the llama_batch

Upon receiving the batch, llama_decode will ggml_get_rows(ctx, tok_vision, ctx.output_vision) and concat it with inpL

Image

@ggerganov
Copy link
Owner

Is llama_vision_encode is effectively the CLIP, is that correct? It convert the image patches to embedding vectors?

@ngxson
Copy link
Collaborator Author

ngxson commented Jan 17, 2025

Yes, llama_vision_encode is CLIP. It works like an embedding model with LLAMA_POOLING_TYPE_NONE. Number of output embd == number of input patches.

The difference is that there is an extra projection step to make sure the vision embd and text embd dimensions are the same.

P/s: probably llama_vision_decode is a better name (not "encode")

@slaren
Copy link
Collaborator

slaren commented Jan 17, 2025

I think you could simplify all of this by working directly with ggml_tensor. Make the llama_vision_encode return a ggml_tensor with the embeddings, allow passing embeddings in llama_batch as ggml_tensor (rather than as a plain pointer to float). This will remove the need of copies (between CPU and GPU) in the future, and keep the API generic.

@ngxson
Copy link
Collaborator Author

ngxson commented Jan 17, 2025

@slaren indeed, my suggest is already (mostly) what you said. Instead of returning ggml_tensor from llama_vision_encode, I'm proposing that we temporary store it in ctx.output_vision

Upon doing llama_decode for text tokens, it retrieves the vision embeddings directly from ctx.output_vision without doing any copy. This allows us to not modifying llama_batch for now, so this save us from having to modify too many things inside llama-batch.cpp (which is, I think, far too complicated to do atm)

@ngxson
Copy link
Collaborator Author

ngxson commented Jan 17, 2025

Also, technically say, llama_vision_encode should only return a status code. However, we can have llama_vision_get_embeddings, much like how we're already doing with llama_get_embeddings_ith

llama_vision_get_embeddings can simply returns ctx.output_vision in this case

@slaren
Copy link
Collaborator

slaren commented Jan 17, 2025

I still think it would be good to modify llama_batch to represent a list of sequences. This would allow mixing sequences of embeddings and sequences of tokens, then the vision embeddings could just be added as additional sequences in the batch.

@ngxson
Copy link
Collaborator Author

ngxson commented Jan 17, 2025

Modifying llama_batch is quite outside of my knowledge (I read the code of llama-batch.cpp but still can't deeply understand it), so I think I will leave it mostly as-is. But I agree that having sequence-based batch will be more intuitive for downstream projects.

With my last proposal, I think migrating from token-based batch to sequence-based batch won't be too complicated in the future. As said, llama_vision_get_embeddings can be added easily to help this.

Not sure if there're other things to worry about here, feel free to comment if you can think of any. On my side, I'll start working on this in the next few days - I should start when I'm still having the motivation 😆

@slaren
Copy link
Collaborator

slaren commented Jan 17, 2025

Personally I wouldn't like adding special values to tokens, or coupling the llama_context with the vision context. If you don't want to modify llama_batch at this point, then doing multiple calls to llama_decode, first with the vision embeddings and then with the tokens, should also be fine. llama_batch can always be changed later to allow everything in the same batch.

@ngxson
Copy link
Collaborator Author

ngxson commented Jan 17, 2025

Hmm with the approach you said, it will be tricky to support batched image, because ctx.output_vision can contain embeddings from multiple image.

But I think we could skip it for now, given that it will be also complicated for me to add batching to llama_vision_encode

So for now we can simply it to:

  • llama_vision_encode accepts only single image (a.k.a llama_vision_patches)
  • After calling vision_encode, user calls llama_batch_init_from_vision to get a llama_batch that "points" to ctx.output_vision tensor. No tokens can be added to this batch (exclusive for vision embd)
  • llama_decode is called on the batch

then doing multiple calls to llama_decode, first with the vision embeddings and then with the tokens, should also be fine.

Please note that, llama_decode will need to be called more than twice. For example, with a formatted chat like this:

<s>user\n describe this photo:<image_tokens_here> then tell me what color is the balloon </s>

Then, we need to call llama_decode 3 times:

  • One for <s>user\n describe this photo:
  • One for <image_tokens_here>, with the batch from llama_batch_init_from_vision
  • One for then tell me what color is the balloon </s>

@slaren
Copy link
Collaborator

slaren commented Jan 17, 2025

llama_batch_init_from_vision may even be overkill, building the batch should only require something like this:

llama_batch batch = {
    /*n_tokens       =*/ 0, // implied from embd_tensor dimensions
    /*tokens         =*/ nullptr,
    /*embd           =*/ nullptr,
    /*embd_tensor    =*/ llama_vision_get_embeddings(vision_ctx),
    /*pos            =*/ nullptr,
    /*n_seq_id       =*/ nullptr,
    /*seq_id         =*/ nullptr,
    /*logits         =*/ nullptr,
};

To support batching later, llama_vision_get_embeddings_ith could be added to return the embeddings of the ith image.

@ngxson ngxson linked a pull request Jan 18, 2025 that will close this issue
4 tasks
@tellsiddh
Copy link

Hey @ngxson I noticed you were working on bringing back vision capabilities back to llama.cpp and your last commit shows you were working on minicpm. Not sure if it helps, but here is an implementation of this model to work with llama.cpp - Open BMB llama.cpp fork - minicpm-v2.5

Thank you for your work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request llava LLaVa and multimodal server
Projects
None yet
13 participants