-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama : first attempt to implement vision API (WIP) #9687
base: master
Are you sure you want to change the base?
Conversation
gguf-py/gguf/constants.py
Outdated
@@ -178,6 +178,28 @@ class Adapter: | |||
TYPE = "adapter.type" | |||
LORA_ALPHA = "adapter.lora.alpha" | |||
|
|||
class Vision: | |||
# only support vision.type = "clip" for now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably better to be very specific here you are supporting ViT. HuggingFace has also moved away from calling every generic terms. Also I don't think the purpose is to support actual CLIP inference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, agree. I think for now it's safer to call this clip-vit
to reflect the base implementation is openai/clip-vit-*
Atm it's quite complicated for me to drop the clip_
prefix in all functions. But hey at least the file name is now llama-vision.{cpp|h}
instead of llama-clip
, which should reflect that we can support ViT and also other things to come in the future.
image_output = std::move(padded_image); | ||
} | ||
|
||
static void normalize_image_u8_to_f32(const clip_image_u8 src, clip_image_f32 dst, const std::array<float, 3> & mean, const std::array<float, 3> & std) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be a rootcause for some bugs ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, probably missing clip_image_f32 &
; thanks for pointing this out.
Forgive me for (probably) offtopic, but would it be possible to use CLIP model with llama.cpp to compute text embedding for the "classic" purpose of matching text with images? |
I'm not 100% sure, but in theory, everything is possible with the correct model weight and the correct compute graph. |
I took a shot at rebasing this branch and as there have been quite a few changes upstream which affects this PR and I wanted try this API out. I wanted to share the rebased repo in case it would be useful/save time. I moved the example to a new example named simple-vision as the |
@ngxson are you still planning on working on this? Having this as a first step towards llama-3.2-vision would be very useful. |
I'm not actively working on it, but will continue soon. Ref discussion: #10381 |
@ngxson I've been taking a look at this and tried adding some initial support for Llama 3.2 Vision Instruct. I've added a simple-vision-mllama-example that shows the usage and also contains some more details. The code needs more work and possibly integration with the existing vision api, but my goal was only to get something working as a first step. I've rebased the above linked branch (which builds upon this PR's code) with master and after latest code refactoring there I need to revisit some of my changes. But if this looks like it would be worth pursuing I'd be happy to continue working on it. If nothing else perhaps the model conversion and quantization could be used from this. |
(Hopefully) fix #8010
Important
This is still WIP, only
simple
example is workingCollaborators are encouraged to discuss and give feedback on this
Motivation
Currently, the vision capability is provided by
llava
example, which is a CLIP implementation in ggml. While it's a good start, the API need some refactoring to be cleaner and more future-proof.Inspired by current rework on sampling API, I propose to move the CLIP implementation into the main
libllama
, providing user a stable, easy-to-use API like what we did forllama_encode
The goals of this refactoring are:
llama-cli
accept image inputThe no-goals:
llama-server
. It will be another PRPlan
libllama
convert_hf_to_gguf.py
to support llava --> not an ideal implementation, but kinda worksllama_model
andllama_context
to hold vision-related datallama-vision.{cpp|h}
llama-cli
Implementation
Naming scheme
For metadata, we will add
vision.*
namespace.vision.type
: the type of vision encoder. We only support"clip"
for now (not sure if there are any other implementation out there)vision.*
: other params for vision encoding, for example patch size, image size, etcvision.clip.*
: CLIP-related paramsExample:
For tensor naming scheme, we will prefix all vision-related tensor with
v.enc.*
. For example:API
libllama
will be responsible for: