Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add LLaVA Support #484

Merged
merged 44 commits into from
Jul 5, 2024
Merged

Add LLaVA Support #484

merged 44 commits into from
Jul 5, 2024

Conversation

chenwanqq
Copy link
Contributor

@chenwanqq chenwanqq commented Jun 27, 2024

Introduction

This implementation is based on my work for candle. However, it incorporates some notable differences:

  • I have completely removed support for the model format used in the liuhaotian/llava (original) repo. Instead, I adopted the model format from the llava-hf repo, providing a more standardized approach to utilizing the model, particularly the tokenizer.
  • Similar to the Python transformer library, I have segmented the original project into LLaVA (1.5) and LLaVANext (1.6). This reduction in the use of if-else statements simplifies the code.
  • I have revamped the process of concatenating text and image features. The updated method aligns closely with phi3v's implementation in this repo, ensuring a more uniform coding style.

Still Working!

Some Notes

Status

  • Implement model
    • LLaVANext (1.6)
    • LLaVA 1.5
      • Input processor
      • Model structure
  • Support for different LLM Backends
    • LLaMA-like models (vicuna, Nous-Hermes-2-Yi-34B)
    • Mistral
  • Cleanup of some debug code
  • Documentation
  • Examples

Copy link

github-actions bot commented Jun 27, 2024

Code Metrics Report
  ===============================================================================
 Language            Files        Lines         Code     Comments       Blanks
===============================================================================
 Dockerfile              1           34           25            0            9
 Happy                   1          442          369            0           73
 JSON                   10           66           65            0            1
 Python                 36         1412         1220           38          154
 TOML                   18          518          458           11           49
-------------------------------------------------------------------------------
 Jupyter Notebooks       1            0            0            0            0
 |- Markdown             1           60           30           22            8
 |- Python               1           96           87            1            8
 (Total)                            156          117           23           16
-------------------------------------------------------------------------------
 Markdown               21         1554            0         1173          381
 |- BASH                 5          101           98            0            3
 |- JSON                 1           12           12            0            0
 |- Python               5           98           88            0           10
 |- Rust                 4          221          198           10           13
 |- TOML                 2           75           63            0           12
 (Total)                           2061          459         1183          419
-------------------------------------------------------------------------------
 Rust                  132        44214        40090          775         3349
 |- Markdown            77          723           13          671           39
 (Total)                          44937        40103         1446         3388
===============================================================================
 Total                 221        48240        42227         1997         4016
===============================================================================
  

@EricLBuehler EricLBuehler added new feature New feature or request models Additions to model or architectures labels Jun 27, 2024
@EricLBuehler
Copy link
Owner

Hi @chenwanqq! Please let me know when this is ready for review.

@chenwanqq chenwanqq marked this pull request as ready for review July 1, 2024 14:58
@chenwanqq
Copy link
Contributor Author

Hi @chenwanqq! Please let me know when this is ready for review.

Hhh I think it's ready. Maybe some docs and examples are a little bit hurry but I think we can review it🤣

@chenwanqq
Copy link
Contributor Author

chenwanqq commented Jul 1, 2024

Great, I'll take a look. I think that the recent merge of #476 caused some merge conflicts here, can you please take a look?

Edit: Seems like they are trivial to fix.

The addition of AnyMOE (which modifies the dependencies of NormalModel) affects the LLaVA llm backend. However, the current solution based on copied code is just a temporary one, so I have removed the NormalModel trait from LLaVALLM altogether. Once the fusedRoPE is fixed, we should change LLaVA to rely on an ordinary model as soon as possible.

Edit: I add Moe support code for LLaVA, but never really test it. As I see the Phi3V, seems AnyMoe only affects the llm part, and have nothing to do with image embedding?

@chenwanqq
Copy link
Contributor Author

And LLaVANext is just LLaVA1.5 plus some multiple resolution trick(anyres). So they share some infrastuctres. If this causes any confusion, we can modify it.

Copy link
Owner

@EricLBuehler EricLBuehler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @chenwanqq! This looks amazing, thanks for adding it.

I think there are some inconsistencies (probably mistakes?) with the image tokens in the examples: some look like they are from Idefics (no image token) and some like they are from Phi 3 (have <|image_1|>) but according to the HF model card, it expects <image>. This also extends to the regexes in the inputs processor. I have commented where the changes should be made, can you please update it?

docs/LLaVA.md Show resolved Hide resolved
docs/LLaVA.md Outdated Show resolved Hide resolved
docs/LLaVA.md Outdated Show resolved Hide resolved
docs/LLaVA.md Outdated Show resolved Hide resolved
docs/LLaVA.md Outdated Show resolved Hide resolved
mistralrs-core/src/vision_models/llava/llava_next.rs Outdated Show resolved Hide resolved
mistralrs/Cargo.toml Show resolved Hide resolved
mistralrs/examples/llava/main.rs Outdated Show resolved Hide resolved
mistralrs/examples/llava_next/main.rs Outdated Show resolved Hide resolved
@chenwanqq
Copy link
Contributor Author

Seem quite a lot for me to check!🧐 I will see it later, and just to explain something that:
(1) As I mentioned, I did not use the original image tag processing logic of llava, but instead adopted a processing logic similar to phi3v. This is due to compatibility considerations with the input of this repo. For example, the original logic of llava does not calculate num_img_tokens before encode_image, etc.; Additionally, phi3v's logic is better for handling multiple images. Since in llava is just a placeholder, if we set up the chat_template correctly (the same as the backend llm), the effects of the two implementations will be the same. I think we should add some special notifications?
(2) llava supports multiple images, but this is only practically meaningful in version 1.5. In version 1.6, the introduction of multi-resolution processing has led to longer feature representations for each image, but the context length is only 4K, so multiple images will be truncated.
(3) Yes, it is only the rope problem that I copy the code of LLM.

@EricLBuehler
Copy link
Owner

As I mentioned, I did not use the original image tag processing logic of llava, but instead adopted a processing logic similar to phi3v. This is due to compatibility considerations with the input of this repo. For example, the original logic of llava does not calculate num_img_tokens before encode_image, etc.; Additionally, phi3v's logic is better for handling multiple images. Since in llava is just a placeholder, if we set up the chat_template correctly (the same as the backend llm), the effects of the two implementations will be the same. I think we should add some special notifications?

I see. I think we should use the Llava <image> token to avoid confusion, regardless that it is a placeholder. Should be an easy change, most of the comments I made are about that.

) llava supports multiple images, but this is only practically meaningful in version 1.5. In version 1.6, the introduction of multi-resolution processing has led to longer feature representations for each image, but the context length is only 4K, so multiple images will be truncated.

That makes sense. Given that we support multiple images, if we could document that it would be great! 😃

@chenwanqq
Copy link
Contributor Author

I've fix the image tag problem and anymoe problem accordingly. May we can have more review?

@EricLBuehler
Copy link
Owner

Hi @chenwanqq! I made some edits for some small updates to the model support matrix and model architecture docs.

This looks good, but it crashes when I run:

cargo run --release --features cuda -- --port 1234 --isq Q4K vision-plain -m llava-hf/llava-v1.6-mistral-7b-hf -a llava_next

And

python3 examples/server/llava_next.py

With

prompt step - Model failed with error: WithBacktrace { inner: DeviceMismatchBinaryOp { lhs: Cuda { gpu_id: 0 }, rhs: Cpu, op: "conv2d" }, backtrace: Backtrace [{ fn: "candle_core::error::Error::bt" }, { fn: "candle_core::storage::Storage::same_device" }, { fn: "candle_core::storage::Storage::conv2d" }, { fn: "candle_core::conv::<impl candle_core::tensor::Tensor>::conv2d_single_group" }, { fn: "candle_core::conv::<impl candle_core::tensor::Tensor>::conv2d" }, { fn: "<candle_nn::conv::Conv2d as candle_core::Module>::forward" }, { fn: "<mistralrs_core::vision_models::clip::ClipVisionEmbeddings as candle_core::Module>::forward" }, { fn: "mistralrs_core::vision_models::clip::ClipVisionTransformer::forward_get_hidden_states" }, { fn: "mistralrs_core::vision_models::llava::llava_next::Model::encode_images" }, { fn: "<mistralrs_core::vision_models::llava::llava_next::Model as mistralrs_core::pipeline::VisionModel>::forward" }, { fn: "<mistralrs_core::pipeline::vision::VisionPipeline as mistralrs_core::pipeline::Pipeline>::forward_inputs" }, { fn: "mistralrs_core::pipeline::Pipeline::step::{{closure}}" }, { fn: "mistralrs_core::engine::Engine::run::{{closure}}" }, { fn: "tokio::runtime::runtime::Runtime::block_on" }, { fn: "std::sys_common::backtrace::__rust_begin_short_backtrace" }, { fn: "core::ops::function::FnOnce::call_once{{vtable.shim}}" }, { fn: "std::sys::pal::unix::thread::Thread::new::thread_start" }, { fn: "start_thread", file: "./nptl/pthread_create.c", line: 442 }, { fn: "__GI___clone3", file: "./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S", line: 81 }] }

It seems to come from the CLIP embeddings. Note that for ISQ models, the VBs are on the CPU and are cast using mapper.set_nm_device. Pass loading_isq as false if this is not going to be touched by ISQ, and otherwise, be sure to pass the correct parameter from the normal loading metadata.

I would reccomend that you check out the Idefics 2 model, as I do .set_device(text_model.device().clone()) several times, which will set it to the primary device. This may be what you want. Please let me know if you have any questions!

When I run without ISQ, it successfully produces the correct output though - great work here!

@chenwanqq
Copy link
Contributor Author

Hi @chenwanqq! I made some edits for some small updates to the model support matrix and model architecture docs.

This looks good, but it crashes when I run:

cargo run --release --features cuda -- --port 1234 --isq Q4K vision-plain -m llava-hf/llava-v1.6-mistral-7b-hf -a llava_next

And

python3 examples/server/llava_next.py

With

prompt step - Model failed with error: WithBacktrace { inner: DeviceMismatchBinaryOp { lhs: Cuda { gpu_id: 0 }, rhs: Cpu, op: "conv2d" }, backtrace: Backtrace [{ fn: "candle_core::error::Error::bt" }, { fn: "candle_core::storage::Storage::same_device" }, { fn: "candle_core::storage::Storage::conv2d" }, { fn: "candle_core::conv::<impl candle_core::tensor::Tensor>::conv2d_single_group" }, { fn: "candle_core::conv::<impl candle_core::tensor::Tensor>::conv2d" }, { fn: "<candle_nn::conv::Conv2d as candle_core::Module>::forward" }, { fn: "<mistralrs_core::vision_models::clip::ClipVisionEmbeddings as candle_core::Module>::forward" }, { fn: "mistralrs_core::vision_models::clip::ClipVisionTransformer::forward_get_hidden_states" }, { fn: "mistralrs_core::vision_models::llava::llava_next::Model::encode_images" }, { fn: "<mistralrs_core::vision_models::llava::llava_next::Model as mistralrs_core::pipeline::VisionModel>::forward" }, { fn: "<mistralrs_core::pipeline::vision::VisionPipeline as mistralrs_core::pipeline::Pipeline>::forward_inputs" }, { fn: "mistralrs_core::pipeline::Pipeline::step::{{closure}}" }, { fn: "mistralrs_core::engine::Engine::run::{{closure}}" }, { fn: "tokio::runtime::runtime::Runtime::block_on" }, { fn: "std::sys_common::backtrace::__rust_begin_short_backtrace" }, { fn: "core::ops::function::FnOnce::call_once{{vtable.shim}}" }, { fn: "std::sys::pal::unix::thread::Thread::new::thread_start" }, { fn: "start_thread", file: "./nptl/pthread_create.c", line: 442 }, { fn: "__GI___clone3", file: "./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S", line: 81 }] }

It seems to come from the CLIP embeddings. Note that for ISQ models, the VBs are on the CPU and are cast using mapper.set_nm_device. Pass loading_isq as false if this is not going to be touched by ISQ, and otherwise, be sure to pass the correct parameter from the normal loading metadata.

I would reccomend that you check out the Idefics 2 model, as I do .set_device(text_model.device().clone()) several times, which will set it to the primary device. This may be what you want. Please let me know if you have any questions!

When I run without ISQ, it successfully produces the correct output though - great work here!

That's exactly the problem! With the new commit, I believe I fixed it (tested in llava-v1.6-vicuna-7b-hf, llava-v1.6-mistral-7b-hf, and llava-1.5-7b-hf). However, when I deal with llama-like models (such as Vicuna), there is a device mismatch problem with rmsnorm:

prompt step - Model failed with error: DeviceMismatchBinaryOp { lhs: Cuda { gpu_id: 0 }, rhs: Cpu, op: "rms-norm" }
^C

When I checked the code, I found that the Llama and Mistral code have slight differences in initialization:

let rms_1 = RmsNorm::new(
cfg.hidden_size,
cfg.rms_norm_eps,
mapper.set_device(layer_idx, vb.pp("input_layernorm"), loading_isq),
)?;
let rms_2 = RmsNorm::new(
cfg.hidden_size,
cfg.rms_norm_eps,
mapper.set_device(layer_idx, vb.pp("post_attention_layernorm"), loading_isq),

let input_layernorm = RmsNorm::new(
cfg.hidden_size,
cfg.rms_norm_eps,
mapper.set_device(layer_idx, vb.pp("input_layernorm"), false),
)?;
let post_attention_layernorm = RmsNorm::new(
cfg.hidden_size,
cfg.rms_norm_eps,
mapper.set_device(layer_idx, vb.pp("post_attention_layernorm"), false),

I change it as Mistral way in my copied version of backend llama llm. I don't have Llama weights right now (and it's not very easy for me to download them). I think maybe you can check if the Llama text model has this issue.

Copy link
Owner

@EricLBuehler EricLBuehler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thank you for adding this. I'll need to check the RoPE instability in a future PR but I think the current solution + LlavaModel is good enough.

@EricLBuehler EricLBuehler merged commit 62146d9 into EricLBuehler:master Jul 5, 2024
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
models Additions to model or architectures new feature New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants