Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add LLaVA Support #484

Merged
merged 44 commits into from
Jul 5, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
301a829
first commit
chenwanqq Jun 14, 2024
5bf944b
continue construction
chenwanqq Jun 18, 2024
cf516ec
Merge branch 'EricLBuehler:master' into llava
chenwanqq Jun 19, 2024
0a4e601
continue working
chenwanqq Jun 19, 2024
3f188a0
Merge branch 'llava' of https://github.com/chenwanqq/mistral.rs into …
chenwanqq Jun 19, 2024
975347e
finish models
chenwanqq Jun 19, 2024
cb7833b
keep working
chenwanqq Jun 19, 2024
18ed03f
keep working
chenwanqq Jun 22, 2024
9572f89
Merge branch 'llava' into master
chenwanqq Jun 22, 2024
81851fe
keep working
chenwanqq Jun 22, 2024
ac3de15
meet serious problem
chenwanqq Jun 23, 2024
93aadda
debugging
chenwanqq Jun 23, 2024
ba1a398
debugging
chenwanqq Jun 25, 2024
b3b46e2
Merge branch 'EricLBuehler:master' into llava
chenwanqq Jun 27, 2024
6bdf529
debugging
chenwanqq Jun 27, 2024
ede233d
Merge branch 'llava' of https://github.com/chenwanqq/mistral.rs into …
chenwanqq Jun 27, 2024
04161ef
restore inrelate codes
chenwanqq Jun 27, 2024
f7a619b
restore change
chenwanqq Jun 27, 2024
ac268d2
finish llama-llavanext
chenwanqq Jun 27, 2024
eab6de2
modify num_image_tokens and num_image_patches
chenwanqq Jun 28, 2024
c87db09
modify code structure
chenwanqq Jun 28, 2024
da84418
update llava1.5
chenwanqq Jul 1, 2024
376343b
add some examples and docs
chenwanqq Jul 1, 2024
3768c5b
Update main.rs. remove a redundant comment
chenwanqq Jul 1, 2024
86219a7
update some deps
chenwanqq Jul 1, 2024
7ffafa1
Merge branch 'llava' of https://github.com/chenwanqq/mistral.rs into …
chenwanqq Jul 1, 2024
3bf1b59
Merge branch 'master' into llava
chenwanqq Jul 1, 2024
6c8126c
fmt
chenwanqq Jul 1, 2024
106b23c
modify llavallm. not support anymoe yet
chenwanqq Jul 1, 2024
bcae4ad
Merge branch 'EricLBuehler:master' into llava
chenwanqq Jul 2, 2024
6b3fb78
add moe support
chenwanqq Jul 2, 2024
2262e95
Merge branch 'master' into llava
chenwanqq Jul 3, 2024
7543319
rust fmt
chenwanqq Jul 3, 2024
d5bd3f8
update image tag
chenwanqq Jul 4, 2024
51ffe76
typo
chenwanqq Jul 4, 2024
03cd36d
Merge branch 'EricLBuehler:master' into llava
chenwanqq Jul 4, 2024
a2c1e72
add anymoe for llava
chenwanqq Jul 4, 2024
e5eaf85
delete redundant image
chenwanqq Jul 4, 2024
cb7d4a1
Update docs and fromstr for llava and llavanext
EricLBuehler Jul 4, 2024
7d1ae69
Add fromstr impl
EricLBuehler Jul 4, 2024
00d3618
Fix examples
EricLBuehler Jul 4, 2024
e636a3c
fix device proble related to isq
chenwanqq Jul 5, 2024
f4f76f3
typo(which not generated by me?)
chenwanqq Jul 5, 2024
41ee9a1
Update LLaVA documentation and chat template options
chenwanqq Jul 5, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 12 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -112,10 +112,12 @@ https://github.com/EricLBuehler/mistral.rs/assets/65165915/09d9a30f-1e22-4b9a-90
|Phi 2|✅|✅|✅|✅|
|Phi 3|✅|✅|✅|✅|
|Qwen 2|✅| |✅|✅|
|Phi 3 Vision|✅| |✅| |
|Idefics 2|✅| |✅| |
|Phi 3 Vision|✅| |✅||
|Idefics 2|✅| |✅||
|Gemma 2|✅|✅|✅|✅|
|Starcoder 2|✅|✅|✅|✅|
|LLaVa Next|✅| |✅|✅|
|LLaVa|✅| |✅|✅|

## APIs and Integrations

Expand Down Expand Up @@ -356,6 +358,8 @@ Additionally, for models without quantization, the model architecture should be

- `phi3v`
- `idefics2`
- `llava_next`
- `llava`

**Interactive mode:**

Expand Down Expand Up @@ -443,6 +447,8 @@ Example:
|Idefics 2| | |✅|
|Gemma 2| | |✅|
|Starcoder 2| | |✅|
|LLaVa Next| | |✅|
|LLaVa| | |✅|

**Device mapping support**
|Model category|Supported|
Expand All @@ -466,6 +472,8 @@ Example:
|Idefics 2| | | |
|Gemma 2|✅| | |
|Starcoder 2|✅| | |
|LLaVa Next| | | |
|LLaVa| | | |

**AnyMoE support**
|Model|AnyMoE|
Expand All @@ -481,6 +489,8 @@ Example:
|Idefics 2| |
|Gemma 2|✅|
|Starcoder 2|✅|
|LLaVa Next|✅|
|LLaVa|✅|


### Using derivative model
Expand Down
36 changes: 36 additions & 0 deletions chat_templates/vicuna.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
{
"add_bos_token": true,
"add_eos_token": false,
"bos_token": {
"__type": "AddedToken",
"content": "<s>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
"chat_template": "{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = messages[0]['content'] %}{% else %}{% set loop_messages = messages %}{% set system_message = 'A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user\\'s questions.' %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 %}{{ system_message }}{% endif %}{% if message['role'] == 'user' %}{{ ' USER: ' + message['content'].strip() }}{% elif message['role'] == 'assistant' %}{{ ' ASSISTANT: ' + message['content'].strip() + eos_token }}{% endif %}{% endfor %}{% if add_generation_prompt %}{{ ' ASSISTANT:' }}{% endif %}",
"clean_up_tokenization_spaces": false,
"eos_token": {
"__type": "AddedToken",
"content": "</s>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
"legacy": false,
"model_max_length": 4096,
"pad_token": null,
"padding_side": "right",
"sp_model_kwargs": {},
"tokenizer_class": "LlamaTokenizer",
"unk_token": {
"__type": "AddedToken",
"content": "<unk>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
}
}
219 changes: 219 additions & 0 deletions docs/LLaVA.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,219 @@
# LLaVA and LLaVANext Model: `llava-hf model family`

The [LLaVA](https://arxiv.org/abs/2310.03744) and [LLaVANext](https://llava-vl.github.io/blog/2024-01-30-llava-next/) are great multimodal models that can handle both text and vision inputs.

This implementation supports both LLaVA and LLaVANext(which adds multi resolution image processing) and two types of LLM base model: llama and mistral. Currently it is tested on:
* llava-hf/llava-v1.6-mistral-7b-hf
* llava-hf/llava-v1.6-vicuna-7b-hf
* llava-hf/llava-1.5-7b-hf


The LLaVA and LLaVANext Model has support in the Rust, Python, and HTTP APIs. The LLaVA and LLaVANext Model also supports ISQ for increased performance.

The Python and HTTP APIs support sending images as:
- URL
- Path to a local image
- [Base64](https://en.wikipedia.org/wiki/Base64) encoded string

The Rust API takes an image from the [image](https://docs.rs/image/latest/image/index.html) crate.
chenwanqq marked this conversation as resolved.
Show resolved Hide resolved

## HTTP server
You can find this example [here](../examples/server/llava_next.py).

We support an OpenAI compatible HTTP API for vision models. This example demonstrates sending a chat completion request with an image.

> Note: The image_url may be either a path, URL, or a base64 encoded string.

---

**Image:**
<img src="https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg" alt="Mount Washington" width = "1000" height = "666">
<h6><a href = "https://www.nhmagazine.com/mount-washington/">Credit</a></h6>

**Prompt:**
```
<image>What is shown in this image?
```

**Output:**
```
Text: The image shows a steep, snow-covered hillside with a pine tree on the right side, close to the top. The landscape appears to be a mountainous area with winter conditions. There are no visible humans or permanent structures in the immediate vicinity that suggest this is a summer or recreational location. It's likely a cold, snowy day or season, and the slopes might be part of a mountainous region.
```

---

1) Start the server
```
cargo run --release --features ... -- --port 1234 --isq Q4K vision-plain -m llava-hf/llava-v1.6-mistral-7b-hf -a llava_next
//or
cargo run --features cuda -- --port 1234 --isq Q4K --chat-template ./chat_templates/vicuna.json vision-plain -m /root/autodl-tmp/llava-v1.6-vicuna-7b-hf -a llava_next // if use vicuna as backend llm, then we need to specific the chat-template
```

2) Send a request

```py
import openai

completion = openai.chat.completions.create(
model="llava_next",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg"
},
},
{
"type": "text",
"text": "<image>What is shown in this image?",
},
],
},
],
max_tokens=256,
frequency_penalty=1.0,
top_p=0.1,
temperature=0,
)
resp = completion.choices[0].message.content
print(resp)
```

- You can find an example of encoding the [image via base64 here](../examples/server/phi3v_base64.py).
- You can find an example of loading an [image locally here](../examples/server/phi3v_local_img.py).

---

## Rust
You can find this example [here](../mistralrs/examples/llava_next/main.rs).

This is a minimal example of running the LLaVA and LLaVANext model with a dummy image.

```rust
use either::Either;
use image::{ColorType, DynamicImage};
use indexmap::IndexMap;
use std::sync::Arc;
use tokio::sync::mpsc::channel;

use mistralrs::{
Constraint, Device, DeviceMapMetadata, MistralRs, MistralRsBuilder, NormalRequest, Request,
RequestMessage, Response, SamplingParams, SchedulerMethod, TokenSource, VisionLoaderBuilder,
VisionLoaderType, VisionSpecificConfig,
};

fn setup() -> anyhow::Result<Arc<MistralRs>> {
// Select a Mistral model
let loader = VisionLoaderBuilder::new(
VisionSpecificConfig {
use_flash_attn: false,
repeat_last_n: 64,
},
None,
None,
Some("llava-hf/llava-v1.6-mistral-7b-hf".to_string()),
)
.build(VisionLoaderType::LLaVANext);
// Load, into a Pipeline
let pipeline = loader.load_model_from_hf(
None,
TokenSource::CacheToken,
None,
&Device::cuda_if_available(0)?,
false,
DeviceMapMetadata::dummy(),
None,
)?;
// Create the MistralRs, which is a runner
Ok(MistralRsBuilder::new(pipeline, SchedulerMethod::Fixed(5.try_into().unwrap())).build())
}

fn main() -> anyhow::Result<()> {
let mistralrs = setup()?;

let (tx, mut rx) = channel(10_000);
let request = Request::Normal(NormalRequest {
messages: RequestMessage::VisionChat {
images: vec![DynamicImage::new(1280, 720, ColorType::Rgb8)],
messages: vec![IndexMap::from([
("role".to_string(), Either::Left("user".to_string())),
(
"content".to_string(),
Either::Left("<image>What is shown in this image?".to_string()),
),
])],
},
sampling_params: SamplingParams::default(),
response: tx,
return_logprobs: false,
is_streaming: false,
id: 0,
constraint: Constraint::None,
suffix: None,
adapters: None,
});
mistralrs.get_sender()?.blocking_send(request)?;

let response = rx.blocking_recv().unwrap();
match response {
Response::Done(c) => println!("Text: {}", c.choices[0].message.content),
_ => unreachable!(),
}
Ok(())
}
```

## Python
You can find this example [here](../examples/python/llava_next.py).

This example demonstrates loading and sending a chat completion request with an image.

> Note: the image_url may be either a path, URL, or a base64 encoded string.

```py
from mistralrs import Runner, Which, ChatCompletionRequest, VisionArchitecture

runner = Runner(
which=Which.VisionPlain(
model_id="llava-hf/llava-v1.6-mistral-7b-hf",
tokenizer_json=None,
repeat_last_n=64,
arch=VisionArchitecture.LLaVANext,
),
)

res = runner.send_chat_completion_request(
ChatCompletionRequest(
model="llava_next",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg"
},
},
{
"type": "text",
"text": "<image>What is shown in this image?",
},
],
},
],
max_tokens=256,
presence_penalty=1.0,
top_p=0.1,
temperature=0.1,
)
)
print(res.choices[0].message.content)
print(res.usage)
```

- You can find an example of encoding the [image via base64 here](../examples/python/phi3v_base64.py).
- You can find an example of loading an [image locally here](../examples/python/phi3v_local_img.py).
1 change: 1 addition & 0 deletions docs/VISION_MODELS.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ Please see docs for the following model types:

- Phi 3 Vision: [PHI3V.md](PHI3V.md)
- Idefics2: [IDEFICS2.md](IDEFICS2.md)
- LLaVA and LLaVANext [LLAVA.md](LLAVA.md)

> Note for the Python and HTTP APIs:
> We follow the OpenAI specification for structuring the image messages and allow both base64 encoded images as well as a URL/path to the image. There are many examples of this, see [this Python example](../examples/python/phi3v.py).
39 changes: 39 additions & 0 deletions examples/python/llava_next.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
from mistralrs import Runner, Which, ChatCompletionRequest, VisionArchitecture

runner = Runner(
which=Which.VisionPlain(
model_id="llava-hf/llava-v1.6-mistral-7b-hf",
tokenizer_json=None,
repeat_last_n=64,
arch=VisionArchitecture.LLaVANext,
),
)

res = runner.send_chat_completion_request(
ChatCompletionRequest(
model="llava_next",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg"
},
},
{
"type": "text",
"text": "<image>What is shown in this image? Write a detailed response analyzing the scene.",
},
],
}
],
max_tokens=256,
presence_penalty=1.0,
top_p=0.1,
temperature=0.1,
)
)
print(res.choices[0].message.content)
print(res.usage)
Loading
Loading