How to use UQFF File locally without sending requests to Hugging Face? #821

solaoi · 2024-10-03T16:30:48Z

Describe the bug

I'm trying to use UQFF File in a local environment only, but my sample code is still sending requests to Hugging Face.
I would like to know how to prevent these external requests and use UQFF File entirely locally.

Sample Code

use std::{env::current_dir, sync::Arc};
use tokio::sync::mpsc::channel;

use mistralrs::{
    Constraint, DefaultSchedulerMethod, Device, DeviceMapMetadata, IsqType, MistralRs,
    MistralRsBuilder, ModelDType, NormalLoaderBuilder, NormalRequest, NormalSpecificConfig,
    Request, RequestMessage, ResponseOk, SamplingParams, SchedulerConfig, TokenSource,
};

fn setup() -> anyhow::Result<Arc<MistralRs>> {
    let path_buf = current_dir()?;
    let loader = NormalLoaderBuilder::new(
        NormalSpecificConfig {
            use_flash_attn: false,
            prompt_batchsize: None,
            topology: None,
            organization: Default::default(),
            write_uqff: None,
            from_uqff: Some(path_buf.join("honyaku13B/Honyaku-13b-q4_0.uqff")),
        },
        Some("honyaku13B/llama2.json".to_string()),
        None,
        Some("aixsatoshi/Honyaku-13b".to_string()),
    )
    .build(None)?;

    let pipeline = loader.load_model_from_hf(
        None,
        TokenSource::None,
        &ModelDType::Auto,
        &Device::new_metal(0)?,
        false,
        DeviceMapMetadata::dummy(),
        Some(IsqType::Q4_0),
        None,
    )?;

    Ok(MistralRsBuilder::new(
        pipeline,
        SchedulerConfig::DefaultScheduler {
            method: DefaultSchedulerMethod::Fixed(5.try_into().unwrap()),
        },
    )
    .build())
}

fn main() -> anyhow::Result<()> {
    let mistralrs = setup()?;

    let (tx, mut rx) = channel(10_000);
    let text = std::env::args()
        .nth(1)
        .unwrap_or_else(|| "Hello world!".to_string());
    let prompt = format!("<english>: {} <NL>\n\n<japanese>: ", text);
    let request = Request::Normal(NormalRequest {
        messages: RequestMessage::Completion {
            text: prompt,
            echo_prompt: false,
            best_of: 1,
        },
        sampling_params: SamplingParams::deterministic(),
        response: tx,
        return_logprobs: false,
        is_streaming: false,
        id: 0,
        constraint: Constraint::None,
        suffix: None,
        adapters: None,
        tools: None,
        tool_choice: None,
        logits_processors: None,
    });
    mistralrs.get_sender()?.blocking_send(request)?;

    let response = rx.blocking_recv().unwrap().as_result().unwrap();
    match response {
        ResponseOk::CompletionDone(c) => println!("Text: {}", c.choices[0].text),
        _ => unreachable!(),
    }
    Ok(())
}

I tried changing the model_id from aixsatoshi/Honyaku-13b to ./honyaku13B/Honyaku-13b-q4_0.uqff, but this resulted in the following error:

File "tokenizer.json" not found at model id "./honyaku13B/Honyaku-13b-q4_0.uqff"

Latest commit or version

329e0e8

The text was updated successfully, but these errors were encountered:

Oracuda · 2024-10-07T05:59:56Z

Hey, I'm not working with UQFF so I can't say for sure, but I think this might be possible. Here's the functional command line syntax that I used. I'm also completely new to this project and I haven't worked with it in code yet either, so forgive my naivety.
--quantized-filename "my_modelgguf" --quantized-model-id "drive:\mymodels\"
so try changing model-id to ./honyaku13B/ or the folder that contains the .uqff

solaoi · 2024-10-09T14:03:24Z

@Oracuda
Thank you for your advice.
When I write code similar to gguf, it somehow requires the original config.json, tokenizer.json, and safetensors files.

It seems I'm encountering a problem similar to these issues:
#828
#836

Alternatively, my understanding of UQFF might be incorrect to begin with.

EricLBuehler · 2024-10-14T02:36:03Z

@solaoi @Oracuda at the moment, UQFF is a storage format for ISQ artifacts and simply removes the step of doing the quantization locally. #849 will generate all the necessary files when you create a UQFF model, though!

solaoi · 2024-10-14T08:58:00Z

@EricLBuehler

Thank you for the explanation.
I see now that UQFF is a storage format for ISQ artifacts that simplifies the process by removing the need for local quantization.

I also appreciate the work you're doing on #849 to generate all the necessary files when creating a UQFF model.
I've already tried out the uqff_standalone branch from PR #849.

However, I encountered the same error as described in this issue:
#845

For context, I was trying to enable metal and use UQFF with the following command:

cargo run --features metal -- --isq Q4_0 -i plain -m aixsatoshi/Honyaku-13b --write-uqff Honyaku-13b-q4_0.uqff

Interestingly, I was able to build without errors using the metal_f8_buildfix branch from PR #846.

Would it be possible to incorporate the changes from PR #846 into the uqff_standalone branch of PR #849?
I'd like to test if that resolves the issue.

EricLBuehler · 2024-10-15T23:44:01Z

@solaoi I merged both #846 and #849, so you can now load UQFF models without downloading the full weights in #849!

For example (https://huggingface.co/EricB/Llama-3.2-11B-Vision-Instruct-UQFF):

./mistralrs-server -i vision-plain -m EricB/Llama-3.2-11B-Vision-Instruct-UQFF -a vllama --from-uqff llama3.2-vision-instruct-q4k.uqff

More models can be found here: https://huggingface.co/collections/EricB/uqff-670e4a49d56ecdd3f7f0fd4c.

jiabochao · 2024-10-16T05:38:34Z

@EricLBuehler Hi, I have a question about the new UQFF API. After specifying the local paths for the uqff file and tokenizer.json, I noticed that the program tries to download config.json from the huggingface. Is there a way to specify a local path for the config.json file?

2024-10-16T05:29:19.553770Z  INFO mistralrs_core::pipeline::vision: Using tokenizer.json at `~/Downloads/tokenizer.json`
2024-10-16T05:29:19.553829Z  INFO mistralrs_core::pipeline::vision: Loading `config.json` at `EricB/Llama-3.2-11B-Vision-Instruct-UQFF`

here is my code

use anyhow::Result;
use mistralrs::{IsqType, TextMessageRole, VisionLoaderType, VisionMessages, VisionModelBuilder};

const MODEL_ID: &str = "EricB/Llama-3.2-11B-Vision-Instruct-UQFF";

#[tokio::main]
async fn main() -> Result<()> {

    let model = VisionModelBuilder::new(MODEL_ID, VisionLoaderType::VLlama)
        .with_isq(IsqType::Q4K)
        .with_logging()
        .from_uqff("~/Downloads/llam3.2-vision-instruct-q4k.uqff".into())
        .with_tokenizer_json("~/Downloads/tokenizer.json")
        .build()
        .await?;

    let bytes = match reqwest::blocking::get(
        "https://d2r55xnwy6nx47.cloudfront.net/uploads/2018/02/Ants_Lede1300.jpg",
    ) {
        Ok(http_resp) => http_resp.bytes()?.to_vec(),
        Err(e) => anyhow::bail!(e),
    };
    let image = image::load_from_memory(&bytes)?;

    let messages = VisionMessages::new().add_vllama_image_message(
        TextMessageRole::User,
        "What is depicted here? Please describe the scene in detail.",
        image,
    );

    let response = model.send_chat_request(messages).await?;

    println!("{}", response.choices[0].message.content.as_ref().unwrap());
    dbg!(
        response.usage.avg_prompt_tok_per_sec,
        response.usage.avg_compl_tok_per_sec
    );

    Ok(())
}

EricLBuehler · 2024-10-16T09:08:53Z

Hi @jiabochao! If you specify a Hugging Face model ID, it will always source the tokenizer from there. If you want to avoid downloading files, it would be best to download the model locally and then use a local model ID.

jiabochao · 2024-10-16T09:34:59Z

Hi @jiabochao! If you specify a Hugging Face model ID, it will always source the tokenizer from there. If you want to avoid downloading files, it would be best to download the model locally and then use a local model ID.

@EricLBuehler How can I use a local model ID? Could you please provide me with some examples?

solaoi · 2024-10-16T14:11:08Z

@EricLBuehler
Thank you for your assistance.
I've started recreating the UQFF model using the latest master branch as per this commit:
751be3d

I executed the following command:

cargo run --features metal -- --isq Q4_0 -i plain -m aixsatoshi/Honyaku-13b --write-uqff Honyaku-13b-q4_0.uqff

However, after execution, I couldn't find the UQFF file.
Only the following files were newly created:

config.json
generation_config.json
residual.safetensors
tokenizer.json
tokenizer_config.json

Could you please look into this?
I'm particularly concerned about the following warning that appeared when running the command:

warning: unused variable: `quantized_values`
   --> mistralrs-core/src/pipeline/isq.rs:506:25
    |
506 |                     let quantized_values = if silent {
    |                         ^^^^^^^^^^^^^^^^ help: if this is intentional, prefix it with an underscore: `_quantized_values`
    |
    = note: `#[warn(unused_variables)]` on by default

warning: `mistralrs-core` (lib) generated 1 warning

EricLBuehler · 2024-10-16T22:39:37Z

Hi @solaoi! I merged #857 and this works locally on Metal for me now.

solaoi · 2024-10-17T00:34:46Z

@EricLBuehler
Thank you for the update. However, I'm encountering the following error on my Mac (M2, macOS Sonoma):

-[AGXG14GFamilyCommandBuffer blitCommandEncoderCommon:]:757: failed assertion `A command encoder is already encoding to this command buffer'
zsh: abort      cargo run --features metal -- --isq Q4_0 -i plain -m aixsatoshi/Honyaku-13b

EricLBuehler · 2024-10-17T01:38:59Z

@solaoi thanks for catching that. On my Metal machine (M3 Max, macOS Sonoma), it worked during testing but now fails intermittently. This seems to be caused by something in our Candle backend and warrants further investigation!

This was actually a regression from a recently merged PR (#857), but I just merged #861 which seems to work now.

If you get any errors during the loading phase, please let me know.

solaoi · 2024-10-17T13:58:07Z

@EricLBuehler
Thanks for the follow-up.
I've tested the current master branch as well, and I'm experiencing the same intermittent build failures you mentioned.

The error message is indeed consistent with the one we discussed earlier: "A command encoder is already encoding to this command buffer".
It seems this issue requires further investigation.

On a positive note, I'm glad to hear that when the build does succeed, we're able to load the generated files and use the model without any issues. That's encouraging, but we should still address the inconsistent build process.

solaoi · 2024-10-17T14:10:38Z

@jiabochao
To use this model, you'll need to download the following files from the Hugging Face repository (https://huggingface.co/EricB/Llama-3.2-11B-Vision-Instruct-UQFF/tree/main):

The specific .uqff file you want to use
config.json
residual.safetensors
tokenizer.json
tokenizer_config.json

Once you've downloaded these files, place them in a directory (e.g., llama3.2-vision/).
Then, you can set the model_id to the path of this directory (e.g., llama3.2-vision/).
This should allow the model to run entirely locally.

jiabochao · 2024-10-17T16:51:58Z

@jiabochao To use this model, you'll need to download the following files from the Hugging Face repository (https://huggingface.co/EricB/Llama-3.2-11B-Vision-Instruct-UQFF/tree/main):

The specific .uqff file you want to use

config.json

residual.safetensors

tokenizer.json

tokenizer_config.json

Once you've downloaded these files, place them in a directory (e.g., llama3.2-vision/). Then, you can set the model_id to the path of this directory (e.g., llama3.2-vision/). This should allow the model to run entirely locally.

@solaoi Thank you for your reply, it works! but another error has occurred.

2024-10-17T16:32:52.574363Z  INFO mistralrs_core::pipeline::vision: Loading `tokenizer.json` at `/Users/bochao/Downloads/Llama-3.2-11B-Vision-Instruct-UQFF/`
2024-10-17T16:32:52.574471Z  INFO mistralrs_core::pipeline::vision: Loading `tokenizer.json` locally at `/Users/bochao/Downloads/Llama-3.2-11B-Vision-Instruct-UQFF/tokenizer.json`
2024-10-17T16:32:52.574493Z  INFO mistralrs_core::pipeline::vision: Loading `config.json` at `/Users/bochao/Downloads/Llama-3.2-11B-Vision-Instruct-UQFF/`
2024-10-17T16:32:52.574508Z  INFO mistralrs_core::pipeline::vision: Loading `config.json` locally at `/Users/bochao/Downloads/Llama-3.2-11B-Vision-Instruct-UQFF/config.json`
2024-10-17T16:32:52.589274Z  INFO mistralrs_core::pipeline::paths: Found model weight filenames ["residual.safetensors"]
2024-10-17T16:32:52.589345Z  INFO mistralrs_core::pipeline::paths: Loading `residual.safetensors` locally at `/Users/bochao/Downloads/Llama-3.2-11B-Vision-Instruct-UQFF/residual.safetensors`
2024-10-17T16:32:52.589717Z  INFO mistralrs_core::pipeline::vision: Loading `tokenizer_config.json` at `/Users/bochao/Downloads/Llama-3.2-11B-Vision-Instruct-UQFF/`
2024-10-17T16:32:52.589736Z  INFO mistralrs_core::pipeline::vision: Loading `tokenizer_config.json` locally at `/Users/bochao/Downloads/Llama-3.2-11B-Vision-Instruct-UQFF/tokenizer_config.json`
2024-10-17T16:32:52.589943Z  INFO mistralrs_core::pipeline::vision: Loading `llam3.2-vision-instruct-q4k.uqff` locally at `/Users/bochao/Downloads/Llama-3.2-11B-Vision-Instruct-UQFF/llam3.2-vision-instruct-q4k.uqff`
2024-10-17T16:32:52.590295Z  INFO mistralrs_core::pipeline::vision: Loading model `/Users/bochao/Downloads/Llama-3.2-11B-Vision-Instruct-UQFF/` on cpu.
2024-10-17T16:32:52.591102Z  INFO mistralrs_core::pipeline::vision: Model config: MLlamaConfig { vision_config: MLlamaVisionConfig { hidden_size: 1280, hidden_act: Gelu, num_hidden_layers: 32, num_global_layers: 8, num_attention_heads: 16, num_channels: 3, intermediate_size: 5120, vision_output_dim: 7680, image_size: 560, patch_size: 14, norm_eps: 1e-5, max_num_tiles: 4, intermediate_layers_indices: [3, 7, 15, 23, 30], supported_aspect_ratios: [(1, 1), (1, 2), (1, 3), (1, 4), (2, 1), (2, 2), (3, 1), (4, 1)] }, text_config: MLlamaTextConfig { rope_scaling: Some(MLlamaRopeScaling { rope_type: Llama3, factor: Some(8.0), original_max_position_embeddings: 8192, attention_factor: None, beta_fast: None, beta_slow: None, short_factor: None, long_factor: None, low_freq_factor: Some(1.0), high_freq_factor: Some(4.0) }), vocab_size: 128256, hidden_size: 4096, hidden_act: Silu, num_hidden_layers: 40, num_attention_heads: 32, num_key_value_heads: 8, intermediate_size: 14336, rope_theta: 500000.0, rms_norm_eps: 1e-5, max_position_embeddings: 131072, tie_word_embeddings: false, cross_attention_layers: [3, 8, 13, 18, 23, 28, 33, 38], use_flash_attn: false, quantization_config: None } }
2024-10-17T16:32:52.592577Z  INFO mistralrs_core::utils::normal: DType selected is F16.
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 682/682 [00:21<00:00, 172.64it/s]
Error: cannot find tensor vision_model.transformer.layers.0.self_attn.q_proj.weight

EricLBuehler · 2024-10-17T17:42:37Z

@solaoi are you still running into intermittent errors?

EricLBuehler · 2024-10-17T17:52:18Z

@jiabochao the following works for me:

./mistralrs-server -i vision-plain -m EricB/Llama-3.2-11B-Vision-Instruct-UQFF -a vllama --from-uqff llama3.2-vision-instruct-q4k.uqff

Perhaps you downloaded the wrong residual.safetensors? I would recommend loading from HF Hub if possible.

jiabochao · 2024-10-18T03:55:55Z

@jiabochao the following works for me:
./mistralrs-server -i vision-plain -m EricB/Llama-3.2-11B-Vision-Instruct-UQFF -a vllama --from-uqff llama3.2-vision-instruct-q4k.uqff 
Perhaps you downloaded the wrong residual.safetensors? I would recommend loading from HF Hub if possible.

@EricLBuehler hi, the issue occurs when specifying .with_isq(IsqType::Q4K), but it works fine if this parameter is omitted.

let model = VisionModelBuilder::new(MODEL_ID, VisionLoaderType::VLlama)
        .with_isq(IsqType::Q4K) // here 
        .with_logging()
        .from_uqff("llama3.2-vision-instruct-q4k.uqff".into())
        .build()
        .await?;

solaoi · 2024-10-18T12:19:44Z

@EricLBuehler
Yes, I'm still encountering them. The frustrating part is that I can't pinpoint the exact conditions that trigger these errors.
I've tried everything from running cargo clean to even rebooting my OS, but I still can't reliably reproduce the issue.
It's quite puzzling.

solaoi added the bug Something isn't working label Oct 3, 2024

EricLBuehler mentioned this issue Oct 14, 2024

Generate standalone UQFF models #849

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to use UQFF File locally without sending requests to Hugging Face? #821

How to use UQFF File locally without sending requests to Hugging Face? #821

solaoi commented Oct 3, 2024 •

edited

Loading

Oracuda commented Oct 7, 2024

solaoi commented Oct 9, 2024

EricLBuehler commented Oct 14, 2024

solaoi commented Oct 14, 2024

EricLBuehler commented Oct 15, 2024

jiabochao commented Oct 16, 2024 •

edited

Loading

EricLBuehler commented Oct 16, 2024

jiabochao commented Oct 16, 2024 •

edited

Loading

solaoi commented Oct 16, 2024

EricLBuehler commented Oct 16, 2024

solaoi commented Oct 17, 2024

EricLBuehler commented Oct 17, 2024

solaoi commented Oct 17, 2024

solaoi commented Oct 17, 2024

jiabochao commented Oct 17, 2024

EricLBuehler commented Oct 17, 2024

EricLBuehler commented Oct 17, 2024

jiabochao commented Oct 18, 2024

solaoi commented Oct 18, 2024

How to use UQFF File locally without sending requests to Hugging Face? #821

How to use UQFF File locally without sending requests to Hugging Face? #821

Comments

solaoi commented Oct 3, 2024 • edited Loading

Describe the bug

Sample Code

Latest commit or version

Oracuda commented Oct 7, 2024

solaoi commented Oct 9, 2024

EricLBuehler commented Oct 14, 2024

solaoi commented Oct 14, 2024

EricLBuehler commented Oct 15, 2024

jiabochao commented Oct 16, 2024 • edited Loading

EricLBuehler commented Oct 16, 2024

jiabochao commented Oct 16, 2024 • edited Loading

solaoi commented Oct 16, 2024

EricLBuehler commented Oct 16, 2024

solaoi commented Oct 17, 2024

EricLBuehler commented Oct 17, 2024

solaoi commented Oct 17, 2024

solaoi commented Oct 17, 2024

jiabochao commented Oct 17, 2024

EricLBuehler commented Oct 17, 2024

EricLBuehler commented Oct 17, 2024

jiabochao commented Oct 18, 2024

solaoi commented Oct 18, 2024

solaoi commented Oct 3, 2024 •

edited

Loading

jiabochao commented Oct 16, 2024 •

edited

Loading

jiabochao commented Oct 16, 2024 •

edited

Loading