Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use UQFF File locally without sending requests to Hugging Face? #821

Open
solaoi opened this issue Oct 3, 2024 · 19 comments
Open
Labels
bug Something isn't working

Comments

@solaoi
Copy link

solaoi commented Oct 3, 2024

Describe the bug

I'm trying to use UQFF File in a local environment only, but my sample code is still sending requests to Hugging Face.
I would like to know how to prevent these external requests and use UQFF File entirely locally.

Sample Code

use std::{env::current_dir, sync::Arc};
use tokio::sync::mpsc::channel;

use mistralrs::{
    Constraint, DefaultSchedulerMethod, Device, DeviceMapMetadata, IsqType, MistralRs,
    MistralRsBuilder, ModelDType, NormalLoaderBuilder, NormalRequest, NormalSpecificConfig,
    Request, RequestMessage, ResponseOk, SamplingParams, SchedulerConfig, TokenSource,
};

fn setup() -> anyhow::Result<Arc<MistralRs>> {
    let path_buf = current_dir()?;
    let loader = NormalLoaderBuilder::new(
        NormalSpecificConfig {
            use_flash_attn: false,
            prompt_batchsize: None,
            topology: None,
            organization: Default::default(),
            write_uqff: None,
            from_uqff: Some(path_buf.join("honyaku13B/Honyaku-13b-q4_0.uqff")),
        },
        Some("honyaku13B/llama2.json".to_string()),
        None,
        Some("aixsatoshi/Honyaku-13b".to_string()),
    )
    .build(None)?;

    let pipeline = loader.load_model_from_hf(
        None,
        TokenSource::None,
        &ModelDType::Auto,
        &Device::new_metal(0)?,
        false,
        DeviceMapMetadata::dummy(),
        Some(IsqType::Q4_0),
        None,
    )?;

    Ok(MistralRsBuilder::new(
        pipeline,
        SchedulerConfig::DefaultScheduler {
            method: DefaultSchedulerMethod::Fixed(5.try_into().unwrap()),
        },
    )
    .build())
}

fn main() -> anyhow::Result<()> {
    let mistralrs = setup()?;

    let (tx, mut rx) = channel(10_000);
    let text = std::env::args()
        .nth(1)
        .unwrap_or_else(|| "Hello world!".to_string());
    let prompt = format!("<english>: {} <NL>\n\n<japanese>: ", text);
    let request = Request::Normal(NormalRequest {
        messages: RequestMessage::Completion {
            text: prompt,
            echo_prompt: false,
            best_of: 1,
        },
        sampling_params: SamplingParams::deterministic(),
        response: tx,
        return_logprobs: false,
        is_streaming: false,
        id: 0,
        constraint: Constraint::None,
        suffix: None,
        adapters: None,
        tools: None,
        tool_choice: None,
        logits_processors: None,
    });
    mistralrs.get_sender()?.blocking_send(request)?;

    let response = rx.blocking_recv().unwrap().as_result().unwrap();
    match response {
        ResponseOk::CompletionDone(c) => println!("Text: {}", c.choices[0].text),
        _ => unreachable!(),
    }
    Ok(())
}

I tried changing the model_id from aixsatoshi/Honyaku-13b to ./honyaku13B/Honyaku-13b-q4_0.uqff, but this resulted in the following error:

File "tokenizer.json" not found at model id "./honyaku13B/Honyaku-13b-q4_0.uqff"

Latest commit or version

329e0e8

@solaoi solaoi added the bug Something isn't working label Oct 3, 2024
@Oracuda
Copy link

Oracuda commented Oct 7, 2024

Hey, I'm not working with UQFF so I can't say for sure, but I think this might be possible. Here's the functional command line syntax that I used. I'm also completely new to this project and I haven't worked with it in code yet either, so forgive my naivety.
--quantized-filename "my_modelgguf" --quantized-model-id "drive:\mymodels\"
so try changing model-id to ./honyaku13B/ or the folder that contains the .uqff

@solaoi
Copy link
Author

solaoi commented Oct 9, 2024

@Oracuda
Thank you for your advice.
When I write code similar to gguf, it somehow requires the original config.json, tokenizer.json, and safetensors files.

It seems I'm encountering a problem similar to these issues:
#828
#836

Alternatively, my understanding of UQFF might be incorrect to begin with.

@EricLBuehler
Copy link
Owner

@solaoi @Oracuda at the moment, UQFF is a storage format for ISQ artifacts and simply removes the step of doing the quantization locally. #849 will generate all the necessary files when you create a UQFF model, though!

@solaoi
Copy link
Author

solaoi commented Oct 14, 2024

@EricLBuehler

Thank you for the explanation.
I see now that UQFF is a storage format for ISQ artifacts that simplifies the process by removing the need for local quantization.

I also appreciate the work you're doing on #849 to generate all the necessary files when creating a UQFF model.
I've already tried out the uqff_standalone branch from PR #849.

However, I encountered the same error as described in this issue:
#845

For context, I was trying to enable metal and use UQFF with the following command:

cargo run --features metal -- --isq Q4_0 -i plain -m aixsatoshi/Honyaku-13b --write-uqff Honyaku-13b-q4_0.uqff 

Interestingly, I was able to build without errors using the metal_f8_buildfix branch from PR #846.

Would it be possible to incorporate the changes from PR #846 into the uqff_standalone branch of PR #849?
I'd like to test if that resolves the issue.

@EricLBuehler
Copy link
Owner

@solaoi I merged both #846 and #849, so you can now load UQFF models without downloading the full weights in #849!

For example (https://huggingface.co/EricB/Llama-3.2-11B-Vision-Instruct-UQFF):

./mistralrs-server -i vision-plain -m EricB/Llama-3.2-11B-Vision-Instruct-UQFF -a vllama --from-uqff llama3.2-vision-instruct-q4k.uqff

More models can be found here: https://huggingface.co/collections/EricB/uqff-670e4a49d56ecdd3f7f0fd4c.

@jiabochao
Copy link

jiabochao commented Oct 16, 2024

@EricLBuehler Hi, I have a question about the new UQFF API. After specifying the local paths for the uqff file and tokenizer.json, I noticed that the program tries to download config.json from the huggingface. Is there a way to specify a local path for the config.json file?

2024-10-16T05:29:19.553770Z  INFO mistralrs_core::pipeline::vision: Using tokenizer.json at `~/Downloads/tokenizer.json`
2024-10-16T05:29:19.553829Z  INFO mistralrs_core::pipeline::vision: Loading `config.json` at `EricB/Llama-3.2-11B-Vision-Instruct-UQFF`

here is my code

use anyhow::Result;
use mistralrs::{IsqType, TextMessageRole, VisionLoaderType, VisionMessages, VisionModelBuilder};

const MODEL_ID: &str = "EricB/Llama-3.2-11B-Vision-Instruct-UQFF";

#[tokio::main]
async fn main() -> Result<()> {

    let model = VisionModelBuilder::new(MODEL_ID, VisionLoaderType::VLlama)
        .with_isq(IsqType::Q4K)
        .with_logging()
        .from_uqff("~/Downloads/llam3.2-vision-instruct-q4k.uqff".into())
        .with_tokenizer_json("~/Downloads/tokenizer.json")
        .build()
        .await?;

    let bytes = match reqwest::blocking::get(
        "https://d2r55xnwy6nx47.cloudfront.net/uploads/2018/02/Ants_Lede1300.jpg",
    ) {
        Ok(http_resp) => http_resp.bytes()?.to_vec(),
        Err(e) => anyhow::bail!(e),
    };
    let image = image::load_from_memory(&bytes)?;

    let messages = VisionMessages::new().add_vllama_image_message(
        TextMessageRole::User,
        "What is depicted here? Please describe the scene in detail.",
        image,
    );

    let response = model.send_chat_request(messages).await?;

    println!("{}", response.choices[0].message.content.as_ref().unwrap());
    dbg!(
        response.usage.avg_prompt_tok_per_sec,
        response.usage.avg_compl_tok_per_sec
    );

    Ok(())
}

@EricLBuehler
Copy link
Owner

Hi @jiabochao! If you specify a Hugging Face model ID, it will always source the tokenizer from there. If you want to avoid downloading files, it would be best to download the model locally and then use a local model ID.

@jiabochao
Copy link

jiabochao commented Oct 16, 2024

Hi @jiabochao! If you specify a Hugging Face model ID, it will always source the tokenizer from there. If you want to avoid downloading files, it would be best to download the model locally and then use a local model ID.

@EricLBuehler How can I use a local model ID? Could you please provide me with some examples?

@solaoi
Copy link
Author

solaoi commented Oct 16, 2024

@EricLBuehler
Thank you for your assistance.
I've started recreating the UQFF model using the latest master branch as per this commit:
751be3d

I executed the following command:

cargo run --features metal -- --isq Q4_0 -i plain -m aixsatoshi/Honyaku-13b --write-uqff Honyaku-13b-q4_0.uqff

However, after execution, I couldn't find the UQFF file.
Only the following files were newly created:

config.json
generation_config.json
residual.safetensors
tokenizer.json
tokenizer_config.json

Could you please look into this?
I'm particularly concerned about the following warning that appeared when running the command:

warning: unused variable: `quantized_values`
   --> mistralrs-core/src/pipeline/isq.rs:506:25
    |
506 |                     let quantized_values = if silent {
    |                         ^^^^^^^^^^^^^^^^ help: if this is intentional, prefix it with an underscore: `_quantized_values`
    |
    = note: `#[warn(unused_variables)]` on by default

warning: `mistralrs-core` (lib) generated 1 warning

@EricLBuehler
Copy link
Owner

Hi @solaoi! I merged #857 and this works locally on Metal for me now.

@solaoi
Copy link
Author

solaoi commented Oct 17, 2024

@EricLBuehler
Thank you for the update. However, I'm encountering the following error on my Mac (M2, macOS Sonoma):

-[AGXG14GFamilyCommandBuffer blitCommandEncoderCommon:]:757: failed assertion `A command encoder is already encoding to this command buffer'
zsh: abort      cargo run --features metal -- --isq Q4_0 -i plain -m aixsatoshi/Honyaku-13b

@EricLBuehler
Copy link
Owner

@solaoi thanks for catching that. On my Metal machine (M3 Max, macOS Sonoma), it worked during testing but now fails intermittently. This seems to be caused by something in our Candle backend and warrants further investigation!

This was actually a regression from a recently merged PR (#857), but I just merged #861 which seems to work now.

If you get any errors during the loading phase, please let me know.

@solaoi
Copy link
Author

solaoi commented Oct 17, 2024

@EricLBuehler
Thanks for the follow-up.
I've tested the current master branch as well, and I'm experiencing the same intermittent build failures you mentioned.

The error message is indeed consistent with the one we discussed earlier: "A command encoder is already encoding to this command buffer".
It seems this issue requires further investigation.

On a positive note, I'm glad to hear that when the build does succeed, we're able to load the generated files and use the model without any issues. That's encouraging, but we should still address the inconsistent build process.

@solaoi
Copy link
Author

solaoi commented Oct 17, 2024

@jiabochao
To use this model, you'll need to download the following files from the Hugging Face repository (https://huggingface.co/EricB/Llama-3.2-11B-Vision-Instruct-UQFF/tree/main):

  • The specific .uqff file you want to use
  • config.json
  • residual.safetensors
  • tokenizer.json
  • tokenizer_config.json

Once you've downloaded these files, place them in a directory (e.g., llama3.2-vision/).
Then, you can set the model_id to the path of this directory (e.g., llama3.2-vision/).
This should allow the model to run entirely locally.

@jiabochao
Copy link

@jiabochao To use this model, you'll need to download the following files from the Hugging Face repository (https://huggingface.co/EricB/Llama-3.2-11B-Vision-Instruct-UQFF/tree/main):

  • The specific .uqff file you want to use
  • config.json
  • residual.safetensors
  • tokenizer.json
  • tokenizer_config.json

Once you've downloaded these files, place them in a directory (e.g., llama3.2-vision/). Then, you can set the model_id to the path of this directory (e.g., llama3.2-vision/). This should allow the model to run entirely locally.

@solaoi Thank you for your reply, it works! but another error has occurred.

2024-10-17T16:32:52.574363Z  INFO mistralrs_core::pipeline::vision: Loading `tokenizer.json` at `/Users/bochao/Downloads/Llama-3.2-11B-Vision-Instruct-UQFF/`
2024-10-17T16:32:52.574471Z  INFO mistralrs_core::pipeline::vision: Loading `tokenizer.json` locally at `/Users/bochao/Downloads/Llama-3.2-11B-Vision-Instruct-UQFF/tokenizer.json`
2024-10-17T16:32:52.574493Z  INFO mistralrs_core::pipeline::vision: Loading `config.json` at `/Users/bochao/Downloads/Llama-3.2-11B-Vision-Instruct-UQFF/`
2024-10-17T16:32:52.574508Z  INFO mistralrs_core::pipeline::vision: Loading `config.json` locally at `/Users/bochao/Downloads/Llama-3.2-11B-Vision-Instruct-UQFF/config.json`
2024-10-17T16:32:52.589274Z  INFO mistralrs_core::pipeline::paths: Found model weight filenames ["residual.safetensors"]
2024-10-17T16:32:52.589345Z  INFO mistralrs_core::pipeline::paths: Loading `residual.safetensors` locally at `/Users/bochao/Downloads/Llama-3.2-11B-Vision-Instruct-UQFF/residual.safetensors`
2024-10-17T16:32:52.589717Z  INFO mistralrs_core::pipeline::vision: Loading `tokenizer_config.json` at `/Users/bochao/Downloads/Llama-3.2-11B-Vision-Instruct-UQFF/`
2024-10-17T16:32:52.589736Z  INFO mistralrs_core::pipeline::vision: Loading `tokenizer_config.json` locally at `/Users/bochao/Downloads/Llama-3.2-11B-Vision-Instruct-UQFF/tokenizer_config.json`
2024-10-17T16:32:52.589943Z  INFO mistralrs_core::pipeline::vision: Loading `llam3.2-vision-instruct-q4k.uqff` locally at `/Users/bochao/Downloads/Llama-3.2-11B-Vision-Instruct-UQFF/llam3.2-vision-instruct-q4k.uqff`
2024-10-17T16:32:52.590295Z  INFO mistralrs_core::pipeline::vision: Loading model `/Users/bochao/Downloads/Llama-3.2-11B-Vision-Instruct-UQFF/` on cpu.
2024-10-17T16:32:52.591102Z  INFO mistralrs_core::pipeline::vision: Model config: MLlamaConfig { vision_config: MLlamaVisionConfig { hidden_size: 1280, hidden_act: Gelu, num_hidden_layers: 32, num_global_layers: 8, num_attention_heads: 16, num_channels: 3, intermediate_size: 5120, vision_output_dim: 7680, image_size: 560, patch_size: 14, norm_eps: 1e-5, max_num_tiles: 4, intermediate_layers_indices: [3, 7, 15, 23, 30], supported_aspect_ratios: [(1, 1), (1, 2), (1, 3), (1, 4), (2, 1), (2, 2), (3, 1), (4, 1)] }, text_config: MLlamaTextConfig { rope_scaling: Some(MLlamaRopeScaling { rope_type: Llama3, factor: Some(8.0), original_max_position_embeddings: 8192, attention_factor: None, beta_fast: None, beta_slow: None, short_factor: None, long_factor: None, low_freq_factor: Some(1.0), high_freq_factor: Some(4.0) }), vocab_size: 128256, hidden_size: 4096, hidden_act: Silu, num_hidden_layers: 40, num_attention_heads: 32, num_key_value_heads: 8, intermediate_size: 14336, rope_theta: 500000.0, rms_norm_eps: 1e-5, max_position_embeddings: 131072, tie_word_embeddings: false, cross_attention_layers: [3, 8, 13, 18, 23, 28, 33, 38], use_flash_attn: false, quantization_config: None } }
2024-10-17T16:32:52.592577Z  INFO mistralrs_core::utils::normal: DType selected is F16.
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 682/682 [00:21<00:00, 172.64it/s]
Error: cannot find tensor vision_model.transformer.layers.0.self_attn.q_proj.weight

@EricLBuehler
Copy link
Owner

@solaoi are you still running into intermittent errors?

@EricLBuehler
Copy link
Owner

@jiabochao the following works for me:

./mistralrs-server -i vision-plain -m EricB/Llama-3.2-11B-Vision-Instruct-UQFF -a vllama --from-uqff llama3.2-vision-instruct-q4k.uqff 

Perhaps you downloaded the wrong residual.safetensors? I would recommend loading from HF Hub if possible.

@jiabochao
Copy link

@jiabochao the following works for me:

./mistralrs-server -i vision-plain -m EricB/Llama-3.2-11B-Vision-Instruct-UQFF -a vllama --from-uqff llama3.2-vision-instruct-q4k.uqff 

Perhaps you downloaded the wrong residual.safetensors? I would recommend loading from HF Hub if possible.

@EricLBuehler hi, the issue occurs when specifying .with_isq(IsqType::Q4K), but it works fine if this parameter is omitted.

let model = VisionModelBuilder::new(MODEL_ID, VisionLoaderType::VLlama)
        .with_isq(IsqType::Q4K) // here 
        .with_logging()
        .from_uqff("llama3.2-vision-instruct-q4k.uqff".into())
        .build()
        .await?;

@solaoi
Copy link
Author

solaoi commented Oct 18, 2024

@EricLBuehler
Yes, I'm still encountering them. The frustrating part is that I can't pinpoint the exact conditions that trigger these errors.
I've tried everything from running cargo clean to even rebooting my OS, but I still can't reliably reproduce the issue.
It's quite puzzling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants