Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unknown dtype for tensor (BF16?) #663

Closed
oldgithubman opened this issue Aug 1, 2024 · 12 comments
Closed

unknown dtype for tensor (BF16?) #663

oldgithubman opened this issue Aug 1, 2024 · 12 comments
Labels
bug Something isn't working

Comments

@oldgithubman
Copy link

Describe the bug

My Q8_0 quant of Athene-70B loads fine. I have another quant that is identical except the output and embedding tensors are BF16:

$ RUST_BACKTRACE=full ./mistralrs_server --interactive-mode --num-device-layers 13 --pa-ctxt-len 8192 gguf -m PATH -f Athene-70B-Q8_0-BF16.gguf
2024-08-01T17:08:17.446889Z  INFO mistralrs_server: avx: true, neon: false, simd128: false, f16c: true
2024-08-01T17:08:17.446907Z  INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> minp -> multinomial
2024-08-01T17:08:17.446917Z  INFO mistralrs_server: Model kind is: quantized from gguf (no adapters)
2024-08-01T17:08:17.446981Z  INFO mistralrs_core::pipeline::paths: Loading `Athene-70B-Q8_0-BF16.gguf` locally at `/PATH/Athene-70B-Q8_0-BF16.gguf`
2024-08-01T17:08:17.447033Z  WARN mistralrs_core::pipeline::gguf: Device mapping and PagedAttention are incompatible, disabling PagedAttention.
Error: path: "/PATH/Athene-70B-Q8_0-BF16.gguf" unknown dtype for tensor 30
   0: candle_core::error::Error::bt
   1: candle_core::quantized::GgmlDType::from_u32
   2: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_path
   3: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_hf
   4: mistralrs_server::main::{{closure}}
   5: mistralrs_server::main
   6: std::sys_common::backtrace::__rust_begin_short_backtrace
   7: std::rt::lang_start::{{closure}}
   8: std::rt::lang_start_internal
   9: main
  10: __libc_start_call_main
             at ./csu/../sysdeps/nptl/libc_start_call_main.h:58:16
  11: __libc_start_main_impl
             at ./csu/../csu/libc-start.c:392:3
  12: _start


Stack backtrace:
   0: anyhow::error::<impl core::convert::From<E> for anyhow::Error>::from
   1: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_path
   2: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_hf
   3: mistralrs_server::main::{{closure}}
   4: mistralrs_server::main
   5: std::sys_common::backtrace::__rust_begin_short_backtrace
   6: std::rt::lang_start::{{closure}}
   7: std::rt::lang_start_internal
   8: main
   9: __libc_start_call_main
             at ./csu/../sysdeps/nptl/libc_start_call_main.h:58:16
  10: __libc_start_main_impl
             at ./csu/../csu/libc-start.c:392:3
  11: _start

Latest commit or version

0.2.4

@oldgithubman oldgithubman added the bug Something isn't working label Aug 1, 2024
@EricLBuehler
Copy link
Owner

EricLBuehler commented Aug 2, 2024

@oldgithubman yes, this is the problem. Please see huggingface/candle#2387. This will enable support for BF16 and more descriptive errors!

@EricLBuehler
Copy link
Owner

@oldgithubman given that the Candle PR hasn't been merged, I have mirrored my changes onto our Candle fork so we can proceed. Please see #691, which should enable this to work.

To test:

git pull
git switch gguf_bf16
<test command here>

@oldgithubman
Copy link
Author

@oldgithubman given that the Candle PR hasn't been merged, I have mirrored my changes onto our Candle fork so we can proceed. Please see #691, which should enable this to work.

To test:

git pull
git switch gguf_bf16
<test command here>
$ RUST_BACKTRACE=full ./mistralrs_server -i -n 13 --pa-ctxt-len 8192 gguf -m PATH -f Athene-70B-Q8_0-BF16.gguf
2024-08-20T04:09:57.420536Z  INFO mistralrs_server: avx: true, neon: false, simd128: false, f16c: true
2024-08-20T04:09:57.420558Z  INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> minp -> multinomial
2024-08-20T04:09:57.420561Z  INFO mistralrs_server: Using flash attention.
2024-08-20T04:09:57.420569Z  WARN mistralrs_server: Using flash attention with a quantized model has no effect!
2024-08-20T04:09:57.420572Z  INFO mistralrs_server: Model kind is: quantized from gguf (no adapters)
2024-08-20T04:09:57.430607Z  INFO mistralrs_core::pipeline::paths: Loading `Athene-70B-Q8_0-BF16.gguf` locally at `/media/j/72B264BFB2648A05/Athene-70B-Q8_0-BF16.gguf`
2024-08-20T04:09:57.430858Z  WARN mistralrs_core::pipeline::gguf: Device mapping and PagedAttention are incompatible, disabling PagedAttention.
2024-08-20T04:09:57.655808Z  INFO mistralrs_core::pipeline::gguf: Model config:
general.architecture: llama
general.basename: Athene
general.file_type: 7
general.languages: en
general.license: cc-by-nc-4.0
general.name: Athene 70B
general.organization: Nexusflow
general.quantization_version: 2
general.size_label: 70B
general.tags: RLHF, Nexusflow, Athene, Chat Model
general.type: model
llama.attention.head_count: 64
llama.attention.head_count_kv: 8
llama.attention.layer_norm_rms_epsilon: 0.00001
llama.block_count: 80
llama.context_length: 8192
llama.embedding_length: 8192
llama.feed_forward_length: 28672
llama.rope.dimension_count: 128
llama.rope.freq_base: 500000
llama.vocab_size: 128256
quantize.imatrix.entries_count: 560
quantize.imatrix.file: FILE
2024-08-20T04:09:57.883996Z  INFO mistralrs_core::gguf::gguf_tokenizer: GGUF tokenizer model is `gpt2`, kind: `Bpe`, num tokens: 128256, num added tokens: 0, num merges: 280147, num scores: 0
2024-08-20T04:09:57.893896Z  INFO mistralrs_core::gguf::chat_template: Discovered and using GGUF chat template: `{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}{% endif %}`
Error: quantized type BF16 is not supported yet
   0: candle_core::error::Error::bt
   1: candle_core::quantized::ggml_file::qtensor_from_ggml
   2: candle_core::quantized::gguf_file::Content::tensor
   3: <mistralrs_core::models::quantized_llama::ModelWeights as mistralrs_core::utils::model_config::FromGGUF>::from_gguf
   4: mistralrs_core::utils::model_config::<impl core::convert::TryFrom<mistralrs_core::utils::model_config::ModelParams<mistralrs_core::utils::model_config::ParamsGGUF>> for mistralrs_core::models::quantized_llama::ModelWeights>::try_from
   5: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_path
   6: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_hf
   7: mistralrs_server::main::{{closure}}
   8: mistralrs_server::main
   9: std::sys_common::backtrace::__rust_begin_short_backtrace
  10: std::rt::lang_start::{{closure}}
  11: std::rt::lang_start_internal
  12: main
  13: __libc_start_call_main
             at ./csu/../sysdeps/nptl/libc_start_call_main.h:58:16
  14: __libc_start_main_impl
             at ./csu/../csu/libc-start.c:392:3
  15: _start


Stack backtrace:
   0: anyhow::error::<impl core::convert::From<E> for anyhow::Error>::from
   1: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_path
   2: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_hf
   3: mistralrs_server::main::{{closure}}
   4: mistralrs_server::main
   5: std::sys_common::backtrace::__rust_begin_short_backtrace
   6: std::rt::lang_start::{{closure}}
   7: std::rt::lang_start_internal
   8: main
   9: __libc_start_call_main
             at ./csu/../sysdeps/nptl/libc_start_call_main.h:58:16
  10: __libc_start_main_impl
             at ./csu/../csu/libc-start.c:392:3
  11: _start

@EricLBuehler
Copy link
Owner

@oldgithubman thanks, that should be fixed now if you git pull again and retry!

@oldgithubman
Copy link
Author

@oldgithubman thanks, that should be fixed now if you git pull again and retry!

ERROR mistralrs_core::engine: prompt step - Model failed with error: WithBacktrace { inner: Msg("unsupported dtype for quantized matmul BF16"), backtrace: Backtrace [{ fn: "candle_core::error::Error::bt" }, { fn: "<candle_core::quantized::QMatMul as candle_core::Module>::forward" }, { fn: "<mistralrs_quant::gguf::GgufMatMul as mistralrs_quant::QuantMethod>::forward" }, { fn: "mistralrs_core::models::quantized_llama::ModelWeights::forward" }, { fn: "<mistralrs_core::pipeline::gguf::GGUFPipeline as mistralrs_core::pipeline::Pipeline>::forward_inputs" }, { fn: "mistralrs_core::pipeline::Pipeline::step::{{closure}}" }, { fn: "mistralrs_core::engine::Engine::run::{{closure}}" }, { fn: "std::sys_common::backtrace::__rust_begin_short_backtrace" }, { fn: "core::ops::function::FnOnce::call_once{{vtable.shim}}" }, { fn: "std::sys::pal::unix::thread::Thread::new::thread_start" }, { fn: "start_thread", file: "./nptl/pthread_create.c", line: 442 }, { fn: "__GI___clone3", file: "./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S", line: 81 }] }
2024-08-20T18:09:40.158461Z ERROR mistralrs_server::interactive_mode: Got a model error: "unsupported dtype for quantized matmul BF16\n   0: candle_core::error::Error::bt\n   1: <candle_core::quantized::QMatMul as candle_core::Module>::forward\n   2: <mistralrs_quant::gguf::GgufMatMul as mistralrs_quant::QuantMethod>::forward\n   3: mistralrs_core::models::quantized_llama::ModelWeights::forward\n   4: <mistralrs_core::pipeline::gguf::GGUFPipeline as mistralrs_core::pipeline::Pipeline>::forward_inputs\n   5: mistralrs_core::pipeline::Pipeline::step::{{closure}}\n   6: mistralrs_core::engine::Engine::run::{{closure}}\n   7: std::sys_common::backtrace::__rust_begin_short_backtrace\n   8: core::ops::function::FnOnce::call_once{{vtable.shim}}\n   9: std::sys::pal::unix::thread::Thread::new::thread_start\n  10: start_thread\n             at ./nptl/pthread_create.c:442:8\n  11: __GI___clone3\n             at ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81\n", response: ChatCompletionResponse { id: "0", choices: [Choice { finish_reason: "error", index: 0, message: ResponseMessage { content: Some(""), role: "assistant", tool_calls: [] }, logprobs: None }], created: 1724177337, model: "PATH", system_fingerprint: "local", object: "chat.completion", usage: Usage { completion_tokens: 0, prompt_tokens: 48, total_tokens: 48, avg_tok_per_sec: 1.1136891, avg_prompt_tok_per_sec: inf, avg_compl_tok_per_sec: NaN, total_time_sec: 43.1, total_prompt_time_sec: 0.0, total_completion_time_sec: 0.0 } }

@EricLBuehler
Copy link
Owner

@oldgithubman can you please run with RUST_BACKTRACE=1?

@oldgithubman
Copy link
Author

@oldgithubman can you please run with RUST_BACKTRACE=1?

that was run with RUST_BACKTRACE=full. Do you still want me to do it with 1?

@EricLBuehler
Copy link
Owner

Ah ok thanks, I'll take a look.

@EricLBuehler
Copy link
Owner

@oldgithubman I just updated the branch to correctly setup the QMatMul (#691).

@oldgithubman
Copy link
Author

@oldgithubman I just updated the branch to correctly setup the QMatMul (#691).

Works!

@EricLBuehler
Copy link
Owner

@oldgithubman thanks for confirming! I just merged #691, so this feature is available on master and will be in 0.2.6 in a few days.

@EricLBuehler
Copy link
Owner

@oldgithubman closing this issue as it works, please feel free to reopen!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants