You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Error: DriverError(CUDA_ERROR_UNSUPPORTED_PTX_VERSION, "the provided PTX was compiled with an unsupported toolchain.") when loading dequantize_block_q4_K_f32
I have used this same model file with llama.cpp on the same platform, so I don't think the file is the problem.
Click to see full output
Finished `release` profile [optimized] target(s) in 0.36s
Running `target/release/mistralrs-server -i gguf -m /external/bradley/llama.cpp/models -f llama-31-70B-Q4-K-M.gguf`
2024-10-19T20:08:24.006809Z INFO mistralrs_server: avx: false, neon: true, simd128: false, f16c: false
2024-10-19T20:08:24.007090Z INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> minp -> multinomial
2024-10-19T20:08:24.007133Z INFO mistralrs_server: Model kind is: gguf quantized from gguf (no adapters)
2024-10-19T20:08:24.007232Z INFO mistralrs_core::utils::tokens: Could not load token at "/home/bradley/.cache/huggingface/token", using no HF token.
2024-10-19T20:08:24.007350Z INFO mistralrs_core::utils::tokens: Could not load token at "/home/bradley/.cache/huggingface/token", using no HF token.
2024-10-19T20:08:24.007379Z INFO mistralrs_core::pipeline::paths: Loading `llama-31-70B-Q4-K-M.gguf` locally at `/external/bradley/llama.cpp/models/llama-31-70B-Q4-K-M.gguf`
2024-10-19T20:08:24.007642Z INFO mistralrs_core::pipeline::gguf: Loading model `/external/bradley/llama.cpp/models` on cuda[0].
2024-10-19T20:08:24.573480Z INFO mistralrs_core::gguf::content: Model config:
general.architecture: llama
general.base_model.0.name: Meta Llama 3.1 70B
general.base_model.0.organization: Meta Llama
general.base_model.0.repo_url: https://huggingface.co/meta-llama/Meta-Llama-3.1-70B
general.base_model.count: 1
general.file_type: 15
general.finetune: 33101ce6ccc08fa6249c10a543ebfcac65173393
general.languages: en, de, fr, it, pt, hi, es, th
general.license: llama3.1
general.name: 33101ce6ccc08fa6249c10a543ebfcac65173393
general.quantization_version: 2
general.size_label: 71B
general.tags: facebook, meta, pytorch, llama, llama-3, text-generation
general.type: model
llama.attention.head_count: 64
llama.attention.head_count_kv: 8
llama.attention.layer_norm_rms_epsilon: 0.00001
llama.block_count: 80
llama.context_length: 131072
llama.embedding_length: 8192
llama.feed_forward_length: 28672
llama.rope.dimension_count: 128
llama.rope.freq_base: 500000
llama.vocab_size: 128256
2024-10-19T20:08:24.982454Z INFO mistralrs_core::gguf::gguf_tokenizer: GGUF tokenizer model is `gpt2`, kind: `Bpe`, num tokens: 128256, num added tokens: 0, num merges: 280147, num scores: 0
2024-10-19T20:08:24.993793Z INFO mistralrs_core::gguf::chat_template: Discovered and using GGUF chat template: `{{- bos_token }}\n{%- if custom_tools is defined %}\n {%- set tools = custom_tools %}\n{%- endif %}\n{%- if not tools_in_user_message is defined %}\n {%- set tools_in_user_message = true %}\n{%- endif %}\n{%- if not date_string is defined %}\n {%- set date_string = "26 Jul 2024" %}\n{%- endif %}\n{%- if not tools is defined %}\n {%- set tools = none %}\n{%- endif %}\n\n{#- This block extracts the system message, so we can slot it into the right place. #}\n{%- if messages[0]['role'] == 'system' %}\n {%- set system_message = messages[0]['content']|trim %}\n {%- set messages = messages[1:] %}\n{%- else %}\n {%- set system_message = "" %}\n{%- endif %}\n\n{#- System message + builtin tools #}\n{{- "<|start_header_id|>system<|end_header_id|>\n\n" }}\n{%- if builtin_tools is defined or tools is not none %}\n {{- "Environment: ipython\n" }}\n{%- endif %}\n{%- if builtin_tools is defined %}\n {{- "Tools: " + builtin_tools | reject('equalto', 'code_interpreter') | join(", ") + "\n\n"}}\n{%- endif %}\n{{- "Cutting Knowledge Date: December 2023\n" }}\n{{- "Today Date: " + date_string + "\n\n" }}\n{%- if tools is not none and not tools_in_user_message %}\n {{- "You have access to the following functions. To call a function, please respond with JSON for a function call." }}\n {{- 'Respond in the format {"name": function name, "parameters": dictionary of argument name and its value}.' }}\n {{- "Do not use variables.\n\n" }}\n {%- for t in tools %}\n {{- t | tojson(indent=4) }}\n {{- "\n\n" }}\n {%- endfor %}\n{%- endif %}\n{{- system_message }}\n{{- "<|eot_id|>" }}\n\n{#- Custom tools are passed in a user message with some extra guidance #}\n{%- if tools_in_user_message and not tools is none %}\n {#- Extract the first user message so we can plug it in here #}\n {%- if messages | length != 0 %}\n {%- set first_user_message = messages[0]['content']|trim %}\n {%- set messages = messages[1:] %}\n {%- else %}\n {{- raise_exception("Cannot put tools in the first user message when there's no first user message!") }}\n{%- endif %}\n {{- '<|start_header_id|>user<|end_header_id|>\n\n' -}}\n {{- "Given the following functions, please respond with a JSON for a function call " }}\n {{- "with its proper arguments that best answers the given prompt.\n\n" }}\n {{- 'Respond in the format {"name": function name, "parameters": dictionary of argument name and its value}.' }}\n {{- "Do not use variables.\n\n" }}\n {%- for t in tools %}\n {{- t | tojson(indent=4) }}\n {{- "\n\n" }}\n {%- endfor %}\n {{- first_user_message + "<|eot_id|>"}}\n{%- endif %}\n\n{%- for message in messages %}\n {%- if not (message.role == 'ipython' or message.role == 'tool' or 'tool_calls' in message) %}\n {{- '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' }}\n {%- elif 'tool_calls' in message %}\n {%- if not message.tool_calls|length == 1 %}\n {{- raise_exception("This model only supports single tool-calls at once!") }}\n {%- endif %}\n {%- set tool_call = message.tool_calls[0].function %}\n {%- if builtin_tools is defined and tool_call.name in builtin_tools %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\n\n' -}}\n {{- "<|python_tag|>" + tool_call.name + ".call(" }}\n {%- for arg_name, arg_val in tool_call.arguments | items %}\n {{- arg_name + '="' + arg_val + '"' }}\n {%- if not loop.last %}\n {{- ", " }}\n {%- endif %}\n {%- endfor %}\n {{- ")" }}\n {%- else %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\n\n' -}}\n {{- '{"name": "' + tool_call.name + '", ' }}\n {{- '"parameters": ' }}\n {{- tool_call.arguments | tojson }}\n {{- "}" }}\n {%- endif %}\n {%- if builtin_tools is defined %}\n {#- This means we're in ipython mode #}\n {{- "<|eom_id|>" }}\n {%- else %}\n {{- "<|eot_id|>" }}\n {%- endif %}\n {%- elif message.role == "tool" or message.role == "ipython" %}\n {{- "<|start_header_id|>ipython<|end_header_id|>\n\n" }}\n {%- if message.content is mapping or message.content is iterable %}\n {{- message.content | tojson }}\n {%- else %}\n {{- message.content }}\n {%- endif %}\n {{- "<|eot_id|>" }}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\n\n' }}\n{%- endif %}\n`
Error: DriverError(CUDA_ERROR_UNSUPPORTED_PTX_VERSION, "the provided PTX was compiled with an unsupported toolchain.") when loading dequantize_block_q4_K_f32
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "Orin"
CUDA Driver Version / Runtime Version 12.2 / 12.4
CUDA Capability Major/Minor version number: 8.7
Total amount of global memory: 62841 MBytes (65893945344 bytes)
(016) Multiprocessors, (128) CUDA Cores/MP: 2048 CUDA Cores
GPU Max Clock rate: 1300 MHz (1.30 GHz)
Memory Clock rate: 1300 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 4194304 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total shared memory per multiprocessor: 167936 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: Yes
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Managed Memory: Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 0 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 12.2, CUDA Runtime Version = 12.4, NumDevs = 1
Result = PASS
Could this be a problem with ARM support? Or maybe a build.rs in some dependency is using the wrong version of nvcc somehow? Also possible that this is a usage error on my part.
Latest commit or version
32e8945, current master as of writing. Also tried v0.3.1 with the same result.
The text was updated successfully, but these errors were encountered:
Describe the bug
When I run this command:
cargo run --bin mistralrs-server --release --features "cuda" -- -i gguf -m /external/bradley/llama.cpp/models -f llama-31-70B-Q4-K-M.gguf
I get the following error:
I have used this same model file with
llama.cpp
on the same platform, so I don't think the file is the problem.Click to see full output
My system is an Nvidia Jetson AGX Orin 64 GB Developer Kit.
Click to show output of deviceQuery
Could this be a problem with ARM support? Or maybe a
build.rs
in some dependency is using the wrong version ofnvcc
somehow? Also possible that this is a usage error on my part.Latest commit or version
32e8945
, currentmaster
as of writing. Also triedv0.3.1
with the same result.The text was updated successfully, but these errors were encountered: