You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have searched related issues but cannot get the expected help.
The bug has not been fixed in the latest version.
Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
I used the AutoAWQ tool to quantize Deepseek-V2 model . The quantization script is as follows, resulting in a quantized network. I expect to obtain a model in awq_marlin quantization format.
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_path = 'path/to/Deepseek-V2'
quant_path = 'path/to/Deepseek-V2_marlin'
quant_config = { "zero_point": False, "q_group_size": 128, "w_bit": 4, "version": "Marlin" }
# Load model
model = AutoAWQForCausalLM.from_pretrained(
model_path, **{"low_cpu_mem_usage": True, "use_cache": False}
)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
# Quantize
model.quantize(tokenizer, quant_config=quant_config)
# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
print(f'Model is quantized and saved at "{quant_path}"')
The config.json corresponding to the quantized model is as follows.
[2024-10-25 18:12:30 TP0] Traceback (most recent call last):
File "/opt/conda/envs/sglang_py310/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 1115, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank)
File "/opt/conda/envs/sglang_py310/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 146, in __init__
self.tp_worker = TpModelWorker(
File "/opt/conda/envs/sglang_py310/lib/python3.10/site-packages/sglang/srt/managers/tp_worker.py", line 58, in __init__
self.model_runner = ModelRunner(
File "/opt/conda/envs/sglang_py310/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 147, in __init__
self.load_model()
File "/opt/conda/envs/sglang_py310/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 234, in load_model
self.vllm_model_config = VllmModelConfig(
File "/opt/conda/envs/sglang_py310/lib/python3.10/site-packages/vllm/config.py", line 227, in __init__
self._verify_quantization()
File "/opt/conda/envs/sglang_py310/lib/python3.10/site-packages/vllm/config.py", line 296, in _verify_quantization
raise ValueError(
ValueError: Quantization method specified in the model config (awq) does not match the quantization method specified in the `quantization` argument (awq_marlin).
If I change quantization_config to --quantization awq , also got error.
File "/opt/conda/envs/sglang_py310/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 1115, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank)
File "/opt/conda/envs/sglang_py310/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 146, in __init__
self.tp_worker = TpModelWorker(
File "/opt/conda/envs/sglang_py310/lib/python3.10/site-packages/sglang/srt/managers/tp_worker.py", line 58, in __init__
self.model_runner = ModelRunner(
File "/opt/conda/envs/sglang_py310/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 147, in __init__
self.load_model()
File "/opt/conda/envs/sglang_py310/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 251, in load_model
self.model = get_model(
File "/opt/conda/envs/sglang_py310/lib/python3.10/site-packages/vllm/model_executor/model_loader/__init__.py", line 19, in get_model
return loader.load_model(model_config=model_config,
File "/opt/conda/envs/sglang_py310/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 341, in load_model
model = _initialize_model(model_config, self.load_config,
File "/opt/conda/envs/sglang_py310/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 170, in _initialize_model
return build_model(
File "/opt/conda/envs/sglang_py310/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 155, in build_model
return model_class(config=hf_config,
File "/opt/conda/envs/sglang_py310/lib/python3.10/site-packages/sglang/srt/models/deepseek_v2.py", line 648, in __init__
self.model = DeepseekV2Model(config, cache_config, quant_config)
File "/opt/conda/envs/sglang_py310/lib/python3.10/site-packages/sglang/srt/models/deepseek_v2.py", line 608, in __init__
[
File "/opt/conda/envs/sglang_py310/lib/python3.10/site-packages/sglang/srt/models/deepseek_v2.py", line 609, in <listcomp>
DeepseekV2DecoderLayer(
File "/opt/conda/envs/sglang_py310/lib/python3.10/site-packages/sglang/srt/models/deepseek_v2.py", line 551, in __init__
self.mlp = DeepseekV2MoE(config=config, quant_config=quant_config)
File "/opt/conda/envs/sglang_py310/lib/python3.10/site-packages/sglang/srt/models/deepseek_v2.py", line 113, in __init__
self.experts = FusedMoE(
File "/opt/conda/envs/sglang_py310/lib/python3.10/site-packages/vllm/model_executor/layers/fused_moe/layer.py", line 192, in __init__
assert self.quant_method is not None
AssertionError
So how to run an awq_marlin or marlin quantized model with SGLang?
Reproduction
Quant the Deepseek-V2 model; In fact you can use small model to reproduce;
Checklist
I have searched related issues but cannot get the expected help.
The bug has not been fixed in the latest version.
Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
Please use English, otherwise it will be closed.
Describe the bug
I used the AutoAWQ tool to quantize Deepseek-V2 model . The quantization script is as follows, resulting in a quantized network. I expect to obtain a model in awq_marlin quantization format.
The config.json corresponding to the quantized model is as follows.
Then, I used SGLang to run quantized model with the following command.
And got the error
If I change quantization_config to
--quantization awq
, also got error.So how to run an awq_marlin or marlin quantized model with SGLang?
Reproduction
Quant the Deepseek-V2 model; In fact you can use small model to reproduce;
Run quantization model with SGLang.
Environment
The text was updated successfully, but these errors were encountered: