[Bug] Got error with awq_marlin quantization args. #1792

liangzelang · 2024-10-25T10:21:16Z

Checklist

I have searched related issues but cannot get the expected help.
The bug has not been fixed in the latest version.
Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
Please use English, otherwise it will be closed.

Describe the bug

I used the AutoAWQ tool to quantize Deepseek-V2 model . The quantization script is as follows, resulting in a quantized network. I expect to obtain a model in awq_marlin quantization format.

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer


model_path = 'path/to/Deepseek-V2'
quant_path = 'path/to/Deepseek-V2_marlin'
quant_config = { "zero_point": False, "q_group_size": 128, "w_bit": 4, "version": "Marlin" }

# Load model
model = AutoAWQForCausalLM.from_pretrained(
    model_path, **{"low_cpu_mem_usage": True, "use_cache": False}
)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

# Quantize
model.quantize(tokenizer, quant_config=quant_config)

# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

print(f'Model is quantized and saved at "{quant_path}"')

The config.json corresponding to the quantized model is as follows.

{
  "_name_or_path": "/path/to/Deepseek-V2",
  "architectures": [
    "DeepseekV2ForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "auto_map": {
    "AutoConfig": "configuration_deepseek.DeepseekV2Config",
    "AutoModel": "modeling_deepseek.DeepseekV2Model",
    "AutoModelForCausalLM": "modeling_deepseek.DeepseekV2ForCausalLM"
  },
  "aux_loss_alpha": 0.001,
  "bos_token_id": 100000,
  "eos_token_id": 100001,
  "ep_size": 1,
  "first_k_dense_replace": 1,
  "hidden_act": "silu",
  "hidden_size": 5120,
  "initializer_range": 0.02,
  "intermediate_size": 12288,
  "kv_lora_rank": 512,
  "max_position_embeddings": 163840,
  "model_type": "deepseek_v2",
  "moe_intermediate_size": 1536,
  "moe_layer_freq": 1,
  "n_group": 8,
  "n_routed_experts": 80,
  "n_shared_experts": 2,
  "norm_topk_prob": false,
  "num_attention_heads": 128,
  "num_experts_per_tok": 6,
  "num_hidden_layers": 60,
  "num_key_value_heads": 128,
  "pretraining_tp": 1,
  "q_lora_rank": 1536,
  "qk_nope_head_dim": 128,
  "qk_rope_head_dim": 64,
  "quantization_config": {
    "bits": 4,
    "group_size": 128,
    "modules_to_not_convert": null,
    "quant_method": "awq",
    "version": "marlin",
    "zero_point": false
  },
  "rms_norm_eps": 1e-06,
  "rope_scaling": {
    "beta_fast": 32,
    "beta_slow": 1,
    "factor": 40,
    "mscale": 0.707,
    "mscale_all_dim": 0.707,
    "original_max_position_embeddings": 4096,
    "type": "yarn"
  },
  "rope_theta": 10000,
  "routed_scaling_factor": 16.0,
  "scoring_func": "softmax",
  "seq_aux": true,
  "tie_word_embeddings": false,
  "topk_group": 3,
  "topk_method": "group_limited_greedy",
  "torch_dtype": "float16",
  "transformers_version": "4.45.2",
  "use_cache": false,
  "v_head_dim": 128,
  "vocab_size": 102400
}

Then, I used SGLang to run quantized model with the following command.

python -m sglang.launch_server --trust-remote-code --model-path $MODEL_PATH --port $SERVER_PORT --quantization awq_marlin --tp 4 --mem-fraction-static 0.9

And got the error

[2024-10-25 18:12:30 TP0] Traceback (most recent call last):
  File "/opt/conda/envs/sglang_py310/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 1115, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank)
  File "/opt/conda/envs/sglang_py310/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 146, in __init__
    self.tp_worker = TpModelWorker(
  File "/opt/conda/envs/sglang_py310/lib/python3.10/site-packages/sglang/srt/managers/tp_worker.py", line 58, in __init__
    self.model_runner = ModelRunner(
  File "/opt/conda/envs/sglang_py310/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 147, in __init__
    self.load_model()
  File "/opt/conda/envs/sglang_py310/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 234, in load_model
    self.vllm_model_config = VllmModelConfig(
  File "/opt/conda/envs/sglang_py310/lib/python3.10/site-packages/vllm/config.py", line 227, in __init__
    self._verify_quantization()
  File "/opt/conda/envs/sglang_py310/lib/python3.10/site-packages/vllm/config.py", line 296, in _verify_quantization
    raise ValueError(
ValueError: Quantization method specified in the model config (awq) does not match the quantization method specified in the `quantization` argument (awq_marlin).

If I change quantization_config to --quantization awq , also got error.

  File "/opt/conda/envs/sglang_py310/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 1115, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank)
  File "/opt/conda/envs/sglang_py310/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 146, in __init__
    self.tp_worker = TpModelWorker(
  File "/opt/conda/envs/sglang_py310/lib/python3.10/site-packages/sglang/srt/managers/tp_worker.py", line 58, in __init__
    self.model_runner = ModelRunner(
  File "/opt/conda/envs/sglang_py310/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 147, in __init__
    self.load_model()
  File "/opt/conda/envs/sglang_py310/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 251, in load_model
    self.model = get_model(
  File "/opt/conda/envs/sglang_py310/lib/python3.10/site-packages/vllm/model_executor/model_loader/__init__.py", line 19, in get_model
    return loader.load_model(model_config=model_config,
  File "/opt/conda/envs/sglang_py310/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 341, in load_model
    model = _initialize_model(model_config, self.load_config,
  File "/opt/conda/envs/sglang_py310/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 170, in _initialize_model
    return build_model(
  File "/opt/conda/envs/sglang_py310/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 155, in build_model
    return model_class(config=hf_config,
  File "/opt/conda/envs/sglang_py310/lib/python3.10/site-packages/sglang/srt/models/deepseek_v2.py", line 648, in __init__
    self.model = DeepseekV2Model(config, cache_config, quant_config)
  File "/opt/conda/envs/sglang_py310/lib/python3.10/site-packages/sglang/srt/models/deepseek_v2.py", line 608, in __init__
    [
  File "/opt/conda/envs/sglang_py310/lib/python3.10/site-packages/sglang/srt/models/deepseek_v2.py", line 609, in <listcomp>
    DeepseekV2DecoderLayer(
  File "/opt/conda/envs/sglang_py310/lib/python3.10/site-packages/sglang/srt/models/deepseek_v2.py", line 551, in __init__
    self.mlp = DeepseekV2MoE(config=config, quant_config=quant_config)
  File "/opt/conda/envs/sglang_py310/lib/python3.10/site-packages/sglang/srt/models/deepseek_v2.py", line 113, in __init__
    self.experts = FusedMoE(
  File "/opt/conda/envs/sglang_py310/lib/python3.10/site-packages/vllm/model_executor/layers/fused_moe/layer.py", line 192, in __init__
    assert self.quant_method is not None
AssertionError

So how to run an awq_marlin or marlin quantized model with SGLang?

Reproduction

Quant the Deepseek-V2 model; In fact you can use small model to reproduce;
Run quantization model with SGLang.

Environment

python -m sglang.check_env
Python: 3.10.15 | packaged by conda-forge | (main, Oct 16 2024, 01:24:24) [GCC 13.3.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA H800
GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.1, V12.1.105
CUDA Driver Version: 550.90.07
PyTorch: 2.4.0+cu121
sglang: 0.3.4
flashinfer: 0.1.6+cu121torch2.4
triton: 3.0.0
transformers: 4.45.2
requests: 2.32.3
tqdm: 4.66.5
numpy: 1.26.4
aiohttp: 3.10.10
fastapi: 0.115.2
hf_transfer: 0.1.8
huggingface_hub: 0.26.0
interegular: 0.3.3
packaging: 24.1
PIL: 11.0.0
psutil: 6.1.0
pydantic: 2.9.2
uvicorn: 0.32.0
uvloop: 0.21.0
zmq: 26.2.0
vllm: 0.5.5
multipart: 0.0.12
openai: 1.52.0
tiktoken: 0.8.0
anthropic: 0.36.2

Hypervisor vendor: KVM
ulimit soft: 1048576

The text was updated successfully, but these errors were encountered:

liangzelang mentioned this issue Oct 26, 2024

[Bug]: AutoAWQ marlin methods error vllm-project/vllm#7517

Open

merrymercy mentioned this issue Nov 1, 2024

[Bug] Unable to fix model output #1316

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Got error with awq_marlin quantization args. #1792

[Bug] Got error with awq_marlin quantization args. #1792

liangzelang commented Oct 25, 2024 •

edited

Loading

[Bug] Got error with awq_marlin quantization args. #1792

[Bug] Got error with awq_marlin quantization args. #1792

Comments

liangzelang commented Oct 25, 2024 • edited Loading

Checklist

Describe the bug

Reproduction

Environment

liangzelang commented Oct 25, 2024 •

edited

Loading