[Feature]: Support Deepseek-r1 671B #809

Bihan · 2025-02-10T06:21:45Z

🚀 The feature, motivation and pitch

8xGaudi2 with 768GB HBM can support Deepseek-r1 671B. The model weights are successfully loaded using vllm habana fork but cannot deploy due to below mentioned error.

Alternatives

No response

Additional context

When trying to serve:

MODEL_ID=deepseek-ai/DeepSeek-R1
vllm serve $MODEL_ID --tensor-parallel-size 8

INFO 02-07 14:49:04 api_server.py:839] args: Namespace(subparser='serve', model_tag='deepseek-ai/DeepSeek-R1', config='', host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, enable_reasoning=False, reasoning_parser=None, tool_call_parser=None, tool_parser_plugin='', model='deepseek-ai/DeepSeek-R1', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, download_dir='/data', load_format='auto', weights_load_device=None, config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=4096, guided_decoding_backend='xgrammar', logits_processor_pattern=None, distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=8, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, use_padding_aware_scheduling=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=None, max_num_prefill_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', generation_config=None, override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, dispatch_function=<function serve at 0x7f98a7566320>)
INFO 02-07 14:49:04 api_server.py:204] Started engine process with PID 577
config.json: 100% 1.73k/1.73k [00:00<00:00, 19.8MB/s]
configuration_deepseek.py: 100% 10.6k/10.6k [00:00<00:00, 112MB/s]
A new version of the following files was downloaded from https://huggingface.co/deepseek-ai/DeepSeek-R1:
- configuration_deepseek.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
INFO 02-07 14:49:05 config.py:134] Replacing legacy 'type' key with 'rope_type'
INFO 02-07 14:49:07 __init__.py:192] Automatically detected platform hpu.
INFO 02-07 14:49:08 config.py:134] Replacing legacy 'type' key with 'rope_type'
INFO 02-07 14:49:10 config.py:526] This model supports multiple tasks: {'reward', 'classify', 'generate', 'embed', 'score'}. Defaulting to 'generate'.
/usr/lib/python3.10/inspect.py:288: FutureWarning: `torch.distributed.reduce_op` is deprecated, please use `torch.distributed.ReduceOp` instead
  return isinstance(object, types.FunctionType)
INFO 02-07 14:49:11 config.py:1344] Defaulting to use mp for distributed inference
WARNING 02-07 14:49:11 fp8.py:53] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
tokenizer_config.json: 100% 3.58k/3.58k [00:00<00:00, 42.6MB/s]
tokenizer.json: 100% 7.85M/7.85M [00:00<00:00, 12.3MB/s]
INFO 02-07 14:49:13 config.py:526] This model supports multiple tasks: {'reward', 'generate', 'embed', 'classify', 'score'}. Defaulting to 'generate'.
/usr/lib/python3.10/inspect.py:288: FutureWarning: `torch.distributed.reduce_op` is deprecated, please use `torch.distributed.ReduceOp` instead
  return isinstance(object, types.FunctionType)
INFO 02-07 14:49:14 config.py:1344] Defaulting to use mp for distributed inference
WARNING 02-07 14:49:14 fp8.py:53] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
INFO 02-07 14:49:14 llm_engine.py:232] Initializing a V0 LLM engine (v0.6.3.dev2207+g397ec534) with config: model='deepseek-ai/DeepSeek-R1', speculative_config=None, tokenizer='deepseek-ai/DeepSeek-R1', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir='/data', load_format=auto, tensor_parallel_size=8, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=fp8, enforce_eager=False, kv_cache_dtype=auto,  device_config=hpu, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=deepseek-ai/DeepSeek-R1, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True, 
generation_config.json: 100% 171/171 [00:00<00:00, 2.13MB/s]
WARNING 02-07 14:49:15 multiproc_worker_utils.py:316] Reducing Torch parallelism from 76 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 02-07 14:49:15 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
WARNING 02-07 14:49:15 hpu.py:81] Pin memory is not supported on HPU.
INFO 02-07 14:49:15 hpu.py:32] Using HPUAttention backend.
VLLM_PROMPT_BS_BUCKET_MIN=1 (default:1)
VLLM_PROMPT_BS_BUCKET_STEP=32 (default:32)
VLLM_PROMPT_BS_BUCKET_MAX=256 (default:256)
VLLM_DECODE_BS_BUCKET_MIN=1 (default:1)
VLLM_DECODE_BS_BUCKET_STEP=32 (default:32)
VLLM_DECODE_BS_BUCKET_MAX=256 (default:256)
VLLM_PROMPT_SEQ_BUCKET_MIN=128 (default:128)
VLLM_PROMPT_SEQ_BUCKET_STEP=128 (default:128)
VLLM_PROMPT_SEQ_BUCKET_MAX=1024 (default:1024)
VLLM_DECODE_BLOCK_BUCKET_MIN=128 (default:128)
VLLM_DECODE_BLOCK_BUCKET_STEP=128 (default:128)
VLLM_DECODE_BLOCK_BUCKET_MAX=4096 (default:4096)
Prompt bucket config (min, step, max_warmup) bs:[1, 32, 256], seq:[128, 128, 1024]
Decode bucket config (min, step, max_warmup) bs:[1, 32, 256], block:[128, 128, 4096]
INFO 02-07 14:49:17 __init__.py:192] Automatically detected platform hpu.
INFO 02-07 14:49:17 __init__.py:192] Automatically detected platform hpu.
INFO 02-07 14:49:17 __init__.py:192] Automatically detected platform hpu.
INFO 02-07 14:49:17 __init__.py:192] Automatically detected platform hpu.
INFO 02-07 14:49:17 __init__.py:192] Automatically detected platform hpu.
INFO 02-07 14:49:17 __init__.py:192] Automatically detected platform hpu.
INFO 02-07 14:49:17 __init__.py:192] Automatically detected platform hpu.
/usr/lib/python3.10/inspect.py:288: FutureWarning: `torch.distributed.reduce_op` is deprecated, please use `torch.distributed.ReduceOp` instead
  return isinstance(object, types.FunctionType)
(VllmWorkerProcess pid=879) INFO 02-07 14:49:18 multiproc_worker_utils.py:229] Worker ready; awaiting tasks
/usr/lib/python3.10/inspect.py:288: FutureWarning: `torch.distributed.reduce_op` is deprecated, please use `torch.distributed.ReduceOp` instead
  return isinstance(object, types.FunctionType)
/usr/lib/python3.10/inspect.py:288: FutureWarning: `torch.distributed.reduce_op` is deprecated, please use `torch.distributed.ReduceOp` instead
  return isinstance(object, types.FunctionType)
/usr/lib/python3.10/inspect.py:288: FutureWarning: `torch.distributed.reduce_op` is deprecated, please use `torch.distributed.ReduceOp` instead
  return isinstance(object, types.FunctionType)
(VllmWorkerProcess pid=878) INFO 02-07 14:49:18 multiproc_worker_utils.py:229] Worker ready; awaiting tasks
(VllmWorkerProcess pid=881) INFO 02-07 14:49:18 multiproc_worker_utils.py:229] Worker ready; awaiting tasks
(VllmWorkerProcess pid=876) INFO 02-07 14:49:18 multiproc_worker_utils.py:229] Worker ready; awaiting tasks
/usr/lib/python3.10/inspect.py:288: FutureWarning: `torch.distributed.reduce_op` is deprecated, please use `torch.distributed.ReduceOp` instead
  return isinstance(object, types.FunctionType)
/usr/lib/python3.10/inspect.py:288: FutureWarning: `torch.distributed.reduce_op` is deprecated, please use `torch.distributed.ReduceOp` instead
  return isinstance(object, types.FunctionType)
/usr/lib/python3.10/inspect.py:288: FutureWarning: `torch.distributed.reduce_op` is deprecated, please use `torch.distributed.ReduceOp` instead
  return isinstance(object, types.FunctionType)
(VllmWorkerProcess pid=877) INFO 02-07 14:49:18 multiproc_worker_utils.py:229] Worker ready; awaiting tasks
(VllmWorkerProcess pid=880) INFO 02-07 14:49:18 multiproc_worker_utils.py:229] Worker ready; awaiting tasks
(VllmWorkerProcess pid=882) INFO 02-07 14:49:18 multiproc_worker_utils.py:229] Worker ready; awaiting tasks
(VllmWorkerProcess pid=879) WARNING 02-07 14:49:19 hpu.py:81] Pin memory is not supported on HPU.
(VllmWorkerProcess pid=879) INFO 02-07 14:49:19 hpu.py:32] Using HPUAttention backend.
(VllmWorkerProcess pid=879) VLLM_PROMPT_BS_BUCKET_MIN=1 (default:1)
(VllmWorkerProcess pid=879) VLLM_PROMPT_BS_BUCKET_STEP=32 (default:32)
(VllmWorkerProcess pid=879) VLLM_PROMPT_BS_BUCKET_MAX=256 (default:256)
(VllmWorkerProcess pid=879) VLLM_DECODE_BS_BUCKET_MIN=1 (default:1)
(VllmWorkerProcess pid=879) VLLM_DECODE_BS_BUCKET_STEP=32 (default:32)
(VllmWorkerProcess pid=879) VLLM_DECODE_BS_BUCKET_MAX=256 (default:256)
(VllmWorkerProcess pid=879) VLLM_PROMPT_SEQ_BUCKET_MIN=128 (default:128)
(VllmWorkerProcess pid=879) VLLM_PROMPT_SEQ_BUCKET_STEP=128 (default:128)
(VllmWorkerProcess pid=879) VLLM_PROMPT_SEQ_BUCKET_MAX=1024 (default:1024)
(VllmWorkerProcess pid=879) VLLM_DECODE_BLOCK_BUCKET_MIN=128 (default:128)
(VllmWorkerProcess pid=879) VLLM_DECODE_BLOCK_BUCKET_STEP=128 (default:128)
(VllmWorkerProcess pid=879) VLLM_DECODE_BLOCK_BUCKET_MAX=4096 (default:4096)
(VllmWorkerProcess pid=879) Prompt bucket config (min, step, max_warmup) bs:[1, 32, 256], seq:[128, 128, 1024]
(VllmWorkerProcess pid=879) Decode bucket config (min, step, max_warmup) bs:[1, 32, 256], block:[128, 128, 4096]
(VllmWorkerProcess pid=878) WARNING 02-07 14:49:19 hpu.py:81] Pin memory is not supported on HPU.
(VllmWorkerProcess pid=878) INFO 02-07 14:49:19 hpu.py:32] Using HPUAttention backend.
(VllmWorkerProcess pid=881) WARNING 02-07 14:49:19 hpu.py:81] Pin memory is not supported on HPU.
(VllmWorkerProcess pid=876) WARNING 02-07 14:49:19 hpu.py:81] Pin memory is not supported on HPU.
(VllmWorkerProcess pid=881) INFO 02-07 14:49:19 hpu.py:32] Using HPUAttention backend.
(VllmWorkerProcess pid=876) INFO 02-07 14:49:19 hpu.py:32] Using HPUAttention backend.
(VllmWorkerProcess pid=878) VLLM_PROMPT_BS_BUCKET_MIN=1 (default:1)
(VllmWorkerProcess pid=878) VLLM_PROMPT_BS_BUCKET_STEP=32 (default:32)
(VllmWorkerProcess pid=878) VLLM_PROMPT_BS_BUCKET_MAX=256 (default:256)
(VllmWorkerProcess pid=878) VLLM_DECODE_BS_BUCKET_MIN=1 (default:1)
(VllmWorkerProcess pid=878) VLLM_DECODE_BS_BUCKET_STEP=32 (default:32)
(VllmWorkerProcess pid=878) VLLM_DECODE_BS_BUCKET_MAX=256 (default:256)
(VllmWorkerProcess pid=878) VLLM_PROMPT_SEQ_BUCKET_MIN=128 (default:128)
(VllmWorkerProcess pid=878) VLLM_PROMPT_SEQ_BUCKET_STEP=128 (default:128)
(VllmWorkerProcess pid=878) VLLM_PROMPT_SEQ_BUCKET_MAX=1024 (default:1024)
(VllmWorkerProcess pid=878) VLLM_DECODE_BLOCK_BUCKET_MIN=128 (default:128)
(VllmWorkerProcess pid=878) VLLM_DECODE_BLOCK_BUCKET_STEP=128 (default:128)
(VllmWorkerProcess pid=878) VLLM_DECODE_BLOCK_BUCKET_MAX=4096 (default:4096)
(VllmWorkerProcess pid=878) Prompt bucket config (min, step, max_warmup) bs:[1, 32, 256], seq:[128, 128, 1024]
(VllmWorkerProcess pid=878) Decode bucket config (min, step, max_warmup) bs:[1, 32, 256], block:[128, 128, 4096]
(VllmWorkerProcess pid=881) VLLM_PROMPT_BS_BUCKET_MIN=1 (default:1)
(VllmWorkerProcess pid=881) VLLM_PROMPT_BS_BUCKET_STEP=32 (default:32)
(VllmWorkerProcess pid=881) VLLM_PROMPT_BS_BUCKET_MAX=256 (default:256)
(VllmWorkerProcess pid=881) VLLM_DECODE_BS_BUCKET_MIN=1 (default:1)
(VllmWorkerProcess pid=881) VLLM_DECODE_BS_BUCKET_STEP=32 (default:32)
(VllmWorkerProcess pid=881) VLLM_DECODE_BS_BUCKET_MAX=256 (default:256)
(VllmWorkerProcess pid=881) VLLM_PROMPT_SEQ_BUCKET_MIN=128 (default:128)
(VllmWorkerProcess pid=881) VLLM_PROMPT_SEQ_BUCKET_STEP=128 (default:128)
(VllmWorkerProcess pid=881) VLLM_PROMPT_SEQ_BUCKET_MAX=1024 (default:1024)
(VllmWorkerProcess pid=881) VLLM_DECODE_BLOCK_BUCKET_MIN=128 (default:128)
(VllmWorkerProcess pid=876) VLLM_PROMPT_BS_BUCKET_MIN=1 (default:1)
(VllmWorkerProcess pid=881) VLLM_DECODE_BLOCK_BUCKET_STEP=128 (default:128)
(VllmWorkerProcess pid=881) VLLM_DECODE_BLOCK_BUCKET_MAX=4096 (default:4096)
(VllmWorkerProcess pid=876) VLLM_PROMPT_BS_BUCKET_STEP=32 (default:32)
(VllmWorkerProcess pid=881) Prompt bucket config (min, step, max_warmup) bs:[1, 32, 256], seq:[128, 128, 1024]
(VllmWorkerProcess pid=876) VLLM_PROMPT_BS_BUCKET_MAX=256 (default:256)
(VllmWorkerProcess pid=881) Decode bucket config (min, step, max_warmup) bs:[1, 32, 256], block:[128, 128, 4096]
(VllmWorkerProcess pid=876) VLLM_DECODE_BS_BUCKET_MIN=1 (default:1)
(VllmWorkerProcess pid=876) VLLM_DECODE_BS_BUCKET_STEP=32 (default:32)
(VllmWorkerProcess pid=876) VLLM_DECODE_BS_BUCKET_MAX=256 (default:256)
(VllmWorkerProcess pid=876) VLLM_PROMPT_SEQ_BUCKET_MIN=128 (default:128)
(VllmWorkerProcess pid=876) VLLM_PROMPT_SEQ_BUCKET_STEP=128 (default:128)
(VllmWorkerProcess pid=876) VLLM_PROMPT_SEQ_BUCKET_MAX=1024 (default:1024)
(VllmWorkerProcess pid=876) VLLM_DECODE_BLOCK_BUCKET_MIN=128 (default:128)
(VllmWorkerProcess pid=876) VLLM_DECODE_BLOCK_BUCKET_STEP=128 (default:128)
(VllmWorkerProcess pid=876) VLLM_DECODE_BLOCK_BUCKET_MAX=4096 (default:4096)
(VllmWorkerProcess pid=876) Prompt bucket config (min, step, max_warmup) bs:[1, 32, 256], seq:[128, 128, 1024]
(VllmWorkerProcess pid=876) Decode bucket config (min, step, max_warmup) bs:[1, 32, 256], block:[128, 128, 4096]
(VllmWorkerProcess pid=877) WARNING 02-07 14:49:19 hpu.py:81] Pin memory is not supported on HPU.
(VllmWorkerProcess pid=877) INFO 02-07 14:49:19 hpu.py:32] Using HPUAttention backend.
(VllmWorkerProcess pid=880) WARNING 02-07 14:49:19 hpu.py:81] Pin memory is not supported on HPU.
(VllmWorkerProcess pid=880) INFO 02-07 14:49:19 hpu.py:32] Using HPUAttention backend.
(VllmWorkerProcess pid=877) VLLM_PROMPT_BS_BUCKET_MIN=1 (default:1)
(VllmWorkerProcess pid=877) VLLM_PROMPT_BS_BUCKET_STEP=32 (default:32)
(VllmWorkerProcess pid=877) VLLM_PROMPT_BS_BUCKET_MAX=256 (default:256)
(VllmWorkerProcess pid=877) VLLM_DECODE_BS_BUCKET_MIN=1 (default:1)
(VllmWorkerProcess pid=877) VLLM_DECODE_BS_BUCKET_STEP=32 (default:32)
(VllmWorkerProcess pid=877) VLLM_DECODE_BS_BUCKET_MAX=256 (default:256)
(VllmWorkerProcess pid=877) VLLM_PROMPT_SEQ_BUCKET_MIN=128 (default:128)
(VllmWorkerProcess pid=877) VLLM_PROMPT_SEQ_BUCKET_STEP=128 (default:128)
(VllmWorkerProcess pid=877) VLLM_PROMPT_SEQ_BUCKET_MAX=1024 (default:1024)
(VllmWorkerProcess pid=877) VLLM_DECODE_BLOCK_BUCKET_MIN=128 (default:128)
(VllmWorkerProcess pid=877) VLLM_DECODE_BLOCK_BUCKET_STEP=128 (default:128)
(VllmWorkerProcess pid=877) VLLM_DECODE_BLOCK_BUCKET_MAX=4096 (default:4096)
(VllmWorkerProcess pid=877) Prompt bucket config (min, step, max_warmup) bs:[1, 32, 256], seq:[128, 128, 1024]
(VllmWorkerProcess pid=877) Decode bucket config (min, step, max_warmup) bs:[1, 32, 256], block:[128, 128, 4096]
(VllmWorkerProcess pid=882) WARNING 02-07 14:49:19 hpu.py:81] Pin memory is not supported on HPU.
(VllmWorkerProcess pid=882) INFO 02-07 14:49:19 hpu.py:32] Using HPUAttention backend.
(VllmWorkerProcess pid=880) VLLM_PROMPT_BS_BUCKET_MIN=1 (default:1)
(VllmWorkerProcess pid=880) VLLM_PROMPT_BS_BUCKET_STEP=32 (default:32)
(VllmWorkerProcess pid=880) VLLM_PROMPT_BS_BUCKET_MAX=256 (default:256)
(VllmWorkerProcess pid=880) VLLM_DECODE_BS_BUCKET_MIN=1 (default:1)
(VllmWorkerProcess pid=880) VLLM_DECODE_BS_BUCKET_STEP=32 (default:32)
(VllmWorkerProcess pid=880) VLLM_DECODE_BS_BUCKET_MAX=256 (default:256)
(VllmWorkerProcess pid=880) VLLM_PROMPT_SEQ_BUCKET_MIN=128 (default:128)
(VllmWorkerProcess pid=880) VLLM_PROMPT_SEQ_BUCKET_STEP=128 (default:128)
(VllmWorkerProcess pid=880) VLLM_PROMPT_SEQ_BUCKET_MAX=1024 (default:1024)
(VllmWorkerProcess pid=880) VLLM_DECODE_BLOCK_BUCKET_MIN=128 (default:128)
(VllmWorkerProcess pid=880) VLLM_DECODE_BLOCK_BUCKET_STEP=128 (default:128)
(VllmWorkerProcess pid=880) VLLM_DECODE_BLOCK_BUCKET_MAX=4096 (default:4096)
(VllmWorkerProcess pid=880) Prompt bucket config (min, step, max_warmup) bs:[1, 32, 256], seq:[128, 128, 1024]
(VllmWorkerProcess pid=880) Decode bucket config (min, step, max_warmup) bs:[1, 32, 256], block:[128, 128, 4096]
(VllmWorkerProcess pid=882) VLLM_PROMPT_BS_BUCKET_MIN=1 (default:1)
(VllmWorkerProcess pid=882) VLLM_PROMPT_BS_BUCKET_STEP=32 (default:32)
(VllmWorkerProcess pid=882) VLLM_PROMPT_BS_BUCKET_MAX=256 (default:256)
(VllmWorkerProcess pid=882) VLLM_DECODE_BS_BUCKET_MIN=1 (default:1)
(VllmWorkerProcess pid=882) VLLM_DECODE_BS_BUCKET_STEP=32 (default:32)
(VllmWorkerProcess pid=882) VLLM_DECODE_BS_BUCKET_MAX=256 (default:256)
(VllmWorkerProcess pid=882) VLLM_PROMPT_SEQ_BUCKET_MIN=128 (default:128)
(VllmWorkerProcess pid=882) VLLM_PROMPT_SEQ_BUCKET_STEP=128 (default:128)
(VllmWorkerProcess pid=882) VLLM_PROMPT_SEQ_BUCKET_MAX=1024 (default:1024)
(VllmWorkerProcess pid=882) VLLM_DECODE_BLOCK_BUCKET_MIN=128 (default:128)
(VllmWorkerProcess pid=882) VLLM_DECODE_BLOCK_BUCKET_STEP=128 (default:128)
(VllmWorkerProcess pid=882) VLLM_DECODE_BLOCK_BUCKET_MAX=4096 (default:4096)
(VllmWorkerProcess pid=882) Prompt bucket config (min, step, max_warmup) bs:[1, 32, 256], seq:[128, 128, 1024]
(VllmWorkerProcess pid=882) Decode bucket config (min, step, max_warmup) bs:[1, 32, 256], block:[128, 128, 4096]
============================= HABANA PT BRIDGE CONFIGURATION =========================== 
 PT_HPU_LAZY_MODE = 1
 PT_RECIPE_CACHE_PATH = 
 PT_CACHE_FOLDER_DELETE = 0
 PT_HPU_RECIPE_CACHE_CONFIG = 
 PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
 PT_HPU_LAZY_ACC_PAR_MODE = 1
 PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
 PT_HPU_EAGER_PIPELINE_ENABLE = 1
 PT_HPU_EAGER_COLLECTIVE_PIPELINE_ENABLE = 1
---------------------------: System Configuration :---------------------------
Num CPU Cores : 152
CPU RAM       : 1056439504 KB
------------------------------------------------------------------------------
============================= HABANA PT BRIDGE CONFIGURATION =========================== 
 PT_HPU_LAZY_MODE = 1
 PT_RECIPE_CACHE_PATH = 
 PT_CACHE_FOLDER_DELETE = 0
 PT_HPU_RECIPE_CACHE_CONFIG = 
 PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
 PT_HPU_LAZY_ACC_PAR_MODE = 1
 PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
 PT_HPU_EAGER_PIPELINE_ENABLE = 1
 PT_HPU_EAGER_COLLECTIVE_PIPELINE_ENABLE = 1
---------------------------: System Configuration :---------------------------
Num CPU Cores : 152
CPU RAM       : 1056439504 KB
------------------------------------------------------------------------------
============================= HABANA PT BRIDGE CONFIGURATION =========================== 
 PT_HPU_LAZY_MODE = 1
 PT_RECIPE_CACHE_PATH = 
 PT_CACHE_FOLDER_DELETE = 0
 PT_HPU_RECIPE_CACHE_CONFIG = 
 PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
 PT_HPU_LAZY_ACC_PAR_MODE = 1
 PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
 PT_HPU_EAGER_PIPELINE_ENABLE = 1
 PT_HPU_EAGER_COLLECTIVE_PIPELINE_ENABLE = 1
---------------------------: System Configuration :---------------------------
Num CPU Cores : 152
CPU RAM       : 1056439504 KB
------------------------------------------------------------------------------
============================= HABANA PT BRIDGE CONFIGURATION =========================== 
 PT_HPU_LAZY_MODE = 1
 PT_RECIPE_CACHE_PATH = 
 PT_CACHE_FOLDER_DELETE = 0
 PT_HPU_RECIPE_CACHE_CONFIG = 
 PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
 PT_HPU_LAZY_ACC_PAR_MODE = 1
 PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
 PT_HPU_EAGER_PIPELINE_ENABLE = 1
 PT_HPU_EAGER_COLLECTIVE_PIPELINE_ENABLE = 1
---------------------------: System Configuration :---------------------------
Num CPU Cores : 152
CPU RAM       : 1056439504 KB
------------------------------------------------------------------------------
============================= HABANA PT BRIDGE CONFIGURATION =========================== 
 PT_HPU_LAZY_MODE = 1
 PT_RECIPE_CACHE_PATH = 
 PT_CACHE_FOLDER_DELETE = 0
 PT_HPU_RECIPE_CACHE_CONFIG = 
 PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
 PT_HPU_LAZY_ACC_PAR_MODE = 1
 PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
 PT_HPU_EAGER_PIPELINE_ENABLE = 1
 PT_HPU_EAGER_COLLECTIVE_PIPELINE_ENABLE = 1
---------------------------: System Configuration :---------------------------
Num CPU Cores : 152
CPU RAM       : 1056439504 KB
------------------------------------------------------------------------------
============================= HABANA PT BRIDGE CONFIGURATION =========================== 
 PT_HPU_LAZY_MODE = 1
 PT_RECIPE_CACHE_PATH = 
 PT_CACHE_FOLDER_DELETE = 0
 PT_HPU_RECIPE_CACHE_CONFIG = 
 PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
 PT_HPU_LAZY_ACC_PAR_MODE = 1
 PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
 PT_HPU_EAGER_PIPELINE_ENABLE = 1
 PT_HPU_EAGER_COLLECTIVE_PIPELINE_ENABLE = 1
---------------------------: System Configuration :---------------------------
Num CPU Cores : 152
CPU RAM       : 1056439504 KB
------------------------------------------------------------------------------
============================= HABANA PT BRIDGE CONFIGURATION =========================== 
 PT_HPU_LAZY_MODE = 1
 PT_RECIPE_CACHE_PATH = 
 PT_CACHE_FOLDER_DELETE = 0
 PT_HPU_RECIPE_CACHE_CONFIG = 
 PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
 PT_HPU_LAZY_ACC_PAR_MODE = 1
 PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
 PT_HPU_EAGER_PIPELINE_ENABLE = 1
 PT_HPU_EAGER_COLLECTIVE_PIPELINE_ENABLE = 1
---------------------------: System Configuration :---------------------------
Num CPU Cores : 152
CPU RAM       : 1056439504 KB
------------------------------------------------------------------------------
============================= HABANA PT BRIDGE CONFIGURATION =========================== 
 PT_HPU_LAZY_MODE = 1
 PT_RECIPE_CACHE_PATH = 
 PT_CACHE_FOLDER_DELETE = 0
 PT_HPU_RECIPE_CACHE_CONFIG = 
 PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
 PT_HPU_LAZY_ACC_PAR_MODE = 1
 PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
 PT_HPU_EAGER_PIPELINE_ENABLE = 1
 PT_HPU_EAGER_COLLECTIVE_PIPELINE_ENABLE = 1
---------------------------: System Configuration :---------------------------
Num CPU Cores : 152
CPU RAM       : 1056439504 KB
------------------------------------------------------------------------------
INFO 02-07 14:49:25 shm_broadcast.py:256] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1, 2, 3, 4, 5, 6, 7], buffer_handle=(7, 4194304, 6, 'psm_fa40797d'), local_subscribe_port=42631, remote_subscribe_port=None)
WARNING: The experimental weight sharing feature is enabled and may cause larger device memory
              consumption in quantized models. Please disable it by setting PT_HPU_WEIGHT_SHARING=0(VllmWorkerProcess pid=879) WARNING: The experimental weight sharing feature is enabled and may cause larger device memory
(VllmWorkerProcess pid=879)               consumption in quantized models. Please disable it by setting PT_HPU_WEIGHT_SHARING=0
(VllmWorkerProcess pid=877) WARNING: The experimental weight sharing feature is enabled and may cause larger device memory
(VllmWorkerProcess pid=877)               consumption in quantized models. Please disable it by setting PT_HPU_WEIGHT_SHARING=0

(VllmWorkerProcess pid=880) WARNING: The experimental weight sharing feature is enabled and may cause larger device memory
(VllmWorkerProcess pid=880)               consumption in quantized models. Please disable it by setting PT_HPU_WEIGHT_SHARING=0
(VllmWorkerProcess pid=882) WARNING: The experimental weight sharing feature is enabled and may cause larger device memory
(VllmWorkerProcess pid=882)               consumption in quantized models. Please disable it by setting PT_HPU_WEIGHT_SHARING=0
(VllmWorkerProcess pid=876) WARNING: The experimental weight sharing feature is enabled and may cause larger device memory
(VllmWorkerProcess pid=876)               consumption in quantized models. Please disable it by setting PT_HPU_WEIGHT_SHARING=0
(VllmWorkerProcess pid=878) WARNING: The experimental weight sharing feature is enabled and may cause larger device memory
(VllmWorkerProcess pid=878)               consumption in quantized models. Please disable it by setting PT_HPU_WEIGHT_SHARING=0
(VllmWorkerProcess pid=881) WARNING: The experimental weight sharing feature is enabled and may cause larger device memory
(VllmWorkerProcess pid=881)               consumption in quantized models. Please disable it by setting PT_HPU_WEIGHT_SHARING=0
(VllmWorkerProcess pid=882) Detected flags: [-compile_one_hot -cpu -fp32_softmax +fsdpa -gaudi +gaudi2 -gaudi3]
(VllmWorkerProcess pid=880) Detected flags: [-compile_one_hot -cpu -fp32_softmax +fsdpa -gaudi +gaudi2 -gaudi3]
(VllmWorkerProcess pid=877) Detected flags: [-compile_one_hot -cpu -fp32_softmax +fsdpa -gaudi +gaudi2 -gaudi3]
Detected flags: [-compile_one_hot -cpu -fp32_softmax +fsdpa -gaudi +gaudi2 -gaudi3]
(VllmWorkerProcess pid=879) Detected flags: [-compile_one_hot -cpu -fp32_softmax +fsdpa -gaudi +gaudi2 -gaudi3]
(VllmWorkerProcess pid=881) Detected flags: [-compile_one_hot -cpu -fp32_softmax +fsdpa -gaudi +gaudi2 -gaudi3]
(VllmWorkerProcess pid=876) Detected flags: [-compile_one_hot -cpu -fp32_softmax +fsdpa -gaudi +gaudi2 -gaudi3]
(VllmWorkerProcess pid=878) Detected flags: [-compile_one_hot -cpu -fp32_softmax +fsdpa -gaudi +gaudi2 -gaudi3]
(VllmWorkerProcess pid=880) INFO 02-07 14:49:27 loader.py:392] Loading weights on hpu...
(VllmWorkerProcess pid=877) INFO 02-07 14:49:27 loader.py:392] Loading weights on hpu...
INFO 02-07 14:49:27 loader.py:392] Loading weights on hpu...
(VllmWorkerProcess pid=882) INFO 02-07 14:49:27 loader.py:392] Loading weights on hpu...
(VllmWorkerProcess pid=879) INFO 02-07 14:49:27 loader.py:392] Loading weights on hpu...
(VllmWorkerProcess pid=881) INFO 02-07 14:49:27 loader.py:392] Loading weights on hpu...
(VllmWorkerProcess pid=876) INFO 02-07 14:49:27 loader.py:392] Loading weights on hpu...
(VllmWorkerProcess pid=878) INFO 02-07 14:49:27 loader.py:392] Loading weights on hpu...
INFO 02-07 14:49:28 weight_utils.py:251] Using model weights format ['*.safetensors']
(VllmWorkerProcess pid=880) INFO 02-07 14:49:28 weight_utils.py:251] Using model weights format ['*.safetensors']
(VllmWorkerProcess pid=882) INFO 02-07 14:49:28 weight_utils.py:251] Using model weights format ['*.safetensors']
(VllmWorkerProcess pid=877) INFO 02-07 14:49:28 weight_utils.py:251] Using model weights format ['*.safetensors']
(VllmWorkerProcess pid=881) INFO 02-07 14:49:28 weight_utils.py:251] Using model weights format ['*.safetensors']
(VllmWorkerProcess pid=879) INFO 02-07 14:49:28 weight_utils.py:251] Using model weights format ['*.safetensors']
(VllmWorkerProcess pid=876) INFO 02-07 14:49:28 weight_utils.py:251] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% 0/163 [00:00<?, ?it/s](VllmWorkerProcess pid=878) INFO 02-07 14:49:28 weight_utils.py:251] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards: 100% 163/163 [01:12<00:00,  2.25it/s]
(VllmWorkerProcess pid=880) INFO 02-07 14:50:41 hpu_model_runner.py:712] Pre-loading model weights on hpu:0 took 79.41 GiB of device memory (79.41 GiB/94.62 GiB used) and 4.706 GiB of host memory (98.92 GiB/1007 GiB used)
(VllmWorkerProcess pid=882) INFO 02-07 14:50:41 hpu_model_runner.py:712] Pre-loading model weights on hpu:0 took 79.41 GiB of device memory (79.41 GiB/94.62 GiB used) and 4.706 GiB of host memory (98.92 GiB/1007 GiB used)
(VllmWorkerProcess pid=878) INFO 02-07 14:50:41 hpu_model_runner.py:712] Pre-loading model weights on hpu:0 took 79.41 GiB of device memory (79.41 GiB/94.62 GiB used) and 4.59 GiB of host memory (98.92 GiB/1007 GiB used)
INFO 02-07 14:50:41 hpu_model_runner.py:712] Pre-loading model weights on hpu:0 took 79.41 GiB of device memory (79.41 GiB/94.62 GiB used) and 4.703 GiB of host memory (98.92 GiB/1007 GiB used)
(VllmWorkerProcess pid=882) INFO 02-07 14:50:42 hpu_model_runner.py:790] Wrapping in HPU Graph took 0 B of device memory (79.41 GiB/94.62 GiB used) and -8.086 MiB of host memory (98.9 GiB/1007 GiB used)
(VllmWorkerProcess pid=880) INFO 02-07 14:50:42 hpu_model_runner.py:790] Wrapping in HPU Graph took 0 B of device memory (79.41 GiB/94.62 GiB used) and -8.086 MiB of host memory (98.9 GiB/1007 GiB used)
(VllmWorkerProcess pid=881) INFO 02-07 14:50:42 hpu_model_runner.py:712] Pre-loading model weights on hpu:0 took 79.41 GiB of device memory (79.41 GiB/94.62 GiB used) and 4.628 GiB of host memory (98.93 GiB/1007 GiB used)
(VllmWorkerProcess pid=878) INFO 02-07 14:50:42 hpu_model_runner.py:790] Wrapping in HPU Graph took 0 B of device memory (79.41 GiB/94.62 GiB used) and 16.77 MiB of host memory (98.93 GiB/1007 GiB used)
INFO 02-07 14:50:42 hpu_model_runner.py:790] Wrapping in HPU Graph took 0 B of device memory (79.41 GiB/94.62 GiB used) and 36.21 MiB of host memory (98.93 GiB/1007 GiB used)
(VllmWorkerProcess pid=882) INFO 02-07 14:50:42 hpu_model_runner.py:794] Loading model weights took in total 79.41 GiB of device memory (79.41 GiB/94.62 GiB used) and 4.726 GiB of host memory (98.94 GiB/1007 GiB used)
(VllmWorkerProcess pid=880) INFO 02-07 14:50:42 hpu_model_runner.py:794] Loading model weights took in total 79.41 GiB of device memory (79.41 GiB/94.62 GiB used) and 4.727 GiB of host memory (98.94 GiB/1007 GiB used)
(VllmWorkerProcess pid=878) INFO 02-07 14:50:42 hpu_model_runner.py:794] Loading model weights took in total 79.41 GiB of device memory (79.41 GiB/94.62 GiB used) and 4.686 GiB of host memory (98.94 GiB/1007 GiB used)
INFO 02-07 14:50:42 hpu_model_runner.py:794] Loading model weights took in total 79.41 GiB of device memory (79.41 GiB/94.62 GiB used) and 4.728 GiB of host memory (98.94 GiB/1007 GiB used)
(VllmWorkerProcess pid=881) INFO 02-07 14:50:43 hpu_model_runner.py:790] Wrapping in HPU Graph took 0 B of device memory (79.41 GiB/94.62 GiB used) and 1008 KiB of host memory (98.95 GiB/1007 GiB used)
(VllmWorkerProcess pid=877) INFO 02-07 14:50:43 hpu_model_runner.py:712] Pre-loading model weights on hpu:0 took 79.41 GiB of device memory (79.41 GiB/94.62 GiB used) and 4.742 GiB of host memory (98.95 GiB/1007 GiB used)
(VllmWorkerProcess pid=876) INFO 02-07 14:50:43 hpu_model_runner.py:712] Pre-loading model weights on hpu:0 took 79.41 GiB of device memory (79.41 GiB/94.62 GiB used) and 4.646 GiB of host memory (98.95 GiB/1007 GiB used)
(VllmWorkerProcess pid=881) INFO 02-07 14:50:43 hpu_model_runner.py:794] Loading model weights took in total 79.41 GiB of device memory (79.41 GiB/94.62 GiB used) and 4.734 GiB of host memory (98.95 GiB/1007 GiB used)
(VllmWorkerProcess pid=877) INFO 02-07 14:50:44 hpu_model_runner.py:790] Wrapping in HPU Graph took 0 B of device memory (79.41 GiB/94.62 GiB used) and 4.887 MiB of host memory (98.96 GiB/1007 GiB used)
(VllmWorkerProcess pid=876) INFO 02-07 14:50:44 hpu_model_runner.py:790] Wrapping in HPU Graph took 0 B of device memory (79.41 GiB/94.62 GiB used) and 3.133 MiB of host memory (98.97 GiB/1007 GiB used)
(VllmWorkerProcess pid=877) INFO 02-07 14:50:44 hpu_model_runner.py:794] Loading model weights took in total 79.41 GiB of device memory (79.41 GiB/94.62 GiB used) and 4.755 GiB of host memory (98.96 GiB/1007 GiB used)
(VllmWorkerProcess pid=876) INFO 02-07 14:50:44 hpu_model_runner.py:794] Loading model weights took in total 79.41 GiB of device memory (79.41 GiB/94.62 GiB used) and 4.745 GiB of host memory (98.96 GiB/1007 GiB used)
(VllmWorkerProcess pid=879) INFO 02-07 14:50:45 hpu_model_runner.py:712] Pre-loading model weights on hpu:0 took 79.41 GiB of device memory (79.41 GiB/94.62 GiB used) and 4.737 GiB of host memory (98.96 GiB/1007 GiB used)
(VllmWorkerProcess pid=879) INFO 02-07 14:50:46 hpu_model_runner.py:790] Wrapping in HPU Graph took 0 B of device memory (79.41 GiB/94.62 GiB used) and -284 KiB of host memory (98.96 GiB/1007 GiB used)
(VllmWorkerProcess pid=879) INFO 02-07 14:50:46 hpu_model_runner.py:794] Loading model weights took in total 79.41 GiB of device memory (79.41 GiB/94.62 GiB used) and 4.771 GiB of host memory (98.98 GiB/1007 GiB used)
ERROR 02-07 14:50:47 engine.py:387] 0 active drivers ([]). There should only be one.
ERROR 02-07 14:50:47 engine.py:387] Traceback (most recent call last):
ERROR 02-07 14:50:47 engine.py:387]   File "/workflow/vllm-fork/vllm/engine/multiprocessing/engine.py", line 378, in run_mp_engine
ERROR 02-07 14:50:47 engine.py:387]     engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
ERROR 02-07 14:50:47 engine.py:387]   File "/workflow/vllm-fork/vllm/engine/multiprocessing/engine.py", line 121, in from_engine_args
ERROR 02-07 14:50:47 engine.py:387]     return cls(ipc_path=ipc_path,
ERROR 02-07 14:50:47 engine.py:387]   File "/workflow/vllm-fork/vllm/engine/multiprocessing/engine.py", line 73, in __init__
ERROR 02-07 14:50:47 engine.py:387]     self.engine = LLMEngine(*args, **kwargs)
ERROR 02-07 14:50:47 engine.py:387]   File "/workflow/vllm-fork/vllm/engine/llm_engine.py", line 274, in __init__
ERROR 02-07 14:50:47 engine.py:387]     self._initialize_kv_caches()
ERROR 02-07 14:50:47 engine.py:387]   File "/workflow/vllm-fork/vllm/engine/llm_engine.py", line 414, in _initialize_kv_caches
ERROR 02-07 14:50:47 engine.py:387]     self.model_executor.determine_num_available_blocks())
ERROR 02-07 14:50:47 engine.py:387]   File "/workflow/vllm-fork/vllm/executor/executor_base.py", line 99, in determine_num_available_blocks
ERROR 02-07 14:50:47 engine.py:387]     results = self.collective_rpc("determine_num_available_blocks")
ERROR 02-07 14:50:47 engine.py:387]   File "/workflow/vllm-fork/vllm/executor/executor_base.py", line 305, in collective_rpc
ERROR 02-07 14:50:47 engine.py:387]     return self._run_workers(method, *args, **(kwargs or {}))
ERROR 02-07 14:50:47 engine.py:387]   File "/workflow/vllm-fork/vllm/executor/mp_distributed_executor.py", line 187, in _run_workers
ERROR 02-07 14:50:47 engine.py:387]     driver_worker_output = run_method(self.driver_worker, sent_method,
ERROR 02-07 14:50:47 engine.py:387]   File "/workflow/vllm-fork/vllm/utils.py", line 2305, in run_method
ERROR 02-07 14:50:47 engine.py:387]     return func(*args, **kwargs)
ERROR 02-07 14:50:47 engine.py:387]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 02-07 14:50:47 engine.py:387]     return func(*args, **kwargs)
ERROR 02-07 14:50:47 engine.py:387]   File "/workflow/vllm-fork/vllm/worker/hpu_worker.py", line 310, in determine_num_available_blocks
ERROR 02-07 14:50:47 engine.py:387]     self.model_runner.profile_run()
ERROR 02-07 14:50:47 engine.py:387]   File "/workflow/vllm-fork/vllm/worker/hpu_model_runner.py", line 1597, in profile_run
ERROR 02-07 14:50:47 engine.py:387]     self.warmup_scenario(max_batch_size, max_seq_len, True, kv_caches,
ERROR 02-07 14:50:47 engine.py:387]   File "/workflow/vllm-fork/vllm/worker/hpu_model_runner.py", line 1673, in warmup_scenario
ERROR 02-07 14:50:47 engine.py:387]     self.execute_model(inputs, kv_caches, warmup_mode=True)
ERROR 02-07 14:50:47 engine.py:387]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 02-07 14:50:47 engine.py:387]     return func(*args, **kwargs)
ERROR 02-07 14:50:47 engine.py:387]   File "/workflow/vllm-fork/vllm/worker/hpu_model_runner.py", line 2295, in execute_model
ERROR 02-07 14:50:47 engine.py:387]     hidden_states = self.model.forward(
ERROR 02-07 14:50:47 engine.py:387]   File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/graphs.py", line 726, in forward
ERROR 02-07 14:50:47 engine.py:387]     return wrapped_hpugraph_forward(
ERROR 02-07 14:50:47 engine.py:387]   File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/graphs.py", line 576, in wrapped_hpugraph_forward
ERROR 02-07 14:50:47 engine.py:387]     return orig_fwd(*args, **kwargs)
ERROR 02-07 14:50:47 engine.py:387]   File "/workflow/vllm-fork/vllm/worker/hpu_model_runner.py", line 410, in forward
ERROR 02-07 14:50:47 engine.py:387]     hidden_states = self.model(*args, **kwargs)
ERROR 02-07 14:50:47 engine.py:387]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1556, in _wrapped_call_impl
ERROR 02-07 14:50:47 engine.py:387]     return self._call_impl(*args, **kwargs)
ERROR 02-07 14:50:47 engine.py:387]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1565, in _call_impl
ERROR 02-07 14:50:47 engine.py:387]     return forward_call(*args, **kwargs)
ERROR 02-07 14:50:47 engine.py:387]   File "/workflow/vllm-fork/vllm/model_executor/models/deepseek_v3.py", line 532, in forward
ERROR 02-07 14:50:47 engine.py:387]     hidden_states = self.model(input_ids, positions, kv_caches,
ERROR 02-07 14:50:47 engine.py:387]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1556, in _wrapped_call_impl
ERROR 02-07 14:50:47 engine.py:387]     return self._call_impl(*args, **kwargs)
ERROR 02-07 14:50:47 engine.py:387]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1606, in _call_impl
ERROR 02-07 14:50:47 engine.py:387]     result = forward_call(*args, **kwargs)
ERROR 02-07 14:50:47 engine.py:387]   File "/workflow/vllm-fork/vllm/model_executor/models/deepseek_v3.py", line 488, in forward
ERROR 02-07 14:50:47 engine.py:387]     hidden_states, residual = layer(positions, hidden_states,
ERROR 02-07 14:50:47 engine.py:387]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1556, in _wrapped_call_impl
ERROR 02-07 14:50:47 engine.py:387]     return self._call_impl(*args, **kwargs)
ERROR 02-07 14:50:47 engine.py:387]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1606, in _call_impl
ERROR 02-07 14:50:47 engine.py:387]     result = forward_call(*args, **kwargs)
ERROR 02-07 14:50:47 engine.py:387]   File "/workflow/vllm-fork/vllm/model_executor/models/deepseek_v3.py", line 407, in forward
ERROR 02-07 14:50:47 engine.py:387]     hidden_states = self.self_attn(
ERROR 02-07 14:50:47 engine.py:387]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1556, in _wrapped_call_impl
ERROR 02-07 14:50:47 engine.py:387]     return self._call_impl(*args, **kwargs)
ERROR 02-07 14:50:47 engine.py:387]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1606, in _call_impl
ERROR 02-07 14:50:47 engine.py:387]     result = forward_call(*args, **kwargs)
ERROR 02-07 14:50:47 engine.py:387]   File "/workflow/vllm-fork/vllm/model_executor/models/deepseek_v3.py", line 288, in forward
ERROR 02-07 14:50:47 engine.py:387]     q = self.q_a_proj(hidden_states)[0]
ERROR 02-07 14:50:47 engine.py:387]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1556, in _wrapped_call_impl
ERROR 02-07 14:50:47 engine.py:387]     return self._call_impl(*args, **kwargs)
ERROR 02-07 14:50:47 engine.py:387]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1606, in _call_impl
ERROR 02-07 14:50:47 engine.py:387]     result = forward_call(*args, **kwargs)
ERROR 02-07 14:50:47 engine.py:387]   File "/workflow/vllm-fork/vllm/model_executor/layers/linear.py", line 247, in forward
ERROR 02-07 14:50:47 engine.py:387]     output = self.quant_method.apply(self, x, bias)
ERROR 02-07 14:50:47 engine.py:387]   File "/workflow/vllm-fork/vllm/model_executor/layers/quantization/fp8.py", line 359, in apply
ERROR 02-07 14:50:47 engine.py:387]     return apply_w8a8_block_fp8_linear(
ERROR 02-07 14:50:47 engine.py:387]   File "/workflow/vllm-fork/vllm/model_executor/layers/quantization/utils/fp8_utils.py", line 24, in apply_w8a8_block_fp8_linear
ERROR 02-07 14:50:47 engine.py:387]     q_input, x_scale = per_token_group_quant_fp8(input_2d, block_size[1])
ERROR 02-07 14:50:47 engine.py:387]   File "/workflow/vllm-fork/vllm/model_executor/layers/quantization/utils/fp8_utils.py", line 174, in per_token_group_quant_fp8
ERROR 02-07 14:50:47 engine.py:387]     _per_token_group_quant_fp8[(M, )](
ERROR 02-07 14:50:47 engine.py:387]   File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 345, in <lambda>
ERROR 02-07 14:50:47 engine.py:387]     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
ERROR 02-07 14:50:47 engine.py:387]   File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 607, in run
ERROR 02-07 14:50:47 engine.py:387]     device = driver.active.get_current_device()
ERROR 02-07 14:50:47 engine.py:387]   File "/usr/local/lib/python3.10/dist-packages/triton/runtime/driver.py", line 23, in __getattr__
ERROR 02-07 14:50:47 engine.py:387]     self._initialize_obj()
ERROR 02-07 14:50:47 engine.py:387]   File "/usr/local/lib/python3.10/dist-packages/triton/runtime/driver.py", line 20, in _initialize_obj
ERROR 02-07 14:50:47 engine.py:387]     self._obj = self._init_fn()
ERROR 02-07 14:50:47 engine.py:387]   File "/usr/local/lib/python3.10/dist-packages/triton/runtime/driver.py", line 8, in _create_driver
ERROR 02-07 14:50:47 engine.py:387]     raise RuntimeError(f"{len(actives)} active drivers ({actives}). There should only be one.")
ERROR 02-07 14:50:47 engine.py:387] RuntimeError: 0 active drivers ([]). There should only be one.
Traceback (most recent call last):
  File "/usr/local/bin/vllm", line 33, in <module>
    sys.exit(load_entry_point('vllm', 'console_scripts', 'vllm')())
  File "/workflow/vllm-fork/vllm/scripts.py", line 202, in main
    args.dispatch_function(args)
  File "/workflow/vllm-fork/vllm/scripts.py", line 42, in serve
    uvloop.run(run_server(args))
  File "/usr/local/lib/python3.10/dist-packages/uvloop/__init__.py", line 82, in run
    return loop.run_until_complete(wrapper())
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/usr/local/lib/python3.10/dist-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
  File "/workflow/vllm-fork/vllm/entrypoints/openai/api_server.py", line 873, in run_server
    async with build_async_engine_client(args) as engine_client:
  File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__
    return await anext(self.gen)
  File "/workflow/vllm-fork/vllm/entrypoints/openai/api_server.py", line 134, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
  File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__
    return await anext(self.gen)
  File "/workflow/vllm-fork/vllm/entrypoints/openai/api_server.py", line 228, in build_async_engine_client_from_engine_args
    raise RuntimeError(
RuntimeError: Engine process failed to start. See stack trace for the root cause.
/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 7 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

PatrykWo · 2025-02-10T09:02:10Z

@xuechendi please check this.

PatrykWo · 2025-02-10T11:00:56Z

@Bihan Please run the following and paste the output below.

wget https://raw.githubusercontent.com/vllm-project/vllm/main/collect_env.py

For security purposes, please feel free to check the contents of collect_env.py before running it.

python collect_env.py

xuechendi · 2025-02-11T23:15:04Z

@Bihan , are you using "https://github.com/HabanaAI/vllm-fork/tree/deepseek_r1"?
Based on you error msg, seems you're using Habana_main which fp8 linear is still handled by triton instead of HPU.

For easy test, you can follow instruction as below:

deploy
0	git clone https://github.com/HabanaAI/vllm-fork.git; git checkout deepseek_r1
1	sudo docker run -it --runtime=habana --name deepseek-vllm -v `pwd`:/workspace/vllm/ -v /software/data:/software/data -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --ipc=host --net=host -e HF_HOME=/software/data/ vault.habana.ai/gaudi-docker/1.19.1/ubuntu22.04/habanalabs/pytorch-installer-2.5.1:latest /bin/bash
2	cd vllm; pip install -r requirements-hpu.txt; VLLM_TARGET_DEVICE=hpu pip install -e . --no-build-isolation;
3	python vllm/scripts/run_example_tp.py

Bihan · 2025-02-12T01:43:56Z

@Bihan , are you using "https://github.com/HabanaAI/vllm-fork/tree/deepseek_r1"? Based on you error msg, seems you're using Habana_main which fp8 linear is still handled by triton instead of HPU.

For easy test, you can follow instruction as below:
<style> </style>
deploy
0 git clone https://github.com/HabanaAI/vllm-fork.git; git checkout deepseek_r1
1 sudo docker run -it --runtime=habana --name deepseek-vllm -v pwd:/workspace/vllm/ -v /software/data:/software/data -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --ipc=host --net=host -e HF_HOME=/software/data/ vault.habana.ai/gaudi-docker/1.19.1/ubuntu22.04/habanalabs/pytorch-installer-2.5.1:latest /bin/bash
2 cd vllm; pip install -r requirements-hpu.txt; VLLM_TARGET_DEVICE=hpu pip install -e . --no-build-isolation;
3 python vllm/scripts/run_example_tp.py

@xuechendi

Below is the error I got

INFO 02-12 01:40:46 __init__.py:192] Automatically detected platform hpu.
Traceback (most recent call last):
  File "/workspace/vllm/vllm-fork/scripts/run_example_tp.py", line 179, in <module>
    llm = LLM(
  File "/workspace/vllm/vllm-fork/vllm/utils.py", line 1110, in inner
    return fn(*args, **kwargs)
  File "/workspace/vllm/vllm-fork/vllm/entrypoints/llm.py", line 240, in __init__
    self.llm_engine = self.engine_class.from_engine_args(
  File "/workspace/vllm/vllm-fork/vllm/engine/llm_engine.py", line 479, in from_engine_args
    engine_config = engine_args.create_engine_config(usage_context)
  File "/workspace/vllm/vllm-fork/vllm/engine/arg_utils.py", line 1098, in create_engine_config
    model_config = self.create_model_config()
  File "/workspace/vllm/vllm-fork/vllm/engine/arg_utils.py", line 1021, in create_model_config
    return ModelConfig(
  File "/workspace/vllm/vllm-fork/vllm/config.py", line 286, in __init__
    hf_config = get_config(self.model, trust_remote_code, revision,
  File "/workspace/vllm/vllm-fork/vllm/transformers_utils/config.py", line 182, in get_config
    if is_gguf or file_or_path_exists(
  File "/workspace/vllm/vllm-fork/vllm/transformers_utils/config.py", line 91, in file_or_path_exists
    cached_filepath = try_to_load_from_cache(repo_id=model,
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 106, in _inner_fn
    validate_repo_id(arg_value)
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 154, in validate_repo_id
    raise HFValidationError(
huggingface_hub.errors.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/data/models/DeepSeek-R1/'. Use `repo_type` argument if needed.

Also I had to pip install datasets before running /vllm/scripts/run_example_tp.py

Is there a way to use it as we do normally with vLLM like vllm serve $MODEL_ID --tensor-parallel-size 8 --download-dir /data --trust-remote-code?

PatrykWo added the external Issues or PRs submitted by external users label Feb 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Support Deepseek-r1 671B #809

[Feature]: Support Deepseek-r1 671B #809

Bihan commented Feb 10, 2025

PatrykWo commented Feb 10, 2025

PatrykWo commented Feb 10, 2025 •

edited

Loading

xuechendi commented Feb 11, 2025

Bihan commented Feb 12, 2025

[Feature]: Support Deepseek-r1 671B #809

[Feature]: Support Deepseek-r1 671B #809

Comments

Bihan commented Feb 10, 2025

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

PatrykWo commented Feb 10, 2025

PatrykWo commented Feb 10, 2025 • edited Loading

For security purposes, please feel free to check the contents of collect_env.py before running it.

xuechendi commented Feb 11, 2025

Bihan commented Feb 12, 2025

PatrykWo commented Feb 10, 2025 •

edited

Loading