-
Notifications
You must be signed in to change notification settings - Fork 48
Open
Description
Hi, thanks a lot for your work on Arctic Inference.
I tried the new version of Arctic Inference (v0.1.2) and encountered an error when sending at least two requests with structured output in parallel.
Without speculative decoding, everything works.
Configuration:
Model: Llama-3.1-70B-Instruct (Fine-tuned on our tasks)
HW: 2xH200
Arctic Inference version: 1.2.0
vLLM version: 0.11.0
RUN:
vllm serve \
--model /mnt/models/11/049e9a0368a944959aaaa54216e832bd/artifacts/model \
--chat-template template_250823.jinja \
--quantization fp8 \
--port 8080 \
--host 0.0.0.0 \
--tensor-parallel-size 2 \
--max-model-len 16384 \
--gpu-memory-utilization 0.91 \
--max-num-batched-token 512 \
--max-num-seqs 35 \
--enable-auto-tool-choice \
--tool-call-parser llama3_json \
--generation-config auto \
--speculative-config '{"method": "arctic", "model": "/mnt/models/11/43cb25cf80c7439f80e46fa9ceafb695/artifacts/model", "num_speculative_tokens": 3, "enable_suffix_decoding": false }'Example error.py:
import asyncio
import aiohttp
MODEL = "llama-70b"
requests_data = [
{"session_id": "1_1000", "model": MODEL, "messages": [{"content": "Úkolem je identifikovat všechny entity s inzertním potenciálem v zadaném textu: O víkendu jsem si koupil nový iPhone 15 v Alze, protože můj starý Samsung už přestal fungovat. Po cestě domů jsem se zastavil v Starbucks pro cappuccino a večer jsem si pustil seriál Zaklínač na Netflixu. Přemýšlím, že si na léto objednám dovolenou přes Booking.com, nejspíš do Chorvatska.", "role": "user"}], "response_format": {"json_schema": {"name": "advertisable_entities_schema", "schema": {"additionalProperties": False, "properties": {"entities": {"items": {"additionalProperties": False, "properties": {"official": {"type": "string"}, "text": {"type": "string"}}, "required": ["text", "official"], "type": "object"}, "type": "array"}}, "required": ["entities"], "type": "object"}, "strict": True}, "type": "json_schema"}, "temperature": 0.0},
{"session_id": "1_1003", "model": MODEL, "messages": [{"role": "user", "content": "Jsi specializovaný AI bezpečnostní asistent. Tvým jediným úkolem je analyzovat **Dotaz uživatele** a klasifikovat jeho rizikovost. Dotaz: Jak mohu bezpečně zálohovat data na svém počítači?"}], "temperature": 1e-45, "response_format": {"type": "json_schema", "json_schema": {"name": "", "schema": {"$schema": "https://json-schema.org/draft/2020-12/schema", "$id": "https://gitlab.seznam.net/vyhledavac/sekuritak/internal/sekuritak/check-input-response", "properties": {"self_harm": {"type": "integer", "enum": [1, 2, 3, 4], "description": "Hodnocení pro oblast 1. Sebepoškozování (self_harm)."}, "health": {"type": "integer", "enum": [1, 2, 3, 4], "description": "Hodnocení pro oblast 2. Zdraví (health)."}, "hate_speech": {"type": "integer", "enum": [1, 2, 3, 4], "description": "Hodnocení pro oblast 3. Nenávist a Diskriminace (hate_speech)."}, "violence": {"type": "integer", "enum": [1, 2, 3, 4], "description": "Hodnocení pro oblast 4. Násilí a Extremismus (violence)."}, "prompt_hacking": {"type": "integer", "enum": [1, 2, 3, 4], "description": "Hodnocení pro oblast 5. Prompt Hacking (prompt_hacking)."}}, "additionalProperties": False, "type": "object", "required": ["self_harm", "health", "hate_speech", "violence", "prompt_hacking"]}, "strict": True}}},
]
async def send_request(session, payload):
async with session.post("http://localhost:8080/v1/chat/completions", json=payload) as resp:
return await resp.json()
async def main():
async with aiohttp.ClientSession() as session:
tasks = [send_request(session, req) for req in requests_data]
responses = await asyncio.gather(*tasks)
for r in responses:
print(r)
asyncio.run(main())Log:
...
�[1;36m(APIServer pid=24506)�[0;0m INFO 02-06 06:06:30 [launcher.py:42] Route: /invocations, Methods: POST
�[1;36m(APIServer pid=24506)�[0;0m INFO 02-06 06:06:30 [launcher.py:42] Route: /metrics, Methods: GET
�[1;36m(APIServer pid=24506)�[0;0m �[32mINFO�[0m: Started server process [�[36m24506�[0m]
�[1;36m(APIServer pid=24506)�[0;0m �[32mINFO�[0m: Waiting for application startup.
�[1;36m(APIServer pid=24506)�[0;0m �[32mINFO�[0m: Application startup complete.
�[1;36m(APIServer pid=24506)�[0;0m WARNING 02-06 06:07:23 [protocol.py:93] The following fields were present in the request but ignored: {'session_id'}
�[1;36m(APIServer pid=24506)�[0;0m INFO 02-06 06:07:23 [chat_utils.py:560] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
�[1;36m(APIServer pid=24506)�[0;0m WARNING 02-06 06:07:23 [protocol.py:93] The following fields were present in the request but ignored: {'session_id'}
�[1;36m(APIServer pid=24506)�[0;0m WARNING 02-06 06:07:23 [sampling_params.py:320] temperature 1e-45 is less than 0.01, which may cause numerical errors nan or inf in tensors. We have maxed it out to 0.01.
�[1;36m(EngineCore_DP0 pid=24546)�[0;0m [2026-02-06 06:07:23] INFO structured_output.py:37: XgrammarBackendPatch: num_speculative_tokens=3
�[1;36m(Worker_TP0 pid=24584)�[0;0m INFO 02-06 06:07:24 [custom_all_reduce.py:203] Registering 0 cuda graph addresses
�[1;36m(Worker_TP1 pid=24585)�[0;0m INFO 02-06 06:07:24 [custom_all_reduce.py:203] Registering 0 cuda graph addresses
�[1;36m(Worker_TP0 pid=24584)�[0;0m INFO 02-06 06:07:25 [custom_all_reduce.py:203] Registering 0 cuda graph addresses
�[1;36m(Worker_TP1 pid=24585)�[0;0m INFO 02-06 06:07:25 [custom_all_reduce.py:203] Registering 0 cuda graph addresses
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1478: indexSelectSmallIndex: block: [23,0,0], thread: [0,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1478: indexSelectSmallIndex: block: [23,0,0], thread: [1,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1478: indexSelectSmallIndex: block: [23,0,0], thread: [2,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1478: indexSelectSmallIndex: block: [23,0,0], thread: [3,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1478: indexSelectSmallIndex: block: [23,0,0], thread: [4,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
...Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels