Skip to content

Structured output error in parallel - v0.1.2 #246

@petrpechman

Description

@petrpechman

Hi, thanks a lot for your work on Arctic Inference.

I tried the new version of Arctic Inference (v0.1.2) and encountered an error when sending at least two requests with structured output in parallel.

Without speculative decoding, everything works.

Configuration:
Model: Llama-3.1-70B-Instruct (Fine-tuned on our tasks)
HW: 2xH200
Arctic Inference version: 1.2.0
vLLM version: 0.11.0

RUN:

vllm serve \
     --model /mnt/models/11/049e9a0368a944959aaaa54216e832bd/artifacts/model \
     --chat-template template_250823.jinja \
     --quantization fp8 \
     --port 8080 \
     --host 0.0.0.0 \
     --tensor-parallel-size 2 \
     --max-model-len 16384 \
     --gpu-memory-utilization 0.91 \
     --max-num-batched-token 512 \
     --max-num-seqs 35 \
     --enable-auto-tool-choice \
     --tool-call-parser llama3_json \
     --generation-config auto \
     --speculative-config '{"method": "arctic", "model": "/mnt/models/11/43cb25cf80c7439f80e46fa9ceafb695/artifacts/model", "num_speculative_tokens": 3, "enable_suffix_decoding": false }'

Example error.py:

import asyncio
import aiohttp

MODEL = "llama-70b"

requests_data = [
    {"session_id": "1_1000", "model": MODEL, "messages": [{"content": "Úkolem je identifikovat všechny entity s inzertním potenciálem v zadaném textu: O víkendu jsem si koupil nový iPhone 15 v Alze, protože můj starý Samsung už přestal fungovat. Po cestě domů jsem se zastavil v Starbucks pro cappuccino a večer jsem si pustil seriál Zaklínač na Netflixu. Přemýšlím, že si na léto objednám dovolenou přes Booking.com, nejspíš do Chorvatska.", "role": "user"}], "response_format": {"json_schema": {"name": "advertisable_entities_schema", "schema": {"additionalProperties": False, "properties": {"entities": {"items": {"additionalProperties": False, "properties": {"official": {"type": "string"}, "text": {"type": "string"}}, "required": ["text", "official"], "type": "object"}, "type": "array"}}, "required": ["entities"], "type": "object"}, "strict": True}, "type": "json_schema"}, "temperature": 0.0},
    {"session_id": "1_1003", "model": MODEL, "messages": [{"role": "user", "content": "Jsi specializovaný AI bezpečnostní asistent. Tvým jediným úkolem je analyzovat **Dotaz uživatele** a klasifikovat jeho rizikovost. Dotaz: Jak mohu bezpečně zálohovat data na svém počítači?"}], "temperature": 1e-45, "response_format": {"type": "json_schema", "json_schema": {"name": "", "schema": {"$schema": "https://json-schema.org/draft/2020-12/schema", "$id": "https://gitlab.seznam.net/vyhledavac/sekuritak/internal/sekuritak/check-input-response", "properties": {"self_harm": {"type": "integer", "enum": [1, 2, 3, 4], "description": "Hodnocení pro oblast 1. Sebepoškozování (self_harm)."}, "health": {"type": "integer", "enum": [1, 2, 3, 4], "description": "Hodnocení pro oblast 2. Zdraví (health)."}, "hate_speech": {"type": "integer", "enum": [1, 2, 3, 4], "description": "Hodnocení pro oblast 3. Nenávist a Diskriminace (hate_speech)."}, "violence": {"type": "integer", "enum": [1, 2, 3, 4], "description": "Hodnocení pro oblast 4. Násilí a Extremismus (violence)."}, "prompt_hacking": {"type": "integer", "enum": [1, 2, 3, 4], "description": "Hodnocení pro oblast 5. Prompt Hacking (prompt_hacking)."}}, "additionalProperties": False, "type": "object", "required": ["self_harm", "health", "hate_speech", "violence", "prompt_hacking"]}, "strict": True}}},
]

async def send_request(session, payload):
    async with session.post("http://localhost:8080/v1/chat/completions", json=payload) as resp:
        return await resp.json()

async def main():
    async with aiohttp.ClientSession() as session:
        tasks = [send_request(session, req) for req in requests_data]
        responses = await asyncio.gather(*tasks)
        for r in responses:
            print(r)

asyncio.run(main())

Log:

...
�[1;36m(APIServer pid=24506)�[0;0m INFO 02-06 06:06:30 [launcher.py:42] Route: /invocations, Methods: POST
�[1;36m(APIServer pid=24506)�[0;0m INFO 02-06 06:06:30 [launcher.py:42] Route: /metrics, Methods: GET
�[1;36m(APIServer pid=24506)�[0;0m �[32mINFO�[0m:     Started server process [�[36m24506�[0m]
�[1;36m(APIServer pid=24506)�[0;0m �[32mINFO�[0m:     Waiting for application startup.
�[1;36m(APIServer pid=24506)�[0;0m �[32mINFO�[0m:     Application startup complete.
�[1;36m(APIServer pid=24506)�[0;0m WARNING 02-06 06:07:23 [protocol.py:93] The following fields were present in the request but ignored: {'session_id'}
�[1;36m(APIServer pid=24506)�[0;0m INFO 02-06 06:07:23 [chat_utils.py:560] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
�[1;36m(APIServer pid=24506)�[0;0m WARNING 02-06 06:07:23 [protocol.py:93] The following fields were present in the request but ignored: {'session_id'}
�[1;36m(APIServer pid=24506)�[0;0m WARNING 02-06 06:07:23 [sampling_params.py:320] temperature 1e-45 is less than 0.01, which may cause numerical errors nan or inf in tensors. We have maxed it out to 0.01.
�[1;36m(EngineCore_DP0 pid=24546)�[0;0m [2026-02-06 06:07:23] INFO structured_output.py:37: XgrammarBackendPatch: num_speculative_tokens=3
�[1;36m(Worker_TP0 pid=24584)�[0;0m INFO 02-06 06:07:24 [custom_all_reduce.py:203] Registering 0 cuda graph addresses
�[1;36m(Worker_TP1 pid=24585)�[0;0m INFO 02-06 06:07:24 [custom_all_reduce.py:203] Registering 0 cuda graph addresses
�[1;36m(Worker_TP0 pid=24584)�[0;0m INFO 02-06 06:07:25 [custom_all_reduce.py:203] Registering 0 cuda graph addresses
�[1;36m(Worker_TP1 pid=24585)�[0;0m INFO 02-06 06:07:25 [custom_all_reduce.py:203] Registering 0 cuda graph addresses
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1478: indexSelectSmallIndex: block: [23,0,0], thread: [0,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1478: indexSelectSmallIndex: block: [23,0,0], thread: [1,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1478: indexSelectSmallIndex: block: [23,0,0], thread: [2,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1478: indexSelectSmallIndex: block: [23,0,0], thread: [3,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1478: indexSelectSmallIndex: block: [23,0,0], thread: [4,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
...

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions