Structured output error in parallel - v0.1.2

Hi, thanks a lot for your work on Arctic Inference.

I tried the new version of Arctic Inference (v0.1.2) and encountered an error when sending at least two requests with structured output in parallel.

Without speculative decoding, everything works.

Configuration:
Model: [Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct) (Fine-tuned on our tasks)
HW: 2xH200
Arctic Inference version: 1.2.0
vLLM version: 0.11.0

RUN:
```bash
vllm serve \
     --model /mnt/models/11/049e9a0368a944959aaaa54216e832bd/artifacts/model \
     --chat-template template_250823.jinja \
     --quantization fp8 \
     --port 8080 \
     --host 0.0.0.0 \
     --tensor-parallel-size 2 \
     --max-model-len 16384 \
     --gpu-memory-utilization 0.91 \
     --max-num-batched-token 512 \
     --max-num-seqs 35 \
     --enable-auto-tool-choice \
     --tool-call-parser llama3_json \
     --generation-config auto \
     --speculative-config '{"method": "arctic", "model": "/mnt/models/11/43cb25cf80c7439f80e46fa9ceafb695/artifacts/model", "num_speculative_tokens": 3, "enable_suffix_decoding": false }'
```

Example error.py:
```python
import asyncio
import aiohttp

MODEL = "llama-70b"

requests_data = [
    {"session_id": "1_1000", "model": MODEL, "messages": [{"content": "Úkolem je identifikovat všechny entity s inzertním potenciálem v zadaném textu: O víkendu jsem si koupil nový iPhone 15 v Alze, protože můj starý Samsung už přestal fungovat. Po cestě domů jsem se zastavil v Starbucks pro cappuccino a večer jsem si pustil seriál Zaklínač na Netflixu. Přemýšlím, že si na léto objednám dovolenou přes Booking.com, nejspíš do Chorvatska.", "role": "user"}], "response_format": {"json_schema": {"name": "advertisable_entities_schema", "schema": {"additionalProperties": False, "properties": {"entities": {"items": {"additionalProperties": False, "properties": {"official": {"type": "string"}, "text": {"type": "string"}}, "required": ["text", "official"], "type": "object"}, "type": "array"}}, "required": ["entities"], "type": "object"}, "strict": True}, "type": "json_schema"}, "temperature": 0.0},
    {"session_id": "1_1003", "model": MODEL, "messages": [{"role": "user", "content": "Jsi specializovaný AI bezpečnostní asistent. Tvým jediným úkolem je analyzovat **Dotaz uživatele** a klasifikovat jeho rizikovost. Dotaz: Jak mohu bezpečně zálohovat data na svém počítači?"}], "temperature": 1e-45, "response_format": {"type": "json_schema", "json_schema": {"name": "", "schema": {"$schema": "https://json-schema.org/draft/2020-12/schema", "$id": "https://gitlab.seznam.net/vyhledavac/sekuritak/internal/sekuritak/check-input-response", "properties": {"self_harm": {"type": "integer", "enum": [1, 2, 3, 4], "description": "Hodnocení pro oblast 1. Sebepoškozování (self_harm)."}, "health": {"type": "integer", "enum": [1, 2, 3, 4], "description": "Hodnocení pro oblast 2. Zdraví (health)."}, "hate_speech": {"type": "integer", "enum": [1, 2, 3, 4], "description": "Hodnocení pro oblast 3. Nenávist a Diskriminace (hate_speech)."}, "violence": {"type": "integer", "enum": [1, 2, 3, 4], "description": "Hodnocení pro oblast 4. Násilí a Extremismus (violence)."}, "prompt_hacking": {"type": "integer", "enum": [1, 2, 3, 4], "description": "Hodnocení pro oblast 5. Prompt Hacking (prompt_hacking)."}}, "additionalProperties": False, "type": "object", "required": ["self_harm", "health", "hate_speech", "violence", "prompt_hacking"]}, "strict": True}}},
]

async def send_request(session, payload):
    async with session.post("http://localhost:8080/v1/chat/completions", json=payload) as resp:
        return await resp.json()

async def main():
    async with aiohttp.ClientSession() as session:
        tasks = [send_request(session, req) for req in requests_data]
        responses = await asyncio.gather(*tasks)
        for r in responses:
            print(r)

asyncio.run(main())
```

Log:
```bash
...
[1;36m(APIServer pid=24506)[0;0m INFO 02-06 06:06:30 [launcher.py:42] Route: /invocations, Methods: POST
[1;36m(APIServer pid=24506)[0;0m INFO 02-06 06:06:30 [launcher.py:42] Route: /metrics, Methods: GET
[1;36m(APIServer pid=24506)[0;0m [32mINFO[0m:     Started server process [[36m24506[0m]
[1;36m(APIServer pid=24506)[0;0m [32mINFO[0m:     Waiting for application startup.
[1;36m(APIServer pid=24506)[0;0m [32mINFO[0m:     Application startup complete.
[1;36m(APIServer pid=24506)[0;0m WARNING 02-06 06:07:23 [protocol.py:93] The following fields were present in the request but ignored: {'session_id'}
[1;36m(APIServer pid=24506)[0;0m INFO 02-06 06:07:23 [chat_utils.py:560] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
[1;36m(APIServer pid=24506)[0;0m WARNING 02-06 06:07:23 [protocol.py:93] The following fields were present in the request but ignored: {'session_id'}
[1;36m(APIServer pid=24506)[0;0m WARNING 02-06 06:07:23 [sampling_params.py:320] temperature 1e-45 is less than 0.01, which may cause numerical errors nan or inf in tensors. We have maxed it out to 0.01.
[1;36m(EngineCore_DP0 pid=24546)[0;0m [2026-02-06 06:07:23] INFO structured_output.py:37: XgrammarBackendPatch: num_speculative_tokens=3
[1;36m(Worker_TP0 pid=24584)[0;0m INFO 02-06 06:07:24 [custom_all_reduce.py:203] Registering 0 cuda graph addresses
[1;36m(Worker_TP1 pid=24585)[0;0m INFO 02-06 06:07:24 [custom_all_reduce.py:203] Registering 0 cuda graph addresses
[1;36m(Worker_TP0 pid=24584)[0;0m INFO 02-06 06:07:25 [custom_all_reduce.py:203] Registering 0 cuda graph addresses
[1;36m(Worker_TP1 pid=24585)[0;0m INFO 02-06 06:07:25 [custom_all_reduce.py:203] Registering 0 cuda graph addresses
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1478: indexSelectSmallIndex: block: [23,0,0], thread: [0,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1478: indexSelectSmallIndex: block: [23,0,0], thread: [1,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1478: indexSelectSmallIndex: block: [23,0,0], thread: [2,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1478: indexSelectSmallIndex: block: [23,0,0], thread: [3,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1478: indexSelectSmallIndex: block: [23,0,0], thread: [4,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
...
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Structured output error in parallel - v0.1.2 #246

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Structured output error in parallel - v0.1.2 #246

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions