feat: add vLLM 0.16 mistral option parity for exllamav3 by lesj0610 · Pull Request #415 · theroyallab/tabbyAPI

lesj0610 · 2026-02-24T13:36:57Z

Goal

Bring TabbyAPI's ExLlama integration closer to the current vLLM-era parser/tooling behavior where that makes sense, while keeping the ExLlama-specific runtime architecture intact.

What Changed

Parser / tokenizer parity work

Added / aligned Mistral parser and tokenizer-mode behavior for ExLlama-backed serving.
Expanded reasoning/tool parser parity coverage.
Restored qwen3 reasoning parser parity for Qwen3-Next so plain assistant text is emitted as content again when the tokenizer template does not expose the enable_thinking switch.

DeepSeek-VL2 support in TabbyAPI

Added a built-in DeepSeek-VL2 chat serializer so the model can serve chat completions without requiring an ad-hoc Jinja template.
Preserved the official conversation semantics for:
- <|User|> / <|Assistant|> roles
- <image> placeholders
- interleaved image-text content
- grounding tags such as <|ref|>, <|det|>, and <|grounding|>
Allowed DeepseekVLV2ForCausalLM chat-completions flow even without a generic prompt template.

Input robustness

Hardened image loading so broken image streams fail cleanly with a request error instead of taking down the process.
Included the earlier exllamav3 startup-lock mitigation on the Tabby side (retained in this PR branch).

Validation

Unit tests

Existing parser/tooling suite retained from earlier branch state
Additional targeted tests:
- pytest -q tests/qwen3_reasoning_parser_test.py -> 8 passed
- pytest -q tests/deepseek_vl2_chat_serializer_test.py -> 2 passed

Runtime smoke

Qwen3-Next

Direct TabbyAPI serving (curl, non-stream + stream):

non-stream: content = "OK"
stream: delta.content = "OK"

This confirms the qwen3 parser parity regression is fixed for Qwen3-Next.

DeepSeek-VL2

Validated through actual serving:

text chat: correct English responses
single image: correct color/object responses
multi-image interleaved prompts: correct per-image answers
grounding tags remain intact through serialization

Reviewer Focus

Please focus review on:

reasoning parser fallback invariants for Qwen3-Next
DeepSeek-VL2 serializer correctness
image decode error handling
non-Mistral / non-DeepSeek regression risk (must remain default path)

Note

This PR intentionally keeps the ExLlama-specific runtime model and does not attempt to mirror vLLM's backend orchestration wholesale.

…atibility

…c fix, inference abort fix

lesj0610 · 2026-02-25T12:01:06Z

Update pushed on branch feat/vllm-mistral-parity (commit 88014e7).

What was added in this update

Extended tokenizer_mode compatibility to vLLM-style keys: auto, hf, slow, mistral, deepseek_v32
Normalized slow -> hf for ExLlama backends
Added explicit allowlist config for mistral mode: model.mistral_tokenizer_models
Enforced Mistral-only activation for tokenizer_mode=mistral (non-Mistral models fall back to default path)
Added safer load lock release path for cancelled loads (Lock is not acquired guard)

Validation

Unit tests:
- PYTHONPATH=. pytest -q tests/mistral_tokenizer_mode_test.py tests/parser_options_test.py tests/tool_parser_test.py tests/mistral_reasoning_parser_test.py
- Result: 35 passed
Wheel check:
- python tests/wheel_test.py
- flashinfer/torch/jinja2 import checks passed
Runtime model checks (EXL3):
- Magistral-Small-2509-EXL3-4.0bpw
- test/Llama-3.1-8B-Instruct-exl3-4.0bpw
- test/Qwen2.5-7B-Instruct-EXL3-4.0bpw
- test/gemma-2-2b-it-EXL3-4.0bpw
- test/AFM-4.5B-EXL3-4.0bpw
Verified:
- tokenizer mode routing (Magistral -> mistral, others -> auto)
- EN/KO short + complex prompts
- THINK/reasoning field presence with mistral reasoning parser and system prompt
Note: one initial Llama KO-complex sample had a replacement character (�) in output, but immediate re-runs of the same case produced clean outputs.

Report artifact path from runtime check: /tmp/tabby_flashinfer_model_test_report.json

lesj0610 · 2026-02-25T13:09:48Z

Final readiness update (post-fix / post-retest).

Branch head: 88014e7 (clean working tree).

Re-validation with model-specific official sampling settings

Retested with each model's official/recommended sampling values (instead of using Magistral values globally):

Magistral-Small-2509-EXL3-4.0bpw
- temperature=0.7, top_p=0.95, max_tokens=131072
Llama-3.1-8B-Instruct-exl3-4.0bpw
- temperature=0.6, top_p=0.9, max_tokens=256
Qwen2.5-7B-Instruct-EXL3-4.0bpw
- temperature=0.7, top_p=0.8, top_k=20, repetition_penalty=1.05, max_tokens=512
gemma-2-2b-it-EXL3-4.0bpw
- max_tokens=256 (no explicit top_p/temperature in primary official examples)
AFM-4.5B-EXL3-4.0bpw
- temperature=0.5, top_k=50, top_p=0.95, repetition_penalty=1.1, max_tokens=256

Runtime checks performed

EN/KO short + complex prompts per model
tokenizer mode routing check
- Magistral -> mistral
- others -> auto
THINK/reasoning field check on Magistral with required system prompt

Result

all_tokenizer_mode_ok: true
all_prompt_ok: true
think_reasoning_present: true

Artifacts:

/tmp/tabby_flashinfer_model_test_report_official_params.json
/tmp/tabby_flashinfer_model_test_report_round2.json

Note: per latest clarification, Qwen2.5 remains included in test scope (the skip note was about Qwen3, not Qwen2.5).

lesj0610 · 2026-02-25T13:30:06Z

Review request update:

I updated the PR description with explicit fallback invariants and final validation results (43 passed, plus official-parameter model smoke summary).

Please review with focus on:

Parser dispatch correctness for tool_call_parser=mistral
tokenizer_mode normalization/fallback (auto|hf|slow|mistral|deepseek_v32)
mistral_tokenizer_models allowlist behavior
Non-Mistral default-path regression risk

lesj0610 · 2026-02-26T15:04:24Z

Added startup freeze fix for ExLlamaV3 JIT lock behavior.

Commit:

e52fde2 fix: surface and avoid exllamav3 startup lock deadlock

Root cause confirmed:

torch.utils.cpp_extension.load() uses a file baton lock at
~/.cache/torch_extensions/*/exllamav3_ext/lock.
If the lock file remains stale (e.g., interrupted build/shutdown), startup can wait indefinitely in baton.wait() and appears frozen with no output.

What changed:

common/model.py
- backend registry init changed to lazy init (_ensure_backend_registry())
- explicit startup log before exllamav3 import
- lock hint logger (_log_exllamav3_lock_hint) that warns about stale lock path(s)
common/multimodal.py
- removed eager exllama imports at module import time
- switched to lazy vision imports in add() to avoid pre-start blocking during import graph initialization

Validation:

Forced repro by creating lock file:
- ~/.cache/torch_extensions/py312_cu128/exllamav3_ext/lock
Before this patch: python main.py appeared to hang with no logs.
After this patch: immediate warning logs identify lock file and cause.
Removing lock file allowed normal startup and model load.

lesj0610 and others added 14 commits February 24, 2026 21:37

feat(oai): add vLLM-style reasoning parsers and robust tool-call comp…

dc2b964

…atibility

fix(oai): map top-level thinking flags into chat template kwargs

2d58b3f

feat(oai): add vllm-compatible parser options and reasoning/tool parsing

355bac9

fix(reasoning): make exaone4 parser independent and add tests

ec79080

feat(oai): add vLLM-style tool parser registry and native parser flows

a51138a

Fix exaone4 streaming reasoning split across think/tool boundaries

fbc65d8

Full tool-calling support: XML parsing, streaming compliance, Pydanti…

d26bbca

…c fix, inference abort fix

Broader model compatibility, tool_choice support, bug fixes and cleanup

e5f9948

fix(oai): preserve parser_key parser dispatch after stacking PRs

f1f488b

feat(oai): add mistral tool-call parser compatibility

c86a6dd

feat(config): add tokenizer_mode and mistral-safe tool ID handling

c0f9387

test(reasoning): align mistral parser token handling with mistral_common

e8a7620

test(model): skip exllamav2 model tests when dependency is missing

c74cc17

Align ExLlama tokenizer modes with vLLM and enforce flashinfer path

88014e7

fix: surface and avoid exllamav3 startup lock deadlock

e52fde2

lesj0610 added 3 commits February 27, 2026 01:05

Align qwen3 reasoning + expand tool parser parity coverage

824c6c0

feat(chat): add DeepSeek-VL2 built-in serializer

39845d2

fix(reasoning): restore qwen3-next parser parity

ae72f8d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add vLLM 0.16 mistral option parity for exllamav3#415

feat: add vLLM 0.16 mistral option parity for exllamav3#415
lesj0610 wants to merge 18 commits intotheroyallab:mainfrom
lesj0610:feat/vllm-mistral-parity

lesj0610 commented Feb 24, 2026 •

edited

Loading

Uh oh!

lesj0610 commented Feb 25, 2026

Uh oh!

lesj0610 commented Feb 25, 2026

Uh oh!

lesj0610 commented Feb 25, 2026

Uh oh!

lesj0610 commented Feb 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

lesj0610 commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Goal

What Changed

Parser / tokenizer parity work

DeepSeek-VL2 support in TabbyAPI

Input robustness

Validation

Unit tests

Runtime smoke

Qwen3-Next

DeepSeek-VL2

Reviewer Focus

Note

Uh oh!

lesj0610 commented Feb 25, 2026

What was added in this update

Validation

Uh oh!

lesj0610 commented Feb 25, 2026

Re-validation with model-specific official sampling settings

Runtime checks performed

Result

Uh oh!

lesj0610 commented Feb 25, 2026

Uh oh!

lesj0610 commented Feb 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lesj0610 commented Feb 24, 2026 •

edited

Loading