Skip to content

feat: add vLLM 0.16 mistral option parity for exllamav3#415

Open
lesj0610 wants to merge 18 commits intotheroyallab:mainfrom
lesj0610:feat/vllm-mistral-parity
Open

feat: add vLLM 0.16 mistral option parity for exllamav3#415
lesj0610 wants to merge 18 commits intotheroyallab:mainfrom
lesj0610:feat/vllm-mistral-parity

Conversation

@lesj0610
Copy link

@lesj0610 lesj0610 commented Feb 24, 2026

Goal

Bring TabbyAPI's ExLlama integration closer to the current vLLM-era parser/tooling behavior where that makes sense, while keeping the ExLlama-specific runtime architecture intact.

What Changed

Parser / tokenizer parity work

  • Added / aligned Mistral parser and tokenizer-mode behavior for ExLlama-backed serving.
  • Expanded reasoning/tool parser parity coverage.
  • Restored qwen3 reasoning parser parity for Qwen3-Next so plain assistant text is emitted as content again when the tokenizer template does not expose the enable_thinking switch.

DeepSeek-VL2 support in TabbyAPI

  • Added a built-in DeepSeek-VL2 chat serializer so the model can serve chat completions without requiring an ad-hoc Jinja template.
  • Preserved the official conversation semantics for:
    • <|User|> / <|Assistant|> roles
    • <image> placeholders
    • interleaved image-text content
    • grounding tags such as <|ref|>, <|det|>, and <|grounding|>
  • Allowed DeepseekVLV2ForCausalLM chat-completions flow even without a generic prompt template.

Input robustness

  • Hardened image loading so broken image streams fail cleanly with a request error instead of taking down the process.
  • Included the earlier exllamav3 startup-lock mitigation on the Tabby side (retained in this PR branch).

Validation

Unit tests

  • Existing parser/tooling suite retained from earlier branch state
  • Additional targeted tests:
    • pytest -q tests/qwen3_reasoning_parser_test.py -> 8 passed
    • pytest -q tests/deepseek_vl2_chat_serializer_test.py -> 2 passed

Runtime smoke

Qwen3-Next

Direct TabbyAPI serving (curl, non-stream + stream):

  • non-stream: content = "OK"
  • stream: delta.content = "OK"

This confirms the qwen3 parser parity regression is fixed for Qwen3-Next.

DeepSeek-VL2

Validated through actual serving:

  • text chat: correct English responses
  • single image: correct color/object responses
  • multi-image interleaved prompts: correct per-image answers
  • grounding tags remain intact through serialization

Reviewer Focus

Please focus review on:

  • reasoning parser fallback invariants for Qwen3-Next
  • DeepSeek-VL2 serializer correctness
  • image decode error handling
  • non-Mistral / non-DeepSeek regression risk (must remain default path)

Note

This PR intentionally keeps the ExLlama-specific runtime model and does not attempt to mirror vLLM's backend orchestration wholesale.

@lesj0610
Copy link
Author

Update pushed on branch feat/vllm-mistral-parity (commit 88014e7).

What was added in this update

  • Extended tokenizer_mode compatibility to vLLM-style keys: auto, hf, slow, mistral, deepseek_v32
  • Normalized slow -> hf for ExLlama backends
  • Added explicit allowlist config for mistral mode: model.mistral_tokenizer_models
  • Enforced Mistral-only activation for tokenizer_mode=mistral (non-Mistral models fall back to default path)
  • Added safer load lock release path for cancelled loads (Lock is not acquired guard)

Validation

  • Unit tests:

    • PYTHONPATH=. pytest -q tests/mistral_tokenizer_mode_test.py tests/parser_options_test.py tests/tool_parser_test.py tests/mistral_reasoning_parser_test.py
    • Result: 35 passed
  • Wheel check:

    • python tests/wheel_test.py
    • flashinfer/torch/jinja2 import checks passed
  • Runtime model checks (EXL3):

    • Magistral-Small-2509-EXL3-4.0bpw
    • test/Llama-3.1-8B-Instruct-exl3-4.0bpw
    • test/Qwen2.5-7B-Instruct-EXL3-4.0bpw
    • test/gemma-2-2b-it-EXL3-4.0bpw
    • test/AFM-4.5B-EXL3-4.0bpw

    Verified:

    • tokenizer mode routing (Magistral -> mistral, others -> auto)
    • EN/KO short + complex prompts
    • THINK/reasoning field presence with mistral reasoning parser and system prompt

    Note: one initial Llama KO-complex sample had a replacement character () in output, but immediate re-runs of the same case produced clean outputs.

Report artifact path from runtime check: /tmp/tabby_flashinfer_model_test_report.json

@lesj0610
Copy link
Author

Final readiness update (post-fix / post-retest).

Branch head: 88014e7 (clean working tree).

Re-validation with model-specific official sampling settings

Retested with each model's official/recommended sampling values (instead of using Magistral values globally):

  • Magistral-Small-2509-EXL3-4.0bpw
    • temperature=0.7, top_p=0.95, max_tokens=131072
  • Llama-3.1-8B-Instruct-exl3-4.0bpw
    • temperature=0.6, top_p=0.9, max_tokens=256
  • Qwen2.5-7B-Instruct-EXL3-4.0bpw
    • temperature=0.7, top_p=0.8, top_k=20, repetition_penalty=1.05, max_tokens=512
  • gemma-2-2b-it-EXL3-4.0bpw
    • max_tokens=256 (no explicit top_p/temperature in primary official examples)
  • AFM-4.5B-EXL3-4.0bpw
    • temperature=0.5, top_k=50, top_p=0.95, repetition_penalty=1.1, max_tokens=256

Runtime checks performed

  • EN/KO short + complex prompts per model
  • tokenizer mode routing check
    • Magistral -> mistral
    • others -> auto
  • THINK/reasoning field check on Magistral with required system prompt

Result

  • all_tokenizer_mode_ok: true
  • all_prompt_ok: true
  • think_reasoning_present: true

Artifacts:

  • /tmp/tabby_flashinfer_model_test_report_official_params.json
  • /tmp/tabby_flashinfer_model_test_report_round2.json

Note: per latest clarification, Qwen2.5 remains included in test scope (the skip note was about Qwen3, not Qwen2.5).

@lesj0610
Copy link
Author

Review request update:

I updated the PR description with explicit fallback invariants and final validation results (43 passed, plus official-parameter model smoke summary).

Please review with focus on:

  1. Parser dispatch correctness for tool_call_parser=mistral
  2. tokenizer_mode normalization/fallback (auto|hf|slow|mistral|deepseek_v32)
  3. mistral_tokenizer_models allowlist behavior
  4. Non-Mistral default-path regression risk

@lesj0610
Copy link
Author

Added startup freeze fix for ExLlamaV3 JIT lock behavior.

Commit:

  • e52fde2 fix: surface and avoid exllamav3 startup lock deadlock

Root cause confirmed:

  • torch.utils.cpp_extension.load() uses a file baton lock at
    ~/.cache/torch_extensions/*/exllamav3_ext/lock.
  • If the lock file remains stale (e.g., interrupted build/shutdown), startup can wait indefinitely in baton.wait() and appears frozen with no output.

What changed:

  • common/model.py
    • backend registry init changed to lazy init (_ensure_backend_registry())
    • explicit startup log before exllamav3 import
    • lock hint logger (_log_exllamav3_lock_hint) that warns about stale lock path(s)
  • common/multimodal.py
    • removed eager exllama imports at module import time
    • switched to lazy vision imports in add() to avoid pre-start blocking during import graph initialization

Validation:

  • Forced repro by creating lock file:
    • ~/.cache/torch_extensions/py312_cu128/exllamav3_ext/lock
  • Before this patch: python main.py appeared to hang with no logs.
  • After this patch: immediate warning logs identify lock file and cause.
  • Removing lock file allowed normal startup and model load.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants