feat: add vLLM 0.16 mistral option parity for exllamav3#415
feat: add vLLM 0.16 mistral option parity for exllamav3#415lesj0610 wants to merge 18 commits intotheroyallab:mainfrom
Conversation
…c fix, inference abort fix
|
Update pushed on branch What was added in this update
Validation
Report artifact path from runtime check: |
|
Final readiness update (post-fix / post-retest). Branch head: Re-validation with model-specific official sampling settingsRetested with each model's official/recommended sampling values (instead of using Magistral values globally):
Runtime checks performed
Result
Artifacts:
Note: per latest clarification, Qwen2.5 remains included in test scope (the skip note was about Qwen3, not Qwen2.5). |
|
Review request update: I updated the PR description with explicit fallback invariants and final validation results ( Please review with focus on:
|
|
Added startup freeze fix for ExLlamaV3 JIT lock behavior. Commit:
Root cause confirmed:
What changed:
Validation:
|
Goal
Bring TabbyAPI's ExLlama integration closer to the current vLLM-era parser/tooling behavior where that makes sense, while keeping the ExLlama-specific runtime architecture intact.
What Changed
Parser / tokenizer parity work
qwen3reasoning parser parity forQwen3-Nextso plain assistant text is emitted ascontentagain when the tokenizer template does not expose theenable_thinkingswitch.DeepSeek-VL2 support in TabbyAPI
<|User|>/<|Assistant|>roles<image>placeholders<|ref|>,<|det|>, and<|grounding|>DeepseekVLV2ForCausalLMchat-completions flow even without a generic prompt template.Input robustness
Validation
Unit tests
pytest -q tests/qwen3_reasoning_parser_test.py->8 passedpytest -q tests/deepseek_vl2_chat_serializer_test.py->2 passedRuntime smoke
Qwen3-Next
Direct TabbyAPI serving (
curl, non-stream + stream):content = "OK"delta.content = "OK"This confirms the
qwen3parser parity regression is fixed forQwen3-Next.DeepSeek-VL2
Validated through actual serving:
Reviewer Focus
Please focus review on:
Qwen3-NextNote
This PR intentionally keeps the ExLlama-specific runtime model and does not attempt to mirror vLLM's backend orchestration wholesale.