-
Notifications
You must be signed in to change notification settings - Fork 414
Description
Bug report
Gemma3 Model Token Logits Mismatch with Different max_prefill_predict_length
Description
When running prefill with the Gemma3 4B model, the output token logits differ depending on the max_prefill_predict_length
value.
In contrast, other models (e.g., Qwen3, LLaMA 2, LLaMA 3) produce consistent logits regardless of the prefill length.
Steps to Reproduce
- Load the Gemma3 4B model.
- Use the same input prompt: The capital of France is
- Run inference with:
- Case 1:
max_prefill_predict_length = 1024
,max_target_length = 2048
- Case 2:
max_prefill_predict_length = 4096
,max_target_length = 8192
Observed Behavior (Gemma3 4B)
Case 1:
[null, {"The": -1.68888}, {" capital": -20.57046}, {" of": -0.15805}, {" France": -3.24782}, {" is": -1.67389}]
Full output: The capital of France is Paris.\n\nParis is a global center for art
Case 2:
[null, {"The": -1.68888}, {" capital": -20.30916}, {" of": -0.16284}, {" France": -3.24021}, {" is": -1.64998}]
Full output: The capital of France is Paris�Ar¡ 3<start_of_image>--- 3
Note: The very first token logits matches! Remaining tokens start to divergence. Strongly suspect related to kv cache.
Note: The small cumulative logit differences lead to large divergence in generated output for longer sequences.
Expected Behavior
Token logits for the same input prompt should remain identical across different max_prefill_predict_length
values, as observed with Qwen3, LLaMA 2, and LLaMA 3 models.
Additional Information
- Affected Model: Gemma3 4B
- Verified Models (working as expected): Qwen3, LLaMA 2, LLaMA 3
Logs/Output
No response
Environment Information
No response
Additional Context
No response