[Feature] In Sglang ,Is chunked-prefill use fused(prefill+decode) batch? #1163
-
Checklist
Motivationhttps://arxiv.org/abs/2308.16369 Based on this article, I got the following hypothesis: Related resourcesNo response |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 3 replies
-
I can't figure out where the prefill and decode are combined into one batch, and chunked-prefill still seems to perform the prefill and decode separately when enabled. Also, the default chunked-prefill-size=8192, which seems to be a very unreasonable value, and I'm confused about it.Maybe there's something wrong with my perception? |
Beta Was this translation helpful? Give feedback.
-
By default, it does not mix prefill and decode. However, you can turn on this flag to mix/fuse them sglang/python/sglang/srt/server_args.py Lines 418 to 422 in 30b4f77 It can help reduce the inter-token latency as described in that paper. We pick chunk size 8192 to favor throughput. |
Beta Was this translation helpful? Give feedback.
-
Is there any document described the chunked-prefill-size option? What is this for? Hard to find a document described this. |
Beta Was this translation helpful? Give feedback.
By default, it does not mix prefill and decode. However, you can turn on this flag to mix/fuse them
sglang/python/sglang/srt/server_args.py
Lines 418 to 422 in 30b4f77
It can help reduce the inter-token latency as described in that paper.
We pick chunk size 8192 to favor throughput.