[Feature] In Sglang ，Is chunked-prefill use fused(prefill+decode) batch? #1163

CSEEduanyu · 2024-08-20T14:07:44Z

CSEEduanyu
Aug 20, 2024

Checklist

1. If the issue you raised is not a feature but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
2. Please use English, otherwise it will be closed.

Motivation

https://arxiv.org/abs/2308.16369 Based on this article, I got the following hypothesis：
1.no prefill batch or decode batch anymore，only fused batch (contains the prefill and decode)
2.The ideal chunk size depends on the current hardware and the computational intensity calculated by the model，on A100+llama13B this num is about 512？

Related resources

No response

Answered by merrymercy

Aug 25, 2024

By default, it does not mix prefill and decode. However, you can turn on this flag to mix/fuse them

sglang/python/sglang/srt/server_args.py

Lines 418 to 422 in 30b4f77

     parser.add_argument(  
   "--enable-mixed-chunk",  
   action="store_true",  
   help="Enabling mixing prefill and decode in a chunked batch.",  
   )  

 

.
It can help reduce the inter-token latency as described in that paper.

We pick chunk size 8192 to favor throughput.

View full answer

CSEEduanyu · 2024-08-20T14:09:36Z

CSEEduanyu
Aug 20, 2024
Author

I can't figure out where the prefill and decode are combined into one batch, and chunked-prefill still seems to perform the prefill and decode separately when enabled. Also, the default chunked-prefill-size=8192, which seems to be a very unreasonable value, and I'm confused about it.Maybe there's something wrong with my perception?

0 replies

merrymercy · 2024-08-25T19:31:22Z

merrymercy
Aug 25, 2024
Maintainer

By default, it does not mix prefill and decode. However, you can turn on this flag to mix/fuse them

sglang/python/sglang/srt/server_args.py

Lines 418 to 422 in 30b4f77

    
           parser.add_argument( 
        
               "--enable-mixed-chunk", 
        
               action="store_true", 
        
               help="Enabling mixing prefill and decode in a chunked batch.", 
        
           )

.
It can help reduce the inter-token latency as described in that paper.

We pick chunk size 8192 to favor throughput.

3 replies

CSEEduanyu Sep 1, 2024
Author

Thanks for the reply, I've seen this change in the latest version!

Desmond819 Sep 1, 2024

does it increase the throughput?

merrymercy Sep 4, 2024
Maintainer

In our test, it can reduce inter-token latency for some workloads. It does not increase the peak throughput.

KylinMountain · 2024-09-01T08:28:18Z

KylinMountain
Sep 1, 2024

Is there any document described the chunked-prefill-size option? What is this for? Hard to find a document described this.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] In Sglang ，Is chunked-prefill use fused(prefill+decode) batch? #1163

{{title}}

Replies: 3 comments 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

	parser.add_argument(
	"--enable-mixed-chunk",
	action="store_true",
	help="Enabling mixing prefill and decode in a chunked batch.",
	)

[Feature] In Sglang ，Is chunked-prefill use fused(prefill+decode) batch? #1163

CSEEduanyu Aug 20, 2024

Checklist

Motivation

Related resources

Replies: 3 comments · 3 replies

CSEEduanyu Aug 20, 2024 Author

merrymercy Aug 25, 2024 Maintainer

CSEEduanyu Sep 1, 2024 Author

Desmond819 Sep 1, 2024

merrymercy Sep 4, 2024 Maintainer

KylinMountain Sep 1, 2024

CSEEduanyu
Aug 20, 2024

Replies: 3 comments 3 replies

CSEEduanyu
Aug 20, 2024
Author

merrymercy
Aug 25, 2024
Maintainer

CSEEduanyu Sep 1, 2024
Author

merrymercy Sep 4, 2024
Maintainer

KylinMountain
Sep 1, 2024