[Feature] Throughput-aware speculative decoding #1979

vkc1vk · 2024-11-10T01:06:59Z

Checklist

1. If the issue you raised is not a feature but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
2. Please use English, otherwise it will be closed.

Motivation

Since speculative decoding techniques such as Eagle are mostly effective in low-throughput scenarios, I was wondering if it makes sense to tone down the size of draft trees or to shut off speculative decoding completely in a throughput-aware manner.

What do you think?

Related resources

No response

zhyncs · 2024-11-10T11:34:28Z

You're talking about Automatic Speculative Decoding, which is feasible.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Throughput-aware speculative decoding #1979

[Feature] Throughput-aware speculative decoding #1979

vkc1vk commented Nov 10, 2024 •

edited

Loading

zhyncs commented Nov 10, 2024

[Feature] Throughput-aware speculative decoding #1979

[Feature] Throughput-aware speculative decoding #1979

Comments

vkc1vk commented Nov 10, 2024 • edited Loading

Checklist

Motivation

Related resources

zhyncs commented Nov 10, 2024

vkc1vk commented Nov 10, 2024 •

edited

Loading