Development Roadmap (2024 Q4) #1487

Ying1123 · 2024-09-21T22:38:00Z

Here is the development roadmap for 2024 Q4. Contributions and feedback are welcome (Join Bi-weekly Development Meeting). Previous 2024 Q3 roadmap can be found in #634.

Performance

Hide CPU overhead with overlapped scheduler (Faster overlap mode scheduler #1738, Enable overlap by default #2067)
Support speculative decoding
- Eagle Speculative EAGLE2. New PR #2150
- Reference-based. Reference speculative decoding #270
- Medusa head [Feature] plan to support medusa? #859
- Draft model based.
Sparse Attention Support double sparsity #1459
Faster grammar parsing library for constrained decoding [Performance] Support both xgrammar and outlines for constrained decoding #1752
Multi-layer radix cache (GPU/CPU/Disk) @xiezhq-hermann
Improve the performance of mixed chunked prefill. see a draft Rewrite mixed chunked prefill #1383
Integrate CuDNN paged attention kernels

Parallelism

Support sequence parallelism [Feature] Add initial support for sequence parallelism #1436. Related paper
Support pipeline parallelism.
Support expert parallelism + data parallelism for DeepSeek/MoE models. @ispobock
- Data parallelism Support DP MLA #1970
- Expert parallelism # [Feature] Expert parallelism support #1435
Implement a better cache-aware load balancer for data parallelism. [router] cache-aware load-balancing router v1 #2114 [Feature] Cache-aware Data Parallel Router #1732 @ByronHsu @yichuan520030910320
Overlap communication in tensor parallelsim. @ZhuohaoL
Support disaggregated serving to separate prefill and decoding.

Hardware Coverage

AMD optimizations. cc @HaiShaw
- CK kernels
- Setup CI (accuracy/performance) for AMD
Intel XPU support.
- [Feature, Hardware] Enable SGLang on XPU GPUs via PyTorch #1480
- Add initial support for intel Gaudi accelerators #2121

Model Coverage

Multi-modal models
- Llama 3.2 Vision Llama3.2 vision model support #1551
- QWen2-VL Support qwen2 vl model #1546
- GLM 4V Add GLM-4v Multimodal Model support for SGLang #1641
- VILA
- Phi-vision
- FishSpeech audio model support
- Ultravox
Language models
- Mamba models
Reward models
- [Feature] Support reward model LxzGordon/URM-LLaMa-3.1-8B #1525
- Gemma2 reward model support #1954

New features

Integrate with LMCache https://github.com/LMCache/LMCache
A padded batch mode to make results more deterministic

sglang/docs/references/faq.md

Line 3 in 8912b76

## The results are not deterministic, even with a temperature of 0
Performance optimizations for multi-LoRA serving [LoRA, Performance] Add gemm expand triton kernel for multi-LoRA #1728

Quantization

@HaiShaw @zhyncs @ispobock

Torchao integration Add llama implementation with no tensor parallel linears #1561
Turbomind operators integration
More CUTLASS mixed precision gemm integration
KV cache quantization (more formats + scaling factor)

Server API

Support directly taking embedding as inputs. [Feature] Generation Inputs: input_embeds #745
Add APIs for using the inference engine in a single script without launching a separate server. See also examples.
- Provide an offline engine API #1567
Support endpoint other than OpenAI (Anthropic, Mistral) in the language frontend.
Better APIs to support RL trainers, including https://github.com/huggingface/trl and https://github.com/OpenRLHF/OpenRLHF @zhaochenyang20
Support generalized reward API (adding linear layers to any Causal LM to get the reward) https://github.com/OpenRLHF/OpenRLHF @zhaochenyang20

Observability

Integrate Grafana / Prometheus
- support prometheus metrics #1853 [WIP] Prometheus Metrics #1461

Others

Notebook-style interactive tutorials. @zhaochenyang20
Compiler mode optimizations for the language (e.g. support sending a full serialized SGL program to the server). @hnyls2002
Memory pool refactor to better support mixing different attention layers (e.g., interleaved window attention). @Ying1123
Make vLLM an optional dependency. @zhyncs @ByronHsu @yizhang2077 [Feature] Make vLLM optional in model code #1673

fengyang95 · 2024-09-22T02:02:41Z

Are there any plans to optimize long context latency?

lumiere-ml · 2024-10-17T02:24:33Z

Hi，can I help for Multi-layer radix cache (GPU/CPU/Disk)？ Really insterested in that.

tanzelin430 · 2024-10-17T11:58:58Z

Are there any plans to optimize long context latency?

I am interested in contributing to P-D split inference architechure and I have machines that support me to develop the architechure, if you guys got any related develop plans please let me know. Thank you @Ying1123 @zhyncs @fengyang95

merrymercy · 2024-10-19T13:58:47Z

@lumiere-ml @tanzelin430 Are you in the slack channel? We can follow up on that.

zhyncs · 2024-10-20T06:01:03Z

@lumiere-ml @tanzelin430 Are you in the slack channel? We can follow up on that.

@lumiere-ml @tanzelin430 Welcome to join our slack channel https://join.slack.com/t/sgl-fru7574/shared_invite/zt-2ngly9muu-t37XiH87qvD~6rVBTkTEHw

tanzelin430 · 2024-10-20T06:14:54Z

@lumiere-ml @tanzelin430 Are you in the slack channel? We can follow up on that.

@lumiere-ml @tanzelin430 Welcome to join our slack channel https://join.slack.com/t/sgl-fru7574/shared_invite/zt-2ngly9muu-t37XiH87qvD~6rVBTkTEHw

thanks for invitation, I am in slack now. forward to collaberate with you

lumiere-ml · 2024-10-20T09:01:30Z

@lumiere-ml @tanzelin430 Are you in the slack channel? We can follow up on that.

@lumiere-ml @tanzelin430 Welcome to join our slack channel https://join.slack.com/t/sgl-fru7574/shared_invite/zt-2ngly9muu-t37XiH87qvD~6rVBTkTEHw

Thanks for your invitation！

Edenzzzz · 2024-11-11T03:30:14Z

@lumiere-ml @tanzelin430 Are you in the slack channel? We can follow up on that.

@lumiere-ml @tanzelin430 Welcome to join our slack channel https://join.slack.com/t/sgl-fru7574/shared_invite/zt-2ngly9muu-t37XiH87qvD~6rVBTkTEHw

Thanks for your invitation！

@lumiere-ml @zhyncs I'm also very interested, could you share which channel you're using to discuss?
Perhaps we can combine radix tree prefix matching with P-D disaggregation similar to Mooncake?

mfdj2002 · 2024-11-21T07:40:18Z

If no one is actively working on supporting pipeline parallelism, I'm down to help

Ying1123 changed the title ~~[WIP] Development Roadmap (2024 Q4)~~ Development Roadmap (2024 Q4) Sep 22, 2024

zhyncs pinned this issue Sep 22, 2024

zhyncs mentioned this issue Sep 22, 2024

[Feature] Are there plans to implement a prefill-decode split inference architecture? #1080

Closed

ByronHsu mentioned this issue Oct 4, 2024

Provide an offline engine API #1567

Merged

3 tasks

ByronHsu mentioned this issue Oct 15, 2024

Support vLLM-style rope flashinfer-ai/flashinfer#530

Closed

zhaochenyang20 mentioned this issue Oct 20, 2024

Add documentations for Installation #1733

Closed

3 tasks

zhyncs mentioned this issue Nov 1, 2024

Development Roadmap (2024 Q3) #634

Closed

29 tasks

liangzelang mentioned this issue Nov 15, 2024

[Feature] Expert parallelism support #1435

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Development Roadmap (2024 Q4) #1487

Development Roadmap (2024 Q4) #1487

Ying1123 commented Sep 21, 2024 •

edited by merrymercy

Loading

fengyang95 commented Sep 22, 2024

lumiere-ml commented Oct 17, 2024

tanzelin430 commented Oct 17, 2024

merrymercy commented Oct 19, 2024

zhyncs commented Oct 20, 2024

tanzelin430 commented Oct 20, 2024

lumiere-ml commented Oct 20, 2024

Edenzzzz commented Nov 11, 2024 •

edited

Loading

mfdj2002 commented Nov 21, 2024

Development Roadmap (2024 Q4) #1487

Development Roadmap (2024 Q4) #1487

Comments

Ying1123 commented Sep 21, 2024 • edited by merrymercy Loading

Performance

Parallelism

Hardware Coverage

Model Coverage

New features

Quantization

Server API

Observability

Others

fengyang95 commented Sep 22, 2024

lumiere-ml commented Oct 17, 2024

tanzelin430 commented Oct 17, 2024

merrymercy commented Oct 19, 2024

zhyncs commented Oct 20, 2024

tanzelin430 commented Oct 20, 2024

lumiere-ml commented Oct 20, 2024

Edenzzzz commented Nov 11, 2024 • edited Loading

mfdj2002 commented Nov 21, 2024

Ying1123 commented Sep 21, 2024 •

edited by merrymercy

Loading

Edenzzzz commented Nov 11, 2024 •

edited

Loading