Development Roadmap (2024 Q3) #634

Ying1123 · 2024-07-17T02:15:39Z

Here is the development roadmap for 2024 Q3. Contributions and feedback are welcome.

Server API

Add APIs for using the inference engine in a single script without launching a separate server. See also examples.
- [RFC] Add an LLM engine #1127
Support most OpenAI APIs: Batch, completion, chat, embedding
Support directly taking embedding as inputs. [Feature] Generation Inputs: input_embeds #745
Support updating the model weights without relaunching the server. @shanyu-sys
- [Feat] Support update weights without restart server #1157
Support Mistral endpoint in the language frontend

Performance

Improve time-to-first-token in streaming mode with better scheduling.
- Optimize schedule #1339
- Fix some online scheduling delay #1345
Implement chunked prefill. @hnyls2002 @vikranth22446
Implement speculative decoding. See also a prototype.
- [Feature] plan to support medusa? #859

Parallelism

Support sequence parallelism for long context models.
- Sequence Parallel #1041

Quantization

Support W8A16, W4A16 weight-only integer quantization. @zhyncs
- Add torchao quant (int4/int8/fp8) to llama models #1341
Support W4A8 quantization with fp8 activation and int4 weight.
Support fp8/fp4 KV cache quantization. int4/int8 is low priority. Currently we've supported fp8 e5m2, and we should also support fp8 e4m3. @ispobock
- [Feature] Support fp8 e5m2 kv cache with flashinfer #1204

Observability

Integrate Grafana / Prometheus
- [WIP] Prometheus Metrics #1461

Model Coverage

Support interleaved window attention (gemma). @Ying1123
- [Feat] Add window attention for gemma-2 #1056
- [Fix] Compatibility of window attention and cuda graph #1090
- [Fix] Window attention compatible with RadixAttention and chunked prefill #1112
- Save memory from interleaved attention #1151 (delayed to Q4, which is dependent on new memory manager)
Language Models
- Mamba models
- Deepseek models
Vision Language Models
Embedding models

Hardware Coverage

AMD support
- [Feature, Hardware] Enable SGLang on AMD GPUs via PyTorch for ROCm #1420

Language API

Function calling. Add tools argument in sgl.gen. See also guidance and Claudette. For OpenAI models, we can translate to their function calling API (https://platform.openai.com/docs/guides/function-calling). For local models, we can use SGLang primitives (regex, select) and constrained decoding to implement a similar workflow. Or we can interrupt the decoding process to replace it with function callings. @Yiyun-Liang
- Function calling for OpenAI backend #573
Support sending a full serialized SGL program to the server.
Constraint decoding
- [FEAT] JSON constrained support #1125

LoRA Support

Port multi-LoRA serving from S-LoRA (Full optimizations will be in Q4 planning). @Ying1123
- [Feature] Initial support for multi-LoRA serving #1307
- [Feature] Support LoRA path renaming and add LoRA serving benchmarks #1433

Usage examples

Add more usage examples (e.g., parallel JSON decoding, auto parallel decoding, Self-Discover: Large Language Models Self-Compose Reasoning Structures).

Others

Setup CI. @zhyncs @hnyls2002 @merrymercy @Ying1123
Documentation website.
Compiler mode optimizations for the language. (Delayed to Q4)

The text was updated successfully, but these errors were encountered:

zhyncs · 2024-07-17T02:20:31Z

Support W8A4 quantization with fp8 activation and int4 weight.

typo: W8A4 -> W4A8

Ying1123 · 2024-07-17T02:21:27Z

Support W8A4 quantization with fp8 activation and int4 weight.

typo: W8A4 -> W4A8

Thanks! Changed.

LinqingZhong · 2024-07-17T13:27:23Z

May I ask if there is an example for using llava-next-interleave with multi images ?

anatoli26 · 2024-07-26T07:50:15Z

I guess ROCm support is under Hardware Coverage - AMD support. Any ETA for this?

usaxena-asapp · 2024-07-26T20:22:14Z

Hey @Ying1123 - are you okay with open source contributions from developers outside the core team? Looking to find more places I can contribute and I'm excited about SGLang. Just wondering.

Ying1123 · 2024-07-26T20:58:44Z

Hey @Ying1123 - are you okay with open source contributions from developers outside the core team? Looking to find more places I can contribute and I'm excited about SGLang. Just wondering.

Hi @usaxena-asapp, definitely! There is no strict definition of a "core team," and I'm just a volunteer to coordinate. If you contribute a lot, you are a core member! Let me know if you need any help from people with experience. My suggestion is to start with small issues and PRs and join discussions. If you want to start a big one, you can start with a simple proposal to trigger collaborations from the community.

Ying1123 · 2024-07-26T21:00:54Z

I guess ROCm support is under Hardware Coverage - AMD support. Any ETA for this?

Hi @usaxena-asapp, thanks for the question, we list it in the roadmap, but we might just start with some basic tests. Optimizations will depend on how many people and resources we can get.

anatoli26 · 2024-07-26T21:20:13Z

we list it in the roadmap, but we might just start with some basic tests. Optimizations will depend on how many people and resources we can get.

Have you tried talking to AMD for hardware samples (e.g. a pair of W7900) and software collaboration? They are trying hard to be on par with NVIDIA in software stack: AMD is Becoming a Software Company. Here's the Plan. The author of the article has some great connections with the AMD people, maybe you could write him (W1zzard under the title) to ask for contacts at AMD responsible for relations with FOSS projects?

ghchris2021 · 2024-08-12T20:50:30Z

IDK if there's any potential interest to broaden the concepts involved in "Hardware Coverage" but in case it may raise some ideas to consider in the future:

You mention CPU support, AMD support, but there are higher level frameworks that MAY considerably help with supporting different hardware backends (CPU, GPU) so you don't necessarily have to put as much work / focus into supporting a SPECIFIC backend -- they may ease / largely solve running on more than one for the same effort.
For instance OpenCL, SYCL, Vulkan compute, maybe OpenACC, and others are somewhat portable parallel computing frameworks and support some CPU(s) and some GPU(s) typically at least a couple if not several.

IIRC OpenCL can run on Nvidia, Amd, Intel GPUs as well as Intel & AMD & I think some ARM CPUs.

IIRC SYCL runs on Intel GPUs, Intel / AMD CPUs, and I believe also NVIDIA GPUs. It may run on AMD GPUs but I'm not so sure about that.

There are higher level still frameworks / implementations that can encapsulate / provide some of the tools / implementations for such open standards e.g.

https://github.com/AdaptiveCpp/AdaptiveCpp

targets SYCL but also provides C++ std:: paralellism programming models.

POCL, RustiCL, and several other (intel, amd, nvidia, ...) development packages / solutions support particular instances of platforms with functional compatible OpenCL support.

Besides the NVIDIA, AMD GPUs Intel has generations of data center / enterprise / business / consumer grade GPUs which are strong in their capabilities and they've got the same tooling / documentation / etc. across the product line insofar as supporting stuff like SYCL, OneAPI, OpenVINO, DPC++, libraries like OneDNN, etc. etc. for GPU families and CPUs.

There exist vulkan wrappers and higher level middleware that encapsulate the details of Vulkan compute programming and expose easier to use developer interfaces / solutions for general parallel compute, math / arithmetic / matrix / vector / NN etc. stuff.

IIRC all major gpus NVIDIA / AMD / Intel have Vulkan compatible runtimes and development options available and several ARM SOC etc. GPUs as well. So it as a middleware layer could help support numerous platforms for a single quantum of effort to target Vulkan based operations for the primary memory / NN / linear algebra etc. related calculations that can be accelerated.

So I'm just suggesting trying to reach for tools to support multiple standards based platforms if that eases your work and also broadens / accelerates the support of more platforms.

CSEEduanyu · 2024-08-23T08:40:08Z

I noticed that the speculate decode function has been implemented in the branch https://github.com/sgl-project/sglang/pull/270/commits, why was this commit closed? How long will it take to support speculate decode? Thank you for your reply.

TimDettmers · 2024-08-27T18:52:43Z

This is an awesome project! Thank you for this. @Ying1123 I am interested in using SGLang for multi-LoRA deployments for a project. The alternative is currently vLLM, but I like SGLang better. I am curious about the current state and timeline for supporting S-LoRA-like deployment.

Ying1123 · 2024-08-29T02:51:30Z

This is an awesome project! Thank you for this. @Ying1123 I am interested in using SGLang for multi-LoRA deployments for a project. The alternative is currently vLLM, but I like SGLang better. I am curious about the current state and timeline for supporting S-LoRA-like deployment.

Hi @TimDettmers, happy to hear from you! I got the same request from another SGLang user, so I am actively working on the multi-LoRA module, which is expected to have a first runnable version in a week. You are welcome to join our Slack and send me your sample script!

hxer7963 · 2024-09-07T06:42:13Z

Have no plan to support W8A8 quantization?

brotherchen · 2024-09-13T06:51:23Z

Can the current excellent performance (compared to vllm) be understood as excellent engineering implementation (such as using multiple processes to reduce CPU overhead) and more efficient scheduling strategies? And I want to know whether support for pipeline parallelism is being considered.

merrymercy · 2024-09-13T07:42:57Z

@hxer7963 fp8 W8A8 quantization is supported.

@brotherchen yes. Pipeline parallelism will be in the plan of Q4

CedricHwong · 2024-09-14T03:22:03Z

Planning to use Sglang on Intel Gaudi 2, but I have not tried it yet. Would like to know the current support level？

mingfeima · 2024-09-14T03:26:06Z

Planning to use Sglang on Intel Gaudi 2, but I have not tried it yet. Would like to know the current support level？

@xinyu-intel we don't have binding for Gaudi 2 yet, right?

xinyu-intel · 2024-09-14T13:27:08Z

@CedricHwong @mingfeima Hi, glad to see the requests for sglang on Intel Gaudi. Currently, it's not implemented and we are evaluating the feasibility.

zhyncs · 2024-11-01T05:56:56Z

Most of the list in this 2024 Q3 roadmap has been completed, and the unfinished parts have been migrated to the 2024 Q4 roadmap. This issue is now closed. For those interested in the latest roadmap, please follow #1487

Ying1123 pinned this issue Jul 17, 2024

Ying1123 mentioned this issue Jul 17, 2024

Development Roadmap (Deprecated) #157

Closed

zhyncs mentioned this issue Jul 31, 2024

[Feature] Support for Inference with LoRA Adapter #847

Closed

zhyncs mentioned this issue Aug 13, 2024

[Feature] Are there plans to implement a prefill-decode split inference architecture? #1080

Closed

shanyu-sys mentioned this issue Aug 20, 2024

[Feat] Support update weights without restart server #1157

Merged

4 tasks

Ying1123 mentioned this issue Sep 21, 2024

Development Roadmap (2024 Q4) #1487

Open

37 tasks

zhyncs unpinned this issue Oct 11, 2024

zhyncs closed this as completed Nov 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Development Roadmap (2024 Q3) #634

Development Roadmap (2024 Q3) #634

Ying1123 commented Jul 17, 2024 •

edited

Loading

zhyncs commented Jul 17, 2024

Ying1123 commented Jul 17, 2024

LinqingZhong commented Jul 17, 2024

anatoli26 commented Jul 26, 2024

usaxena-asapp commented Jul 26, 2024

Ying1123 commented Jul 26, 2024

Ying1123 commented Jul 26, 2024

anatoli26 commented Jul 26, 2024

ghchris2021 commented Aug 12, 2024

CSEEduanyu commented Aug 23, 2024

TimDettmers commented Aug 27, 2024

Ying1123 commented Aug 29, 2024

hxer7963 commented Sep 7, 2024

brotherchen commented Sep 13, 2024

merrymercy commented Sep 13, 2024

CedricHwong commented Sep 14, 2024

mingfeima commented Sep 14, 2024

xinyu-intel commented Sep 14, 2024

zhyncs commented Nov 1, 2024

Development Roadmap (2024 Q3) #634

Development Roadmap (2024 Q3) #634

Comments

Ying1123 commented Jul 17, 2024 • edited Loading

Server API

Performance

Parallelism

Quantization

Observability

Model Coverage

Hardware Coverage

Language API

LoRA Support

Usage examples

Others

zhyncs commented Jul 17, 2024

Ying1123 commented Jul 17, 2024

LinqingZhong commented Jul 17, 2024

anatoli26 commented Jul 26, 2024

usaxena-asapp commented Jul 26, 2024

Ying1123 commented Jul 26, 2024

Ying1123 commented Jul 26, 2024

anatoli26 commented Jul 26, 2024

ghchris2021 commented Aug 12, 2024

CSEEduanyu commented Aug 23, 2024

TimDettmers commented Aug 27, 2024

Ying1123 commented Aug 29, 2024

hxer7963 commented Sep 7, 2024

brotherchen commented Sep 13, 2024

merrymercy commented Sep 13, 2024

CedricHwong commented Sep 14, 2024

mingfeima commented Sep 14, 2024

xinyu-intel commented Sep 14, 2024

zhyncs commented Nov 1, 2024

Ying1123 commented Jul 17, 2024 •

edited

Loading