-
Notifications
You must be signed in to change notification settings - Fork 529
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Expert parallelism support #1435
Comments
sglang/python/sglang/srt/models/mixtral_quant.py Lines 86 to 150 in 441c22d
this is an early example |
@merrymercy Hi, any progress has been made on this issue? The example you provided previously didn't use FusedMOE but mlp. How can we enable Expert Parallel with the current Mixtral/DeepSeek-v2 after using FusedMOE? Do you have a modified example? |
related #1970 |
@merrymercy I see that this issue is mainly related to TP and DP. I noticed that the SGLang Q4 roadmap #1487 mentioned supporting this feature. |
@liangzelang DP has already been merged(only for DeepSeek right now) #1970 and EP will be supported soon cc @ispobock |
@zhyncs Does MoE-EP have any support? I have implemented MoE-EP. |
@xiaobochen123 We are going to implement it with a DP + EP approach for throughput gains. Currently, DP attention is implemented. Before we start the EP, some updates to the MoE codebase should be done. I am interested in what kind of MoE-EP did you implement and what codebase did you use? How much are the performance gains compared to TP? |
Checklist
Motivation
Hi team,
First of all thanks so much for such a great project. I am wondering if there is plan to support Expert Parallelism for MoE models?
Related resources
https://nvidia.github.io/TensorRT-LLM/advanced/expert-parallelism.html
The text was updated successfully, but these errors were encountered: