-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Description
This is a draft roadmap for DeepSpeed Q2 2026. Feedback is welcome — please leave comments on this issue or join the #2026q2-roadmap channel on the DeepSpeed Slack.
New feature and enhancement
AutoEP support
AutoEP enables Expert Parallelism (EP) for major Mixture-of-Experts (MoE) models out of the box, eliminating the need for users to write model-specific parallelization code. By automatically distributing expert layers across devices, AutoEP allows users to scale MoE training with minimal configuration changes.
A prototype implementation has been validated on 8xH100, achieving ~5x throughput improvement over ZeRO-3 baselines. We will build on this work to extend AutoEP support to production readiness in Q2.
- Convergence validation: Verify training convergence matches non-EP baselines across latest MoE model architectures
- Model coverage: Add support for additional MoE architectures (e.g., Qwen-MoE)
- ZeRO-3 support: Extend AutoEP to work with ZeRO Stage 3
- AutoTP integration: Combine AutoEP with AutoTP for hybrid expert/tensor parallelism
- Benchmarking: Publish throughput, memory, and scaling efficiency numbers across model sizes and GPU counts
- Universal Checkpoint support: Enable saving and resuming from Universal Checkpoints with AutoEP
AutoTP extension
AutoTP was significantly revamped in Q1 (PR #7806), introducing a flexible, configuration-driven API for custom layer partitioning patterns. In Q2, we will extend this foundation to support a broader range of models and scales.
- HuggingFace
tp_plansupport: Leverage thebase_model_tp_planmetadata provided by HuggingFace Transformers models to automatically derive partitioning configurations, enabling out-of-the-box TP for any model that ships with a tp_plan - Combination with AutoEP: Support parallel folding for hybrid expert/tensor parallelism
- Universal Checkpoint support: Enable saving and resuming from Universal Checkpoints with AutoTP
AutoSP Integration
AutoSP (ICLR 2026) is a compiler-based approach that automatically applies sequence parallelism via DeepSpeed Ulysses, removing the need for manual partitioning of sequence dimensions.
- Initial integration: The initial PR ([WIP] Merging AutoSP into DeepSpeed #7860) is ready
- Model coverage: Improve coverage for major model families (e.g., Qwen, Llama)
- Multimodal model support: Multimodal models involve significantly longer sequence lengths, making sequence parallelism critical for training efficiency (blog post). However, existing frameworks such as Megatron-LM do not support sequence parallelism for ViT encoders, and manually implementing it requires substantial engineering effort. AutoSP aims to automate this, enabling DeepSpeed Ulysses-based sequence parallelism for multimodal architectures out of the box.
Compiler Integration Enhancement (Optional)
- "DTensor mode" for less graph break and stable graph tracing
- DeepCompile enhancement
- Support multi-stage optimization passes for PyTorch v2.9+
- Compiler pass enhancement
- AutoTP support
- AutoEP support
- AMD support
New Accelerator Support (Q2)
- Planning (Scope, target accelerators)
RL training specific Optimization for DeepSpeed-Inference
- Systems Design, prototyping and benchmarking
Stability (Q2)
- Performance regression test
- Enable nightly full test
- CUDA
- AMD
- Intel XPU
- Intel Gaudi
- NPU