Skip to content

(Draft) [Roadmap] DeepSpeed Roadmap Q2 2026 #7861

@tohtana

Description

@tohtana

This is a draft roadmap for DeepSpeed Q2 2026. Feedback is welcome — please leave comments on this issue or join the #2026q2-roadmap channel on the DeepSpeed Slack.

New feature and enhancement

AutoEP support

AutoEP enables Expert Parallelism (EP) for major Mixture-of-Experts (MoE) models out of the box, eliminating the need for users to write model-specific parallelization code. By automatically distributing expert layers across devices, AutoEP allows users to scale MoE training with minimal configuration changes.

A prototype implementation has been validated on 8xH100, achieving ~5x throughput improvement over ZeRO-3 baselines. We will build on this work to extend AutoEP support to production readiness in Q2.

  • Convergence validation: Verify training convergence matches non-EP baselines across latest MoE model architectures
  • Model coverage: Add support for additional MoE architectures (e.g., Qwen-MoE)
  • ZeRO-3 support: Extend AutoEP to work with ZeRO Stage 3
  • AutoTP integration: Combine AutoEP with AutoTP for hybrid expert/tensor parallelism
  • Benchmarking: Publish throughput, memory, and scaling efficiency numbers across model sizes and GPU counts
  • Universal Checkpoint support: Enable saving and resuming from Universal Checkpoints with AutoEP

AutoTP extension

AutoTP was significantly revamped in Q1 (PR #7806), introducing a flexible, configuration-driven API for custom layer partitioning patterns. In Q2, we will extend this foundation to support a broader range of models and scales.

  • HuggingFace tp_plan support: Leverage the base_model_tp_plan metadata provided by HuggingFace Transformers models to automatically derive partitioning configurations, enabling out-of-the-box TP for any model that ships with a tp_plan
  • Combination with AutoEP: Support parallel folding for hybrid expert/tensor parallelism
  • Universal Checkpoint support: Enable saving and resuming from Universal Checkpoints with AutoTP

AutoSP Integration

AutoSP (ICLR 2026) is a compiler-based approach that automatically applies sequence parallelism via DeepSpeed Ulysses, removing the need for manual partitioning of sequence dimensions.

  • Initial integration: The initial PR ([WIP] Merging AutoSP into DeepSpeed #7860) is ready
  • Model coverage: Improve coverage for major model families (e.g., Qwen, Llama)
  • Multimodal model support: Multimodal models involve significantly longer sequence lengths, making sequence parallelism critical for training efficiency (blog post). However, existing frameworks such as Megatron-LM do not support sequence parallelism for ViT encoders, and manually implementing it requires substantial engineering effort. AutoSP aims to automate this, enabling DeepSpeed Ulysses-based sequence parallelism for multimodal architectures out of the box.

Compiler Integration Enhancement (Optional)

  • "DTensor mode" for less graph break and stable graph tracing
  • DeepCompile enhancement
    • Support multi-stage optimization passes for PyTorch v2.9+
    • Compiler pass enhancement
      • AutoTP support
      • AutoEP support
    • AMD support

New Accelerator Support (Q2)

  • Planning (Scope, target accelerators)

RL training specific Optimization for DeepSpeed-Inference

  • Systems Design, prototyping and benchmarking

Stability (Q2)

  • Performance regression test
  • Enable nightly full test
    • CUDA
    • AMD
    • Intel XPU
    • Intel Gaudi
    • NPU

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions