docs: readme simple init formatting by lbliii · Pull Request #70 · NVIDIA-NeMo/DFM

lbliii · 2025-11-21T19:15:27Z

README structure

Copilot

Pull request overview

This PR improves documentation formatting across multiple README files in the examples directory, transforming minimal documentation into comprehensive, well-structured guides with consistent formatting using markdown tables and hierarchical sections.

Key Changes:

Added comprehensive structure to all README files with tables showing available recipes, tasks, and configurations
Standardized formatting with markdown tables, proper headers, and bullet points
Enhanced descriptions across all example categories (Automodel, Megatron, and top-level examples)

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
`examples/README.md`	Transformed from minimal 2-line description to comprehensive guide with Quick Start section and categorized examples table
`examples/automodel/README.md`	Added new introductory section with task-based table and removed checkmark from "Distributed" feature item
`examples/megatron/README.md`	Expanded from basic description to detailed guide with recipes table including key scripts and directory structure
`examples/megatron/recipes/README.md`	Enhanced from simple description to structured overview with recipes comparison table
`examples/megatron/recipes/wan/README.md`	Created new comprehensive README with file listing and performance testing reference
`examples/megatron/override_configs/README.md`	Expanded from minimal description to detailed guide with usage section and files table

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.

examples/automodel/README.md

* first commit * workable code * workable thd * clean up, remove all CP for sbhd, CP now is only for thd * run outside of Mbridge * Update example scripts and add new data module for multimodal datasets - Added comments to clarify file purposes in example_commands.sh, inference_wan.py, pretrain_wan.py, wan_provider.py, wan_step.py, and wan.py. - Introduced EnergonMultiModalDataModule for handling multimodal datasets in nemo_vfm. - Created SequentialMegatronSampler for efficient sequential sampling in large datasets. - Added new files for DIT attention and base data handling. This commit enhances documentation and introduces new functionalities for better data management and processing. * workable code before refactoring * refactor attention submodules + reorder files locations * update refactor * update refactor * reorganize files * reorganize files * refactoring code * add README for perf test * using vae, t5, scheduler from Diffusers * update repo, remove Wan's Github moduels * fix Ruff * fix ruff + copyright * fix Ruff + Lint * fix Ruff + Lint * fix Ruff + Lint * fix Ruff + Lint * fix Ruff + Lint * fix Ruff + Lint * fix Ruff + Lint * fix Ruff + Lint * merged main + address comments * remove example_commands.md, Google waits until mid Nov * refactor inference_configs + mockdatamodule * add dit_embeddings.py * fix lint ruff * add 'average_gradients_across_tp_domain' to torch.nn for when running sequence_parallelism * add english negative prompt * fix ruff lint * Update uv.lock for deps: diffusers==0.35.1, easydict, imageio * update dfm/src/megatron/data/dit * change english negative prompt * seem to workable seq_packing * refactor with Sajad's PR - DiT data to common dir * fix Ruff, lint * fix Ruff, lint * fix Ruff, lint * workable mock datamodule (doesn't need setting path); updated training algo + hyper-parameters aligning with Linnan; tested training with anime dataset finetung * bring wan_task encoders features to common, sharing with dit * lint, ruff * lint, ruff * lint, ruff * fix CP error (input of thd_split_inputs_cp to be cu_seqlens_q_padded instead of cu_seqlens_q) * udpate README_perf_test.md * fix lint, ruff * update uv.lock, merge main * uv.lock * uv.lock * uv.lock * update uv.lock [using ci] * Performance improvements to Wan * Perf optimizations * Tiny fix * Remove CP disable as packed sequences not supported * Fix comment * Minor fixes. Revert video_latent comparison * Fix missed check * Lint fix * H100 mock pretraining perf config * Rename config file * Lint check Signed-off-by: Parth Mannan <pmannan@nvidia.com> * Adding GB200 perf config Signed-off-by: Parth Mannan <pmannan@nvidia.com> * GB300 perf config Signed-off-by: Parth Mannan <pmannan@nvidia.com> * Refactor Energon data module to return wrapped dataloaders and add EnergonDataloader class for cyclic iteration. Introduce WAN pretrain mock data configuration for testing. * Enhance DiffusionTaskEncoder to handle None attributes in stacking and concatenation methods. Add WAN pretrain mock data configuration for testing purposes. * Refactor data processing in dit_data_step to simplify batch retrieval and update WAN pretrain configuration to include train_iters. * Add op fusions Signed-off-by: Parth Mannan <pmannan@nvidia.com> * Update H100 config Signed-off-by: Parth Mannan <pmannan@nvidia.com> * Fix lint Signed-off-by: Parth Mannan <pmannan@nvidia.com> * Resolve conflict Signed-off-by: Parth Mannan <pmannan@nvidia.com> * Fix for mock dataloader test Signed-off-by: Parth Mannan <pmannan@nvidia.com> * Fix Dummyiter Signed-off-by: Parth Mannan <pmannan@nvidia.com> * Fix test Signed-off-by: Parth Mannan <pmannan@nvidia.com> * Make RoPE test only GPU Signed-off-by: Parth Mannan <pmannan@nvidia.com> * Rope cuda fix Signed-off-by: Parth Mannan <pmannan@nvidia.com> --------- Signed-off-by: Parth Mannan <pmannan@nvidia.com> Co-authored-by: Huy Vu2 <huvu@login-eos02.eos.clusters.nvidia.com> Co-authored-by: Abhinav Garg <abhinavg@stanford.edu> Co-authored-by: root <root@eos0025.eos.clusters.nvidia.com> Co-authored-by: root <root@eos0558.eos.clusters.nvidia.com> Co-authored-by: Pablo Garay <pagaray@nvidia.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>

Signed-off-by: Lawrence Lane <llane@nvidia.com>

Signed-off-by: linnan wang <wangnan318@gmail.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>

* add focs * updated README for Wab * update README wan * relocate teadme --------- Co-authored-by: Huy Vu2 <huvu@login-eos02.eos.clusters.nvidia.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>

* Add DiT Readme. Signed-off-by: sajadn <snorouzi@nvidia.com> * Update DiT readme. Signed-off-by: Sajad Norouzi <snorouzi@nvidia.com> * Minor wording update. Signed-off-by: Sajad Norouzi <snorouzi@nvidia.com> --------- Signed-off-by: sajadn <snorouzi@nvidia.com> Signed-off-by: Sajad Norouzi <snorouzi@nvidia.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>

* inital commit, workable code * add example * fix lint * fix lint * bring all wan related codes to DFM * add tests * lint --------- Co-authored-by: Huy Vu2 <huvu@login-eos02.eos.clusters.nvidia.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>

@akoumpa

* Initial README commit * Update README and add performance summary documentation - Corrected the link in the README for the performance summary to point to the correct file. - Introduced a new `performance-summary.md` document detailing performance benchmarks for large language models using DFM, including nomenclature, performance metrics, and system configurations. * add DiT megatron links. Signed-off-by: sajadn <snorouzi@nvidia.com> * Performance Docs update Signed-off-by: Parth Mannan <pmannan@nvidia.com> * Performance Docs update fix Signed-off-by: Parth Mannan <pmannan@nvidia.com> * Update README to enhance clarity and accuracy - Removed redundant description of the framework. - Clarified the relationship between Megatron Bridge and Megatron Core in the Dual-Path Architecture section. * Enhance README with detailed performance optimizations and parallelism descriptions - Updated the Megatron Bridge Path section to include 6D parallelism details. - Added state-of-the-art performance optimizations to the Dual Training Paths section. - Clarified parallelism terminology in the comparison table for better understanding. * Update perf doc Signed-off-by: Parth Mannan <pmannan@nvidia.com> * update Signed-off-by: linnan wang <wangnan318@gmail.com> * Update README with fine-tuning command Removed TODO comment and added a command for fine-tuning a video diffusion model. * Apply suggestion from @akoumpa * Apply suggestion from @akoumpa * Apply suggestion from @akoumpa * Update README, Wan-related. Updated command syntax and improved clarity in README. * Apply suggestion from @akoumpa * Fixing typo @akoumpa * fix automodel section Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * fix Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * update DFM-specific readme Signed-off-by: Pablo Garay <pagaray@nvidia.com> * Update performance-summary.md Thanks a lot @linnanwang for the bench numbers. * Update performance-summary.md * Update performance-summary.md * Update README.md Co-authored-by: Wenwen Gao <94138584+snowmanwwg@users.noreply.github.com> * Update README.md Co-authored-by: Wenwen Gao <94138584+snowmanwwg@users.noreply.github.com> * Update README.md Co-authored-by: Wenwen Gao <94138584+snowmanwwg@users.noreply.github.com> * Update README.md Co-authored-by: Wenwen Gao <94138584+snowmanwwg@users.noreply.github.com> * Refactor README.md and performance-summary.md for clarity and conciseness - Simplified descriptions of Megatron Bridge and AutoModel paths in README.md. - Removed outdated comparison table to streamline content. - Updated performance-summary.md to generalize model references and improve clarity. Co-authored-by: Wenwen Gao <94138584+snowmanwwg@users.noreply.github.com> * Fix typo in README.md: changed "Built" to "Build" in the container section header for consistency. --------- Signed-off-by: sajadn <snorouzi@nvidia.com> Signed-off-by: Parth Mannan <pmannan@nvidia.com> Signed-off-by: linnan wang <wangnan318@gmail.com> Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> Signed-off-by: Pablo Garay <pagaray@nvidia.com> Co-authored-by: sajadn <snorouzi@nvidia.com> Co-authored-by: Parth Mannan <pmannan@nvidia.com> Co-authored-by: linnan wang <wangnan318@gmail.com> Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com> Co-authored-by: Huy Vu <86480512+huvunvidia@users.noreply.github.com> Co-authored-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> Co-authored-by: Pablo Garay <pagaray@nvidia.com> Co-authored-by: Wenwen Gao <94138584+snowmanwwg@users.noreply.github.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>

* report for public version * fix image size * Update report.md for Wan 2.1 convergence comparison, correcting formatting and ensuring clarity in experiment overview and caveats regarding training loss fluctuations between Diffusers and Megatron-Core implementations. --------- Co-authored-by: Huy Vu2 <huvu@login-eos02.eos.clusters.nvidia.com> Co-authored-by: Abhinav Garg <abhinavg@stanford.edu> Signed-off-by: Lawrence Lane <llane@nvidia.com>

- Introduced a new document detailing the comparison between Diffusers (Automodel path) and Megatron-Core (Megatron-Bridge path) for Wan 2.1. - Included experiment overview, dataset specifications, training setup, and results with visual training curves. - Added two binary images illustrating loss vs. steps for both text-to-image and text-to-video stages. This documentation aims to provide insights into the model's performance and training dynamics during the partial convergence test. Signed-off-by: Lawrence Lane <llane@nvidia.com>

* edm and data preprocess tests. Signed-off-by: sajadn <snorouzi@nvidia.com> * Minor cleanings for DiT. Signed-off-by: Sajad Norouzi <snorouzi@nvidia.com> * add dit unit test. Signed-off-by: Sajad Norouzi <snorouzi@nvidia.com> * add iter to the DiffusionDataModule. Signed-off-by: sajadn <snorouzi@nvidia.com> * add missing copyright. Signed-off-by: sajadn <snorouzi@nvidia.com> * use 'no caption' if caption is not present. Signed-off-by: sajadn <snorouzi@nvidia.com> * fix dit inference bug. Add wanbd to inference code. Signed-off-by: sajadn <snorouzi@nvidia.com> * update the DiT configs to be aligned with the original paper. Signed-off-by: sajadn <snorouzi@nvidia.com> * add wandb[video] and mediapy to uv. Signed-off-by: sajadn <snorouzi@nvidia.com> * adjust pos_ids in mock_dataset to have batch dimension, fuse adaLN layers, use DiTSelfAttention. Signed-off-by: sajadn <snorouzi@nvidia.com> * fix the diffusion sample size bug. Signed-off-by: sajadn <snorouzi@nvidia.com> * fix broken tests. Signed-off-by: sajadn <snorouzi@nvidia.com> --------- Signed-off-by: sajadn <snorouzi@nvidia.com> Signed-off-by: Sajad Norouzi <snorouzi@nvidia.com> Co-authored-by: Abhinav Garg <abhinavg@stanford.edu> Signed-off-by: Lawrence Lane <llane@nvidia.com>

copy-pr-bot · 2025-12-03T15:16:23Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: Lawrence Lane <llane@nvidia.com>

jgerh

Completed a tech pubs review and added a few edits.

jgerh · 2025-12-12T20:15:30Z

docs/megatron/models/dit/README.md


 ### 📋 Overview
 An open-source implementation of [Diffusion Transformers (DiTs)](https://github.com/facebookresearch/DiT) for training text-to-image/video models with [EDMPipeline](https://arxiv.org/abs/2206.00364). The implementation is based on [Megatron-Core](https://github.com/NVIDIA/Megatron-LM) and [Megatron-Bridge](https://github.com/NVIDIA-NeMo/Megatron-Bridge) to bring both scalability and efficiency. Various parallelization techniques such as tensor, sequence, and context parallelism are currently supported.



perhaps add a link to the full tutorial when /tutorials directory is available.

Suggested change

**Full Tutorial**: For detailed configuration options and advanced topics, see [Training from Scratch](../../../tutorials/training-from-scratch.md).

jgerh · 2025-12-12T20:16:01Z

docs/megatron/models/dit/README.md


 Once you have the dataset and container ready, you can start training the DiT model on your own dataset. This repository leverages [sequence packing](https://docs.nvidia.com/nemo-framework/user-guide/24.09/nemotoolkit/features/optimizations/sequence_packing.html) to maximize training efficiency. Sequence packing stacks multiple samples into a single sequence instead of padding individual samples to a fixed length; therefore, `micro_batch_size` must be set to 1. Additionally, `qkv_format` should be set to `thd` to signal to Transformer Engine that sequence packing is enabled.

 For data loading, Energon provides two key hyperparameters related to sequence packing: `task_encoder_seq_length` and `packing_buffer_size`. The `task_encoder_seq_length` parameter controls the maximum sequence length passed to the model, while `packing_buffer_size` determines the number of samples processed to create different buckets. You can look at `select_samples_to_pack` and `pack_selected_samples` methods of [DiffusionTaskEncoderWithSequencePacking](https://github.com/NVIDIA-NeMo/DFM/blob/main/dfm/src/megatron/data/common/diffusion_task_encoder_with_sp.py#L50) to get a better sense of these parameters. For further details you can look at [Energon packing](https://nvidia.github.io/Megatron-Energon/advanced/packing.html) documenation.


Fixed typo: documenation → documentation

Suggested change

For data loading, Energon provides two key hyperparameters related to sequence packing: `task_encoder_seq_length` and `packing_buffer_size`. The `task_encoder_seq_length` parameter controls the maximum sequence length passed to the model, while `packing_buffer_size` determines the number of samples processed to create different buckets. You can look at `select_samples_to_pack` and `pack_selected_samples` methods of [DiffusionTaskEncoderWithSequencePacking](https://github.com/NVIDIA-NeMo/DFM/blob/main/dfm/src/megatron/data/common/diffusion_task_encoder_with_sp.py#L50) to get a better sense of these parameters. For further details you can look at [Energon packing](https://nvidia.github.io/Megatron-Energon/advanced/packing.html) documentation.

jgerh · 2025-12-12T20:17:01Z

docs/megatron/models/dit/README.md


 ### ⚡ Parallelism Support

 The table below shows current parallelism support for different model sizes:


Maybe add note that "Currently, only DiT-XL (700M) has full parallelism support." Verify after TBDs are addressed.

jgerh · 2025-12-12T20:22:41Z

examples/automodel/README.md

 - ✅ **Distributed**: FSDP2 + Tensor Parallelism
 - ✅ **Mixed Precision**: BF16 by default
 - ✅ **WandB**: Automatic logging
 - ✅ **Checkpointing**: consolidated, and sharded formats


Remove extra comma

Suggested change

- ✅ **Checkpointing**: consolidated and sharded formats

lbliii self-assigned this Nov 21, 2025

copy-pr-bot bot temporarily deployed to test November 21, 2025 19:15 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci November 21, 2025 19:15 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci November 21, 2025 19:18 Error

copy-pr-bot bot temporarily deployed to test November 21, 2025 19:22 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci November 21, 2025 19:22 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci November 21, 2025 19:25 Inactive

lbliii requested a review from linnanwang November 21, 2025 19:25

copy-pr-bot bot temporarily deployed to nemo-ci November 21, 2025 19:54 Inactive

copy-pr-bot bot temporarily deployed to test November 24, 2025 21:08 Inactive

lbliii requested review from Copilot and sajadn November 24, 2025 21:08

copy-pr-bot bot temporarily deployed to nemo-ci November 24, 2025 21:08 Inactive

Copilot started reviewing on behalf of lbliii November 24, 2025 21:09 View session

Copilot finished reviewing on behalf of lbliii November 24, 2025 21:10

Copilot AI reviewed Nov 24, 2025

View reviewed changes

examples/automodel/README.md Outdated Show resolved Hide resolved

copy-pr-bot bot temporarily deployed to nemo-ci November 24, 2025 21:14 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci November 24, 2025 21:36 Inactive

copy-pr-bot bot temporarily deployed to test December 1, 2025 21:47 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci December 1, 2025 21:47 Error

copy-pr-bot bot temporarily deployed to test December 1, 2025 21:49 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci December 1, 2025 21:50 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci December 1, 2025 23:14 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci December 1, 2025 23:30 Inactive

copy-pr-bot bot temporarily deployed to test December 3, 2025 15:14 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci December 3, 2025 15:14 Inactive

parthmannan and others added 12 commits December 3, 2025 10:16

docs: readme simple init formatting

f66ef60

Signed-off-by: Lawrence Lane <llane@nvidia.com>

updates

e400443

Signed-off-by: Lawrence Lane <llane@nvidia.com>

update (#69)

a562f5e

Signed-off-by: linnan wang <wangnan318@gmail.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>

Add docs for Megatron Wan (#38)

4c5e76d

* add focs * updated README for Wab * update README wan * relocate teadme --------- Co-authored-by: Huy Vu2 <huvu@login-eos02.eos.clusters.nvidia.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>

Update examples/automodel/README.md

c7ee2e2

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>

lbliii force-pushed the llane/examples/readmes branch from d28e003 to 4ce8373 Compare December 3, 2025 15:16

lbliii requested a review from a team as a code owner December 3, 2025 15:16

lbliii added 2 commits December 3, 2025 10:16

Merge branch 'main' into llane/examples/readmes

800b463

updates

6ed88eb

Signed-off-by: Lawrence Lane <llane@nvidia.com>

lbliii requested a review from jgerh December 9, 2025 14:38

jgerh reviewed Dec 12, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: readme simple init formatting#70

docs: readme simple init formatting#70
lbliii wants to merge 14 commits intomainfrom
llane/examples/readmes

lbliii commented Nov 21, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

copy-pr-bot bot commented Dec 3, 2025

Uh oh!

jgerh left a comment

Uh oh!

jgerh Dec 12, 2025

Uh oh!

jgerh Dec 12, 2025

Uh oh!

jgerh Dec 12, 2025

Uh oh!

jgerh Dec 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants


		### 📋 Overview
		An open-source implementation of [Diffusion Transformers (DiTs)](https://github.com/facebookresearch/DiT) for training text-to-image/video models with [EDMPipeline](https://arxiv.org/abs/2206.00364). The implementation is based on [Megatron-Core](https://github.com/NVIDIA/Megatron-LM) and [Megatron-Bridge](https://github.com/NVIDIA-NeMo/Megatron-Bridge) to bring both scalability and efficiency. Various parallelization techniques such as tensor, sequence, and context parallelism are currently supported.


	Full Tutorial: For detailed configuration options and advanced topics, see [Training from Scratch](../../../tutorials/training-from-scratch.md).


		Once you have the dataset and container ready, you can start training the DiT model on your own dataset. This repository leverages [sequence packing](https://docs.nvidia.com/nemo-framework/user-guide/24.09/nemotoolkit/features/optimizations/sequence_packing.html) to maximize training efficiency. Sequence packing stacks multiple samples into a single sequence instead of padding individual samples to a fixed length; therefore, `micro_batch_size` must be set to 1. Additionally, `qkv_format` should be set to `thd` to signal to Transformer Engine that sequence packing is enabled.

		For data loading, Energon provides two key hyperparameters related to sequence packing: `task_encoder_seq_length` and `packing_buffer_size`. The `task_encoder_seq_length` parameter controls the maximum sequence length passed to the model, while `packing_buffer_size` determines the number of samples processed to create different buckets. You can look at `select_samples_to_pack` and `pack_selected_samples` methods of [DiffusionTaskEncoderWithSequencePacking](https://github.com/NVIDIA-NeMo/DFM/blob/main/dfm/src/megatron/data/common/diffusion_task_encoder_with_sp.py#L50) to get a better sense of these parameters. For further details you can look at [Energon packing](https://nvidia.github.io/Megatron-Energon/advanced/packing.html) documenation.


		### ⚡ Parallelism Support

		The table below shows current parallelism support for different model sizes:

Conversation

lbliii commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

copy-pr-bot bot commented Dec 3, 2025

Uh oh!

jgerh left a comment

Choose a reason for hiding this comment

Uh oh!

jgerh Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

jgerh Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

jgerh Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

jgerh Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

lbliii commented Nov 21, 2025 •

edited

Loading