Conversation
There was a problem hiding this comment.
Pull request overview
This PR improves documentation formatting across multiple README files in the examples directory, transforming minimal documentation into comprehensive, well-structured guides with consistent formatting using markdown tables and hierarchical sections.
Key Changes:
- Added comprehensive structure to all README files with tables showing available recipes, tasks, and configurations
- Standardized formatting with markdown tables, proper headers, and bullet points
- Enhanced descriptions across all example categories (Automodel, Megatron, and top-level examples)
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
examples/README.md |
Transformed from minimal 2-line description to comprehensive guide with Quick Start section and categorized examples table |
examples/automodel/README.md |
Added new introductory section with task-based table and removed checkmark from "Distributed" feature item |
examples/megatron/README.md |
Expanded from basic description to detailed guide with recipes table including key scripts and directory structure |
examples/megatron/recipes/README.md |
Enhanced from simple description to structured overview with recipes comparison table |
examples/megatron/recipes/wan/README.md |
Created new comprehensive README with file listing and performance testing reference |
examples/megatron/override_configs/README.md |
Expanded from minimal description to detailed guide with usage section and files table |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.
* first commit * workable code * workable thd * clean up, remove all CP for sbhd, CP now is only for thd * run outside of Mbridge * Update example scripts and add new data module for multimodal datasets - Added comments to clarify file purposes in example_commands.sh, inference_wan.py, pretrain_wan.py, wan_provider.py, wan_step.py, and wan.py. - Introduced EnergonMultiModalDataModule for handling multimodal datasets in nemo_vfm. - Created SequentialMegatronSampler for efficient sequential sampling in large datasets. - Added new files for DIT attention and base data handling. This commit enhances documentation and introduces new functionalities for better data management and processing. * workable code before refactoring * refactor attention submodules + reorder files locations * update refactor * update refactor * reorganize files * reorganize files * refactoring code * add README for perf test * using vae, t5, scheduler from Diffusers * update repo, remove Wan's Github moduels * fix Ruff * fix ruff + copyright * fix Ruff + Lint * fix Ruff + Lint * fix Ruff + Lint * fix Ruff + Lint * fix Ruff + Lint * fix Ruff + Lint * fix Ruff + Lint * fix Ruff + Lint * merged main + address comments * remove example_commands.md, Google waits until mid Nov * refactor inference_configs + mockdatamodule * add dit_embeddings.py * fix lint ruff * add 'average_gradients_across_tp_domain' to torch.nn for when running sequence_parallelism * add english negative prompt * fix ruff lint * Update uv.lock for deps: diffusers==0.35.1, easydict, imageio * update dfm/src/megatron/data/dit * change english negative prompt * seem to workable seq_packing * refactor with Sajad's PR - DiT data to common dir * fix Ruff, lint * fix Ruff, lint * fix Ruff, lint * workable mock datamodule (doesn't need setting path); updated training algo + hyper-parameters aligning with Linnan; tested training with anime dataset finetung * bring wan_task encoders features to common, sharing with dit * lint, ruff * lint, ruff * lint, ruff * fix CP error (input of thd_split_inputs_cp to be cu_seqlens_q_padded instead of cu_seqlens_q) * udpate README_perf_test.md * fix lint, ruff * update uv.lock, merge main * uv.lock * uv.lock * uv.lock * update uv.lock [using ci] * Performance improvements to Wan * Perf optimizations * Tiny fix * Remove CP disable as packed sequences not supported * Fix comment * Minor fixes. Revert video_latent comparison * Fix missed check * Lint fix * H100 mock pretraining perf config * Rename config file * Lint check Signed-off-by: Parth Mannan <pmannan@nvidia.com> * Adding GB200 perf config Signed-off-by: Parth Mannan <pmannan@nvidia.com> * GB300 perf config Signed-off-by: Parth Mannan <pmannan@nvidia.com> * Refactor Energon data module to return wrapped dataloaders and add EnergonDataloader class for cyclic iteration. Introduce WAN pretrain mock data configuration for testing. * Enhance DiffusionTaskEncoder to handle None attributes in stacking and concatenation methods. Add WAN pretrain mock data configuration for testing purposes. * Refactor data processing in dit_data_step to simplify batch retrieval and update WAN pretrain configuration to include train_iters. * Add op fusions Signed-off-by: Parth Mannan <pmannan@nvidia.com> * Update H100 config Signed-off-by: Parth Mannan <pmannan@nvidia.com> * Fix lint Signed-off-by: Parth Mannan <pmannan@nvidia.com> * Resolve conflict Signed-off-by: Parth Mannan <pmannan@nvidia.com> * Fix for mock dataloader test Signed-off-by: Parth Mannan <pmannan@nvidia.com> * Fix Dummyiter Signed-off-by: Parth Mannan <pmannan@nvidia.com> * Fix test Signed-off-by: Parth Mannan <pmannan@nvidia.com> * Make RoPE test only GPU Signed-off-by: Parth Mannan <pmannan@nvidia.com> * Rope cuda fix Signed-off-by: Parth Mannan <pmannan@nvidia.com> --------- Signed-off-by: Parth Mannan <pmannan@nvidia.com> Co-authored-by: Huy Vu2 <huvu@login-eos02.eos.clusters.nvidia.com> Co-authored-by: Abhinav Garg <abhinavg@stanford.edu> Co-authored-by: root <root@eos0025.eos.clusters.nvidia.com> Co-authored-by: root <root@eos0558.eos.clusters.nvidia.com> Co-authored-by: Pablo Garay <pagaray@nvidia.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* add focs * updated README for Wab * update README wan * relocate teadme --------- Co-authored-by: Huy Vu2 <huvu@login-eos02.eos.clusters.nvidia.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>
* Add DiT Readme. Signed-off-by: sajadn <snorouzi@nvidia.com> * Update DiT readme. Signed-off-by: Sajad Norouzi <snorouzi@nvidia.com> * Minor wording update. Signed-off-by: Sajad Norouzi <snorouzi@nvidia.com> --------- Signed-off-by: sajadn <snorouzi@nvidia.com> Signed-off-by: Sajad Norouzi <snorouzi@nvidia.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>
* inital commit, workable code * add example * fix lint * fix lint * bring all wan related codes to DFM * add tests * lint --------- Co-authored-by: Huy Vu2 <huvu@login-eos02.eos.clusters.nvidia.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>
* Initial README commit * Update README and add performance summary documentation - Corrected the link in the README for the performance summary to point to the correct file. - Introduced a new `performance-summary.md` document detailing performance benchmarks for large language models using DFM, including nomenclature, performance metrics, and system configurations. * add DiT megatron links. Signed-off-by: sajadn <snorouzi@nvidia.com> * Performance Docs update Signed-off-by: Parth Mannan <pmannan@nvidia.com> * Performance Docs update fix Signed-off-by: Parth Mannan <pmannan@nvidia.com> * Update README to enhance clarity and accuracy - Removed redundant description of the framework. - Clarified the relationship between Megatron Bridge and Megatron Core in the Dual-Path Architecture section. * Enhance README with detailed performance optimizations and parallelism descriptions - Updated the Megatron Bridge Path section to include 6D parallelism details. - Added state-of-the-art performance optimizations to the Dual Training Paths section. - Clarified parallelism terminology in the comparison table for better understanding. * Update perf doc Signed-off-by: Parth Mannan <pmannan@nvidia.com> * update Signed-off-by: linnan wang <wangnan318@gmail.com> * Update README with fine-tuning command Removed TODO comment and added a command for fine-tuning a video diffusion model. * Apply suggestion from @akoumpa * Apply suggestion from @akoumpa * Apply suggestion from @akoumpa * Update README, Wan-related. Updated command syntax and improved clarity in README. * Apply suggestion from @akoumpa * Fixing typo @akoumpa * fix automodel section Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * fix Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * update DFM-specific readme Signed-off-by: Pablo Garay <pagaray@nvidia.com> * Update performance-summary.md Thanks a lot @linnanwang for the bench numbers. * Update performance-summary.md * Update performance-summary.md * Update README.md Co-authored-by: Wenwen Gao <94138584+snowmanwwg@users.noreply.github.com> * Update README.md Co-authored-by: Wenwen Gao <94138584+snowmanwwg@users.noreply.github.com> * Update README.md Co-authored-by: Wenwen Gao <94138584+snowmanwwg@users.noreply.github.com> * Update README.md Co-authored-by: Wenwen Gao <94138584+snowmanwwg@users.noreply.github.com> * Refactor README.md and performance-summary.md for clarity and conciseness - Simplified descriptions of Megatron Bridge and AutoModel paths in README.md. - Removed outdated comparison table to streamline content. - Updated performance-summary.md to generalize model references and improve clarity. Co-authored-by: Wenwen Gao <94138584+snowmanwwg@users.noreply.github.com> * Fix typo in README.md: changed "Built" to "Build" in the container section header for consistency. --------- Signed-off-by: sajadn <snorouzi@nvidia.com> Signed-off-by: Parth Mannan <pmannan@nvidia.com> Signed-off-by: linnan wang <wangnan318@gmail.com> Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> Signed-off-by: Pablo Garay <pagaray@nvidia.com> Co-authored-by: sajadn <snorouzi@nvidia.com> Co-authored-by: Parth Mannan <pmannan@nvidia.com> Co-authored-by: linnan wang <wangnan318@gmail.com> Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com> Co-authored-by: Huy Vu <86480512+huvunvidia@users.noreply.github.com> Co-authored-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> Co-authored-by: Pablo Garay <pagaray@nvidia.com> Co-authored-by: Wenwen Gao <94138584+snowmanwwg@users.noreply.github.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>
* report for public version * fix image size * Update report.md for Wan 2.1 convergence comparison, correcting formatting and ensuring clarity in experiment overview and caveats regarding training loss fluctuations between Diffusers and Megatron-Core implementations. --------- Co-authored-by: Huy Vu2 <huvu@login-eos02.eos.clusters.nvidia.com> Co-authored-by: Abhinav Garg <abhinavg@stanford.edu> Signed-off-by: Lawrence Lane <llane@nvidia.com>
- Introduced a new document detailing the comparison between Diffusers (Automodel path) and Megatron-Core (Megatron-Bridge path) for Wan 2.1. - Included experiment overview, dataset specifications, training setup, and results with visual training curves. - Added two binary images illustrating loss vs. steps for both text-to-image and text-to-video stages. This documentation aims to provide insights into the model's performance and training dynamics during the partial convergence test. Signed-off-by: Lawrence Lane <llane@nvidia.com>
* edm and data preprocess tests. Signed-off-by: sajadn <snorouzi@nvidia.com> * Minor cleanings for DiT. Signed-off-by: Sajad Norouzi <snorouzi@nvidia.com> * add dit unit test. Signed-off-by: Sajad Norouzi <snorouzi@nvidia.com> * add iter to the DiffusionDataModule. Signed-off-by: sajadn <snorouzi@nvidia.com> * add missing copyright. Signed-off-by: sajadn <snorouzi@nvidia.com> * use 'no caption' if caption is not present. Signed-off-by: sajadn <snorouzi@nvidia.com> * fix dit inference bug. Add wanbd to inference code. Signed-off-by: sajadn <snorouzi@nvidia.com> * update the DiT configs to be aligned with the original paper. Signed-off-by: sajadn <snorouzi@nvidia.com> * add wandb[video] and mediapy to uv. Signed-off-by: sajadn <snorouzi@nvidia.com> * adjust pos_ids in mock_dataset to have batch dimension, fuse adaLN layers, use DiTSelfAttention. Signed-off-by: sajadn <snorouzi@nvidia.com> * fix the diffusion sample size bug. Signed-off-by: sajadn <snorouzi@nvidia.com> * fix broken tests. Signed-off-by: sajadn <snorouzi@nvidia.com> --------- Signed-off-by: sajadn <snorouzi@nvidia.com> Signed-off-by: Sajad Norouzi <snorouzi@nvidia.com> Co-authored-by: Abhinav Garg <abhinavg@stanford.edu> Signed-off-by: Lawrence Lane <llane@nvidia.com>
d28e003 to
4ce8373
Compare
jgerh
left a comment
There was a problem hiding this comment.
Completed a tech pubs review and added a few edits.
|
|
||
| ### 📋 Overview | ||
| An open-source implementation of [Diffusion Transformers (DiTs)](https://github.com/facebookresearch/DiT) for training text-to-image/video models with [EDMPipeline](https://arxiv.org/abs/2206.00364). The implementation is based on [Megatron-Core](https://github.com/NVIDIA/Megatron-LM) and [Megatron-Bridge](https://github.com/NVIDIA-NeMo/Megatron-Bridge) to bring both scalability and efficiency. Various parallelization techniques such as tensor, sequence, and context parallelism are currently supported. | ||
|
|
There was a problem hiding this comment.
perhaps add a link to the full tutorial when /tutorials directory is available.
| **Full Tutorial**: For detailed configuration options and advanced topics, see [Training from Scratch](../../../tutorials/training-from-scratch.md). |
|
|
||
| Once you have the dataset and container ready, you can start training the DiT model on your own dataset. This repository leverages [sequence packing](https://docs.nvidia.com/nemo-framework/user-guide/24.09/nemotoolkit/features/optimizations/sequence_packing.html) to maximize training efficiency. Sequence packing stacks multiple samples into a single sequence instead of padding individual samples to a fixed length; therefore, `micro_batch_size` must be set to 1. Additionally, `qkv_format` should be set to `thd` to signal to Transformer Engine that sequence packing is enabled. | ||
|
|
||
| For data loading, Energon provides two key hyperparameters related to sequence packing: `task_encoder_seq_length` and `packing_buffer_size`. The `task_encoder_seq_length` parameter controls the maximum sequence length passed to the model, while `packing_buffer_size` determines the number of samples processed to create different buckets. You can look at `select_samples_to_pack` and `pack_selected_samples` methods of [DiffusionTaskEncoderWithSequencePacking](https://github.com/NVIDIA-NeMo/DFM/blob/main/dfm/src/megatron/data/common/diffusion_task_encoder_with_sp.py#L50) to get a better sense of these parameters. For further details you can look at [Energon packing](https://nvidia.github.io/Megatron-Energon/advanced/packing.html) documenation. |
There was a problem hiding this comment.
Fixed typo: documenation → documentation
| For data loading, Energon provides two key hyperparameters related to sequence packing: `task_encoder_seq_length` and `packing_buffer_size`. The `task_encoder_seq_length` parameter controls the maximum sequence length passed to the model, while `packing_buffer_size` determines the number of samples processed to create different buckets. You can look at `select_samples_to_pack` and `pack_selected_samples` methods of [DiffusionTaskEncoderWithSequencePacking](https://github.com/NVIDIA-NeMo/DFM/blob/main/dfm/src/megatron/data/common/diffusion_task_encoder_with_sp.py#L50) to get a better sense of these parameters. For further details you can look at [Energon packing](https://nvidia.github.io/Megatron-Energon/advanced/packing.html) documentation. |
|
|
||
| ### ⚡ Parallelism Support | ||
|
|
||
| The table below shows current parallelism support for different model sizes: |
There was a problem hiding this comment.
Maybe add note that "Currently, only DiT-XL (700M) has full parallelism support." Verify after TBDs are addressed.
| - ✅ **Distributed**: FSDP2 + Tensor Parallelism | ||
| - ✅ **Mixed Precision**: BF16 by default | ||
| - ✅ **WandB**: Automatic logging | ||
| - ✅ **Checkpointing**: consolidated, and sharded formats |
There was a problem hiding this comment.
Remove extra comma
| - ✅ **Checkpointing**: consolidated and sharded formats |
README structure