Conversation
* fix cpu init during export Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * export env fix Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * delete_extra_state for TE related during checkpoint loading for export Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * paths fixes Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * add override_provider option for checkpoint loading Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * add unit test for override_provider option Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * remove debug lines Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * lint Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * unit test fix Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> --------- Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
* chore: Add issue template for model requests Signed-off-by: oliver könig <okoenig@nvidia.com> * copying over remaining templates Signed-off-by: oliver könig <okoenig@nvidia.com> --------- Signed-off-by: oliver könig <okoenig@nvidia.com>
* ci: Skip if `docs-only` label is attached Signed-off-by: oliver könig <okoenig@nvidia.com> * test Signed-off-by: oliver könig <okoenig@nvidia.com> * test Signed-off-by: oliver könig <okoenig@nvidia.com> * test Signed-off-by: oliver könig <okoenig@nvidia.com> * update Signed-off-by: oliver könig <okoenig@nvidia.com> --------- Signed-off-by: oliver könig <okoenig@nvidia.com>
* cleanup process group at end of performance script Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> * Update scripts/performance/run_script.py Signed-off-by: Ananth Subramaniam <ananth.subramaniam@gmail.com> * destroy pg for other scripts Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> * update Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> --------- Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> Signed-off-by: Ananth Subramaniam <ananth.subramaniam@gmail.com>
* ci(fix): pre-flight Signed-off-by: oliver könig <okoenig@nvidia.com> * test Signed-off-by: oliver könig <okoenig@nvidia.com> * test Signed-off-by: oliver könig <okoenig@nvidia.com> * final Signed-off-by: oliver könig <okoenig@nvidia.com> --------- Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
* initial gemma commit Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> * gemma provider Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> * patch tests Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> * add gemma bridge + tests Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> * fix conftest Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> * reenable msc Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> * fix gemma test fallback Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> * try simpler tokenizer Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> * upload assets Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> * use pre-downloaded config for model provider test Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> * lint Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> * address feedback -s Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> * rebase Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> * rebase Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> * use mcore activations Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> * update test Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> * fix mock Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> * fix conversion script reference Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> * subclass Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> * update tests Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> --------- Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
* [docs] packed sequences Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> * [docs] packed sequences Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> * address feedback Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> --------- Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
* gemma2 provider and bridge Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> * gemma2 model provider + bridge Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> --------- Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
* docs] placeholder page for performance summary Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> * add sections for releases Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> * improve description Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> --------- Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
… compatibility (NVIDIA-NeMo#829) * save latest_checkpointed_iteration for compatibility Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> * fix megatron fsdp test assertion Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> --------- Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
* exit profiler context Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> * disable vocab size logging in flops calculation Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> --------- Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
* Clear disk space before install check Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Revert "Clear disk space before install check" This reverts commit 2c085f5. Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Run bare metal install on self-hosted runners Signed-off-by: Charlie Truong <chtruong@nvidia.com> --------- Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
…A-NeMo#607) * update llama and qwen models to use auto bridge and update recipes test as well Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * temporary remove llama4 as it's not fully tested or verified. Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Revert "temporary remove llama4 as it's not fully tested or verified." This reverts commit 5217084. * temp save Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * temp save Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Revert "temp save" This reverts commit 0c57e2b. * Revert "temp save" This reverts commit 0748d52. * update qwen's recipes Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * update llama recipes Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * remove some old recipe files Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * update recipe files to match old recipes Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * update recipe file Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * update qwen recipes Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * update llama recipes Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Update src/megatron/bridge/recipes/qwen/qwen3.py Co-authored-by: Ananth Subramaniam <ansubramania@nvidia.com> Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com> * Update src/megatron/bridge/recipes/qwen/qwen3.py Co-authored-by: Ananth Subramaniam <ansubramania@nvidia.com> Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com> * Update src/megatron/bridge/recipes/qwen/qwen3.py Co-authored-by: Ananth Subramaniam <ansubramania@nvidia.com> Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com> * Update src/megatron/bridge/recipes/llama/llama2.py Co-authored-by: Ananth Subramaniam <ansubramania@nvidia.com> Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com> * Update src/megatron/bridge/recipes/llama/llama2.py Co-authored-by: Ananth Subramaniam <ansubramania@nvidia.com> Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com> * recipe naming update Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * update test Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * lint Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * add TypedDict for args Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * lint Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * update docstring Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * unit test fix and license fix Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * sync eval_interval and save_interval Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * add comments Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * set TRANSFORMERS_OFFLINE=1 in action.yml Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * fix llama3 8b hf model path Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * replay lr decay iters update on updated recipes Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Update action.yml Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com> * add comments Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Add guard / mock for the places needs to download hf config in unit test Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * lint Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * add qwen functional test Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * update recipe tests Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * lint Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> --------- Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com> Co-authored-by: Ananth Subramaniam <ansubramania@nvidia.com>
Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
…ation support - Introduced `pretrain_DiT_Model.py` for flexible pretraining using Megatron-Bridge. - Updated `DITForwardStep` class to use `__call__` method for forward steps. - Modified dataset configuration in `pretrain_config` to utilize `DiffusionDataModule`. - Adjusted tensor and context parallelism settings in `llama3_8b.py`. This commit enhances the pretraining capabilities and configuration flexibility for Llama3 models.
- Commented out sections in `pretrain_DiT_Model.py` related to OmegaConf merging and command-line overrides for clarity. - Added `backend` configuration in `llama3_8b_pretrain_override_example.yaml`. - Updated `init_global_step` handling in `EnergonMultiModalDataModule` to simplify initialization. - Introduced `DiffusionDataModuleConfig` for better dataset configuration management. - Adjusted model parameters in `llama_provider.py` to set `num_layers` to 2 and added `seq_length` and `vocab_size` attributes in `DiTModelProvider`. - Refined imports across various modules to ensure consistency and clarity. This commit enhances the configuration structure and model initialization process, improving maintainability and usability.
There was a problem hiding this comment.
Pull request overview
This pull request introduces comprehensive support for Vace (Video Auto-Context Encoding) finetuning capabilities, including tools for dataset preparation, preprocessing pipelines, and training infrastructure for T2V (Text-to-Video), I2V (Image-to-Video), and V2V (Video-to-Video) tasks. The implementation extends the existing WAN (Wide Attention Network) model architecture with VACE-specific layers and flow-matching training pipelines.
Key Changes
- Added VACE model architecture with context and base layers for video editing tasks
- Implemented flow matching training pipeline with configurable timestep sampling strategies
- Created preprocessing utilities for video/image/mask data and segmentation mask generation
- Added Gemma model family support (Gemma 1.0 and Gemma 2.0) with proper embedding scaling
Reviewed changes
Copilot reviewed 108 out of 229 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| src/megatron/bridge/models/wan/wan_layer_spec.py | Defines WAN transformer layer specifications including VACE-specific base and context layers with adaptive layer normalization |
| src/megatron/bridge/models/wan/wan_bridge.py | Implements parameter mapping bridges between HuggingFace and Megatron formats for WAN and VACE models |
| src/megatron/bridge/models/wan/utils/utils.py | Provides utility functions for grid size calculation, patching/unpatching, and context parallelism operations |
| src/megatron/bridge/models/wan/utils/preprocessor.py | Implements video and image preprocessing classes with resizing, cropping, and normalization capabilities |
| src/megatron/bridge/models/wan/rope_utils.py | Implements 3D RoPE (Rotary Position Embeddings) for spatial-temporal attention in video models |
| src/megatron/bridge/models/wan/modules/vae.py | Defines VAE encoder/decoder architecture with causal 3D convolutions for video latent encoding |
| src/megatron/bridge/models/wan/modules/tokenizers.py | Provides HuggingFace tokenizer wrapper with text cleaning utilities |
| src/megatron/bridge/models/wan/modules/t5.py | Implements T5 encoder/decoder models with custom layer normalization and attention mechanisms |
| src/megatron/bridge/models/wan/flow_matching/time_shift_utils.py | Implements timestep sampling strategies and sigma computation for flow matching training |
| src/megatron/bridge/models/wan/flow_matching/flow_pipeline.py | Defines training pipeline for flow matching with support for both WAN and VACE models |
| src/megatron/bridge/models/wan/flow_matching/flow_inference_pipeline.py | Implements inference pipeline with DPM/UniPC solvers and pipeline parallelism support |
| src/megatron/bridge/models/wan/inference/configs/*.py | Configuration files for different WAN model variants (T2V, I2V, VACE) with size-specific settings |
| src/megatron/bridge/models/gemma/*.py | Adds complete Gemma model family support with proper embedding scaling and configuration mappings |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| query = query.contiguous() # important becuase TE attention expects contiguous tensors | ||
| key = key.contiguous() # important becuase TE attention expects contiguous tensors |
There was a problem hiding this comment.
Corrected spelling of 'becuase' to 'because' in both comments.
| query = query.contiguous() # important becuase TE attention expects contiguous tensors | |
| key = key.contiguous() # important becuase TE attention expects contiguous tensors | |
| query = query.contiguous() # important because TE attention expects contiguous tensors | |
| key = key.contiguous() # important because TE attention expects contiguous tensors |
| query = query.contiguous() # important becuase TE attention expects contiguous tensors | ||
| key = key.contiguous() # important becuase TE attention expects contiguous tensors |
There was a problem hiding this comment.
Corrected spelling of 'becuase' to 'because' in both comments.
| query = query.contiguous() # important becuase TE attention expects contiguous tensors | |
| key = key.contiguous() # important becuase TE attention expects contiguous tensors | |
| query = query.contiguous() # important because TE attention expects contiguous tensors | |
| key = key.contiguous() # important because TE attention expects contiguous tensors |
|
|
||
| class CausalConv3d(nn.Conv3d): | ||
| """ | ||
| Causal 3d convolusion. |
There was a problem hiding this comment.
Corrected spelling of 'convolusion' to 'convolution' in docstring.
| Causal 3d convolusion. | |
| Causal 3d convolution. |
| Input_frames (`list[Tensor]`): | ||
| Input frames for content generation | ||
| Input_masks (`list[Tensor]`): | ||
| Input masks for content generation | ||
| Input_ref_images (`list[Tensor]`): |
There was a problem hiding this comment.
Parameter names should follow snake_case convention. These should be input_frames, input_masks, and input_ref_images instead of capitalized versions.
| Input_frames (`list[Tensor]`): | |
| Input frames for content generation | |
| Input_masks (`list[Tensor]`): | |
| Input masks for content generation | |
| Input_ref_images (`list[Tensor]`): | |
| input_frames (`list[Tensor]`): | |
| Input frames for content generation | |
| input_masks (`list[Tensor]`): | |
| Input masks for content generation | |
| input_ref_images (`list[Tensor]`): |
| """Configuration for a 2B parameter Code Gemma model. | ||
|
|
||
| Extends GemmaModelProvider with specific settings for code generation. | ||
| Thism model has an identical configuration to GemmaModelProvider2B. |
There was a problem hiding this comment.
Corrected spelling of 'Thism' to 'This' in docstring.
| Thism model has an identical configuration to GemmaModelProvider2B. | |
| This model has an identical configuration to GemmaModelProvider2B. |
Code for:
Refer to annotators/Inpainting/run_batch_process.sh for segmentation.
Refer to example_commands.sh for commands to process datasets and launch training.