Vace finetuning by Tatiana21 · Pull Request #3 · huvunvidia/Megatron-Bridge

Tatiana21 · 2025-12-10T23:56:18Z

Code for:

Creating segmentation masks for Inpainting tasks for a video dataset, based on open-sora-plan dataset format
Preprocessing and creating energon dataset for finetuing with T2V, I2V and V2V tasks
Finetuning vace, for T2V, I2V and V2V tasks

Refer to annotators/Inpainting/run_batch_process.sh for segmentation.
Refer to example_commands.sh for commands to process datasets and launch training.

* fix cpu init during export Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * export env fix Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * delete_extra_state for TE related during checkpoint loading for export Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * paths fixes Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * add override_provider option for checkpoint loading Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * add unit test for override_provider option Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * remove debug lines Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * lint Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * unit test fix Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> --------- Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* chore: Add issue template for model requests Signed-off-by: oliver könig <okoenig@nvidia.com> * copying over remaining templates Signed-off-by: oliver könig <okoenig@nvidia.com> --------- Signed-off-by: oliver könig <okoenig@nvidia.com>

* ci: Skip if `docs-only` label is attached Signed-off-by: oliver könig <okoenig@nvidia.com> * test Signed-off-by: oliver könig <okoenig@nvidia.com> * test Signed-off-by: oliver könig <okoenig@nvidia.com> * test Signed-off-by: oliver könig <okoenig@nvidia.com> * update Signed-off-by: oliver könig <okoenig@nvidia.com> --------- Signed-off-by: oliver könig <okoenig@nvidia.com>

* cleanup process group at end of performance script Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> * Update scripts/performance/run_script.py Signed-off-by: Ananth Subramaniam <ananth.subramaniam@gmail.com> * destroy pg for other scripts Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> * update Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> --------- Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> Signed-off-by: Ananth Subramaniam <ananth.subramaniam@gmail.com>

* ci(fix): pre-flight Signed-off-by: oliver könig <okoenig@nvidia.com> * test Signed-off-by: oliver könig <okoenig@nvidia.com> * test Signed-off-by: oliver könig <okoenig@nvidia.com> * final Signed-off-by: oliver könig <okoenig@nvidia.com> --------- Signed-off-by: oliver könig <okoenig@nvidia.com>

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

Signed-off-by: oliver könig <okoenig@nvidia.com>

* initial gemma commit Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> * gemma provider Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> * patch tests Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> * add gemma bridge + tests Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> * fix conftest Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> * reenable msc Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> * fix gemma test fallback Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> * try simpler tokenizer Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> * upload assets Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> * use pre-downloaded config for model provider test Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> * lint Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> * address feedback -s Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> * rebase Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> * rebase Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> * use mcore activations Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> * update test Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> * fix mock Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> * fix conversion script reference Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> * subclass Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> * update tests Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> --------- Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* [docs] packed sequences Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> * [docs] packed sequences Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> * address feedback Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> --------- Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* gemma2 provider and bridge Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> * gemma2 model provider + bridge Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> --------- Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* docs] placeholder page for performance summary Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> * add sections for releases Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> * improve description Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> --------- Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

… compatibility (NVIDIA-NeMo#829) * save latest_checkpointed_iteration for compatibility Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> * fix megatron fsdp test assertion Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> --------- Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* exit profiler context Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> * disable vocab size logging in flops calculation Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> --------- Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* Clear disk space before install check Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Revert "Clear disk space before install check" This reverts commit 2c085f5. Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Run bare metal install on self-hosted runners Signed-off-by: Charlie Truong <chtruong@nvidia.com> --------- Signed-off-by: Charlie Truong <chtruong@nvidia.com>

Signed-off-by: oliver könig <okoenig@nvidia.com>

…A-NeMo#607) * update llama and qwen models to use auto bridge and update recipes test as well Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * temporary remove llama4 as it's not fully tested or verified. Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Revert "temporary remove llama4 as it's not fully tested or verified." This reverts commit 5217084. * temp save Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * temp save Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Revert "temp save" This reverts commit 0c57e2b. * Revert "temp save" This reverts commit 0748d52. * update qwen's recipes Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * update llama recipes Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * remove some old recipe files Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * update recipe files to match old recipes Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * update recipe file Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * update qwen recipes Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * update llama recipes Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Update src/megatron/bridge/recipes/qwen/qwen3.py Co-authored-by: Ananth Subramaniam <ansubramania@nvidia.com> Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com> * Update src/megatron/bridge/recipes/qwen/qwen3.py Co-authored-by: Ananth Subramaniam <ansubramania@nvidia.com> Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com> * Update src/megatron/bridge/recipes/qwen/qwen3.py Co-authored-by: Ananth Subramaniam <ansubramania@nvidia.com> Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com> * Update src/megatron/bridge/recipes/llama/llama2.py Co-authored-by: Ananth Subramaniam <ansubramania@nvidia.com> Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com> * Update src/megatron/bridge/recipes/llama/llama2.py Co-authored-by: Ananth Subramaniam <ansubramania@nvidia.com> Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com> * recipe naming update Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * update test Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * lint Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * add TypedDict for args Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * lint Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * update docstring Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * unit test fix and license fix Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * sync eval_interval and save_interval Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * add comments Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * set TRANSFORMERS_OFFLINE=1 in action.yml Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * fix llama3 8b hf model path Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * replay lr decay iters update on updated recipes Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Update action.yml Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com> * add comments Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Add guard / mock for the places needs to download hf config in unit test Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * lint Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * add qwen functional test Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * update recipe tests Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * lint Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> --------- Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com> Co-authored-by: Ananth Subramaniam <ansubramania@nvidia.com>

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

…ation support - Introduced `pretrain_DiT_Model.py` for flexible pretraining using Megatron-Bridge. - Updated `DITForwardStep` class to use `__call__` method for forward steps. - Modified dataset configuration in `pretrain_config` to utilize `DiffusionDataModule`. - Adjusted tensor and context parallelism settings in `llama3_8b.py`. This commit enhances the pretraining capabilities and configuration flexibility for Llama3 models.

…into init_dit

- Commented out sections in `pretrain_DiT_Model.py` related to OmegaConf merging and command-line overrides for clarity. - Added `backend` configuration in `llama3_8b_pretrain_override_example.yaml`. - Updated `init_global_step` handling in `EnergonMultiModalDataModule` to simplify initialization. - Introduced `DiffusionDataModuleConfig` for better dataset configuration management. - Adjusted model parameters in `llama_provider.py` to set `num_layers` to 2 and added `seq_length` and `vocab_size` attributes in `DiTModelProvider`. - Refined imports across various modules to ensure consistency and clarity. This commit enhances the configuration structure and model initialization process, improving maintainability and usability.

…e branch

Copilot

Pull request overview

This pull request introduces comprehensive support for Vace (Video Auto-Context Encoding) finetuning capabilities, including tools for dataset preparation, preprocessing pipelines, and training infrastructure for T2V (Text-to-Video), I2V (Image-to-Video), and V2V (Video-to-Video) tasks. The implementation extends the existing WAN (Wide Attention Network) model architecture with VACE-specific layers and flow-matching training pipelines.

Key Changes

Added VACE model architecture with context and base layers for video editing tasks
Implemented flow matching training pipeline with configurable timestep sampling strategies
Created preprocessing utilities for video/image/mask data and segmentation mask generation
Added Gemma model family support (Gemma 1.0 and Gemma 2.0) with proper embedding scaling

Reviewed changes

Copilot reviewed 108 out of 229 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
src/megatron/bridge/models/wan/wan_layer_spec.py	Defines WAN transformer layer specifications including VACE-specific base and context layers with adaptive layer normalization
src/megatron/bridge/models/wan/wan_bridge.py	Implements parameter mapping bridges between HuggingFace and Megatron formats for WAN and VACE models
src/megatron/bridge/models/wan/utils/utils.py	Provides utility functions for grid size calculation, patching/unpatching, and context parallelism operations
src/megatron/bridge/models/wan/utils/preprocessor.py	Implements video and image preprocessing classes with resizing, cropping, and normalization capabilities
src/megatron/bridge/models/wan/rope_utils.py	Implements 3D RoPE (Rotary Position Embeddings) for spatial-temporal attention in video models
src/megatron/bridge/models/wan/modules/vae.py	Defines VAE encoder/decoder architecture with causal 3D convolutions for video latent encoding
src/megatron/bridge/models/wan/modules/tokenizers.py	Provides HuggingFace tokenizer wrapper with text cleaning utilities
src/megatron/bridge/models/wan/modules/t5.py	Implements T5 encoder/decoder models with custom layer normalization and attention mechanisms
src/megatron/bridge/models/wan/flow_matching/time_shift_utils.py	Implements timestep sampling strategies and sigma computation for flow matching training
src/megatron/bridge/models/wan/flow_matching/flow_pipeline.py	Defines training pipeline for flow matching with support for both WAN and VACE models
src/megatron/bridge/models/wan/flow_matching/flow_inference_pipeline.py	Implements inference pipeline with DPM/UniPC solvers and pipeline parallelism support
src/megatron/bridge/models/wan/inference/configs/*.py	Configuration files for different WAN model variants (T2V, I2V, VACE) with size-specific settings
src/megatron/bridge/models/gemma/*.py	Adds complete Gemma model family support with proper embedding scaling and configuration mappings

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-10T23:57:19Z

src/megatron/bridge/models/wan/wan_layer_spec.py

+            query = query.contiguous() # important becuase TE attention expects contiguous tensors
+            key = key.contiguous() # important becuase TE attention expects contiguous tensors


Corrected spelling of 'becuase' to 'because' in both comments.

Suggested change

query = query.contiguous() # important becuase TE attention expects contiguous tensors

key = key.contiguous() # important becuase TE attention expects contiguous tensors

query = query.contiguous() # important because TE attention expects contiguous tensors

key = key.contiguous() # important because TE attention expects contiguous tensors

Copilot · 2025-12-10T23:57:19Z

src/megatron/bridge/models/wan/wan_layer_spec.py

+            query = query.contiguous() # important becuase TE attention expects contiguous tensors
+            key = key.contiguous() # important becuase TE attention expects contiguous tensors


Corrected spelling of 'becuase' to 'because' in both comments.

Suggested change

query = query.contiguous() # important becuase TE attention expects contiguous tensors

key = key.contiguous() # important becuase TE attention expects contiguous tensors

query = query.contiguous() # important because TE attention expects contiguous tensors

key = key.contiguous() # important because TE attention expects contiguous tensors

Copilot · 2025-12-10T23:57:20Z

src/megatron/bridge/models/wan/modules/vae.py

+
+class CausalConv3d(nn.Conv3d):
+    """
+    Causal 3d convolusion.


Corrected spelling of 'convolusion' to 'convolution' in docstring.

Suggested change

Causal 3d convolusion.

Causal 3d convolution.

Copilot · 2025-12-10T23:57:20Z

src/megatron/bridge/models/wan/flow_matching/flow_inference_pipeline.py

+            Input_frames (`list[Tensor]`):
+                Input frames for content generation
+            Input_masks (`list[Tensor]`):
+                Input masks for content generation
+            Input_ref_images (`list[Tensor]`):


Parameter names should follow snake_case convention. These should be input_frames, input_masks, and input_ref_images instead of capitalized versions.

Suggested change

Input_frames (`list[Tensor]`):

Input frames for content generation

Input_masks (`list[Tensor]`):

Input masks for content generation

Input_ref_images (`list[Tensor]`):

input_frames (`list[Tensor]`):

Input frames for content generation

input_masks (`list[Tensor]`):

Input masks for content generation

input_ref_images (`list[Tensor]`):

Copilot · 2025-12-10T23:57:20Z

src/megatron/bridge/models/gemma/gemma_provider.py

+    """Configuration for a 2B parameter Code Gemma model.
+
+    Extends GemmaModelProvider with specific settings for code generation.
+    Thism model has an identical configuration to GemmaModelProvider2B.


Corrected spelling of 'Thism' to 'This' in docstring.

Suggested change

Thism model has an identical configuration to GemmaModelProvider2B.

This model has an identical configuration to GemmaModelProvider2B.

abhinavg4 and others added 30 commits September 30, 2025 14:23

Initial commit

2bb8969

[docs] Add canonical lora docs (NVIDIA-NeMo#821)

4bba0e6

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

ci: Bump pre-flight (NVIDIA-NeMo#854)

7e2eeaa

Signed-off-by: oliver könig <okoenig@nvidia.com>

support async saving for CI end to end testing (NVIDIA-NeMo#804)

ad94387

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

docs: Revert 0.2.0 push (NVIDIA-NeMo#865)

a5d7c58

Signed-off-by: oliver könig <okoenig@nvidia.com>

add tests for functor design

96e7b4c

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

improve typing for forward step func and add tests for functors

4a750dd

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

update tests

e0e8611

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

make checks more robust

7f6ec50

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

docstrings

d6b02c6

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

docstrings

897da83

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

docstrings

b7ad487

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

fix tests

a6ae7a3

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

inject state once at the beginning of the loops

6883596

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

cleanup

23e9efc

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

add tests

ab4f32d

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

abhinavg4 and others added 24 commits October 6, 2025 09:33

Merge branch 'functor' of https://github.com/ananthsub/Megatron-Bridge …

db1b812

…into init_dit

diffusion_energon_datamodule

7a701f6

runnanle mcore Wan inference

a86856a

clean inference code

544ad75

workable model implementation, inference, finetuning

e41b3d1

add example commands

74da525

add example commands

0189812

runnable thd, without containers edits

a2a2580

update commands

77f2673

add example commands

bf4b652

add example commands

2b4fd60

fix example_commands.sh

a263c00

vace

ea6bb12

hf verification

e8e30d2

add support for tp and cp

59d3e99

add profiling

afdd3c6

fix memory issues

5996456

enable batch size more than 1

f25c81a

add additional output for context branch and additional input for bas…

7eba845

…e branch

vace pretrain scripts

40e0e32

Vace I2V finetuning

661acb1

Finetuning for V2V

dccfce4

add annotator

c985676

Copilot AI review requested due to automatic review settings December 10, 2025 23:56

Copilot AI reviewed Dec 10, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vace finetuning#3

Vace finetuning#3
Tatiana21 wants to merge 54 commits intohuvunvidia:mainfrom
NeverMore960114:vace_ft

Tatiana21 commented Dec 10, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Dec 10, 2025

Uh oh!

Copilot AI Dec 10, 2025

Uh oh!

Copilot AI Dec 10, 2025

Uh oh!

Copilot AI Dec 10, 2025

Uh oh!

Copilot AI Dec 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Comments

		query = query.contiguous() # important becuase TE attention expects contiguous tensors
		key = key.contiguous() # important becuase TE attention expects contiguous tensors

	Thism model has an identical configuration to GemmaModelProvider2B.
	This model has an identical configuration to GemmaModelProvider2B.

Conversation

Tatiana21 commented Dec 10, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Key Changes

Reviewed changes

Uh oh!

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Comments