Skip to content

Vace finetuning#3

Open
Tatiana21 wants to merge 54 commits intohuvunvidia:mainfrom
NeverMore960114:vace_ft
Open

Vace finetuning#3
Tatiana21 wants to merge 54 commits intohuvunvidia:mainfrom
NeverMore960114:vace_ft

Conversation

@Tatiana21
Copy link

Code for:

  1. Creating segmentation masks for Inpainting tasks for a video dataset, based on open-sora-plan dataset format
  2. Preprocessing and creating energon dataset for finetuing with T2V, I2V and V2V tasks
  3. Finetuning vace, for T2V, I2V and V2V tasks

Refer to annotators/Inpainting/run_batch_process.sh for segmentation.
Refer to example_commands.sh for commands to process datasets and launch training.

abhinavg4 and others added 30 commits September 30, 2025 14:23
* fix cpu init during export

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* export env fix

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* delete_extra_state for TE related during checkpoint loading for export

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* paths fixes

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* add override_provider option for checkpoint loading

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* add unit test for override_provider option

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* remove debug lines

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* lint

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* unit test fix

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

---------

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
* chore: Add issue template for model requests

Signed-off-by: oliver könig <okoenig@nvidia.com>

* copying over remaining templates

Signed-off-by: oliver könig <okoenig@nvidia.com>

---------

Signed-off-by: oliver könig <okoenig@nvidia.com>
* ci: Skip if `docs-only` label is attached

Signed-off-by: oliver könig <okoenig@nvidia.com>

* test

Signed-off-by: oliver könig <okoenig@nvidia.com>

* test

Signed-off-by: oliver könig <okoenig@nvidia.com>

* test

Signed-off-by: oliver könig <okoenig@nvidia.com>

* update

Signed-off-by: oliver könig <okoenig@nvidia.com>

---------

Signed-off-by: oliver könig <okoenig@nvidia.com>
* cleanup process group at end of performance script

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* Update scripts/performance/run_script.py

Signed-off-by: Ananth Subramaniam <ananth.subramaniam@gmail.com>

* destroy pg for other scripts

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* update

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

---------

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
Signed-off-by: Ananth Subramaniam <ananth.subramaniam@gmail.com>
* ci(fix): pre-flight

Signed-off-by: oliver könig <okoenig@nvidia.com>

* test

Signed-off-by: oliver könig <okoenig@nvidia.com>

* test

Signed-off-by: oliver könig <okoenig@nvidia.com>

* final

Signed-off-by: oliver könig <okoenig@nvidia.com>

---------

Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
* initial gemma commit

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* gemma provider

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* patch tests

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* add gemma bridge + tests

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* fix conftest

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* reenable msc

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* fix gemma test fallback

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* try simpler tokenizer

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* upload assets

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* use pre-downloaded config for model provider test

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* lint

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* address feedback -s

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* rebase

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* rebase

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* use mcore activations

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* update test

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* fix mock

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* fix conversion script reference

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* subclass

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* update tests

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

---------

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
* [docs] packed sequences

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* [docs] packed sequences

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* address feedback

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

---------

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
* gemma2 provider and bridge

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* gemma2 model provider + bridge

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

---------

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
* docs] placeholder page for performance summary

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* add sections for releases

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* improve description

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

---------

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
… compatibility (NVIDIA-NeMo#829)

* save latest_checkpointed_iteration for compatibility

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* fix megatron fsdp test assertion

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

---------

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
* exit profiler context

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* disable vocab size logging in flops calculation

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

---------

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
* Clear disk space before install check

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Revert "Clear disk space before install check"

This reverts commit 2c085f5.

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Run bare metal install on self-hosted runners

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

---------

Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
…A-NeMo#607)

* update llama and qwen models to use auto bridge and update recipes test as well

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* temporary remove llama4 as it's not fully tested or verified.

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* Revert "temporary remove llama4 as it's not fully tested or verified."

This reverts commit 5217084.

* temp save

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* temp save

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* Revert "temp save"

This reverts commit 0c57e2b.

* Revert "temp save"

This reverts commit 0748d52.

* update qwen's recipes

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* update llama recipes

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* remove some old recipe files

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* update recipe files to match old recipes

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* update recipe file

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* update qwen recipes

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* update llama recipes

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* Update src/megatron/bridge/recipes/qwen/qwen3.py

Co-authored-by: Ananth Subramaniam <ansubramania@nvidia.com>
Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com>

* Update src/megatron/bridge/recipes/qwen/qwen3.py

Co-authored-by: Ananth Subramaniam <ansubramania@nvidia.com>
Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com>

* Update src/megatron/bridge/recipes/qwen/qwen3.py

Co-authored-by: Ananth Subramaniam <ansubramania@nvidia.com>
Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com>

* Update src/megatron/bridge/recipes/llama/llama2.py

Co-authored-by: Ananth Subramaniam <ansubramania@nvidia.com>
Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com>

* Update src/megatron/bridge/recipes/llama/llama2.py

Co-authored-by: Ananth Subramaniam <ansubramania@nvidia.com>
Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com>

* recipe naming update

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* update test

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* lint

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* add TypedDict for args

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* lint

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* update docstring

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* unit test fix and license fix

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* sync eval_interval and save_interval

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* add comments

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* set TRANSFORMERS_OFFLINE=1 in action.yml

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* fix llama3 8b hf model path

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* replay lr decay iters update on updated recipes

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* Update action.yml

Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com>

* add comments

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* Add guard / mock for the places needs to download hf config in unit test

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* lint

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* add qwen functional test

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* update recipe tests

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* lint

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

---------

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com>
Co-authored-by: Ananth Subramaniam <ansubramania@nvidia.com>
Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
…ation support

- Introduced `pretrain_DiT_Model.py` for flexible pretraining using Megatron-Bridge.
- Updated `DITForwardStep` class to use `__call__` method for forward steps.
- Modified dataset configuration in `pretrain_config` to utilize `DiffusionDataModule`.
- Adjusted tensor and context parallelism settings in `llama3_8b.py`.

This commit enhances the pretraining capabilities and configuration flexibility for Llama3 models.
abhinavg4 and others added 24 commits October 6, 2025 09:33
- Commented out sections in `pretrain_DiT_Model.py` related to OmegaConf merging and command-line overrides for clarity.
- Added `backend` configuration in `llama3_8b_pretrain_override_example.yaml`.
- Updated `init_global_step` handling in `EnergonMultiModalDataModule` to simplify initialization.
- Introduced `DiffusionDataModuleConfig` for better dataset configuration management.
- Adjusted model parameters in `llama_provider.py` to set `num_layers` to 2 and added `seq_length` and `vocab_size` attributes in `DiTModelProvider`.
- Refined imports across various modules to ensure consistency and clarity.

This commit enhances the configuration structure and model initialization process, improving maintainability and usability.
Copilot AI review requested due to automatic review settings December 10, 2025 23:56
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request introduces comprehensive support for Vace (Video Auto-Context Encoding) finetuning capabilities, including tools for dataset preparation, preprocessing pipelines, and training infrastructure for T2V (Text-to-Video), I2V (Image-to-Video), and V2V (Video-to-Video) tasks. The implementation extends the existing WAN (Wide Attention Network) model architecture with VACE-specific layers and flow-matching training pipelines.

Key Changes

  • Added VACE model architecture with context and base layers for video editing tasks
  • Implemented flow matching training pipeline with configurable timestep sampling strategies
  • Created preprocessing utilities for video/image/mask data and segmentation mask generation
  • Added Gemma model family support (Gemma 1.0 and Gemma 2.0) with proper embedding scaling

Reviewed changes

Copilot reviewed 108 out of 229 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/megatron/bridge/models/wan/wan_layer_spec.py Defines WAN transformer layer specifications including VACE-specific base and context layers with adaptive layer normalization
src/megatron/bridge/models/wan/wan_bridge.py Implements parameter mapping bridges between HuggingFace and Megatron formats for WAN and VACE models
src/megatron/bridge/models/wan/utils/utils.py Provides utility functions for grid size calculation, patching/unpatching, and context parallelism operations
src/megatron/bridge/models/wan/utils/preprocessor.py Implements video and image preprocessing classes with resizing, cropping, and normalization capabilities
src/megatron/bridge/models/wan/rope_utils.py Implements 3D RoPE (Rotary Position Embeddings) for spatial-temporal attention in video models
src/megatron/bridge/models/wan/modules/vae.py Defines VAE encoder/decoder architecture with causal 3D convolutions for video latent encoding
src/megatron/bridge/models/wan/modules/tokenizers.py Provides HuggingFace tokenizer wrapper with text cleaning utilities
src/megatron/bridge/models/wan/modules/t5.py Implements T5 encoder/decoder models with custom layer normalization and attention mechanisms
src/megatron/bridge/models/wan/flow_matching/time_shift_utils.py Implements timestep sampling strategies and sigma computation for flow matching training
src/megatron/bridge/models/wan/flow_matching/flow_pipeline.py Defines training pipeline for flow matching with support for both WAN and VACE models
src/megatron/bridge/models/wan/flow_matching/flow_inference_pipeline.py Implements inference pipeline with DPM/UniPC solvers and pipeline parallelism support
src/megatron/bridge/models/wan/inference/configs/*.py Configuration files for different WAN model variants (T2V, I2V, VACE) with size-specific settings
src/megatron/bridge/models/gemma/*.py Adds complete Gemma model family support with proper embedding scaling and configuration mappings

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +232 to +233
query = query.contiguous() # important becuase TE attention expects contiguous tensors
key = key.contiguous() # important becuase TE attention expects contiguous tensors
Copy link

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected spelling of 'becuase' to 'because' in both comments.

Suggested change
query = query.contiguous() # important becuase TE attention expects contiguous tensors
key = key.contiguous() # important becuase TE attention expects contiguous tensors
query = query.contiguous() # important because TE attention expects contiguous tensors
key = key.contiguous() # important because TE attention expects contiguous tensors

Copilot uses AI. Check for mistakes.
Comment on lines +359 to +360
query = query.contiguous() # important becuase TE attention expects contiguous tensors
key = key.contiguous() # important becuase TE attention expects contiguous tensors
Copy link

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected spelling of 'becuase' to 'because' in both comments.

Suggested change
query = query.contiguous() # important becuase TE attention expects contiguous tensors
key = key.contiguous() # important becuase TE attention expects contiguous tensors
query = query.contiguous() # important because TE attention expects contiguous tensors
key = key.contiguous() # important because TE attention expects contiguous tensors

Copilot uses AI. Check for mistakes.

class CausalConv3d(nn.Conv3d):
"""
Causal 3d convolusion.
Copy link

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected spelling of 'convolusion' to 'convolution' in docstring.

Suggested change
Causal 3d convolusion.
Causal 3d convolution.

Copilot uses AI. Check for mistakes.
Comment on lines +1014 to +1018
Input_frames (`list[Tensor]`):
Input frames for content generation
Input_masks (`list[Tensor]`):
Input masks for content generation
Input_ref_images (`list[Tensor]`):
Copy link

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parameter names should follow snake_case convention. These should be input_frames, input_masks, and input_ref_images instead of capitalized versions.

Suggested change
Input_frames (`list[Tensor]`):
Input frames for content generation
Input_masks (`list[Tensor]`):
Input masks for content generation
Input_ref_images (`list[Tensor]`):
input_frames (`list[Tensor]`):
Input frames for content generation
input_masks (`list[Tensor]`):
Input masks for content generation
input_ref_images (`list[Tensor]`):

Copilot uses AI. Check for mistakes.
"""Configuration for a 2B parameter Code Gemma model.

Extends GemmaModelProvider with specific settings for code generation.
Thism model has an identical configuration to GemmaModelProvider2B.
Copy link

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected spelling of 'Thism' to 'This' in docstring.

Suggested change
Thism model has an identical configuration to GemmaModelProvider2B.
This model has an identical configuration to GemmaModelProvider2B.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants

Comments