Skip to content

Conversation

@magicwang1111
Copy link

This PR introduces three key improvements:

  1. Dependency Standardization

    • Added a requirements.txt file based on the original uv.lock to ensure reproducible installations across environments.
    • Locks versions for diffusers, accelerate, transformers, CUDA toolkits, and other core libraries.
  2. Support for Custom Master Port

    • Added a new --main_process_port option in scripts/train_distributed.py to allow explicit control over the master port used by Accelerate’s distributed launcher.
    • Prevents port conflicts when launching multiple distributed training jobs on the same host.

    python scripts/train_distributed.py
    configs/your_config.yaml
    --num_processes 2
    --main_process_port 29600

  3. **Fix for HF DDP Compatibility

In src/ltxv_trainer/trainer.py, unwrap the base model from DistributedDataParallel before calling the gradient-checkpointing API.

Prevents the runtime AttributeError: 'DistributedDataParallel' object has no attribute 'enable_gradient_checkpointing'.

@magicwang1111 magicwang1111 requested a review from matanby as a code owner May 23, 2025 10:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant