Fix: unwrap DDP before enabling gradient checkpointing for HF compatibility #31

magicwang1111 · 2025-05-23T10:10:30Z

This PR introduces three key improvements:

Dependency Standardization
- Added a requirements.txt file based on the original uv.lock to ensure reproducible installations across environments.
- Locks versions for diffusers, accelerate, transformers, CUDA toolkits, and other core libraries.
Support for Custom Master Port
- Added a new --main_process_port option in scripts/train_distributed.py to allow explicit control over the master port used by Accelerate’s distributed launcher.
- Prevents port conflicts when launching multiple distributed training jobs on the same host.
python scripts/train_distributed.py
configs/your_config.yaml
--num_processes 2
--main_process_port 29600
**Fix for HF DDP Compatibility

In src/ltxv_trainer/trainer.py, unwrap the base model from DistributedDataParallel before calling the gradient-checkpointing API.

Prevents the runtime AttributeError: 'DistributedDataParallel' object has no attribute 'enable_gradient_checkpointing'.

…bility

Fix: unwrap DDP before enabling gradient checkpointing for HF compati…

7c18e56

…bility

magicwang1111 requested a review from matanby as a code owner May 23, 2025 10:10

Provide feedback