A high-performance distributed training framework for large language models, built on top of PyTorch and heavily inspired by torchtitan
.
- μP (muP) Implementation: Principled parameter scaling that allows training hyperparameters to transfer between different model sizes
- Distributed Training: Efficient multi-GPU and multi-node training capabilities
- Custom Job Management: Advanced job scheduling and management system
- Experimental Dataloader: Modified version of IBM's dataloader contribution with on-the-fly tokenization
- SLURM Integration: Optimized for supercomputer environments (LUMI)
# Clone the repository
git clone https://github.com/rlrs/maester.git
cd maester
# Install dependencies
uv sync
- Python >= 3.10
- PyTorch >= 2.5, ideally nightly
- The framework is set up for training on the LUMI supercomputer, which uses SLURM and AMD GPUs. Most features are not specific to this setup, though.
The framework uses a Pydantic-based configuration system that allows for:
- Configuration via JSON files, environment variables, and command-line arguments
- Strict type checking and validation
- Nested configurations for model, training, and infrastructure settings
The framework uses a structured approach to job management, centered around self-contained job directories and Pydantic-based configuration:
- Base configuration defined in
maester/config.py
using Pydantic - Configurations can be provided via YAML files, environment variables, or command-line arguments
- All job-specific configurations are automatically serialized to JSON in the job directory
Each job gets its own directory under jobs/
containing:
config.json
: Complete configuration snapshot for reproducibilityslurm.sh
: Generated SLURM script from templatelogs/
: Directory for SLURM output and error logscheckpoints/
: Training checkpoints and model states
- Takes a Pydantic configuration from
maester/config.py
- Creates a job directory with all necessary files
- Generates a SLURM script from
templates/slurm.sh
- Jobs can be resubmitted directly with
sbatch jobs/name/slurm.sh
- Extends the base configuration system from
maester/config.py
- Allows parameter modifications through
sweep_config.py
- Creates separate job directories for each parameter combination
- Provides tools for:
- Sweep submission and monitoring
- Result analysis and visualization
- Job management (cancel, retry, etc.)
- Template-based SLURM script generation (
templates/slurm.sh
) - Container-specific setup in
scripts/slurm/
- Automatic handling of:
- Resource allocation
- Environment setup
- Log management
- Container binding and configuration
-
Submit a single job:
# Submit with config file python submit.py config.yaml # Resubmit existing job sbatch jobs/my_job/slurm.sh
-
Run a parameter sweep:
# Define sweep parameters in sweep_config.py python sweep.py submit sweep_config.py # Monitor sweep status python sweep.py status sweeps/my_sweep
Use the job submission system:
# Submit a single training job
python submit.py config.yaml
# Or run a parameter sweep
python sweep.py submit sweep_config.py
For development or non-SLURM environments, you can run the training script directly:
torchrun --nproc_per_node=8 train.py
Convert and upload checkpoints to Hugging Face:
python scripts/convert_dcp_to_hf.py \
jobs/mistral-7b/checkpoints/ \
../output-dir/hf/ \
--upload org-name/model-name \
--name step-400 \
--base base-model-name
The framework implements μP (muP) parametrization for principled hyperparameter transfer between models of different scales. Here are two validation experiments:
A basic validation of the μP implementation showing expected behavior across different model scales.
Demonstration of successful learning rate transfer between models of different sizes, a key benefit of μP parametrization.
The plots themselves can be reproduced using the scripts in the plots/
directory.
maester/
: Core library codedatasets/
: Dataset implementations including experimental dataloaderparallelisms/
: Distributed training implementations
scripts/
: Utility scripts for training, conversion, etc.jobs/
: Job management and configurationtests/
: Test suite (WIP)
This project builds upon several open-source projects:
- pytorch/torchtitan: Many core features are based on
torchtitan
. - IBM's experimental dataloader: Distributed dataloader contribution. This framework uses a modified, on-the-fly tokenization pipeline that reads raw texts from Parquet files.
- μP (muP): Implementation inspired by Microsoft's muP framework for principled hyperparameter transfer.
See the LICENSE file for details.
Contributions are welcome! Please feel free to submit a Pull Request.