Maester

A high-performance distributed training framework for large language models, built on top of PyTorch and heavily inspired by torchtitan.

Features

μP (muP) Implementation: Principled parameter scaling that allows training hyperparameters to transfer between different model sizes
Distributed Training: Efficient multi-GPU and multi-node training capabilities
Custom Job Management: Advanced job scheduling and management system
Experimental Dataloader: Modified version of IBM's dataloader contribution with on-the-fly tokenization
SLURM Integration: Optimized for supercomputer environments (LUMI)

Installation

# Clone the repository
git clone https://github.com/rlrs/maester.git
cd maester

# Install dependencies
uv sync

Requirements

Python >= 3.10
PyTorch >= 2.5, ideally nightly
The framework is set up for training on the LUMI supercomputer, which uses SLURM and AMD GPUs. Most features are not specific to this setup, though.

Usage

Framework Overview

Configuration System

The framework uses a Pydantic-based configuration system that allows for:

Configuration via JSON files, environment variables, and command-line arguments
Strict type checking and validation
Nested configurations for model, training, and infrastructure settings

Job Management

The framework uses a structured approach to job management, centered around self-contained job directories and Pydantic-based configuration:

Configuration System

Base configuration defined in maester/config.py using Pydantic
Configurations can be provided via YAML files, environment variables, or command-line arguments
All job-specific configurations are automatically serialized to JSON in the job directory

Job Directory Structure

Each job gets its own directory under jobs/ containing:

config.json: Complete configuration snapshot for reproducibility
slurm.sh: Generated SLURM script from template
logs/: Directory for SLURM output and error logs
checkpoints/: Training checkpoints and model states

Job Submission Tools

submit.py

Takes a Pydantic configuration from maester/config.py
Creates a job directory with all necessary files
Generates a SLURM script from templates/slurm.sh
Jobs can be resubmitted directly with sbatch jobs/name/slurm.sh

sweep.py

Extends the base configuration system from maester/config.py
Allows parameter modifications through sweep_config.py
Creates separate job directories for each parameter combination
Provides tools for:
- Sweep submission and monitoring
- Result analysis and visualization
- Job management (cancel, retry, etc.)

SLURM Integration

Template-based SLURM script generation (templates/slurm.sh)
Container-specific setup in scripts/slurm/
Automatic handling of:
- Resource allocation
- Environment setup
- Log management
- Container binding and configuration

Example Usage

Submit a single job:

# Submit with config file
python submit.py config.yaml

# Resubmit existing job
sbatch jobs/my_job/slurm.sh

Run a parameter sweep:

# Define sweep parameters in sweep_config.py
python sweep.py submit sweep_config.py

# Monitor sweep status
python sweep.py status sweeps/my_sweep

Running Training Jobs

On SLURM Systems (Recommended)

Use the job submission system:

# Submit a single training job
python submit.py config.yaml

# Or run a parameter sweep
python sweep.py submit sweep_config.py

Direct Execution

For development or non-SLURM environments, you can run the training script directly:

torchrun --nproc_per_node=8 train.py

Converting and Uploading Checkpoints

Convert and upload checkpoints to Hugging Face:

python scripts/convert_dcp_to_hf.py \
    jobs/mistral-7b/checkpoints/ \
    ../output-dir/hf/ \
    --upload org-name/model-name \
    --name step-400 \
    --base base-model-name

μP (muP) Validation

The framework implements μP (muP) parametrization for principled hyperparameter transfer between models of different scales. Here are two validation experiments:

Coordinate Check

A basic validation of the μP implementation showing expected behavior across different model scales.

Learning Rate Transfer

Demonstration of successful learning rate transfer between models of different sizes, a key benefit of μP parametrization.

The plots themselves can be reproduced using the scripts in the plots/ directory.

Project Structure

maester/: Core library code
- datasets/: Dataset implementations including experimental dataloader
- parallelisms/: Distributed training implementations
scripts/: Utility scripts for training, conversion, etc.
jobs/: Job management and configuration
tests/: Test suite (WIP)

Credits

This project builds upon several open-source projects:

pytorch/torchtitan: Many core features are based on torchtitan.
IBM's experimental dataloader: Distributed dataloader contribution. This framework uses a modified, on-the-fly tokenization pipeline that reads raw texts from Parquet files.
μP (muP): Implementation inspired by Microsoft's muP framework for principled hyperparameter transfer.

License

See the LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Name		Name	Last commit message	Last commit date
Latest commit History 140 Commits
.vscode		.vscode
assets		assets
jobs		jobs
maester		maester
plots		plots
scripts		scripts
templates		templates
tests		tests
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
job_manager.py		job_manager.py
pyproject.toml		pyproject.toml
submit.py		submit.py
sweep.py		sweep.py
sweep_config.py		sweep_config.py
tokenizer_train.py		tokenizer_train.py
train.py		train.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Maester

Features

Installation

Requirements

Usage

Framework Overview

Configuration System

Job Management

Configuration System

Job Directory Structure

Job Submission Tools

submit.py

sweep.py

SLURM Integration

Example Usage

Running Training Jobs

On SLURM Systems (Recommended)

Direct Execution

Converting and Uploading Checkpoints

μP (muP) Validation

Coordinate Check

Learning Rate Transfer

Project Structure

Credits

License

Contributing

About

Releases

Packages

Languages

License

rlrs/maester

Folders and files

Latest commit

History

Repository files navigation

Maester

Features

Installation

Requirements

Usage

Framework Overview

Configuration System

Job Management

Configuration System

Job Directory Structure

Job Submission Tools

submit.py

sweep.py

SLURM Integration

Example Usage

Running Training Jobs

On SLURM Systems (Recommended)

Direct Execution

Converting and Uploading Checkpoints

μP (muP) Validation

Coordinate Check

Learning Rate Transfer

Project Structure

Credits

License

Contributing

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages