Skip to content

step-law/steplaw

Folders and files

NameName
Last commit message
Last commit date

Latest commit

4104faa Β· Mar 14, 2025

History

12 Commits
Mar 9, 2025
Mar 14, 2025
Mar 9, 2025
Mar 14, 2025
Mar 14, 2025
Mar 14, 2025

Repository files navigation

Step Law

Predictable Scale: Part I

Home Page   |   Wandb   |   Paper

News

πŸ—“ Coming Soon

  • Paper
  • Smooth loss heatmaps
  • Thousands of training logs
  • Fitting code
  • Checkpoints

Introduction

We first present the unified optimal hyperparameter scaling laws, termed Step Law, that generalizes across diverse model shapes, architectures, and data distributions.

Our findings demonstrate remarkable accuracy, with estimated values on test sets deviating by only 0.09% from the globally optimal LLM performance identified through exhaustive search.

This research entails a significant computational investment, utilizing nearly one million NVIDIA H800 GPU hours to train 3,700 LLMs of varying sizes and hyperparameters from scratch, consuming approximately 100 trillion tokens in total. To support reproducibility and advance the field for LLM pre-training, we will progressively release all loss measurements and model checkpoints through our designated repository. The universal, plug-and-play optimal hyperparameter tool is provided for the community.

Usage

The repository provides tools and data for predicting optimal learning rate and batch size for LLM pretraining:

Data Files

The Data folder contains:

  • Smooth loss results for both dense and MoE models (two CSV files)
  • Structure and training configurations for each model
  • data/1004_fitted_lr_bs_scaling_model_parameters.csv: Contains fitted model parameters from 1000 bootstrap models for robust prediction of optimal learning rate and batch size. The model follows the form:
    • lr = exp(intercept) * N^coefN * D^coefD
    • bs = exp(intercept) * D^coefD

Prediction Tool

We provide a simple command line tool to predict optimal learning rate and batch size based on your model parameters:

python code/fit_tool.py pred-opt-lr-bs [model_params] [data_in_token] [seq_len]

Parameters:

  • model_params: Number of model parameters
  • data_in_token: Training data size in tokens
  • seq_len: Sequence length

Example:

python code/fit_tool.py pred-opt-lr-bs 7e9 1.4e12 2048

Log Analysis Tool

We also provide a log analysis tool to analyze training logs and extract smooth loss measurements:

python code/log_analysis.py quick-check [base_dir] [dir_pattern] [--target-iter] [--max-cnt] [--pretty]

Parameters:

  • base_dir: Base directory containing training logs
  • dir_pattern: Regex pattern to match experiment directories
  • --target-iter: Target iteration to analyze (optional)
  • --max-cnt: Maximum number of log entries to process (default: 32768)
  • --pretty: Print results in a pretty table format (optional)

Example:

python code/log_analysis.py quick-check ./logs "exp_.*" --pretty

For more training details and experimental results, please refer to our Wandb page.

Citation

If you find our work helpful, feel free to give us a cite :-)

@misc{li2025predictablescalei,
      title={Predictable Scale: Part I -- Optimal Hyperparameter Scaling Law in Large Language Model Pretraining}, 
      author={Houyi Li and Wenzheng Zheng and Jingcheng Hu and Qiufeng Wang and Hanshan Zhang and Zili Wang and Yangshijie Xu and Shuigeng Zhou and Xiangyu Zhang and Daxin Jiang},
      year={2025},
      eprint={2503.04715},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2503.04715}, 
}

Star History

Star History Chart