NanoFormer is a lightweight transformer model implementation designed for efficient training and inference. It features grouped query attention (GQA) and various architectural optimizations.
- Configurable transformer architecture with GQA support
- Dynamic batch size handling with efficient padding
- Mixed precision training (bfloat16)
- Gradient checkpointing for memory efficiency
- Gradient accumulation support
- Wandb integration for experiment tracking
- Automatic model checkpointing
- Custom training loop with validation
git clone https://github.com/yourusername/nanoformer.git
cd nanoformer
To train the model with default parameters:
python train.py \
--dataset "imdatta0/wikipedia_en_sample" \
--batch_size 8 \
--gradient_accumulation_steps 16 \
--num_epochs 1 \
--lr 5e-4 \
--hidden_dim 256 \
--num_hidden_layers 8
To estimate the number of tokens in a dataset and the model's param count with given config: (will need to refactor this to not create the model for estimation)
python train.py \
--dataset "imdatta0/wikipedia_en_sample" \
--batch_size 8 \
--gradient_accumulation_steps 16 \
--num_epochs 1 \
--lr 5e-4 \
--hidden_dim 256 \
--num_hidden_layers 8 \
--estimate
- Implement Differential Transformer
- Implement nGPT
- Implement custom optimisers like Shampoo, SOAP and whatnot
- Add support for Sliding Window Attention
- Modify configs to be closer to Chinchilla Optimal Ratios