Extremely simple and understandable GPT2 implementation with minor tweaks.
- You can train even the subword tokenizer, good for non-English languages.
- Fast optimized code, enough a single GTX 2080ti card
- Easy to understand, solid code
- Easy to extend for new experiments
- Lamb optimizer
- Mixed precision training, the important layers still remained in fp32.
- sin, cos positional encoding