Building Triton and CUDA kernels side-by-side to create a cuBLAS-performant GEMM kernel.
Lately I've been learning Triton, its strengths, and its weaknesses. Inspired by SiBohem's blog, I thought I would show how we can attempt to build a Triton kernel as performant as a near-cuBLAS performant CUDA kernel. In this endeavor I hope to highlight a few things about Triton:
- what are the limitations of a Triton's block level programming paradigm?
- as a kernel engineer, how much control do we retain in Triton to squeeze more performance out?
- where does the Triton compiler take over and attempt to fill in? How successful is it at this task? Where is work still needed at the compiler level?
- when should you actually use Triton v.s. CUDA?
I've divided this project into two branches:
main
: template kernel filessolutions
: solution kernel files
I've included dockerfiles in each /triton
and /cuda
directory to make enviornment setup quick and easy. Open those directories and you'll find README.md
s explaining how to get going.
I'll have a blog on the subject posted at some point on my personal website: alexkranias.com
I'm actively working on that piece.
In the meantime, you can clone
this repo to work on this on your own and follow SiBohem's blog.