triton_vs_cuda

Building Triton and CUDA kernels side-by-side to create a cuBLAS-performant GEMM kernel.

Lately I've been learning Triton, its strengths, and its weaknesses. Inspired by SiBohem's blog, I thought I would show how we can attempt to build a Triton kernel as performant as a near-cuBLAS performant CUDA kernel. In this endeavor I hope to highlight a few things about Triton:

what are the limitations of a Triton's block level programming paradigm?
as a kernel engineer, how much control do we retain in Triton to squeeze more performance out?
where does the Triton compiler take over and attempt to fill in? How successful is it at this task? Where is work still needed at the compiler level?
when should you actually use Triton v.s. CUDA?

Getting Started

I've divided this project into two branches:

main: template kernel files
solutions: solution kernel files

I've included dockerfiles in each /triton and /cuda directory to make enviornment setup quick and easy. Open those directories and you'll find README.mds explaining how to get going.

In Progress

I'll have a blog on the subject posted at some point on my personal website: alexkranias.com

I'm actively working on that piece.

In the meantime, you can clone this repo to work on this on your own and follow SiBohem's blog.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
cuda		cuda
triton		triton
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

triton_vs_cuda

Getting Started

In Progress

About

Releases

Packages

Languages

License

alexkranias/triton_vs_cuda

Folders and files

Latest commit

History

Repository files navigation

triton_vs_cuda

Getting Started

In Progress

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages