Skip to content

Commit d3a5492

Browse files
authored
v4.3.3 update. (#2868)
1 parent 49bd6bf commit d3a5492

File tree

24 files changed

+788
-210
lines changed

24 files changed

+788
-210
lines changed

CHANGELOG.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,17 @@
22

33
# CUTLASS 4.x
44

5+
## [4.3.3](https://github.com/NVIDIA/cutlass/releases/tag/v4.3.3) (2025-12-12)
6+
7+
### CuTe DSL
8+
* New features
9+
- Supported namedtuple and kwargs for JIT function arguments in tvm-ffi
10+
- Supported variadic tuples for JIT function argument in tvm-ffi
11+
12+
* Bug fixing and improvements
13+
- Fixed an issue when JIT function argument with union type annotation for tvm-ffi
14+
- Clearer error message for the case of runtime error cudaErrorInsufficientDriver
15+
516
## [4.3.2](https://github.com/NVIDIA/cutlass/releases/tag/v4.3.2) (2025-12-05)
617

718
### CuTe DSL

README.md

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
11
![ALT](./media/images/gemm-hierarchy-with-epilogue-no-labels.png "Complete CUDA GEMM decomposition")
22
# Overview
33

4-
# CUTLASS 4.3.2
4+
# CUTLASS 4.3.3
55

6-
_CUTLASS 4.3.2 - Dec 2025_
6+
_CUTLASS 4.3.3 - Dec 2025_
77

88
CUTLASS is a collection of abstractions for implementing high-performance matrix-matrix multiplication (GEMM)
99
and related computations at all levels and scales within CUDA. It incorporates strategies for
@@ -54,6 +54,8 @@ To get started quickly - please refer :
5454
- Added Blackwell SM103 support.
5555
- Multiple dependent DSOs in the wheel have been merged into one single DSO.
5656
- New env var `CUTE_DSL_CACHE_DIR` to specify the path for dumping caches.
57+
- Supported namedtuple and kwargs for JIT function arguments in tvm-ffi.
58+
- Supported variadic tuples for JIT function argument in tvm-ffi.
5759
* Debuggability improvements:
5860
- Supported source location tracking for DSL APIs (Allow tools like ``nsight`` profiling to correlate perf metrics with Python source code)
5961
- Supported dumping PTX and CUBIN code: [Hello World Example](https://github.com/NVIDIA/cutlass/blob/main/examples/python/CuTeDSL/notebooks/hello_world.ipynb)
@@ -102,6 +104,8 @@ To get started quickly - please refer :
102104
- Fixed tvm-ffi export compiled function
103105
- Fixed an issue of CUDA JitExecutor when unloading kernels
104106
- Fixed an issue of allocating max smem when there's statically allocated smem
107+
- Fixed an issue when JIT function argument with union type annotation for tvm-ffi
108+
- Clearer error message for the case of runtime error cudaErrorInsufficientDriver
105109

106110
## CUTLASS C++
107111
* Further enhance Blackwell SM100 Attention kernels in [example 77](https://github.com/NVIDIA/cutlass/tree/main/examples/77_blackwell_fmha/).

include/cutlass/version.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@
3636

3737
#define CUTLASS_MAJOR 4
3838
#define CUTLASS_MINOR 3
39-
#define CUTLASS_PATCH 2
39+
#define CUTLASS_PATCH 3
4040

4141
#ifdef CUTLASS_VERSIONS_GENERATED
4242
#include "cutlass/version_extended.h"

media/docs/pythonDSL/cute_dsl_general/compile_with_tvm_ffi.rst

Lines changed: 219 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -288,6 +288,131 @@ composed of the types that are supported by TVM FFI. The example below shows how
288288
example_add_one_with_tuple()
289289
290290
291+
Working with Variadic Tuples
292+
----------------------------
293+
294+
Sometimes it is helpful to annotate a tuple with no explicit element types.
295+
This can be useful to build up a generic template for a function that accepts
296+
a variable number of elements. The compiled function's signature will be
297+
determined by the tuple argument passed to the ``cute.compile`` function.
298+
The following example shows how to use a variadic tuple to build such a
299+
generic template.
300+
301+
.. code-block:: python
302+
303+
import cutlass
304+
import torch
305+
from cutlass import cute
306+
307+
@cute.kernel
308+
def device_add_one(a: cute.Tensor, b: cute.Tensor, extra_value: tuple):
309+
threads_per_block = 128
310+
cta_x_, _, _ = cute.arch.block_idx()
311+
tid_x, _, _ = cute.arch.thread_idx()
312+
tid = cta_x_ * threads_per_block + tid_x
313+
if tid < a.shape[0]:
314+
if cutlass.const_expr(len(extra_value) != 0):
315+
b[tid] = a[tid] + 1 + extra_value[0]
316+
else:
317+
b[tid] = a[tid] + 1
318+
319+
@cute.jit
320+
def add_one_with_extra_value(a: cute.Tensor, b: cute.Tensor, extra_value: tuple):
321+
n = a.shape[0]
322+
threads_per_block = 128
323+
blocks = (n + threads_per_block - 1) // threads_per_block
324+
device_add_one(a, b, extra_value).launch(grid=(blocks, 1, 1), block=(threads_per_block, 1, 1))
325+
326+
def example_add_one_with_variadic_tuple():
327+
n = cute.sym_int()
328+
a_cute = cute.runtime.make_fake_compact_tensor(cute.Float32, (n,))
329+
b_cute = cute.runtime.make_fake_compact_tensor(cute.Float32, (n,))
330+
compiled_add_one_no_extra = cute.compile(
331+
add_one_with_extra_value, a_cute, b_cute, (),
332+
options="--enable-tvm-ffi"
333+
)
334+
compiled_add_one_with_extra = cute.compile(
335+
add_one_with_extra_value, a_cute, b_cute, (cute.Float32(4),),
336+
options="--enable-tvm-ffi"
337+
)
338+
a_torch = torch.arange(10, dtype=torch.float32, device="cuda")
339+
b_torch = torch.empty(10, dtype=torch.float32, device="cuda")
340+
compiled_add_one_no_extra(a_torch, b_torch, ())
341+
print("result of b_torch after compiled_add_one_no_extra(a_torch, b_torch, ())")
342+
print(b_torch)
343+
compiled_add_one_with_extra(a_torch, b_torch, (4,))
344+
print("result of b_torch after compiled_add_one_with_extra(a_torch, b_torch, (4,))")
345+
print(b_torch)
346+
347+
example_add_one_with_variadic_tuple()
348+
349+
350+
Working with Named Tuples
351+
~~~~~~~~~~~~~~~~~~~~~~~~~
352+
353+
Named tuples are also supported and help logically group related arguments together.
354+
The example below shows how to use named tuples as arguments. Under the hood, named tuples
355+
are passed as unnamed tuples at the ABI level. When errors occur, the function signature in
356+
error messages will display unnamed tuple arguments.
357+
Ensure that the compile-time CuTe named tuple type definition has the same fields
358+
as the runtime PyTorch named tuple.
359+
Currently, users need to explicitly unpack the named tuple outside of conditionals and then
360+
use the unpacked variables inside the conditionals.
361+
362+
.. code-block:: python
363+
364+
from typing import NamedTuple
365+
from cutlass import cute
366+
import torch
367+
368+
class CuteNamedTuple(NamedTuple):
369+
a: cute.Tensor
370+
b: cute.Tensor
371+
c: cute.Float32 = cute.Float32(1)
372+
373+
def __new_from_mlir_values__(self, values):
374+
return CuteNamedTuple(*values)
375+
376+
class TorchNamedTuple(NamedTuple):
377+
a: torch.Tensor
378+
b: torch.Tensor
379+
c: float = 1
380+
381+
@cute.kernel
382+
def device_add_one_named_tuple(value: CuteNamedTuple):
383+
tid = cute.arch.block_idx()[0] * 128 + cute.arch.thread_idx()[0]
384+
# need to unpack namedtuple outside conditionals
385+
a = value.a
386+
b = value.b
387+
c = value.c
388+
if tid < a.shape[0]:
389+
b[tid] = a[tid] + c
390+
391+
@cute.jit
392+
def add_one_with_named_tuple(value: CuteNamedTuple):
393+
n = value.a.shape[0]
394+
threads_per_block = 128
395+
blocks = (n + threads_per_block - 1) // threads_per_block
396+
device_add_one_named_tuple(value).launch(grid=(blocks, 1, 1), block=(threads_per_block, 1, 1))
397+
398+
def example_add_one_with_named_tuple():
399+
n = cute.sym_int()
400+
a_cute = cute.runtime.make_fake_compact_tensor(cute.Float32, (n,))
401+
b_cute = cute.runtime.make_fake_compact_tensor(cute.Float32, (n,))
402+
403+
compiled_add_one = cute.compile(
404+
add_one_with_named_tuple, CuteNamedTuple(a=a_cute, b=b_cute),
405+
options="--enable-tvm-ffi"
406+
)
407+
a_torch = torch.arange(10, dtype=torch.float32, device="cuda")
408+
b_torch = torch.empty(10, dtype=torch.float32, device="cuda")
409+
compiled_add_one(TorchNamedTuple(a=a_torch, b=b_torch))
410+
print("result of b_torch")
411+
print(b_torch)
412+
413+
example_add_one_with_named_tuple()
414+
415+
291416
Supported types
292417
---------------
293418

@@ -464,3 +589,97 @@ When you build your own libraries, make sure you link against the necessary runt
464589
You can use ``cute.runtime.find_runtime_libraries(enable_tvm_ffi=True)`` to get the path to these libraries.
465590
``cute.runtime.load_module`` will load these libraries automatically before loading
466591
an exported module. You can also manually load these libraries in advanced use cases.
592+
593+
594+
Keyword Arguments and Defaults
595+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
596+
597+
The function returned by ``cute.compile`` supports keyword arguments and defaults.
598+
The example below shows how to use keyword arguments and defaults:
599+
600+
.. code-block:: python
601+
602+
import torch
603+
from cutlass import cute
604+
605+
@cute.kernel
606+
def device_add_scalar(a: cute.Tensor, b: cute.Tensor, offset: cutlass.Float32):
607+
threads_per_block = 128
608+
cta_x_, _, _ = cute.arch.block_idx()
609+
tid_x, _, _ = cute.arch.thread_idx()
610+
tid = cta_x_ * threads_per_block + tid_x
611+
if tid < a.shape[0]:
612+
b[tid] = a[tid] + offset
613+
614+
@cute.jit
615+
def add_constant(a: cute.Tensor, b: cute.Tensor, offset: cutlass.Float32=cutlass.Float32(1)):
616+
n = a.shape[0]
617+
threads_per_block = 128
618+
blocks = (n + threads_per_block - 1) // threads_per_block
619+
device_add_scalar(a, b, offset).launch(grid=(blocks, 1, 1), block=(threads_per_block, 1, 1))
620+
621+
def example_kwargs_and_defaults():
622+
n = cute.sym_int()
623+
a_cute = cute.runtime.make_fake_compact_tensor(cute.Float32, (n,))
624+
b_cute = cute.runtime.make_fake_compact_tensor(cute.Float32, (n,))
625+
compiled_add_constant = cute.compile(add_constant, a_cute, b_cute, options="--enable-tvm-ffi")
626+
a_torch = torch.arange(10, dtype=torch.float32, device="cuda")
627+
b_torch = torch.empty(10, dtype=torch.float32, device="cuda")
628+
compiled_add_constant(a_torch, b_torch)
629+
print("result of b_torch after compiled_add_constant(a_torch, b_torch)")
630+
print(b_torch)
631+
compiled_add_constant(a_torch, b_torch, offset=4)
632+
print("result of b_torch after compiled_add_constant(a_torch, b_torch, offset=4)")
633+
print(b_torch)
634+
635+
For efficiency and portability reasons, TVM FFI ABI supports functions with positional-only arguments.
636+
If you export the compiled module to an object file and then load it back, the function
637+
will only accept positional arguments in the order of the arguments in the function signature.
638+
You can rewrap the function or use the TVM FFI wrapper generator to generate a kwargs wrapper.
639+
The code block below shows how to do this:
640+
641+
.. code-block:: python
642+
643+
def example_kwargs_and_defaults():
644+
n = cute.sym_int()
645+
a_cute = cute.runtime.make_fake_compact_tensor(cute.Float32, (n,))
646+
b_cute = cute.runtime.make_fake_compact_tensor(cute.Float32, (n,))
647+
compiled_add_constant = cute.compile(add_constant, a_cute, b_cute, options="--enable-tvm-ffi")
648+
# export the compiled module to object file
649+
compiled_add_constant.export_to_c("./add_constant.o", function_name="add_constant")
650+
# obtain necessary runtime libs for loading the shared library
651+
runtime_libs = cute.runtime.find_runtime_libraries(enable_tvm_ffi=True)
652+
# compile the object file to a shared library
653+
cmd = ["gcc", "-shared", "-o", "./add_constant.so", "./add_constant.o", *runtime_libs]
654+
subprocess.run(cmd, check=True)
655+
656+
a_torch = torch.arange(10, dtype=torch.float32, device="cuda")
657+
b_torch = torch.empty(10, dtype=torch.float32, device="cuda")
658+
659+
mod = cute.runtime.load_module("./add_constant.so")
660+
try:
661+
mod.add_constant(a_torch, b_torch)
662+
except Exception as e:
663+
# Raises a missing arguments error because kwargs and default information are lost
664+
print(e)
665+
# We rewrap the function to regain argument and kwargs support.
666+
# Alternatively, use the TVM FFI wrapper generator to generate a kwargs wrapper function.
667+
from tvm_ffi.utils import kwargs_wrapper
668+
# arg_defaults are aligned to the end of the argument list
669+
wrapped_func = kwargs_wrapper.make_kwargs_wrapper(
670+
mod.add_constant, arg_names=["a", "b", "offset"], arg_defaults=(1,)
671+
)
672+
wrapped_func(a_torch, b_torch)
673+
print("result of b_torch after wrapped_func(a_torch, b_torch)")
674+
print(b_torch)
675+
# You can also use the signature of the original function
676+
# to generate a kwargs wrapper function. Make sure to exclude
677+
# arguments that are not included in the runtime,
678+
# such as 'self', constexpr, and env stream arguments.
679+
wrapped_func = kwargs_wrapper.make_kwargs_wrapper_from_signature(
680+
mod.add_constant, signature=inspect.signature(add_constant),
681+
exclude_arg_names=["self"]
682+
)
683+
wrapped_func(a_torch, b_torch, offset=4)
684+
print("result of b_torch after wrapped_func(a_torch, b_torch, offset=4)")
685+
print(b_torch)

media/docs/pythonDSL/cute_dsl_general/dsl_jit_caching.rst

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -128,7 +128,7 @@ Here is an example demonstrating automatic caching of the ``add`` kernel:
128128
The cache can be serialized to files for subsequent runs.
129129
After serialization, compiled MLIR bytecode is stored in file.
130130
The cache directory is ``/tmp/{current_user}/cutlass_python_cache``.
131-
The cache loads from files into memory during |DSL| initialization and saves back to files when the process exits.
131+
During compilation, the cache loads the corresponding kernel from file (if it exists) into memory as needed, and after compilation, it saves any newly compiled executables back to file.
132132

133133
Note that for efficiency, the default cache directory is located in a temporary folder. However, this location is not persistent, it may be cleared by the system (for example, during a reboot or disk space cleanup).
134134
If you wish to preserve the cache across sessions, set the ``CUTE_DSL_CACHE_DIR`` environment variable to point to a persistent directory.
@@ -140,9 +140,6 @@ The following environment variables control file caching:
140140
# Disable file caching while keeping in-memory cache available, defaults to False.
141141
export CUTE_DSL_DISABLE_FILE_CACHING=True
142142
143-
# Maximum number of cache files allowed, defaults to 1000.
144-
export CUTE_DSL_FILE_CACHING_CAPACITY=1000
145-
146143
# Cache directory, defaults to /tmp/{current_user}/cutlass_python_cache.
147144
export CUTE_DSL_CACHE_DIR=/home/user/local_cutlass_python_cache/dense_gemm_cache/
148145

media/docs/pythonDSL/cute_dsl_general/framework_integration.rst

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -192,6 +192,11 @@ For example:
192192
- For a tensor with layout ``(2,2):(8,2)``, since no dimension has stride 1,
193193
all dimensions are marked as dynamic: ``(?,?):(?,?)``.
194194

195+
The leading dimension accepts negative index which means the dimension is counted from the last dimension. For example,
196+
197+
- For a tensor with layout ``(2,2,3,4):(2,1,4,12)``, if ``leading_dim`` is specified to be -1,
198+
the layout will be marked as ``(?,?,?,?):(?,?,?,1)``.
199+
195200
Code Example
196201
~~~~~~~~~~~~
197202

0 commit comments

Comments
 (0)