NVIDIA
diff --git a/‎CHANGELOG.md‎
Lines changed: 11 additions & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 11 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 6 additions & 2 deletions b/‎README.md‎
Lines changed: 6 additions & 2 deletions
diff --git a/‎include/cutlass/version.h‎
Lines changed: 1 addition & 1 deletion b/‎include/cutlass/version.h‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎media/docs/pythonDSL/cute_dsl_general/compile_with_tvm_ffi.rst‎
Lines changed: 219 additions & 0 deletions b/‎media/docs/pythonDSL/cute_dsl_general/compile_with_tvm_ffi.rst‎
Lines changed: 219 additions & 0 deletions
diff --git a/‎media/docs/pythonDSL/cute_dsl_general/dsl_jit_caching.rst‎
Lines changed: 1 addition & 4 deletions b/‎media/docs/pythonDSL/cute_dsl_general/dsl_jit_caching.rst‎
Lines changed: 1 addition & 4 deletions
diff --git a/‎media/docs/pythonDSL/cute_dsl_general/framework_integration.rst‎
Lines changed: 5 additions & 0 deletions b/‎media/docs/pythonDSL/cute_dsl_general/framework_integration.rst‎
Lines changed: 5 additions & 0 deletions
@@ -2,6 +2,17 @@
 
 # CUTLASS 4.x
 
+## [4.3.3](https://github.com/NVIDIA/cutlass/releases/tag/v4.3.3) (2025-12-12)
+
+### CuTe DSL
+* New features
+  - Supported namedtuple and kwargs for JIT function arguments in tvm-ffi
+  - Supported variadic tuples for JIT function argument in tvm-ffi
+
+* Bug fixing and improvements
+  - Fixed an issue when JIT function argument with union type annotation for tvm-ffi
+  - Clearer error message for the case of runtime error cudaErrorInsufficientDriver
+
 ## [4.3.2](https://github.com/NVIDIA/cutlass/releases/tag/v4.3.2) (2025-12-05)
 
 ### CuTe DSL
 
@@ -1,9 +1,9 @@
 ![ALT](./media/images/gemm-hierarchy-with-epilogue-no-labels.png "Complete CUDA GEMM decomposition")
 # Overview
 
-# CUTLASS 4.3.2
+# CUTLASS 4.3.3
 
-_CUTLASS 4.3.2 - Dec 2025_
+_CUTLASS 4.3.3 - Dec 2025_
 
 CUTLASS is a collection of abstractions for implementing high-performance matrix-matrix multiplication (GEMM)
 and related computations at all levels and scales within CUDA. It incorporates strategies for
@@ -54,6 +54,8 @@ To get started quickly - please refer :
   - Added Blackwell SM103 support.
   - Multiple dependent DSOs in the wheel have been merged into one single DSO.
   - New env var `CUTE_DSL_CACHE_DIR` to specify the path for dumping caches.
+  - Supported namedtuple and kwargs for JIT function arguments in tvm-ffi.
+  - Supported variadic tuples for JIT function argument in tvm-ffi.
 * Debuggability improvements:
     - Supported source location tracking for DSL APIs (Allow tools like ``nsight`` profiling to correlate perf metrics with Python source code)
     - Supported dumping PTX and CUBIN code: [Hello World Example](https://github.com/NVIDIA/cutlass/blob/main/examples/python/CuTeDSL/notebooks/hello_world.ipynb)
@@ -102,6 +104,8 @@ To get started quickly - please refer :
     - Fixed tvm-ffi export compiled function
     - Fixed an issue of CUDA JitExecutor when unloading kernels
     - Fixed an issue of allocating max smem when there's statically allocated smem
+    - Fixed an issue when JIT function argument with union type annotation for tvm-ffi
+    - Clearer error message for the case of runtime error cudaErrorInsufficientDriver
 
 ## CUTLASS C++
 * Further enhance Blackwell SM100 Attention kernels in [example 77](https://github.com/NVIDIA/cutlass/tree/main/examples/77_blackwell_fmha/).
 
@@ -36,7 +36,7 @@
 
 #define CUTLASS_MAJOR 4
 #define CUTLASS_MINOR 3
-#define CUTLASS_PATCH 2
+#define CUTLASS_PATCH 3
 
 #ifdef CUTLASS_VERSIONS_GENERATED
 #include "cutlass/version_extended.h"
 
@@ -288,6 +288,131 @@ composed of the types that are supported by TVM FFI. The example below shows how
    example_add_one_with_tuple()
 
 
+Working with Variadic Tuples
+----------------------------
+
+Sometimes it is helpful to annotate a tuple with no explicit element types.
+This can be useful to build up a generic template for a function that accepts
+a variable number of elements. The compiled function's signature will be
+determined by the tuple argument passed to the ``cute.compile`` function.
+The following example shows how to use a variadic tuple to build such a
+generic template.
+
+.. code-block:: python
+
+   import cutlass
+   import torch
+   from cutlass import cute
+
+   @cute.kernel
+   def device_add_one(a: cute.Tensor, b: cute.Tensor, extra_value: tuple):
+      threads_per_block = 128
+      cta_x_, _, _ = cute.arch.block_idx()
+      tid_x, _, _ = cute.arch.thread_idx()
+      tid = cta_x_ * threads_per_block + tid_x
+      if tid < a.shape[0]:
+         if cutlass.const_expr(len(extra_value) != 0):
+               b[tid] = a[tid] + 1 + extra_value[0]
+         else:
+               b[tid] = a[tid] + 1
+
+   @cute.jit
+   def add_one_with_extra_value(a: cute.Tensor, b: cute.Tensor, extra_value: tuple):
+      n = a.shape[0]
+      threads_per_block = 128
+      blocks = (n + threads_per_block - 1) // threads_per_block
+      device_add_one(a, b, extra_value).launch(grid=(blocks, 1, 1), block=(threads_per_block, 1, 1))
+
+   def example_add_one_with_variadic_tuple():
+      n = cute.sym_int()
+      a_cute = cute.runtime.make_fake_compact_tensor(cute.Float32, (n,))
+      b_cute = cute.runtime.make_fake_compact_tensor(cute.Float32, (n,))
+      compiled_add_one_no_extra = cute.compile(
+         add_one_with_extra_value, a_cute, b_cute, (),
+         options="--enable-tvm-ffi"
+      )
+      compiled_add_one_with_extra = cute.compile(
+         add_one_with_extra_value, a_cute, b_cute, (cute.Float32(4),),
+         options="--enable-tvm-ffi"
+      )
+      a_torch = torch.arange(10, dtype=torch.float32, device="cuda")
+      b_torch = torch.empty(10, dtype=torch.float32, device="cuda")
+      compiled_add_one_no_extra(a_torch, b_torch, ())
+      print("result of b_torch after compiled_add_one_no_extra(a_torch, b_torch, ())")
+      print(b_torch)
+      compiled_add_one_with_extra(a_torch, b_torch, (4,))
+      print("result of b_torch after compiled_add_one_with_extra(a_torch, b_torch, (4,))")
+      print(b_torch)
+
+   example_add_one_with_variadic_tuple()
+
+
+Working with Named Tuples
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Named tuples are also supported and help logically group related arguments together.
+The example below shows how to use named tuples as arguments. Under the hood, named tuples
+are passed as unnamed tuples at the ABI level. When errors occur, the function signature in
+error messages will display unnamed tuple arguments.
+Ensure that the compile-time CuTe named tuple type definition has the same fields
+as the runtime PyTorch named tuple.
+Currently, users need to explicitly unpack the named tuple outside of conditionals and then
+use the unpacked variables inside the conditionals.
+
+.. code-block:: python
+
+   from typing import NamedTuple
+   from cutlass import cute
+   import torch
+
+   class CuteNamedTuple(NamedTuple):
+      a: cute.Tensor
+      b: cute.Tensor
+      c: cute.Float32 = cute.Float32(1)
+
+      def __new_from_mlir_values__(self, values):
+         return CuteNamedTuple(*values)
+
+   class TorchNamedTuple(NamedTuple):
+      a: torch.Tensor
+      b: torch.Tensor
+      c: float = 1
+
+   @cute.kernel
+   def device_add_one_named_tuple(value: CuteNamedTuple):
+      tid = cute.arch.block_idx()[0] * 128 + cute.arch.thread_idx()[0]
+      # need to unpack namedtuple outside conditionals
+      a = value.a
+      b = value.b
+      c = value.c
+      if tid < a.shape[0]:
+         b[tid] = a[tid] + c
+
+   @cute.jit
+   def add_one_with_named_tuple(value: CuteNamedTuple):
+      n = value.a.shape[0]
+      threads_per_block = 128
+      blocks = (n + threads_per_block - 1) // threads_per_block
+      device_add_one_named_tuple(value).launch(grid=(blocks, 1, 1), block=(threads_per_block, 1, 1))
+
+   def example_add_one_with_named_tuple():
+      n = cute.sym_int()
+      a_cute = cute.runtime.make_fake_compact_tensor(cute.Float32, (n,))
+      b_cute = cute.runtime.make_fake_compact_tensor(cute.Float32, (n,))
+
+      compiled_add_one = cute.compile(
+         add_one_with_named_tuple, CuteNamedTuple(a=a_cute, b=b_cute),
+         options="--enable-tvm-ffi"
+      )
+      a_torch = torch.arange(10, dtype=torch.float32, device="cuda")
+      b_torch = torch.empty(10, dtype=torch.float32, device="cuda")
+      compiled_add_one(TorchNamedTuple(a=a_torch, b=b_torch))
+      print("result of b_torch")
+      print(b_torch)
+
+   example_add_one_with_named_tuple()
+
+
 Supported types
 ---------------
 
@@ -464,3 +589,97 @@ When you build your own libraries, make sure you link against the necessary runt
 You can use ``cute.runtime.find_runtime_libraries(enable_tvm_ffi=True)`` to get the path to these libraries.
 ``cute.runtime.load_module`` will load these libraries automatically before loading
 an exported module. You can also manually load these libraries in advanced use cases.
+
+
+Keyword Arguments and Defaults
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The function returned by ``cute.compile`` supports keyword arguments and defaults.
+The example below shows how to use keyword arguments and defaults:
+
+.. code-block:: python
+
+   import torch
+   from cutlass import cute
+
+   @cute.kernel
+   def device_add_scalar(a: cute.Tensor, b: cute.Tensor, offset: cutlass.Float32):
+      threads_per_block = 128
+      cta_x_, _, _ = cute.arch.block_idx()
+      tid_x, _, _ = cute.arch.thread_idx()
+      tid = cta_x_ * threads_per_block + tid_x
+      if tid < a.shape[0]:
+         b[tid] = a[tid] + offset
+
+   @cute.jit
+   def add_constant(a: cute.Tensor, b: cute.Tensor, offset: cutlass.Float32=cutlass.Float32(1)):
+      n = a.shape[0]
+      threads_per_block = 128
+      blocks = (n + threads_per_block - 1) // threads_per_block
+      device_add_scalar(a, b, offset).launch(grid=(blocks, 1, 1), block=(threads_per_block, 1, 1))
+
+   def example_kwargs_and_defaults():
+      n = cute.sym_int()
+      a_cute = cute.runtime.make_fake_compact_tensor(cute.Float32, (n,))
+      b_cute = cute.runtime.make_fake_compact_tensor(cute.Float32, (n,))
+      compiled_add_constant = cute.compile(add_constant, a_cute, b_cute, options="--enable-tvm-ffi")
+      a_torch = torch.arange(10, dtype=torch.float32, device="cuda")
+      b_torch = torch.empty(10, dtype=torch.float32, device="cuda")
+      compiled_add_constant(a_torch, b_torch)
+      print("result of b_torch after compiled_add_constant(a_torch, b_torch)")
+      print(b_torch)
+      compiled_add_constant(a_torch, b_torch, offset=4)
+      print("result of b_torch after compiled_add_constant(a_torch, b_torch, offset=4)")
+      print(b_torch)
+
+For efficiency and portability reasons, TVM FFI ABI supports functions with positional-only arguments.
+If you export the compiled module to an object file and then load it back, the function
+will only accept positional arguments in the order of the arguments in the function signature.
+You can rewrap the function or use the TVM FFI wrapper generator to generate a kwargs wrapper.
+The code block below shows how to do this:
+
+.. code-block:: python
+
+   def example_kwargs_and_defaults():
+      n = cute.sym_int()
+      a_cute = cute.runtime.make_fake_compact_tensor(cute.Float32, (n,))
+      b_cute = cute.runtime.make_fake_compact_tensor(cute.Float32, (n,))
+      compiled_add_constant = cute.compile(add_constant, a_cute, b_cute, options="--enable-tvm-ffi")
+      # export the compiled module to object file
+      compiled_add_constant.export_to_c("./add_constant.o", function_name="add_constant")
+      # obtain necessary runtime libs for loading the shared library
+      runtime_libs = cute.runtime.find_runtime_libraries(enable_tvm_ffi=True)
+      # compile the object file to a shared library
+      cmd = ["gcc", "-shared", "-o", "./add_constant.so", "./add_constant.o", *runtime_libs]
+      subprocess.run(cmd, check=True)
+
+      a_torch = torch.arange(10, dtype=torch.float32, device="cuda")
+      b_torch = torch.empty(10, dtype=torch.float32, device="cuda")
+
+      mod = cute.runtime.load_module("./add_constant.so")
+      try:
+         mod.add_constant(a_torch, b_torch)
+      except Exception as e:
+         # Raises a missing arguments error because kwargs and default information are lost
+         print(e)
+      # We rewrap the function to regain argument and kwargs support.
+      # Alternatively, use the TVM FFI wrapper generator to generate a kwargs wrapper function.
+      from tvm_ffi.utils import kwargs_wrapper
+      # arg_defaults are aligned to the end of the argument list
+      wrapped_func = kwargs_wrapper.make_kwargs_wrapper(
+         mod.add_constant, arg_names=["a", "b", "offset"], arg_defaults=(1,)
+      )
+      wrapped_func(a_torch, b_torch)
+      print("result of b_torch after wrapped_func(a_torch, b_torch)")
+      print(b_torch)
+      # You can also use the signature of the original function
+      # to generate a kwargs wrapper function. Make sure to exclude
+      # arguments that are not included in the runtime,
+      # such as 'self', constexpr, and env stream arguments.
+      wrapped_func = kwargs_wrapper.make_kwargs_wrapper_from_signature(
+         mod.add_constant, signature=inspect.signature(add_constant),
+         exclude_arg_names=["self"]
+      )
+      wrapped_func(a_torch, b_torch, offset=4)
+      print("result of b_torch after wrapped_func(a_torch, b_torch, offset=4)")
+      print(b_torch)
@@ -128,7 +128,7 @@ Here is an example demonstrating automatic caching of the ``add`` kernel:
 The cache can be serialized to files for subsequent runs.
 After serialization, compiled MLIR bytecode is stored in file.
 The cache directory is ``/tmp/{current_user}/cutlass_python_cache``.
-The cache loads from files into memory during |DSL| initialization and saves back to files when the process exits.
+During compilation, the cache loads the corresponding kernel from file (if it exists) into memory as needed, and after compilation, it saves any newly compiled executables back to file.
 
 Note that for efficiency, the default cache directory is located in a temporary folder. However, this location is not persistent, it may be cleared by the system (for example, during a reboot or disk space cleanup).
 If you wish to preserve the cache across sessions, set the ``CUTE_DSL_CACHE_DIR`` environment variable to point to a persistent directory.
@@ -140,9 +140,6 @@ The following environment variables control file caching:
    # Disable file caching while keeping in-memory cache available, defaults to False.
    export CUTE_DSL_DISABLE_FILE_CACHING=True
 
-   # Maximum number of cache files allowed, defaults to 1000.
-   export CUTE_DSL_FILE_CACHING_CAPACITY=1000
-
    # Cache directory, defaults to /tmp/{current_user}/cutlass_python_cache.
    export CUTE_DSL_CACHE_DIR=/home/user/local_cutlass_python_cache/dense_gemm_cache/
 
 
@@ -192,6 +192,11 @@ For example:
 - For a tensor with layout ``(2,2):(8,2)``, since no dimension has stride 1,
   all dimensions are marked as dynamic: ``(?,?):(?,?)``.
 
+The leading dimension accepts negative index which means the dimension is counted from the last dimension. For example,
+
+- For a tensor with layout ``(2,2,3,4):(2,1,4,12)``, if ``leading_dim`` is specified to be -1,
+  the layout will be marked as ``(?,?,?,?):(?,?,?,1)``.
+
 Code Example
 ~~~~~~~~~~~~