You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+6-2Lines changed: 6 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,9 +1,9 @@
1
1

2
2
# Overview
3
3
4
-
# CUTLASS 4.3.2
4
+
# CUTLASS 4.3.3
5
5
6
-
_CUTLASS 4.3.2 - Dec 2025_
6
+
_CUTLASS 4.3.3 - Dec 2025_
7
7
8
8
CUTLASS is a collection of abstractions for implementing high-performance matrix-matrix multiplication (GEMM)
9
9
and related computations at all levels and scales within CUDA. It incorporates strategies for
@@ -54,6 +54,8 @@ To get started quickly - please refer :
54
54
- Added Blackwell SM103 support.
55
55
- Multiple dependent DSOs in the wheel have been merged into one single DSO.
56
56
- New env var `CUTE_DSL_CACHE_DIR` to specify the path for dumping caches.
57
+
- Supported namedtuple and kwargs for JIT function arguments in tvm-ffi.
58
+
- Supported variadic tuples for JIT function argument in tvm-ffi.
57
59
* Debuggability improvements:
58
60
- Supported source location tracking for DSL APIs (Allow tools like ``nsight`` profiling to correlate perf metrics with Python source code)
59
61
- Supported dumping PTX and CUBIN code: [Hello World Example](https://github.com/NVIDIA/cutlass/blob/main/examples/python/CuTeDSL/notebooks/hello_world.ipynb)
@@ -102,6 +104,8 @@ To get started quickly - please refer :
102
104
- Fixed tvm-ffi export compiled function
103
105
- Fixed an issue of CUDA JitExecutor when unloading kernels
104
106
- Fixed an issue of allocating max smem when there's statically allocated smem
107
+
- Fixed an issue when JIT function argument with union type annotation for tvm-ffi
108
+
- Clearer error message for the case of runtime error cudaErrorInsufficientDriver
105
109
106
110
## CUTLASS C++
107
111
* Further enhance Blackwell SM100 Attention kernels in [example 77](https://github.com/NVIDIA/cutlass/tree/main/examples/77_blackwell_fmha/).
Copy file name to clipboardExpand all lines: media/docs/pythonDSL/cute_dsl_general/dsl_jit_caching.rst
+1-4Lines changed: 1 addition & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -128,7 +128,7 @@ Here is an example demonstrating automatic caching of the ``add`` kernel:
128
128
The cache can be serialized to files for subsequent runs.
129
129
After serialization, compiled MLIR bytecode is stored in file.
130
130
The cache directory is ``/tmp/{current_user}/cutlass_python_cache``.
131
-
The cache loads from files into memory during |DSL| initialization and saves back to files when the process exits.
131
+
During compilation, the cache loads the corresponding kernel from file (if it exists) into memory as needed, and after compilation, it saves any newly compiled executables back to file.
132
132
133
133
Note that for efficiency, the default cache directory is located in a temporary folder. However, this location is not persistent, it may be cleared by the system (for example, during a reboot or disk space cleanup).
134
134
If you wish to preserve the cache across sessions, set the ``CUTE_DSL_CACHE_DIR`` environment variable to point to a persistent directory.
@@ -140,9 +140,6 @@ The following environment variables control file caching:
140
140
# Disable file caching while keeping in-memory cache available, defaults to False.
141
141
export CUTE_DSL_DISABLE_FILE_CACHING=True
142
142
143
-
# Maximum number of cache files allowed, defaults to 1000.
144
-
export CUTE_DSL_FILE_CACHING_CAPACITY=1000
145
-
146
143
# Cache directory, defaults to /tmp/{current_user}/cutlass_python_cache.
0 commit comments