You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Distributed GEMM is an experimental pipelined Tensor Parallelism implementation utilizing existing CUTLASS kernels and CUDA runtime features, which can hide the most of communication behind computation.
Improved persistent grid launch for Hopper kernels with large cluster sizes (>= size of 4) using the new make_kernel_hardware_info API as shown in example 48.
Enabled high precision accumulation for Hopper FP8 Sparse GEMM.
This discussion was created from the release CUTLASS 3.7.0.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
make_kernel_hardware_info
API as shown in example 48.This discussion was created from the release CUTLASS 3.7.0.
Beta Was this translation helpful? Give feedback.
All reactions