CuTe and CUTLASS 3.x based Ampere convolution forward propagation kernel capable of operating on both affine and gather/scatter tensors.
Example executions:
./59_ampere_gather_scatter_conv
./59_ampere_gather_scatter_conv --n=108
./59_ampere_gather_scatter_conv --n=4096 --i=1
./59_ampere_gather_scatter_conv --n=1080 --i=1000
./59_ampere_gather_scatter_conv --n=131072 --i=1000 --no-check
This example demonstrates a few super cool features of CUTLASS and CuTe. It shows off
- A dense conv 3D fprop kernel written as a single file ...
- ... that leverages off-the-shelf CUTLASS collectives to show how custom kernels can use collectives ...
- ... and uses the exact same templated kernel to also stamp out a gather/scatter 3D fprop conv ...
- ... while getting near peak performance of the Ampere class tensor core on Ampere and Ada GPUs ...
- ... by using static cute shapes and strides in case problem shapes are known at compile time.
The most common strategy for implementing high performance convolution kernels on the GPU is to transform the activation tensor in such a way that we can perform the computation as a GEMM. This is called the image to column (im2col) transformation. CUTLASS 2.x implementation of im2col based convolutions is documented separately, and here we consider a fresh approach for CuTe.
A 3D convolution has the following input tensors:
- Activation tensor (Act):
((N,(D,H,W)), (C,(1,1,1)))
- Filter tensor (Flt):
( K, (C,(T,R,S)))
- Output tensor (Out):
((N,(Z,P,Q)), K )
Where
- N := number of images
- DHW := spatial dimensions of the activation tensor
- C := channel dimension of the activation tensor
- K := channel dimension of the filter and output tensor
- TRS := spoke dimensions of the filter tensor
- ZPQ := spatial dimensions of the output tensor
As is evident in the tensor shapes, these cannot be issued to a GEMM just yet, since there is no logical M, N, and K modes we can group the tensor modes into.
Notice that every spoke of the filter tensor (TRS) will be applied to some (offset) view of the activation tensor, thus expanding the logical size of the activation tensor. Additionally, a similar logical transform of the spatial dimensions can be encoded as a function of the padding, dilations, traversal strides, and filter spokes. This gets us to our im2col transform:
im2col transform affects the component shapes/strides of the activation tensor in the following way:
- ZPQ Shape : changes DHW domain with formula
(1 + (DHW + pad - (((TRS-1) * dilation) + 1)) / traversal_stride)
- TRS Shape : TRS domain instead of
(1,1,1)
- ZPQ Strides : Original DHW strides get
elem_scale()
-ed by traversal strides DHW - TRS Strides : Original DHW strides get
elem_scale()
-ed by dilation DHW
With this transform applied, we end up with a set of input and output tensors that are logically consistent in their MNK dimensions, thus allowing us to dispatch to a GEMM. im2col activation layout: ((N,(Z,P,Q)), (C,(T,R,S))) // logical (M,K) filter layout : ( K, (C,(T,R,S))) // logical (N,K) output layout : ((N,(Z,P,Q)), K ) // logical (M,N)
CuTe's layout representation and algebra make these folded tensors easy to represent and manipulate. This is most evident in the reference check code used in this example:
for (size_t logical_m = 0; logical_m < size<0>(mOutputRef); ++logical_m) {
for (size_t logical_n = 0; logical_n < size<1>(mOutputRef); ++logical_n) {
auto accumulator = float(0);
for (size_t logical_k = 0; logical_k < size<1>(mStencil); ++logical_k) {
accumulator += mStencil(logical_m, logical_k) * mActivation(logical_n, logical_k);
}
mOutputRef(logical_m, logical_n) = accumulator;
}
}
Which succinctly demonstrates how im2col transform allows us to implement convolutions as GEMMs with special layout transformations on the input tensor.
Note: in the example kernel's implementation we treat activations as the B tensor and filter as the A tensor, thus making their logical dimensions NK and MK respectively.
Now that we have transformed our problem in such a way that allows us to dispatch to a GEMM, we can reuse much of the machinery CUTLASS offers to implement this forward pass convolution operator. CUTLASS decomposes these "moving parts" of GPU linear algebra into reusable, modular software components abstracted by C++ template classes. This example demonstrates how some of the lower layers of the hierarchy can be re-used for custom kernels by writing a custom kernel for convolution that re-uses the Ampere/Ada GEMM collectives from CUTLASS 3.
A kernel author is free to compose their custom components with any of the existing templates in the CUTLASS hierarchy to leverage existing high performance implementations from the CUTLASS team. In this example, we write a custom kernel layer and compose with an existing collective. However, any of the CUTLASS kernels can be composed with bespoke collectives if the desired customization is a mainloop or epilogue fusion without changes to the grid planning, tile scheduling, load balancing, or thread marshalling.
Functionality and correctness of the implemented kernel, as a virtue of using CuTe and off the shelf CUTLASS collectives, only relies on the logical consistency of the layouts of input and output tensors. This means that we can freely change how the logical coordinates of the tensors map into the index space, and even how these dereferences happen. CUTLASS example 52 demonstrates this by implementing a custom stride that supports indexed indirection for tensor data accesses. This allows for example 52 to implement a GEMM where inputs are gathered and output is scattered based on an index buffer.
We re-use the same custom stride utilities in this example to implement a convolution kernel that gathers along the NDHW dimensions of the activation tensor and scatters the output along the NZPQ dimensions of the output tensor, treating the channel dimensions as the dense vectors.
Our dense affine im2col transformed activation tensor:
// im2col transformed activation layout: ((nzpq), (ctrs)) => idx
auto xformed_act_layout = make_layout(
make_shape (make_shape ( N, Z, P, Q), make_shape ( C, T, R, S)),
make_stride(make_stride(D*H*W*C, H*W*C, W*C, C), make_stride(_1{}, H*W*C, W*C, C)));
now becomes a composed layout that uses IndexedGather
:
// Inner layout of the composition:
// ((nzpq), (csrt)) => (idx_buffer_idx, dense_offset)
auto EG = E<0>{}; // Gather basis (1,0) (idx_buffer_idx)
auto EC = E<1>{}; // Contiguous basis (0,1) (dense_offset)
auto xformed_act_logical_inner = make_layout(
make_shape (make_shape ( N, Z, P, Q), make_shape ( C, T, R, S)),
make_stride(make_stride(D*H*W*EG, H*W*EG, W*EG, EG), make_stride(EC, H*W*EG, W*EG, EG)));
// Outer layout of the composition:
// (idx_buffer_idx, dense_offset) => idx
// IndexedGather obtains idx by applying (gmem_base_ptr + gather_idx_buf[idx_buffer_idx] + dense_offset)
auto xformed_act_gather_outer = make_layout(
make_shape(_1{},_1{}),
make_stride(CustomStride{IndexedGather{gather_idx_buf}, C}, _1{}));
// Compose the inner and outer layouts
// ((nzpq), (ctrs)) => idx
auto xformed_act_composed_layout = composition(
xformed_act_gather_outer,
make_arithmetic_tuple(_0{}, _0{}),
xformed_act_logical_inner);
Here, we create a composed layout whose inner layout has the same logical MK shape as earlier,
but with an outer layout that uses the custom strides with an index buffer to access memory with
indirections. A custom stride requires two inputs to compute the index that a certain coordinate maps to:
the index buffer offset and the dense offset into the vector. This entails that our inner layout
(the one with the logical MK shape) has a rank-2 codomain (idx_buffer_idx, dense_offset)
.
We can set up such a layout with scaled basis strides, which allow us to map a domain onto a
codomain with multiple orthogonal bases. The two codomain basis are the
index buffer offsets (rank 0 basis), and the dense vector offsets (rank 1 basis).
A similar composed layout is set up for the output scatter tensor.
This tensor still has a logical MK shape and is backed by a CuTe layout, which means we can still tile, partition, and otherwise manipulate it with CuTe's layout algebra in the same way we would any other tensor. Substituting the activation tensor's affine layout for this gather layout requires no changes to the implementation of the kernel whatsoever. Everything composes. This example stamps out a dense 3D convolution as well as gather/scatter 3D convolution using the same kernel template, with the only difference between them being the layouts of the input and output tensors.
Convolutions are just a special case of tensor contractions, and as example 51 demonstrates, the exact same collective used in this example can also be used to implement arbitrary GETTs. Of course, this also means that the same kernel can implement gather/scatter GETTs as well!
This demonstrates the composition power of not just CuTe, but also CUTLASS 3's two level micro kernel abstraction. A single highly tuned temporal micro-kernel (collective) can be implemented once and applied to compute dense GETTs, gather/scatter GETTs, dense convolutions, and gather/scatter convolutions.
Often, when implementing custom kernels, a user has more knowledge of the problem domain that can be exploited to deliver higher performance than otherwise could be through general kernels. In this example we presume that the shape of each of the images (DHWC dimensions) as well as the filter (TRS) are available a-priori and that the tile shape evenly divides the problem. Number of images (N) is still left as a runtime parameter.
Knowing the extents of our tensors at compile time allows us to encode them as static cute shapes rather than a dynamic problem shape, resulting in the elimination of most of the index computation instructions such as expensive div/mods. Knowing that the problem shape is divisible by the tile shape allows us to use the Ampere collective that does not perform predication on global memory loads, further reducing overheads and allowing us to achieve near peak performance on RTX Ampere and Ada GPUs.
Running this example on an RTX 3080Ti prints the following performance numbers (some output culled for brevity):
$> ./examples/59_ampere_gather_scatter_conv/59_ampere_gather_scatter_conv --n=131072 --i=128 --no-check
Ampere convolution forward propogation kernel supporting both affine and gather/scatter tensors.
Allocating tensors ... done.
Initializing data ... done.
Initializing gather/scatter index buffers ... done.
Running dense fprop kernel
Conv TFLOP count = 0.927713
Conv dense perf: 31.027376ms | TFLOP/s = 29.899819
Running gather/scatter fprop kernel
Conv TFLOP count = 0.927713
Conv gather/scatter perf: 28.973721ms | TFLOP/s = 32.019117
With this in mind, this example kernel has the following limitations:
- This example kernel only supports dynamic image count, all other conv problem shape must be defined as
cute::Constant<>
s - Problem shapes (including dynamic image count
N
) must be evenly divisible by the tile shape - It does not perform fp32->tf32 numeric conversion, gmem inputs must be rounded to tf32 already