Replies: 5 comments
-
Swizzled layouts don't affect downstream tiling at all. They simply compose. Swizzling applies only to the codomain of layouts |
Beta Was this translation helpful? Give feedback.
-
I'm still not clear on how instructions such as For example, in slide 40 of this presentation, global to shared memory loading is swizzled in 4 phases: If one were to then issue Since each thread is loading Whereas with no swizzling, the values loaded by thread 2 would be Apologies for the density on my part. |
Beta Was this translation helpful? Give feedback.
-
I see what you are asking, but in some senses, this question does not make sense, and misses the point of these instructions. Let's put swizzles away for a second. LDMatrix, like any copy instruction, prescribes a TV layout that maps if you care about the logical coordinates of the source and destination tensors being consistent, cannot simply issue this instruction on smem layouts that are not swizzled -- it is impossible. the layouts require the data to be swizzled in smem for the coordinates to remain consistent. If you try to do partition a tensor that does not have a compatible swizzled layout in CuTe with LDMatrix, we will not let you compile the code and static assert instead. |
Beta Was this translation helpful? Give feedback.
-
Is there a "canonical" set of swizzle patterns that efficiently layout the data in When threads partition a swizzled |
Beta Was this translation helpful? Give feedback.
-
The images that you've shown are not displaying the logical coordinate to address mapping, which I believe is causing most of the confusion. Those images are showing the address-to-address mappings, which happen to be swizzled, and which are much less intuitive to view. CuTe always shows layouts/tensors in logical coordinates to offsets/values. Partitioning is also always performed on the logical coordinates of tensors. Thus, if you have two tensors with consistent coordinates (but not necessarily the same layouts), and partition them consistently via the coordinate domains, then the result will always remain logically consistent no matter the physical layout of data.
This is inaccurate. "Logical consistency" is a relation between two tensors/layouts that states the coordinates of one tensor make sense as coordinates of the other, it says nothing about how each of those layouts map coordinates to offsets/values. Given logically consistent tensors, we can partition them both in a consistent way. The LDMatrix instruction applies a strange partitioning pattern to each, that's all. This works on normal layouts and swizzled layouts alike. The LDMatrix instruction does check some conditions, namely the vectorization of smem must be large enough, but that can also be satisfied by a wide variety of layouts. Smem bank access patterns can also affect performance, but will not affect correctness -- this is the "layout engineering" portion of optimization and where swizzled layouts can help.
If you are partitioning for a particular instruction like MMA, then the MMA also knows the partitioning pattern it requires for A, B, and C. This is precisely the pattern we use to build the copy partitioner: auto ldsm_copy_A = make_tiled_copy_A(copy_atom_ldsm, my_tiled_mma);
print_latex(ldsm_copy_A); and then use to partition smem and retile rmem (no matter their layouts). |
Beta Was this translation helpful? Give feedback.
-
How do swizzled layouts affect downstream
tiling / MMA
ops?E.g.,
prints the following:
How are
thread / warp
loading of shared mem to registers handled for downstream tiling and MMA ops? I.e., values are mapped to registers such that eitherSIMT
or tensor-coremma
instructions calculate the right result?Beta Was this translation helpful? Give feedback.
All reactions