[QST] Swizzled Layouts #1293

jeromeku · 2024-01-07T21:34:37Z

jeromeku
Jan 7, 2024

How do swizzled layouts affect downstream tiling / MMA ops?

E.g.,

const int M = 16;
const int N = 16;
const int Bbits = 2;
const int MBase = 3;
const int SShift = 3;
auto layout = make_layout(make_shape(M, N), make_stride(M, 1));
std::vector<int> v(M * N);
for (int i = 0; i < M * N; ++i)
    v[i] = i;
auto swizzledLayout = composition(Swizzle<Bbits, MBase, SShift>{},
                                      layout);
auto tV = make_tensor(v.data(), layout);
auto tVswizzled = make_tensor(v.data(), swizzledLayout);
std::cout << "Tensor: \n";
print_tensor(tV);
std::cout << std::endl;
std::cout << "Swizzled: \n";
print_tensor(tVswizzled);
std::cout << std::endl;

prints the following:

Tensor: 
ptr[32b](0x5564e6850eb0) o (16,16):(16,1):
    0    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15
   16   17   18   19   20   21   22   23   24   25   26   27   28   29   30   31
   32   33   34   35   36   37   38   39   40   41   42   43   44   45   46   47
   48   49   50   51   52   53   54   55   56   57   58   59   60   61   62   63
   64   65   66   67   68   69   70   71   72   73   74   75   76   77   78   79
   80   81   82   83   84   85   86   87   88   89   90   91   92   93   94   95
   96   97   98   99  100  101  102  103  104  105  106  107  108  109  110  111
  112  113  114  115  116  117  118  119  120  121  122  123  124  125  126  127
  128  129  130  131  132  133  134  135  136  137  138  139  140  141  142  143
  144  145  146  147  148  149  150  151  152  153  154  155  156  157  158  159
  160  161  162  163  164  165  166  167  168  169  170  171  172  173  174  175
  176  177  178  179  180  181  182  183  184  185  186  187  188  189  190  191
  192  193  194  195  196  197  198  199  200  201  202  203  204  205  206  207
  208  209  210  211  212  213  214  215  216  217  218  219  220  221  222  223
  224  225  226  227  228  229  230  231  232  233  234  235  236  237  238  239
  240  241  242  243  244  245  246  247  248  249  250  251  252  253  254  255

Swizzled: 
ptr[32b](0x5564e6850eb0) o Sw<2,3,3> o _0 o (16,16):(16,1):
    0    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15
   16   17   18   19   20   21   22   23   24   25   26   27   28   29   30   31
   32   33   34   35   36   37   38   39   40   41   42   43   44   45   46   47
   48   49   50   51   52   53   54   55   56   57   58   59   60   61   62   63
   72   73   74   75   76   77   78   79   64   65   66   67   68   69   70   71
   88   89   90   91   92   93   94   95   80   81   82   83   84   85   86   87
  104  105  106  107  108  109  110  111   96   97   98   99  100  101  102  103
  120  121  122  123  124  125  126  127  112  113  114  115  116  117  118  119
  144  145  146  147  148  149  150  151  152  153  154  155  156  157  158  159
  128  129  130  131  132  133  134  135  136  137  138  139  140  141  142  143
  176  177  178  179  180  181  182  183  184  185  186  187  188  189  190  191
  160  161  162  163  164  165  166  167  168  169  170  171  172  173  174  175
  216  217  218  219  220  221  222  223  208  209  210  211  212  213  214  215
  200  201  202  203  204  205  206  207  192  193  194  195  196  197  198  199
  248  249  250  251  252  253  254  255  240  241  242  243  244  245  246  247
  232  233  234  235  236  237  238  239  224  225  226  227  228  229  230  231

How are thread / warp loading of shared mem to registers handled for downstream tiling and MMA ops? I.e., values are mapped to registers such that either SIMT or tensor-core mma instructions calculate the right result?

thakkarV · 2024-01-07T21:42:08Z

thakkarV
Jan 7, 2024
Collaborator

Swizzled layouts don't affect downstream tiling at all. They simply compose. Swizzling applies only to the codomain of layouts

0 replies

jeromeku · 2024-01-08T00:01:49Z

jeromeku
Jan 8, 2024
Author

I'm still not clear on how instructions such as ldmatrix are affected by earlier swizzling operations.

For example, in slide 40 of this presentation, global to shared memory loading is swizzled in 4 phases:

If one were to then issue ldmatrix manually, what would be the smem addresses needed for threads 0 - 7 with and without swizzling (for .num = x1)?

Since each thread is loading 128b (= 8 fp16 elements), in the case with swizzling, wouldn't supplying the smem address of the highlighted boxes in swizzled rows result in rearranged values:

For example, the 128b row the thread 2 would load during the ldmatrix call would be the values corresponding to T8, T11, T10, T13 from the original matrix in global memory in the first slide:

Whereas with no swizzling, the values loaded by thread 2 would be T8 - 11.

Apologies for the density on my part.

0 replies

thakkarV · 2024-01-08T17:02:23Z

thakkarV
Jan 8, 2024
Collaborator

If one were to then issue ldmatrix manually, what would be the smem addresses needed for threads 0 - 7 with and without swizzling (for .num = x1)?

I see what you are asking, but in some senses, this question does not make sense, and misses the point of these instructions.

Let's put swizzles away for a second. LDMatrix, like any copy instruction, prescribes a TV layout that maps (thr_id,val_id) -> logical_coord for each bit of source and destination tensors. these are transcribed in the copy traits of the operation in the cute/atom/copy_traits* files.

if you care about the logical coordinates of the source and destination tensors being consistent, cannot simply issue this instruction on smem layouts that are not swizzled -- it is impossible. the layouts require the data to be swizzled in smem for the coordinates to remain consistent. If you try to do partition a tensor that does not have a compatible swizzled layout in CuTe with LDMatrix, we will not let you compile the code and static assert instead.

0 replies

jeromeku · 2024-01-12T08:08:07Z

jeromeku
Jan 12, 2024
Author

Is there a "canonical" set of swizzle patterns that efficiently layout the data in smem per the slides above for Tensor Cores (and other architectures)?

When threads partition a swizzled smem tensor, how to ensure that the warp-level data mapping as required by mma.sync is met? That is when issuing copy from smem to registers, assuming ldmatrix as the instruction op, how does the layout as prescribed by the tensors being partitioned and tiledmma / thr_mma ensure that the requesting thread receives the correct logical data?

0 replies

ccecka · 2024-01-12T21:52:37Z

ccecka
Jan 12, 2024

The images that you've shown are not displaying the logical coordinate to address mapping, which I believe is causing most of the confusion. Those images are showing the address-to-address mappings, which happen to be swizzled, and which are much less intuitive to view.

CuTe always shows layouts/tensors in logical coordinates to offsets/values. Partitioning is also always performed on the logical coordinates of tensors. Thus, if you have two tensors with consistent coordinates (but not necessarily the same layouts), and partition them consistently via the coordinate domains, then the result will always remain logically consistent no matter the physical layout of data.

cannot simply issue this instruction on smem layouts that are not swizzled -- it is impossible. the layouts require the data to be swizzled in smem for the coordinates to remain consistent. If you try to do partition a tensor that does not have a compatible swizzled layout in CuTe with LDMatrix, we will not let you compile the code and static assert instead.

This is inaccurate. "Logical consistency" is a relation between two tensors/layouts that states the coordinates of one tensor make sense as coordinates of the other, it says nothing about how each of those layouts map coordinates to offsets/values. Given logically consistent tensors, we can partition them both in a consistent way. The LDMatrix instruction applies a strange partitioning pattern to each, that's all. This works on normal layouts and swizzled layouts alike. The LDMatrix instruction does check some conditions, namely the vectorization of smem must be large enough, but that can also be satisfied by a wide variety of layouts. Smem bank access patterns can also affect performance, but will not affect correctness -- this is the "layout engineering" portion of optimization and where swizzled layouts can help.

how to ensure that the warp-level data mapping as required by mma.sync is met? That is when issuing copy from smem to registers, assuming ldmatrix as the instruction op, how does the layout as prescribed by the tensors being partitioned and tiledmma / thr_mma ensure that the requesting thread receives the correct logical data?

If you are partitioning for a particular instruction like MMA, then the MMA also knows the partitioning pattern it requires for A, B, and C. This is precisely the pattern we use to build the copy partitioner:

auto ldsm_copy_A = make_tiled_copy_A(copy_atom_ldsm, my_tiled_mma);

print_latex(ldsm_copy_A);

and then use to partition smem and retile rmem (no matter their layouts).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST] Swizzled Layouts #1293

{{title}}

Replies: 5 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

[QST] Swizzled Layouts #1293

jeromeku Jan 7, 2024

Replies: 5 comments

thakkarV Jan 7, 2024 Collaborator

jeromeku Jan 8, 2024 Author

thakkarV Jan 8, 2024 Collaborator

jeromeku Jan 12, 2024 Author

ccecka Jan 12, 2024

jeromeku
Jan 7, 2024

thakkarV
Jan 7, 2024
Collaborator

jeromeku
Jan 8, 2024
Author

thakkarV
Jan 8, 2024
Collaborator

jeromeku
Jan 12, 2024
Author

ccecka
Jan 12, 2024