Conversation
Adds a CTA swizzle to change the order in which the tiles of the output matrix are processed. This swizzle increases data reuse from A and B, when iterating over gridDim.x. Turns out that CTAs are launched in practice by iterating over gridDim.x first (order is unspecified though, it just happens to behave the same). As a result, the current wave will contain CTAs that compute square sub-matrices of C, and so, increase L2 hit rate. Best factor seems to be 4. This will be part of the heuristics. Setting the factor to 1 disables this swizzle. On a 8192x8192x8192 matmul with default config, the speedup is about 20%. An extreme example is following case: `MNK = 6144 6144 6144, layout=NT stages=0, cta_tile = 32 32 128, warp_tile = 16 16 128, instruction_tile = 16 16 16` where the runtime drops from 12.4 ms to 7.28ms ! Thank you @zasdfgbnm for the help. Values measured on NVIDIA A100 SXM4 80 GB --------- Co-authored-by: Gao, Xiang <qasdfgtyuiop@gmail.com>
|
|
||
| // Applies swizzle factor on C | ||
| if (params.grid_swizzle_factor != 1) { | ||
| int factor = std::max(1, params.grid_swizzle_factor); // must be >=1 |
There was a problem hiding this comment.
We don't consider this type of tiling to be a swizzle. We consider swizzles to be non affine transformations. Can we just get a different word for this optimization? Tile shuffle or something similar might be a good name.
There was a problem hiding this comment.
We are already using the word swizzle for both affine and non-affine transformations. See note [WarpMmaSwizzler].
There was a problem hiding this comment.
I went with CUTLASS's naming. What about grid scaling ? Or I'm fine with tile shuffle. So should I update @zasdfgbnm @csarofeen ?
There was a problem hiding this comment.
No it's fine then, we should probably have a non-affine vs affine swizzle rename, but doesn't seem to make a difference right now.
There was a problem hiding this comment.
Let's merge this PR as is so it is consistent with tracking-matmul.
csarofeen
left a comment
There was a problem hiding this comment.
This seems good to me, we should just be able to merge it in? CC @zasdfgbnm
|
@mmigdal-nv thanks for heads up, when this is promoted to |
z-shape swizzle was not used in tracking-matmul, and now we have a better swizzle implemented in #90
Cherry-pick of the changes made in #87 into main.