Skip to content

Hack to ensure cp.async is waited before smem reuse#2001

Merged
jacobhinkle merged 1 commit intomainfrom
hack_around_async_smem_reuse
Mar 26, 2024
Merged

Hack to ensure cp.async is waited before smem reuse#2001
jacobhinkle merged 1 commit intomainfrom
hack_around_async_smem_reuse

Conversation

@jacobhinkle
Copy link
Collaborator

@jacobhinkle jacobhinkle commented Mar 26, 2024

This is a work-around for #2000.

It seems to address the issue in the only current use case for smem reuse: matmul with params.use_smem_epilogue == true. It is not ideal: for example it will insert a cp.async.wait_all instruction even if circular buffering is not used in the kernel.

Fixes #1996 but since this is a hack, I will not mark #2000 as fixed yet.

@jacobhinkle
Copy link
Collaborator Author

!build --diff-bench

@jacobhinkle jacobhinkle marked this pull request as ready for review March 26, 2024 16:44
Copy link
Collaborator

@zasdfgbnm zasdfgbnm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a temporary hack to make CI green, this PR is good. But we still need to look into a better solution for this problem.

@jacobhinkle jacobhinkle merged commit 2f80cee into main Mar 26, 2024
@jacobhinkle jacobhinkle deleted the hack_around_async_smem_reuse branch March 26, 2024 22:10
jacobhinkle added a commit that referenced this pull request Mar 28, 2024
This just places a `cp.async.wait_group 0` instruction immediately after
any circular buffer main loop which is the approach taken by CUTLASS for
pipelining GEMMs: (see
[mma_multistage.h#L664-L665](https://github.com/NVIDIA/cutlass/blob/c4e3e122e266644c61b4af33d0cc09f4c391a64b/include/cutlass/gemm/threadblock/mma_multistage.h#L664-L665)).
The previous fix for #2000, #2001, is reverted.

This is an alternative to #2005.

Fixes #2000
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Some matmul nvfuser_splitk benchmark fails

2 participants