-
Notifications
You must be signed in to change notification settings - Fork 79
[WIP] Peel off epilogue loop for circular buffering #2005
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I attempted to make
keepStages()aVal*. The issue is that this must be a compile time constant argument for the inline asm instruction. I think it still might work if we are able to unroll the epilogue loop, but that might not always be preferable/acceptable. So instead, we could also have a runtime function/kir node that calls a runtime helper function that wrapscp.async.wait_group Nfor variableNand handles values up to say 5 or 6 inside a switch statement.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UPDATE: the switch statement helper approach seems to work and doesn't require unrolling the epilogue (we can't get away from this switch statement even with unrolling the epilogue), but it means we need to set an upper limit on the number of unsynched stages. We could set that to something high like 10. We only need it to be
num_stages, i.e. we don't need a switch statement with 10 cases if we have only 3 circular buffering stages. However, the requirement to have a constantNincp.async.wait_group, along with the requirement for inline asm to have string literal inputs has stumped me. I tried all kinds of combinations of templates and macros but got nowhere.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another alternative is to, instead of making the epilogue
we change it to
so that the epilogue is naturally unrolled.
And at the same time
Fuser/csrc/kernel_ir.cpp
Lines 321 to 323 in b108bca
needs change to generate an
nfor PTX constraintsalso, you will need update
Fuser/csrc/index_compute.cpp
Lines 2354 to 2400 in b108bca
to change loop index
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I'm understanding, you mean to use the unrolled loop index as the argument to
wait_group. That fails since even if the loop variable is the actual argument we getThat is, in PTX it is seeing this as a non-constant argument. I have tried interpolating it into that string, but for inline assembly the command must be a string literal...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mmigdal-nv helped come up with this solution:
We can replace
((12800 - 2) - i13)with theVal*we have currently. The compiler will evaluate the recursive template and prune the dead branches. I think the only downside to this is that we need to unroll the new loop, will probably hurt compilation time.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fun C++...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW I looked to see what CUTLASS does. It seems they do not peel off an epilogue loop. Instead they just
wait_group 0after the main loop. https://github.com/NVIDIA/cutlass/blob/c4e3e122e266644c61b4af33d0cc09f4c391a64b/include/cutlass/gemm/threadblock/mma_multistage.h