[Overlap] Move initial stream sync into separate for loop#4217
[Overlap] Move initial stream sync into separate for loop#4217samnordmann merged 3 commits intomainfrom
Conversation
|
!test |
|
Review updated until commit d63e992 Description
Changes walkthrough 📝
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
…parate_initial_for_loop
|
!test |
|
To better explain the paradox shown in this pr. Here, we observe that depending on the order in which we emit device instruction, the behavior changes even though the programs are the same. Indeed, let The generated host program will consist of the instructions:
Regarding the instructions for Stream i, we can consider the two following orders to emit the instructions:
While those two variants should be equivalent, we see in the PR that only the "developped" way exhibits the desired behavior. @kevinstephano @naoyam @wujingyue |
port #4217 to stream lowering
Context
Previous bug fix in comms/compute overlap
A recent PR #3913 fixed a bug in the comms/compute overlap algorithm, consisting of adding a stream synchronization when entering the host for-loop. Namely, currently, the host generated program reads
Note the line
Synchronize Stream 0at the beginning of the for-loopDegradation of performance
However, even though it has not been verified before merging, PR #3913 degraded performances of the overlapped algo. Basically, since PR #3913, we do not observe overlapping anymore, even when using UCC/TL/NCCL:

Performance fix in the present PR
What
The current PR fixes this performance degradation. The idea has been found incenditally and reveals some probably interesting finding.
What needs to be done is to execute all the stream synchronization in a separate host for-loop, before the host for-loop responsible for the comms/compute. The new generated program reads:
Performance fix
The obtained nsight profile shows that we achieve perfect overlap:

Further todo:
The current PR modifies the function
lowerToCollectiveBasedPipelinedGemmCommwhich will be removed and replaced in #4147We need to port the current patch there.