Move MarkAliasAnalysisPreparePass before propagateShardingsPass#4274
Move MarkAliasAnalysisPreparePass before propagateShardingsPass#4274
MarkAliasAnalysisPreparePass before propagateShardingsPass#4274Conversation
|
!test |
|
Review updated until commit afacb99 Description
Changes walkthrough 📝
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
MarkAliasAnalysisPreparePass before propagateShardingsPass
|
@jjsjann123 do you request to check any perf benchmarks? I remember allocation-related passes have been sensitive to the order. |
We don't have a specific benchmark that's just testing the allocation domain inference on a end2end examples. So it's tricky for us trying to figure out how to examine the perf impact from this change. Meanwhile, since we are only moving the pass involving sharding passes. This shouldn't introduce any codegen diff for single GPU tests here. So maybe we can do a |
|
!test --diff |
|
!test --diff |
wujingyue
left a comment
There was a problem hiding this comment.
Results for 2 GPUs on main:
Can you specify the benchmark environment when you show any benchmark results?
Also, I'd try viking-prod-pjnl which has H100x8
|
None of the single GPU tests are affected. Update: The codegen changes are due to missed aliasing opportunities for the final |
Updated the comment, the existing results are from GH200. |
|
Using GH200 nodes: Overlap allgather benchmarks: On main: On this branch: CPP transformer benchmarks: On this branch: I do not see any major performance dips for other benchmarks. However, the overlapping benchmarks are less stable with very high standard deviations. |
|
On 8xA100 40GB(luna_prod): CPP benchmarks This branch: Python transformer benchmarks: This branch: |
|
On 8xA100 40GB(luna_prod): This branch: |
|
@wujingyue viking-prod-pjnl maybe down for sometime. I ran on A100x8 and the results are in the above comments. We do not see a performance penalty due to missed aliasing opportunity at this time. Let me know if you would like to see any additional results. I will merge the PR after this. |
|
As you said, codegen diff shows a potential performance hit but it doesn't seem to affect any real benchmarks and you have an idea how to fix this in the short future (e.g. move markAliasesPrepare between reorder and set-allocation). So LGTM! |
This PR extends the `propagateSharding` presegmentation pass for DID loop splits. Key changes: 1. We use TransformPropagator for all expressions except `ViewOp` which is handled manually since TransformPropagator does not support it without first propagating the reshape to the producer. 2. `makeReshardingContiguous` sets allocation domain for tvs with device mesh. Ideally, we need to set it only for global tensors but this is not known before segmentation, but should be set before segmentation. 3. ~The following tests are modified: See [discussion](#3838 (comment). PR #4274 resolved this. Follow-up PRs: - `ViewOp` will be handled in a followup PR. - Currently, we only backpropagate sharding for a tv that does not already have a device dimension. This can be extended to propagate for all parallel types not present on the tv. This will be done in a followup. Backpropagating shardings can incorrectly change DIDx to serial or modify DIDx to be on another location. `shardAllLike` can be modified to specify which parallel type to propagate. Since `insertResharding` and `propagateSharding` require different behavior, I will handle it in a separate PR. - Use `TransformReplay::CasP` in lieu of TransformPropagator. - Propagate DID transforms within `castOp`: [privatizeUpcast](https://github.com/NVIDIA/Fuser/blob/ed687366cf717837c8ea3e40f56542fec48e1616/csrc/fusion_segmenter.cpp#L4235-L4238) clones cast operations, which fails segmentation since the transforms are not replicated. Findings from experiments: #3838 (comment) --------- Co-authored-by: Jingyue Wu <wujingyue@gmail.com>
This makes #3838 performance neutral. PR #3838 sets the allocation domain for multidevice tensorviews in
makeReshardingContiguouspass. Aliasing is not done if allocation domain has already been set for a tensorview. This PR moves the multidevice preseg passes aftermarkAliasAnalysisPreparePassto avoid performance regression.Update: The codegen changes are due to missed aliasing opportunities for the final permute operation inserted in reorderShardedAxisPass (here and here), however this does not have a significant performance impact (see benchmark results). Since, only the input/output of the communication need to have the allocation domain specified, new_output can have the same allocation as output.
Once we fix
markAliasPreparePassto propagate DID transforms and shardings for copied tensorviews, the presegmentation passes will be ordered as [propagateShardingsPass, insertResharding, reorderShardedAxis] -> [markAliasPreparePass, AllocationDomainPass] -> [makeReshardingContiguous]. This avoids missed aliasing opportunites for operators added in theinsertReshardingandreorderShardedAxispass.Benchmarking results on GH200 nodes:
On main:
This branch: