Use `selfReplay` in fusion segmentor by Priya2698 · Pull Request #5177 · NVIDIA/Fuser

Priya2698 · 2025-09-17T23:11:55Z

For #4381.
This allows us to replay loop and allocation both instead of assuming loop is the same as either allocation or logical.

Priya2698 · 2025-09-17T23:12:10Z

!test --diff

github-actions · 2025-09-17T23:17:13Z

Review updated until commit c71bc71

Description

Replace manual domain replay with selfReplay for consistency
Handle scatter outputs with disjoint loop and logical domains
Fix contiguity replay when no allocation domain exists
Improve error messages for broadcast and reduction domains

Changes walkthrough 📝

Relevant files

Enhancement

fusion_segmenter.cpp `Simplify domain replay using selfReplay` csrc/fusion_segmenter.cpp Replace manual allocation and loop domain replay with `TransformReplay::selfReplay` Add special case handling for scatter outputs with disjoint loop/logical domains Remove reduction domains before replay using `kNoReductions` filter Propagate contiguity only for non-reduction domains when needed	+58/-78

Bug fix

transform_replay.cpp `Improve contiguity error messaging` csrc/transform_replay.cpp Update error message to include reduction domains in contiguity check Allow contiguity to be set for non-broadcast, non-reduction domains	+3/-2

Tests

test_multidevice_sharding.cpp `Add test for allocation-loop permutation` tests/cpp/test_multidevice_sharding.cpp Add test for allocation domain being a permutation of loop domain Include `finalize_multidevice_domains.h` for new pass usage Validate correct handling of DIDx parallelization in loop vs allocation	+38/-1

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🧪 PR contains tests
⚡ Recommended focus areas for review Possible Issue The condition in the NVF_ERROR check at line 1803 only verifies if any ID in the logical domain is a gather-scatter, but it may need to specifically confirm that the tensor view is a scatter output, not just contains a gather-scatter ID. std::any_of( tv->getLogicalDomain().begin(), tv->getLogicalDomain().end(), [](IterDomain* id) { return id->isGatherScatter(); }), "Disjoint loop and logical are only permitted for scatter outputs, ", tv->domain()->toString(0, false)); Logic Change The condition in NVF_ERROR_EQ has been updated to include reduction domains, changing the original broadcast-only check. This may affect correctness when contiguity is expected to be nullopt for reductions, which should be validated. (it->second->isBroadcast() \|\| it->second->isReduction()), !contiguity.has_value(), "Contiguity should be nullopt iff broadcast or reduction, true/false " "otherwise."); Performance Impact The use of selfReplay for both loop and allocation domains may introduce overhead in cases where domains are already aligned; the performance benefit of this change should be quantitatively evaluated. TransformReplay::selfReplay( tv->domain(), new_td, /ignore_reductions=/true);

csrc/fusion_segmenter.cpp

Priya2698 · 2025-09-22T20:21:33Z

csrc/fusion_segmenter.cpp

    for (const auto& id : logical) {
      if (id->isRFactorProduct()) {
        // Create new symbolic extents for logical iterDomains


Question: What is the reasoning behind this code snippet? Why do we create a symbolic id for rfactored iterdomains if not all ids are concrete?

I think this is because we don't need to keep the history of transformations. I vaguely remember @jacobhinkle did something around here.

If it's "concrete", we know the concrete size, so that will be used instead anyway.

That's correct. This was needed because we sometimes had a reshaped extent in a segmentation edge. Those extents depended on scalars from other segments such as the original input shape, which led to errors evaluating those segments.

https://github.com/NVIDIA/Fuser/pull/630/files#diff-e2f2ad44a6dc03e4ad8e5f0f047be25eb1c142add431d48c1e046c968a577f3bR1440-R1444

Priya2698 · 2025-09-22T20:25:38Z

@jjsjann123 I am putting this PR up for review. We can discuss if/how it may interfere with your work on padding, and identify any changes that should go before this PR.
Just want to get the ball rolling on the linked issue.

Priya2698 · 2025-09-22T20:25:44Z

!test --diff

csrc/fusion_segmenter.cpp

csrc/transform_replay.cpp

tests/cpp/test_multidevice_sharding.cpp

wujingyue · 2025-09-22T21:06:50Z

Thanks! LGTM overall

Priya2698 · 2025-09-23T00:44:02Z

csrc/transform_replay.cpp

-    const std::vector<std::optional<bool>>& self_contiguity =
-        self->contiguity();
-    NVF_ERROR_EQ(self_allocation.size(), self_contiguity.size());
+  // Replay maybeAllocation and contiguity.


The change here is that I am replaying maybeAllocation unconditionally.

Why? It seems reasonable to skip below if nothing is set for the allocation domain. What am I missing?

It was because of: #5177 (comment)

Even if there is no allocation domain sepcified, there maybe a contiguity set. By default, TensorDomain is created with false contiguity. This replay makes sure that if the original domain has a contiguity of true (or any non-default value), it is correctly replayed

FWIW, I am moving this change to another PR since it is breaking some tests. It would be clearer to do it separately

Even if there is no allocation domain sepcified, there maybe a contiguity set. By default, TensorDomain is created with false contiguity. This replay makes sure that if the original domain has a contiguity of true (or any non-default value), it is correctly replayed

Not sure why replay is necessary. Can't we just copy the bool vector?

That's why I suggested we should keep the API as simple as possible and make each function do minimal work so that it could be easily composable. The original concept of selfReplay is quite simple, and I'm worried that by adding more stuff into the function, it could be getting less flexible to use.

I think we are on the same page here. I was responding to your earlier comment about what if a non-reduction ID is mapped to reduction ID. The base replay class here would not do the correct thing here generating IDs with the itertype of original ID instead. Hence, my comment on not trying to support such a case here, and maybe adding a check to ensure hard-fail.

I disagree since there's no absolute definition of "mostly identical". Instead of defining what's identical, this interface is meant to let each caller to decide which IDs are considered mapped, and that should not be dictated by selfReplay itself.

Do you disagree on the part about rejecting reduction and non-reduction IDs being mapped, and letting the caller ensure that instead?

that by adding more stuff into the function, it could be getting less flexible to use.

This PR reverted the changes around contiguity replay and it is now in the calling code instead (segmentor)
We do replay contiguity anyway if allocation is present. However, contiguity can exist even without allocation, so what is incorrect with making allocation and contiguity replay independent (given that selfReplay is not handling ambiguous cases such as where reduction is mapped to non-reduction ID)?

The base replay class here would not do the correct thing here generating IDs with the itertype of original ID instead.

Ah, that's bit dirty behavior that we have been living with, that is, reduction rfactor creates non-reduction IDs out of a reduction ID. I believe that's the only case where split changes the input iter type. That's more like an exception, so I wouldn't let it determine the overall API design.

Do you disagree on the part about rejecting reduction and non-reduction IDs being mapped, and letting the caller ensure that instead?

My point is that the core implementation should be free from that decision. We could have multiple wrappers around it tailored for various particular use cases, but we should keep the core component as simple as possible.

so what is incorrect with making allocation and contiguity replay independent (given that selfReplay is not handling ambiguous cases such as where reduction is mapped to non-reduction ID)?

Sorry, I'm not sure what your question is. Can you rephrase it, please?

Ah, that's bit dirty behavior that we have been living with, that is, reduction rfactor creates non-reduction IDs out of a reduction ID. I believe that's the only case where split changes the input iter type. That's more like an exception, so I wouldn't let it determine the overall API design.

Interesting. Do we have a case where we need this? My understanding was we do that because we expect them to match anyway.

My point is that the core implementation should be free from that decision. We could have multiple wrappers around it tailored for various particular use cases, but we should keep the core component as simple as possible.

When we replay allocation, currently we do not handle (or expect?) this case. If we have mapped reduction and non-reduction IDs, then we could set the contiguity to be false to be conservative. That is, if the replay of a reduction ID is a non-reduction ID during allocation domain replay, the corresponding contiguity flag is false.
However, if we do not have a case where we do this, it seems simpler to be non-ambiguous and not allow this.

so what is incorrect with making allocation and contiguity replay independent (given that selfReplay is not handling ambiguous cases such as where reduction is mapped to non-reduction ID)?

The motivation for replaying contiguity in selfReplay (not in this PR) was: The Tensordomain created using IrBuider::create.. has a default contiguity of false. It is possible that the original tensordomain had some other contiguity even if we do not have a allocation domain. So we should replay that contiguity to the new tensordomain. Right now, I am doing this in the caller code, that is, in the segmentor so that in the process of cloning tensordomains, I do not lose that information.

The motivation for replaying contiguity in selfReplay (not in this PR) was: The Tensordomain created using IrBuider::create.. has a default contiguity of false.

That's right. Since maybeAllocation and contiguity go hand in hand, I think it's a bug to replay allocation but not contiguity. So this change itself is fine.

To the larger question about the contract of selfReplay, I hope #5221 will simplify the contract and make it more strict. I'm still testing it out...

Interesting. Do we have a case where we need this?

Yes, it's how rfactor works, IIRC.

When we replay allocation, currently we do not handle (or expect?) this case. If we have mapped reduction and non-reduction IDs, then we could set the contiguity to be false to be conservative. That is, if the replay of a reduction ID is a non-reduction ID during allocation domain replay, the corresponding contiguity flag is false.
However, if we do not have a case where we do this, it seems simpler to be non-ambiguous and not allow this.

That's exactly why I'm suggesting keeping selfReplay as simple as possible and moving the responsibility of what to propagate and how to update contiguity to something else. I was chatting with @jjsjann123 about #5184, and it seems we would like to use selfReplay from a fusion output but only with its logical domain since the allocation domain of the output doesn't make any sense to propagate, whereas in the use case of the fusion segmenter, we do want to keep the allocation domain propagated.

These different use cases have been pretty common, and that's why we have those options like bool propagate_allocation. That's not ideal, but I disagree with just removing it just because that would be fine for one case, because it would make it difficult to use in some other cases.

Priya2698 · 2025-09-23T00:45:17Z

!test --diff

csrc/transform_replay.cpp

…mentor_replay

Priya2698 · 2025-09-23T02:37:28Z

!test --diff

csrc/transform_replay.cpp

Priya2698 · 2025-09-23T18:40:31Z

!test --diff

Priya2698 · 2025-09-24T20:11:42Z

!test --diff

Priya2698 · 2025-09-24T23:12:02Z

csrc/fusion_segmenter.cpp

+    auto compare_result = ir_utils::compareDomains(
+        tv->getLogicalDomain(),
+        tv->getLoopDomain(),
+        /*additional_ids=*/{},
+        /*ignore_broadcast=*/false);
+    bool has_disjoint_loop_logical = compare_result.dom0_has_unreachable_ids ||
+        compare_result.dom1_has_unreachable_ids;

-      new_td = IrBuilder::create<TensorDomain>(
-          /*root_domain=*/std::vector<IterDomain*>(),
-          new_logical_domain,
-          new_alloc,
-          new_loop,
-          tv->domain()->contiguity());
+    if (has_disjoint_loop_logical) {


@wujingyue Had to make this new change after your review, if you would like to take another look.

Fine with me. However, I've no idea what's going on. I can see the superficial reason that Scatter and Gather have disjoint logical and loop, but I've yet to understand the reason behind that. I'm sure it's my first time trying to understand how Scatter and Gather are codegen'ed. cc @naoyam and @jjsjann123 who might know the answer or any pointers.

I don't mind the assert here on a gather scatter ID. Conceptually we only expect gather/scatter to create such scenario.

Any other things like padding that could affect the mapping between allocation to logical are only going to show up between logical->allocation. But shouldn't have affected loop.

naoyam

LGTM

--global and others added 8 commits September 15, 2025 15:35

fix DIDx parallel axis, run DecomposeReshardingsPass to rfactor

ac6d6b1

extraneous change

0657777

add DIDx check

a066f84

replace with selfReplay

13da9a6

Merge branch 'main' into pm/decouple

fe91fd2

remove reduction ids before replay

5878853

clean

8888ac1

update error message

acf1128

Priya2698 mentioned this pull request Sep 18, 2025

Decouple loop and allocation during preseg for multi-GPU #4381

Closed

Priya2698 commented Sep 18, 2025

View reviewed changes

csrc/fusion_segmenter.cpp Outdated Show resolved Hide resolved

--global and others added 2 commits September 19, 2025 14:50

comment

8f52625

add test

c070c33

Priya2698 commented Sep 22, 2025

View reviewed changes

Priya2698 marked this pull request as ready for review September 22, 2025 20:22

Priya2698 requested review from jjsjann123 and wujingyue September 22, 2025 20:22

wujingyue reviewed Sep 22, 2025

View reviewed changes

Priya2698 added 3 commits September 22, 2025 16:31

Merge remote-tracking branch 'origin/main' into pm/segmentor_replay

1d812a8

review

c24c024

move contiguity replay within selfReplay

8e18be9

Priya2698 commented Sep 23, 2025

View reviewed changes

Priya2698 requested a review from wujingyue September 23, 2025 00:45

wujingyue approved these changes Sep 23, 2025

View reviewed changes

naoyam reviewed Sep 23, 2025

View reviewed changes

csrc/transform_replay.cpp Outdated Show resolved Hide resolved

Priya2698 added 3 commits September 22, 2025 19:27

undo moving contiguity replay

19d9fee

Merge remote-tracking branch 'origin/pm/segmentor_replay' into pm/seg…

5142dbe

…mentor_replay

fix bad merge

eda1b2c

Priya2698 requested a review from naoyam September 23, 2025 02:37

jjsjann123 reviewed Sep 23, 2025

View reviewed changes

csrc/transform_replay.cpp Show resolved Hide resolved

Merge branch 'main' into pm/segmentor_replay

5493e26

Priya2698 added 2 commits September 24, 2025 13:07

special case for scatter

5b1def0

special case scatter outputs

c71bc71

wujingyue mentioned this pull request Sep 24, 2025

Dead code #5210

Draft

Priya2698 requested review from naoyam and removed request for naoyam September 24, 2025 22:30

Priya2698 commented Sep 24, 2025

View reviewed changes

naoyam approved these changes Sep 25, 2025

View reviewed changes

Priya2698 merged commit 28c578d into main Sep 25, 2025
60 of 62 checks passed

Priya2698 deleted the pm/segmentor_replay branch September 25, 2025 16:19

Priya2698 mentioned this pull request Sep 26, 2025

Make selfReplay stricter #5221

Merged

Conversation

Priya2698 commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Priya2698 commented Sep 17, 2025

Uh oh!

github-actions bot commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes walkthrough 📝

PR Reviewer Guide 🔍

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Priya2698 commented Sep 22, 2025

Uh oh!

Priya2698 commented Sep 22, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wujingyue commented Sep 22, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Priya2698 Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Priya2698 Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Priya2698 commented Sep 23, 2025

Uh oh!

Uh oh!

Priya2698 commented Sep 23, 2025

Uh oh!

Uh oh!

Priya2698 commented Sep 23, 2025

Uh oh!

Priya2698 commented Sep 24, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

naoyam left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Priya2698 commented Sep 17, 2025 •

edited

Loading

github-actions bot commented Sep 17, 2025 •

edited

Loading

Priya2698 Sep 23, 2025 •

edited

Loading

Priya2698 Sep 24, 2025 •

edited

Loading