Skip to content

Use selfReplay in fusion segmentor#5177

Merged
Priya2698 merged 19 commits intomainfrom
pm/segmentor_replay
Sep 25, 2025
Merged

Use selfReplay in fusion segmentor#5177
Priya2698 merged 19 commits intomainfrom
pm/segmentor_replay

Conversation

@Priya2698
Copy link
Collaborator

@Priya2698 Priya2698 commented Sep 17, 2025

For #4381.
This allows us to replay loop and allocation both instead of assuming loop is the same as either allocation or logical.

@Priya2698
Copy link
Collaborator Author

!test --diff

@github-actions
Copy link

github-actions bot commented Sep 17, 2025

Review updated until commit c71bc71

Description

  • Replace manual domain replay with selfReplay for consistency

  • Handle scatter outputs with disjoint loop and logical domains

  • Fix contiguity replay when no allocation domain exists

  • Improve error messages for broadcast and reduction domains


Changes walkthrough 📝

Relevant files
Enhancement
fusion_segmenter.cpp
Simplify domain replay using selfReplay                                   

csrc/fusion_segmenter.cpp

  • Replace manual allocation and loop domain replay with
    TransformReplay::selfReplay
  • Add special case handling for scatter outputs with disjoint
    loop/logical domains
  • Remove reduction domains before replay using kNoReductions filter
  • Propagate contiguity only for non-reduction domains when needed
  • +58/-78 
    Bug fix
    transform_replay.cpp
    Improve contiguity error messaging                                             

    csrc/transform_replay.cpp

  • Update error message to include reduction domains in contiguity check
  • Allow contiguity to be set for non-broadcast, non-reduction domains
  • +3/-2     
    Tests
    test_multidevice_sharding.cpp
    Add test for allocation-loop permutation                                 

    tests/cpp/test_multidevice_sharding.cpp

  • Add test for allocation domain being a permutation of loop domain
  • Include finalize_multidevice_domains.h for new pass usage
  • Validate correct handling of DIDx parallelization in loop vs
    allocation
  • +38/-1   

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    🧪 PR contains tests
    ⚡ Recommended focus areas for review

    Possible Issue

    The condition in the NVF_ERROR check at line 1803 only verifies if any ID in the logical domain is a gather-scatter, but it may need to specifically confirm that the tensor view is a scatter output, not just contains a gather-scatter ID.

    std::any_of(
        tv->getLogicalDomain().begin(),
        tv->getLogicalDomain().end(),
        [](IterDomain* id) { return id->isGatherScatter(); }),
    "Disjoint loop and logical are only permitted for scatter outputs, ",
    tv->domain()->toString(0, false));
    Logic Change

    The condition in NVF_ERROR_EQ has been updated to include reduction domains, changing the original broadcast-only check. This may affect correctness when contiguity is expected to be nullopt for reductions, which should be validated.

    (it->second->isBroadcast() || it->second->isReduction()),
    !contiguity.has_value(),
    "Contiguity should be nullopt iff broadcast or reduction, true/false "
    "otherwise.");
    Performance Impact

    The use of selfReplay for both loop and allocation domains may introduce overhead in cases where domains are already aligned; the performance benefit of this change should be quantitatively evaluated.

    TransformReplay::selfReplay(
        tv->domain(), new_td, /*ignore_reductions=*/true);

    --global and others added 2 commits September 19, 2025 14:50
    Comment on lines 1773 to 1775
    for (const auto& id : logical) {
    if (id->isRFactorProduct()) {
    // Create new symbolic extents for logical iterDomains
    Copy link
    Collaborator Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Question: What is the reasoning behind this code snippet? Why do we create a symbolic id for rfactored iterdomains if not all ids are concrete?

    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    I think this is because we don't need to keep the history of transformations. I vaguely remember @jacobhinkle did something around here.

    If it's "concrete", we know the concrete size, so that will be used instead anyway.

    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    That's correct. This was needed because we sometimes had a reshaped extent in a segmentation edge. Those extents depended on scalars from other segments such as the original input shape, which led to errors evaluating those segments.

    https://github.com/NVIDIA/Fuser/pull/630/files#diff-e2f2ad44a6dc03e4ad8e5f0f047be25eb1c142add431d48c1e046c968a577f3bR1440-R1444

    @Priya2698 Priya2698 marked this pull request as ready for review September 22, 2025 20:22
    @Priya2698
    Copy link
    Collaborator Author

    @jjsjann123 I am putting this PR up for review. We can discuss if/how it may interfere with your work on padding, and identify any changes that should go before this PR.
    Just want to get the ball rolling on the linked issue.

    @Priya2698
    Copy link
    Collaborator Author

    !test --diff

    @wujingyue
    Copy link
    Collaborator

    Thanks! LGTM overall

    const std::vector<std::optional<bool>>& self_contiguity =
    self->contiguity();
    NVF_ERROR_EQ(self_allocation.size(), self_contiguity.size());
    // Replay maybeAllocation and contiguity.
    Copy link
    Collaborator Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    The change here is that I am replaying maybeAllocation unconditionally.

    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Why? It seems reasonable to skip below if nothing is set for the allocation domain. What am I missing?

    Copy link
    Collaborator Author

    @Priya2698 Priya2698 Sep 23, 2025

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    It was because of: #5177 (comment)

    Even if there is no allocation domain sepcified, there maybe a contiguity set. By default, TensorDomain is created with false contiguity. This replay makes sure that if the original domain has a contiguity of true (or any non-default value), it is correctly replayed

    Copy link
    Collaborator Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    FWIW, I am moving this change to another PR since it is breaking some tests. It would be clearer to do it separately

    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Even if there is no allocation domain sepcified, there maybe a contiguity set. By default, TensorDomain is created with false contiguity. This replay makes sure that if the original domain has a contiguity of true (or any non-default value), it is correctly replayed

    Not sure why replay is necessary. Can't we just copy the bool vector?

    Copy link
    Collaborator Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    That's why I suggested we should keep the API as simple as possible and make each function do minimal work so that it could be easily composable. The original concept of selfReplay is quite simple, and I'm worried that by adding more stuff into the function, it could be getting less flexible to use.

    I think we are on the same page here. I was responding to your earlier comment about what if a non-reduction ID is mapped to reduction ID. The base replay class here would not do the correct thing here generating IDs with the itertype of original ID instead. Hence, my comment on not trying to support such a case here, and maybe adding a check to ensure hard-fail.

    I disagree since there's no absolute definition of "mostly identical". Instead of defining what's identical, this interface is meant to let each caller to decide which IDs are considered mapped, and that should not be dictated by selfReplay itself.

    Do you disagree on the part about rejecting reduction and non-reduction IDs being mapped, and letting the caller ensure that instead?

    that by adding more stuff into the function, it could be getting less flexible to use.

    This PR reverted the changes around contiguity replay and it is now in the calling code instead (segmentor)
    We do replay contiguity anyway if allocation is present. However, contiguity can exist even without allocation, so what is incorrect with making allocation and contiguity replay independent (given that selfReplay is not handling ambiguous cases such as where reduction is mapped to non-reduction ID)?

    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    The base replay class here would not do the correct thing here generating IDs with the itertype of original ID instead.

    Ah, that's bit dirty behavior that we have been living with, that is, reduction rfactor creates non-reduction IDs out of a reduction ID. I believe that's the only case where split changes the input iter type. That's more like an exception, so I wouldn't let it determine the overall API design.

    Do you disagree on the part about rejecting reduction and non-reduction IDs being mapped, and letting the caller ensure that instead?

    My point is that the core implementation should be free from that decision. We could have multiple wrappers around it tailored for various particular use cases, but we should keep the core component as simple as possible.

    so what is incorrect with making allocation and contiguity replay independent (given that selfReplay is not handling ambiguous cases such as where reduction is mapped to non-reduction ID)?

    Sorry, I'm not sure what your question is. Can you rephrase it, please?

    Copy link
    Collaborator Author

    @Priya2698 Priya2698 Sep 24, 2025

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Ah, that's bit dirty behavior that we have been living with, that is, reduction rfactor creates non-reduction IDs out of a reduction ID. I believe that's the only case where split changes the input iter type. That's more like an exception, so I wouldn't let it determine the overall API design.

    Interesting. Do we have a case where we need this? My understanding was we do that because we expect them to match anyway.

    My point is that the core implementation should be free from that decision. We could have multiple wrappers around it tailored for various particular use cases, but we should keep the core component as simple as possible.

    When we replay allocation, currently we do not handle (or expect?) this case. If we have mapped reduction and non-reduction IDs, then we could set the contiguity to be false to be conservative. That is, if the replay of a reduction ID is a non-reduction ID during allocation domain replay, the corresponding contiguity flag is false.
    However, if we do not have a case where we do this, it seems simpler to be non-ambiguous and not allow this.

    so what is incorrect with making allocation and contiguity replay independent (given that selfReplay is not handling ambiguous cases such as where reduction is mapped to non-reduction ID)?

    The motivation for replaying contiguity in selfReplay (not in this PR) was: The Tensordomain created using IrBuider::create.. has a default contiguity of false. It is possible that the original tensordomain had some other contiguity even if we do not have a allocation domain. So we should replay that contiguity to the new tensordomain. Right now, I am doing this in the caller code, that is, in the segmentor so that in the process of cloning tensordomains, I do not lose that information.

    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    The motivation for replaying contiguity in selfReplay (not in this PR) was: The Tensordomain created using IrBuider::create.. has a default contiguity of false.

    That's right. Since maybeAllocation and contiguity go hand in hand, I think it's a bug to replay allocation but not contiguity. So this change itself is fine.

    To the larger question about the contract of selfReplay, I hope #5221 will simplify the contract and make it more strict. I'm still testing it out...

    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Interesting. Do we have a case where we need this?

    Yes, it's how rfactor works, IIRC.

    When we replay allocation, currently we do not handle (or expect?) this case. If we have mapped reduction and non-reduction IDs, then we could set the contiguity to be false to be conservative. That is, if the replay of a reduction ID is a non-reduction ID during allocation domain replay, the corresponding contiguity flag is false.
    However, if we do not have a case where we do this, it seems simpler to be non-ambiguous and not allow this.

    That's exactly why I'm suggesting keeping selfReplay as simple as possible and moving the responsibility of what to propagate and how to update contiguity to something else. I was chatting with @jjsjann123 about #5184, and it seems we would like to use selfReplay from a fusion output but only with its logical domain since the allocation domain of the output doesn't make any sense to propagate, whereas in the use case of the fusion segmenter, we do want to keep the allocation domain propagated.

    These different use cases have been pretty common, and that's why we have those options like bool propagate_allocation. That's not ideal, but I disagree with just removing it just because that would be fine for one case, because it would make it difficult to use in some other cases.

    @Priya2698
    Copy link
    Collaborator Author

    !test --diff

    @Priya2698 Priya2698 requested a review from wujingyue September 23, 2025 00:45
    @Priya2698
    Copy link
    Collaborator Author

    !test --diff

    @Priya2698 Priya2698 requested a review from naoyam September 23, 2025 02:37
    @Priya2698
    Copy link
    Collaborator Author

    !test --diff

    @Priya2698
    Copy link
    Collaborator Author

    !test --diff

    @wujingyue wujingyue mentioned this pull request Sep 24, 2025
    @Priya2698 Priya2698 requested review from naoyam and removed request for naoyam September 24, 2025 22:30
    Comment on lines +1789 to +1797
    auto compare_result = ir_utils::compareDomains(
    tv->getLogicalDomain(),
    tv->getLoopDomain(),
    /*additional_ids=*/{},
    /*ignore_broadcast=*/false);
    bool has_disjoint_loop_logical = compare_result.dom0_has_unreachable_ids ||
    compare_result.dom1_has_unreachable_ids;

    new_td = IrBuilder::create<TensorDomain>(
    /*root_domain=*/std::vector<IterDomain*>(),
    new_logical_domain,
    new_alloc,
    new_loop,
    tv->domain()->contiguity());
    if (has_disjoint_loop_logical) {
    Copy link
    Collaborator Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    @wujingyue Had to make this new change after your review, if you would like to take another look.

    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Fine with me. However, I've no idea what's going on. I can see the superficial reason that Scatter and Gather have disjoint logical and loop, but I've yet to understand the reason behind that. I'm sure it's my first time trying to understand how Scatter and Gather are codegen'ed. cc @naoyam and @jjsjann123 who might know the answer or any pointers.

    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    I don't mind the assert here on a gather scatter ID. Conceptually we only expect gather/scatter to create such scenario.

    Any other things like padding that could affect the mapping between allocation to logical are only going to show up between logical->allocation. But shouldn't have affected loop.

    Copy link
    Collaborator

    @naoyam naoyam left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    LGTM

    @Priya2698 Priya2698 merged commit 28c578d into main Sep 25, 2025
    60 of 62 checks passed
    @Priya2698 Priya2698 deleted the pm/segmentor_replay branch September 25, 2025 16:19
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Labels

    None yet

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    5 participants