Propagate `Stream` parallel type in allocation by Priya2698 · Pull Request #5353 · NVIDIA/Fuser

Priya2698 · 2025-10-08T20:29:09Z

Issue #5309
Unlike device parallelization, a stream parallel tensorview (in loop) may or may not have a stream-parallel allocation domain.

We propagate based on the following:

If it is a device parallel type -> always propagate
If it is a fusion input or output -> id is not stream parallelized
If the stream ID in a tensorview is not mapped to stream ID in all of its consumers -> id is not stream parallelized

For cases like:

Fuser/tests/cpp/test_overlap.cpp

Line 155 in f8e84e5

// allocation. This dimension should eventually be parallelized on `Stream`

, we want to start with replicating Stream-parallel ID, that is the allocation is not parallelized. However, this ID will appear in the logical domain due to rfactor and with the current contract, be allocated fully regardless of parallelization. So I am not making this a condition in the pass, yet.

This can be changed in future when we need.

Depends on #5363

Co-authored-by: Jingyue Wu <wujingyue@gmail.com>

Priya2698 · 2025-10-08T20:29:19Z

!test

github-actions · 2025-10-08T20:30:17Z

Review updated until commit 46253fb

Description

Propagate Stream parallelization to allocation domain conditionally
Prevent allocation sharding for non-device, non-Stream tensors
Add tests for Stream-parallel allocation behavior
Print debug transforms in finalize pass for diagnostics

Changes walkthrough 📝

Relevant files

Enhancement

finalize_multidevice_domains.cpp `Implement conditional Stream allocation sharding` csrc/preseg_passes/finalize_multidevice_domains.cpp Introduced `shardAllocation` to handle device and Stream parallelization Added `shouldParallelizeAllocationOnStream` to check Stream consumer consistency Added `isLoopStreamParallelized` to detect Stream in loop domain Skip sharding if no device mesh and not Stream-loop parallelized Print debug transform logging when enabled	+57/-30

Bug fix

test_multidevice_lower_communication.cpp `Fix allgather test device mesh setup` tests/cpp/test_multidevice_lower_communication.cpp Move `setDeviceMesh` call before split for correctness Remove manual split and allocation on output tensor Use `setDeviceMesh` on output to enable proper propagation	+2/-3

Tests

test_stream.cpp `Add Stream allocation propagation tests` tests/cpp/test_stream.cpp Add `#include` Add `ShardedAllocation` test for Stream allocation in loops Add `ReplicatedAllocation` test when Stream not in consumers Verify allocation domain matches logical or loop domain	+58/-0

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🧪 PR contains tests

⚡ Recommended focus areas for review

Loop Stream Check

The function isLoopStreamParallelized checks if any loop domain ID is stream-parallel, but it may not account for nested or conditional loop structures where stream parallelization is context-dependent. This could lead to incorrect propagation decisions in complex loop scenarios.

bool isLoopStreamParallelized(const TensorView* tv) {
  return std::any_of(
      tv->getLoopDomain().begin(),
      tv->getLoopDomain().end(),
      [](IterDomain* id) { return id->isStream(); });
}

Allocation Sharding Logic

The shardAllocation function skips splitting for stream-parallel outer dimensions when shouldParallelizeAllocationOnStream returns false, but it does not handle cases where partial stream parallelization exists across consumer tensorviews, potentially leading to inconsistent memory layouts.

if (split->outer()->isStream() &&
    !shouldParallelizeAllocationOnStream(tv)) {
  continue;
}

Test Coverage

The new tests ShardedAllocation and ReplicatedAllocation verify basic behavior but do not test edge cases such as tensorviews with mixed parallel types or multiple stream-parallel dimensions, which could expose flaws in the propagation logic.

TEST_F(StreamTest, ShardedAllocation) {
  auto fusion = std::make_unique<Fusion>();
  FusionGuard fg(fusion.get());

  const int64_t s = 2;

  TensorView* tv0 = makeContigTensor(3);
  TensorView* tv1 = add(tv0, IrBuilder::create<Val>(1.0));
  TensorView* tv2 = sum(tv1, {2});
  TensorView* tv3 = div(tv1, IrBuilder::create<Val>(2.0));
  fusion->addInput(tv0);
  fusion->addOutput(tv2);
  fusion->addOutput(tv3);

  tv0->outer_split(0, s);
  tv0->axis(0)->parallelize(ParallelType::Stream);

  preseg_passes::OptimizationPass<preseg_passes::PreSegmenter>::runPass(
      fusion.get());

  for (auto* tv : {tv0, tv1, tv2, tv3}) {
    EXPECT_TRUE(tv->axis(0)->isStream()) << tv;
    if (tv->isFusionOutput() || tv->isFusionInput()) {
      EXPECT_EQ(tv->getAllocationDomain(), tv->getLogicalDomain());
    } else {
      EXPECT_EQ(tv->getAllocationDomain(), tv->getLoopDomain());
    }
  }
}

TEST_F(StreamTest, ReplicatedAllocation) {
  auto fusion = std::make_unique<Fusion>();
  FusionGuard fg(fusion.get());

  const int64_t s = 2;

  TensorView* tv0 = makeContigTensor(3);
  TensorView* tv1 = add(tv0, IrBuilder::create<Val>(1.0));
  TensorView* tv2 = sum(tv1, {2});
  TensorView* tv3 = div(tv1, IrBuilder::create<Val>(2.0));
  fusion->addInput(tv0);
  fusion->addOutput(tv2);
  fusion->addOutput(tv3);

  tv0->outer_split(0, s);
  tv0->axis(0)->parallelize(ParallelType::Stream);
  tv2->outer_split(1, s);
  tv2->axis(1)->parallelize(ParallelType::Stream);

  preseg_passes::OptimizationPass<preseg_passes::PreSegmenter>::runPass(
      fusion.get());
  for (auto* tv : {tv0, tv1, tv2, tv3}) {
    EXPECT_TRUE(tv->axis(0)->isStream()) << tv;
    EXPECT_EQ(tv->getAllocationDomain(), tv->getLogicalDomain());
  }
}

… avoid errors in quantization tests

Priya2698 · 2025-10-08T22:10:33Z

!test

Priya2698 · 2025-10-09T00:48:07Z

!test

Priya2698 · 2025-10-09T00:58:35Z

!test

wujingyue

LGTM otherwise

csrc/preseg_passes/finalize_multidevice_domains.cpp

Priya2698 · 2025-10-14T01:59:57Z

!test

wujingyue

Looks great!

csrc/preseg_passes/finalize_multidevice_domains.cpp

Co-authored-by: Jingyue Wu <wujingyue@gmail.com>

Priya2698 · 2025-10-14T20:36:08Z

!test

Priya2698 · 2025-10-14T21:43:08Z

!test

Priya2698 · 2025-10-15T20:46:35Z

!test

wujingyue · 2025-10-16T16:15:17Z

csrc/preseg_passes/finalize_multidevice_domains.cpp

        split != nullptr,
        "Expected all transform exprs to be a split between allocation and "
        "loop domain during sharding propagation.");
+    if (split->outer()->isStream() &&


Nit: I believe you can move this filter to loop_stream_device_view as well. This way, we put all the filters in one location.

This PR was merged, but I'll do it in a follow-up!

Issue #5309 Unlike device parallelization, a stream parallel tensorview (in loop) may or may not have a stream-parallel allocation domain. We propagate based on the following: 1. If it is a device parallel type -> always propagate 2. If it is a fusion input or output -> id is not stream parallelized 3. If the stream ID in a tensorview is not mapped to stream ID in all of its consumers -> id is not stream parallelized For cases like: https://github.com/NVIDIA/Fuser/blob/f8e84e52296cdecd318dd2ce904139616d7bd434/tests/cpp/test_overlap.cpp#L155, we want to start with replicating Stream-parallel ID, that is the allocation is not parallelized. However, this ID will appear in the logical domain due to rfactor and with the current contract, be allocated fully regardless of parallelization. So I am not making this a condition in the pass, yet. This can be changed in future when we need. Depends on #5363 --------- Co-authored-by: Jingyue Wu <wujingyue@gmail.com>

Priya2698 and others added 5 commits October 7, 2025 19:18

refactor haveDifferentShardings for use with Stream

9b3e3ca

Update csrc/multidevice/utils.h

c85ab1c

Co-authored-by: Jingyue Wu <wujingyue@gmail.com>

change signature of haveDifferentShardings instead of call stack

4fe60ea

Merge branch 'main' into pm/index_compute

b827f0f

modify finalize pass

d991e04

--global and others added 2 commits October 8, 2025 14:44

fix allgather test

0ee4861

skip allocation propagation for non-device non-stream parallel tvs to…

e5a2e65

… avoid errors in quantization tests

fix condition

5235d00

Base automatically changed from pm/index_compute to main October 9, 2025 00:57

Merge branch 'main' into pm/alloc_stream

8d5d10f

Priya2698 marked this pull request as ready for review October 9, 2025 01:31

Priya2698 requested review from wujingyue and removed request for wujingyue October 9, 2025 01:32

Priya2698 marked this pull request as draft October 9, 2025 05:42

wujingyue reviewed Oct 9, 2025

View reviewed changes

csrc/preseg_passes/finalize_multidevice_domains.cpp Outdated Show resolved Hide resolved

csrc/preseg_passes/finalize_multidevice_domains.cpp Outdated Show resolved Hide resolved

csrc/preseg_passes/finalize_multidevice_domains.cpp Outdated Show resolved Hide resolved

Priya2698 and others added 4 commits October 13, 2025 17:39

Merge branch 'main' into pm/alloc_stream

1be8972

review comments

8936900

tests wip

bdb4da3

final tests, print debug transforms in the pass

779ed05

wujingyue mentioned this pull request Oct 14, 2025

Lower stream-parallelized matmul #5302

Merged

Priya2698 requested a review from wujingyue October 14, 2025 16:05

Priya2698 marked this pull request as ready for review October 14, 2025 16:05

wujingyue approved these changes Oct 14, 2025

View reviewed changes

csrc/preseg_passes/finalize_multidevice_domains.cpp Outdated Show resolved Hide resolved

csrc/preseg_passes/finalize_multidevice_domains.cpp Outdated Show resolved Hide resolved

csrc/preseg_passes/finalize_multidevice_domains.cpp Outdated Show resolved Hide resolved

wujingyue reviewed Oct 14, 2025

View reviewed changes

csrc/preseg_passes/finalize_multidevice_domains.cpp Show resolved Hide resolved

Priya2698 and others added 5 commits October 14, 2025 12:20

Apply suggestion from @wujingyue

9b487e9

Co-authored-by: Jingyue Wu <wujingyue@gmail.com>

Apply suggestion from @wujingyue

6b216f4

Co-authored-by: Jingyue Wu <wujingyue@gmail.com>

Update csrc/preseg_passes/finalize_multidevice_domains.cpp

f410c87

Co-authored-by: Jingyue Wu <wujingyue@gmail.com>

Merge branch 'main' into pm/alloc_stream

22cd062

lintrunner

1126c7e

filter loop

acc837d

Priya2698 and others added 2 commits October 15, 2025 13:20

Merge branch 'main' into pm/alloc_stream

588132d

add back the check

46253fb

Priya2698 merged commit 851a0e6 into main Oct 16, 2025
64 of 67 checks passed

Priya2698 deleted the pm/alloc_stream branch October 16, 2025 16:06

wujingyue reviewed Oct 16, 2025

View reviewed changes

Priya2698 mentioned this pull request Oct 16, 2025

Set the right allocation domain for stream-parallelized tensors #5309

Closed

wujingyue mentioned this pull request Oct 20, 2025

CI: Failing test_dtensor_opinfo_exp_nvfuser Lightning-AI/lightning-thunder#2670

Closed

Conversation

Priya2698 commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Priya2698 commented Oct 8, 2025

Uh oh!

github-actions bot commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes walkthrough 📝

PR Reviewer Guide 🔍

Uh oh!

Priya2698 commented Oct 8, 2025

Uh oh!

Priya2698 commented Oct 9, 2025

Uh oh!

Priya2698 commented Oct 9, 2025

Uh oh!

wujingyue left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Priya2698 commented Oct 14, 2025

Uh oh!

wujingyue left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Priya2698 commented Oct 14, 2025

Uh oh!

Priya2698 commented Oct 14, 2025

Uh oh!

Priya2698 commented Oct 15, 2025

Uh oh!

Uh oh!

wujingyue Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

Priya2698 Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Priya2698 commented Oct 8, 2025 •

edited

Loading

github-actions bot commented Oct 8, 2025 •

edited

Loading