DID loop split for allgather for non-outermost sharded axis. by Priya2698 · Pull Request #4170 · NVIDIA/Fuser

Priya2698 · 2025-04-02T05:28:23Z

Adds support for allgather if the sharded axis is not outermost.
ProcessGroupNCCL and UCC does require allocation of the sharded axis to be outermost. We do not change the logical shape, and instead permute the tensors to meet the requirements of NCCL and UCC within postAllgather.

This will be added within the reorderShardedAxis preseg pass to correctly set the loop and allocation domain for Allgather communication. Additionally, a set operation is needed to change the allocation of input if it does not have the sharded axis as the outermost allocated axis.

Priya2698 · 2025-04-02T05:28:30Z

!test

github-actions · 2025-04-02T05:29:04Z

Review updated until commit 1c8204c

Description

Added validation of tensor sizes and strides against tensorviews.
Ensured input and output tensors are contiguous for Allgather operations.
Updated tests to include noncontiguous tensors and multiple backends.

Changes walkthrough 📝

Relevant files

Enhancement

executor.cpp `Add tensor validation in executor` csrc/host_ir/executor.cpp Added `validateTensors` function to validate tensor sizes and strides. Called `validateTensors` in `HostIrExecutor::run` and `HostIrEvaluator::handle`.	+27/-0
communication.cpp `Ensure tensor contiguity in Allgather` csrc/multidevice/communication.cpp Added `isTvContiguous` function to check tensorview contiguity. Flattened input and output tensors in `postAllgather` to ensure contiguity.	+30/-4

Tests

test_multidevice_communications.cpp `Include additional headers in tests` tests/cpp/test_multidevice_communications.cpp Included `ops/all_ops.h` and `validator.h` for additional operations and validation.	+2/-0
test_multidevice_host_ir.cpp `Set tensor contiguity in tests` tests/cpp/test_multidevice_host_ir.cpp Set contiguity for communication input and output tensors in tests.	+9/-0
test_multidevice_lower_communication.cpp `Update and add tests for Allgather` tests/cpp/test_multidevice_lower_communication.cpp Refactored `LowerCollectiveTest` to parameterize by backend and enable HostIrLowering. Added `AllgatherLoopSplit_Noncontig` test for noncontiguous tensors.	+81/-67

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🧪 PR contains tests

⚡ Recommended focus areas for review

Performance Impact

The addition of validateTensors calls in multiple places may introduce performance overhead. Ensure that this validation is necessary and does not degrade performance.

namespace {
// Validates the sizes and strides of the input and output tensors
// against the tensorviews
void validateTensors(
    const std::vector<at::Tensor>& tensors,
    const std::vector<TensorView*>& tvs,
    const ExpressionEvaluator& expr_eval) {
  NVF_ERROR(tensors.size() == tvs.size());
  for (const auto& [tensor, tv] : zip(tensors, tvs)) {
    if (tensor.defined()) {
      inferAndValidateAllocationSizesAndStrides(tensor, tv, expr_eval);
    }
  }
}

Contiguity Check

The isTvContiguous function checks if all axes are contiguous, which might be too strict for some use cases. Consider if a more flexible contiguity check is needed.

bool isTvContiguous(const TensorView* tv) {
  // Reduction and broadcast axis do not have a contiguity value.
  return std::all_of(
      tv->getContiguity().begin(),
      tv->getContiguity().end(),
      [](std::optional<bool> c) { return c.value_or(true); });
}

Test Coverage

The new test AllgatherLoopSplit_Noncontig is a good addition, but ensure that it covers all edge cases and does not introduce false positives.

TEST_P(LowerCollectiveTest, AllgatherLoopSplit_Noncontig) {
  auto fusion = std::make_unique<Fusion>();
  FusionGuard fg(fusion.get());

  // ProcessGroupNCCL requires the gathered axis to be outermost.
  // We change the allocation of tensorviews to reflect this.
  // We do not modify the logical shape of the tensorview.
  // This would still require one copy on each device if the input tensor is in
  // a different layout.
  const auto d = communicator_->size();
  auto mesh = DeviceMesh::createForNumDevices(d);

  TensorView* tv0 = makeConcreteTensor({5, d * 3});
  tv0->outer_split(1, d);
  tv0->axis(1)->parallelize(ParallelType::DIDx);
  tv0->reorder({{1, 0}, {2, 1}, {0, 2}});
  // tv0: Logical = [5, d*3], Loop/Allocation = [DIDx(d), 3, 5]

  TensorView* tv1 = set(tv0);
  tv1->outer_split(1, d);
  tv1->axis(1)->parallelize(ParallelType::Serial);
  tv1->reorder({{1, 0}, {2, 1}, {0, 2}});
  // tv1: Logical = [5, d*3], Loop/Allocation = [Serial(d), 3, 5]

  for (auto tv : {tv0, tv1}) {
    tv->setDeviceMesh(mesh);
    tv->setAllocationDomain(tv->getLoopDomain(), true);
  }

  fusion->addInput(tv0);
  fusion->addOutput(tv1);

  at::Tensor unsharded_in_tensor = at::randn({d * 3, 5}, tensor_options);
  at::Tensor in_tensor =
      shardTensor(unsharded_in_tensor, 0, mesh).transpose(0, 1);

  FusionExecutorCache executor_cache(std::move(fusion));
  at::Tensor out_tensor =
      executor_cache.runFusionWithInputs({in_tensor})[0].as<at::Tensor>();

  testValidate(
      executor_cache.fusion(),
      {out_tensor},
      {in_tensor},
      {unsharded_in_tensor.transpose(0, 1)},
      __LINE__,
      __FILE__);
}

Priya2698 · 2025-04-02T06:49:22Z

!test

csrc/multidevice/communication.cpp

tests/cpp/test_multidevice_communications.cpp

csrc/multidevice/utils.cpp

Priya2698 · 2025-04-04T01:46:35Z

!test

tests/cpp/test_multidevice_communications.cpp

csrc/multidevice/communication.cpp

wujingyue · 2025-04-04T21:32:08Z

LGTM otherwise. Thanks for the change!

Priya2698 · 2025-04-07T22:38:53Z

!test

wujingyue

LGTM with comments

csrc/host_ir/executor.cpp

wujingyue · 2025-04-08T02:38:37Z

csrc/multidevice/communication.cpp

+  // Presegmentation pass `makeReshardingContiguous` ensures that the tvs are contiguous
+  // and HostIrExecutor validates the tensor against the tv allocation domain.
+
+  auto flattened_output_tensor = output_tensor.as_strided({output_tensor.numel()}, {1});


Also check contiguity of communication->in() and out()?

It is already enforced by makeReshardingContiguous pass so I am not duplicating it here.

Nit: makeReshardingContiguous is a bit too far and many changes could happen in between. For example, makeReshardingContiguous runs before segmentation and postSingleCommunication is at runtime. makeReshardingContiguous works on fusion IR e.g. set and reduce, and postSingleCommunication works on host IR e.g. Communication.

I am running into some test failures with a contiguity check here for the manual tests in test_multidevice_host_ir.cpp. Since these tests do not set an allocation domain, we have the contiguity set to false. How cumbersome is it to require manual tests also to have the allocation domain set correctly?
CC: @samnordmann

How cumbersome is it to require manual tests also to have the allocation domain set correctly?

IIUC, it should be set correctly, so let's set it correctly. The change can't be too large because test_multidevice_host_ir.cpp is <500 lines and that file has <20 calls to make*Tensor, many of which are Contig already.

hey Priya, I am not sure how cumbersome that is -- but if needed feel free to do it, and please let me know how it looks like.
Let me know also if you need help

Priya2698 · 2025-04-08T23:14:03Z

!test

wujingyue · 2025-04-08T23:50:28Z

csrc/host_ir/executor.cpp

    auto out_tensor = output_args[out_idx].as<at::Tensor>();

-    c10::intrusive_ptr<c10d::Work> work = postSingleCommunication(
+    c10::intrusive_ptr<c10d::Work> work = validateAndPostSingleCommunication(


I'll skip validation for HostIrExecutor. It's done elsewhere already.

Input:

Fuser/csrc/tensor_metadata.cpp

Line 364 in 8a1028f

inferAndValidateAllocationSizesAndStrides(input, tv, ee);

Output:

Fuser/csrc/runtime/allocations.cpp

Line 692 in 8a1028f

inferAndValidateAllocationSizesAndStrides(tensor, tv, expr_eval);

Outputs are validated when they are not provided, i.e. output_args is empty. If not, then there is no validation. Inputs can be skipped, like you said.

I am looking at the callgraph to see what is the case, where output_args come pre-allocated.

what is the case, where output_args come pre-allocated

I looked for this before. I found only unit tests that explicitly call KernelExecutor, not via FusionExecutorCache.

Right. FusionExecutorCache provides an empty output_args (

Fuser/csrc/runtime/fusion_kernel_runtime.cpp

Line 714 in 802f042

auto outputs =

).
I'll leave the validation in for just the output tensor and skip for inputs.

csrc/host_ir/executor.cpp

cowanmeg

LGTM! Thanks!

tests/cpp/test_multidevice_lower_communication.cpp

wujingyue · 2025-04-10T18:48:03Z

csrc/host_ir/executor.cpp

+  for (const auto i : c10::irange(tensors.size())) {
+    const auto& tensor = tensors.at(i);
+    const auto& tv = tvs.at(i);


FWIW, you can zip instead:

Fuser/csrc/transform_replay.cpp

Line 264 in 53f0ace

for (auto&& [id, new_id] : zip(self_logical, new_self_logical)) {

.

Welcome to C++20!

Priya2698 · 2025-04-10T19:06:00Z

tests/cpp/test_multidevice_lower_communication.cpp

+  }
+  // getBackendForTeam throws an error if the requested backend type isn't
+  // available. Therefore, we call it after the isBackendAvailable check.
+  communicator_->setDefaultBackend(backend_type);


I discovered that while this allows me to set the backend when using FusionExecutorCache, communication->backend() is still different from the backend parameter passed to postSingleCommunciation.
getBackendForTeam when not provided a backend returns the default backend set for the communicator.
For executions running through HostIrEvaluator, this approach will not work.

We should change this to have a uniform value of backend between communicator and communication object for easier verification. I'll attempt this in a future PR.

… permuting

Priya2698 · 2025-04-10T22:32:54Z

!test

…utor (#4470) Prep PR for Issue #3900. I am modifying the `reorderShardedAxisPass` to set allocation domain consistent with the memory layout requirements of ProcessGroup NCCL and UCC, without changing the logical shape (see PR #4170 for example). MultiDeviceExecutor does not respect allocation domain, hence, removing these tests. Issue #4453.

Priya2698 commented Apr 4, 2025

View reviewed changes

csrc/multidevice/communication.cpp Outdated Show resolved Hide resolved

Priya2698 commented Apr 4, 2025

View reviewed changes

tests/cpp/test_multidevice_communications.cpp Outdated Show resolved Hide resolved

Priya2698 commented Apr 4, 2025

View reviewed changes

csrc/multidevice/utils.cpp Outdated Show resolved Hide resolved

Priya2698 changed the title ~~allgather loop split, contig + noncontig~~ DID loop split for allgather for non-outermost sharded axis. Apr 4, 2025

Priya2698 mentioned this pull request Apr 4, 2025

DID loop split for scatter #4191

Merged

Priya2698 marked this pull request as ready for review April 4, 2025 19:09

Priya2698 requested review from cowanmeg and wujingyue April 4, 2025 19:09

wujingyue reviewed Apr 4, 2025

View reviewed changes

tests/cpp/test_multidevice_communications.cpp Outdated Show resolved Hide resolved

csrc/multidevice/communication.cpp Outdated Show resolved Hide resolved

Priya2698 requested a review from wujingyue April 8, 2025 02:04

wujingyue approved these changes Apr 8, 2025

View reviewed changes

wujingyue reviewed Apr 8, 2025

View reviewed changes

cowanmeg approved these changes Apr 10, 2025

View reviewed changes

tests/cpp/test_multidevice_lower_communication.cpp Outdated Show resolved Hide resolved

wujingyue reviewed Apr 10, 2025

View reviewed changes

Priya2698 commented Apr 10, 2025

View reviewed changes

Priya2698 added 8 commits April 10, 2025 15:16

allgather loop split, contig + noncontig

bc06eb7

no devices logical domain

2b08902

check non-device, non-reduction logical shape

cde75e4

fix scatter for loop split

62d35a4

update postAllScatter, add tests for ReduceScatter

afd7e06

another approach for noncontig tensors

775b4d5

move scatter, pointwise changes to other PR

46b4242

undo extraneous change

e877874

Priya2698 added 10 commits April 10, 2025 15:16

avoid using getShardedLogicalAxis

61c6a51

undo adding sharding to communication test

cb8fcdb

comment;

9130a15

validate tensors against tvs, flatten inp/out in allgather instead of…

cd6180c

… permuting

pm/reorder

b509ccb

lintrunner

abdcdba

lintrunner

d5ce87d

review comments

6f585cc

verify tvs are contiguous, set contiguity manually for host ir tests

7b5d6dc

update acc to main

05cffdb

Priya2698 force-pushed the pm/reorder branch from ae94bcc to 05cffdb Compare April 10, 2025 22:16

lintrunner

1c8204c

Priya2698 merged commit 35f4aed into main Apr 11, 2025
53 checks passed

Priya2698 deleted the pm/reorder branch April 11, 2025 17:32

samnordmann mentioned this pull request Apr 16, 2025

[Host ir] support for set reduce and binary op #4146

Merged

Priya2698 mentioned this pull request May 16, 2025

Remove tests with inner sharded dimensions when using MultiDeviceExecutor #4470

Merged

wujingyue mentioned this pull request Jun 14, 2025

Fix EnableOptionsGuard usage in test_multidevice_lower_communication. #4641

Merged

Conversation

Priya2698 commented Apr 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Priya2698 commented Apr 2, 2025

Uh oh!

github-actions bot commented Apr 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes walkthrough 📝

PR Reviewer Guide 🔍

Uh oh!

Priya2698 commented Apr 2, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Priya2698 commented Apr 4, 2025

Uh oh!

Uh oh!

Uh oh!

wujingyue commented Apr 4, 2025

Uh oh!

Priya2698 commented Apr 7, 2025

Uh oh!

wujingyue left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Priya2698 commented Apr 8, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cowanmeg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Priya2698 commented Apr 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Priya2698 commented Apr 2, 2025 •

edited

Loading

github-actions bot commented Apr 2, 2025 •

edited

Loading