Fix allocation logic: unconnected alloc/logical by jjsjann123 · Pull Request #5185 · NVIDIA/Fuser

jjsjann123 · 2025-09-18T21:09:59Z

Stacked PRs

Follow up PR on enabling python API and updating test_moe.py is still in cleaning mode.
#5174 allow layout op in automatic scheduler
#5185 Fix allocation logic: unconnected alloc/logical <- this one

This PR

Fixes allocation logic to ensure that the output tensor has:

shape matching its logical domain;
buffer size matching the allocation domain.

Without this PR, the output tensor from PreprocessGroupedMatmulInputSf will have a mismatch shape from its logical domain, causing validation failure in downstream consumers.

Context

PreprocessGroupedMatmulInputSf op has:

unconnected logical and allocation domain.
larger allocation size, because extra padding is represented via arithmetic operations on the extent directly.

Existing allocation logic allocate buffer matches logical sizes/strides. This is not the right behavior. Because allocation domain could have larger extent. We also cannot use allocation sizes/strides neither, because consumer of the tensor expects a tensor matching the logical size.

We updated the logic to use allocation domain for buffer allocation. Then we slice into the buffer using logical domain to produce the correct-sized output.
For the case of PreprocessGroupedMatmulInputSf, because there's no correct way to slice into the buffer for indexing, we give up on producing correct strides and just use a naive stride instead. It's safe to do so, since we are not using indexing logic on the output.

Code change

refactor buffer allocation buffer to use allocation domain, instead of logical domain.
fixing projection from allocation to logical special path when projection is not possible: We now compute correct extent instead of returning the allocation buffer as-is, this allows that layout op to return a tensor with the correct logical size, while still allocating a large enough buffer to accommodate the padding requirement.

github-actions · 2025-09-18T21:11:14Z

Review updated until commit fd8826c

Description

Fix buffer allocation to use allocation domain sizes/strides
Restrict output tensor to logical domain for correct consumer expectations
Handle unprojectable allocation-to-logical cases with correct logical sizes
Add shape validation in test for layout op output consistency

Changes walkthrough 📝

Relevant files

Bug fix

allocations.cpp `Use allocation domain for buffer, logical for output` csrc/runtime/allocations.cpp Allocate buffer using allocation sizes/strides to prevent out-of-bounds access Restriddle buffer to logical sizes/strides before returning to consumer Handle case when allocation-to-logical projection is not possible Compute correct logical sizes with naive strides as fallback	+65/-12

Tests

test_layout_op.cpp `Validate logical shape in layout op test` tests/cpp/test_layout_op.cpp Add validation for output tensor logical shape Compute padded dimensions for strided view Ensure output shape matches reference tensor	+6/-0

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🧪 PR contains tests

⚡ Recommended focus areas for review

Possible Issue

The logic for computing strides in the fallback path when frontier_set != logical_set uses a naive cumulative product, but the loop processes domains in reverse order. This may result in incorrect stride values if the logical domain ordering is not preserved correctly.

int64_t cur_stride = 1;
for (const auto&& [i, id] : enumerate(logical) | std::views::reverse) {
  int64_t cur_size = ee.evaluate(id->extent()).as<int64_t>();
  logical_sizes[i] = cur_size;
  logical_strides[i] = cur_stride;
  cur_stride *= cur_size;
}

Performance Concern

Allocating based on allocation domain and then restriding may lead to unnecessary memory usage or copy operations. Consider whether this two-step allocation and restriding pattern could be optimized for common cases where logical and allocation domains are the same.

if (!out_info.shape_info.allocation_sizes.empty()) {
  alloc_tensor = at::native::empty_strided_cuda(
      out_info.shape_info.allocation_sizes,
      out_info.shape_info.allocation_strides,
      out_info.type,
      c10::nullopt,
      device,
      c10::nullopt);
  alloc_tensor = alloc_tensor.as_strided_(
      out_info.shape_info.logical_sizes,
      out_info.shape_info.logical_strides);
} else {
  // A special case where allocation sizes & strides are NOT availabe,
  // logical sizes & strides are used in place of allocation sizes &
  // strides, hence no restride is necessary.
  alloc_tensor = at::native::empty_strided_cuda(
      out_info.shape_info.logical_sizes,
      out_info.shape_info.logical_strides,
      out_info.type,
      c10::nullopt,
      device,
      c10::nullopt);
}

jjsjann123 · 2025-09-19T23:08:12Z

!test

jjsjann123 · 2025-09-22T21:50:52Z

!test

jjsjann123 · 2025-09-23T17:42:02Z

!test

fixing logical size of allocated buffer for layout op

jjsjann123 · 2025-10-01T20:17:48Z

!test

jjsjann123 · 2025-10-02T23:35:22Z

!test

jjsjann123 · 2025-10-02T23:36:32Z

csrc/runtime/allocations.cpp

  for (int i = static_cast<int>(tensor_new_shape.size()) - 1; i >= 0; --i) {
-    prod *= tensor_new_shape[i];
    tensor_new_strides[i] = prod;
+    prod *= tensor_new_shape[i];


a random bug fix not related in this PR.

naoyam · 2025-10-06T16:49:10Z

csrc/runtime/allocations.cpp

  std::set<IterDomain*> logical_set(logical.begin(), logical.end());
  if (frontier_set != logical_set) {
-    return tensor;
+    std::vector<int64_t> logical_sizes(logical.size(), 0);


IIUC, logical_sizes is the correct shape of the output, but logical_strides isn't. Can you leave a comment a little more about the context, e.g., why it should be done this way and why the incorrect strides don't matter.

naoyam · 2025-10-06T16:51:41Z

csrc/runtime/allocations.cpp

-          c10::nullopt,
-          device,
-          c10::nullopt);
+      at::Tensor alloc_tensor;


Can you explain why it's changed this way?

jjsjann123 · 2025-10-07T19:54:58Z

!test

csrc/runtime/allocations.cpp

naoyam

LGTM

jjsjann123 · 2025-10-07T22:40:04Z

!build

## Stacked PRs Follow up PR on enabling python API and updating test_moe.py is still in cleaning mode. #5174 allow layout op in automatic scheduler #5185 Fix allocation logic: unconnected alloc/logical <- this one ## This PR Fixes allocation logic to ensure that the output tensor has: 1. shape matching its logical domain; 2. buffer size matching the allocation domain. Without this PR, the output tensor from `PreprocessGroupedMatmulInputSf` will have a mismatch shape from its logical domain, causing validation failure in downstream consumers. ### Context PreprocessGroupedMatmulInputSf op has: 1. unconnected logical and allocation domain. 4. larger allocation size, because extra padding is represented via arithmetic operations on the extent directly. Existing allocation logic allocate buffer matches logical sizes/strides. This is not the right behavior. Because allocation domain could have larger extent. We also cannot use allocation sizes/strides neither, because consumer of the tensor expects a tensor matching the logical size. We updated the logic to use allocation domain for buffer allocation. Then we slice into the buffer using logical domain to produce the correct-sized output. For the case of PreprocessGroupedMatmulInputSf, because there's no correct way to slice into the buffer for indexing, we give up on producing correct strides and just use a naive stride instead. It's safe to do so, since we are not using indexing logic on the output. ### Code change 1. refactor buffer allocation buffer to use allocation domain, instead of logical domain. 5. fixing projection from allocation to logical special path when projection is not possible: We now compute correct extent instead of returning the allocation buffer as-is, this allows that layout op to return a tensor with the correct logical size, while still allocating a large enough buffer to accommodate the padding requirement.

jjsjann123 changed the title ~~Fix output buffer size for PreprocessGroupedMatmulInputSf~~ Fix allocation logic: unconnected alloc/logical Sep 18, 2025

This was referenced Sep 18, 2025

Support Split between logical domain to allocation domain to represent padding #5184

Draft

Fix allocation logic: non-divisible split #5186

Draft

jjsjann123 force-pushed the jj/allocation_for_layout_op_PR_1 branch from c64d299 to 33d0ce3 Compare September 18, 2025 21:40

jjsjann123 marked this pull request as ready for review September 18, 2025 21:41

jjsjann123 requested review from naoyam and protonu September 18, 2025 21:41

This was referenced Sep 18, 2025

Patch allocation logic to produce outputs with correct logical size #5170

Closed

Support layout op in scheduler #5174

Merged

jjsjann123 force-pushed the jj/allocation_for_layout_op_PR_1 branch from 33d0ce3 to 17df15a Compare September 19, 2025 22:55

jjsjann123 force-pushed the jj/allocation_for_layout_op_PR_1 branch from 17df15a to f9acfc3 Compare September 22, 2025 21:50

jjsjann123 force-pushed the jj/allocation_for_layout_op_PR_1 branch from f9acfc3 to 87afb60 Compare September 23, 2025 17:39

jjsjann123 marked this pull request as draft September 25, 2025 00:03

PR0:

ea3cd68

fixing logical size of allocated buffer for layout op

jjsjann123 changed the base branch from jj/allocation_PR_0 to main October 1, 2025 20:10

jjsjann123 force-pushed the jj/allocation_for_layout_op_PR_1 branch 2 times, most recently from c64d299 to ea3cd68 Compare October 1, 2025 20:17

jjsjann123 marked this pull request as ready for review October 2, 2025 19:05

missed a fix from cherry-pick

8d363e4

jjsjann123 commented Oct 2, 2025

View reviewed changes

naoyam reviewed Oct 6, 2025

View reviewed changes

jjsjann123 added 2 commits October 7, 2025 09:58

Merge branch 'main' into jj/allocation_for_layout_op_PR_1

8105526

adding comment per reviewer's comments

cb4f72b

jjsjann123 requested a review from naoyam October 7, 2025 18:11

naoyam reviewed Oct 7, 2025

View reviewed changes

csrc/runtime/allocations.cpp Outdated Show resolved Hide resolved

naoyam approved these changes Oct 7, 2025

View reviewed changes

trivial comment update

fd8826c

jjsjann123 merged commit 33337e9 into main Oct 7, 2025
17 checks passed

jjsjann123 deleted the jj/allocation_for_layout_op_PR_1 branch October 7, 2025 23:41

Conversation

jjsjann123 commented Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Stacked PRs

This PR

Context

Code change

Uh oh!

github-actions bot commented Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes walkthrough 📝

PR Reviewer Guide 🔍

Uh oh!

jjsjann123 commented Sep 19, 2025

Uh oh!

jjsjann123 commented Sep 22, 2025

Uh oh!

jjsjann123 commented Sep 23, 2025

Uh oh!

jjsjann123 commented Oct 1, 2025

Uh oh!

jjsjann123 commented Oct 2, 2025

Uh oh!

jjsjann123 Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

naoyam Oct 6, 2025

Choose a reason for hiding this comment

Uh oh!

naoyam Oct 6, 2025

Choose a reason for hiding this comment

Uh oh!

jjsjann123 commented Oct 7, 2025

Uh oh!

Uh oh!

naoyam left a comment

Choose a reason for hiding this comment

Uh oh!

jjsjann123 commented Oct 7, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jjsjann123 commented Sep 18, 2025 •

edited

Loading

github-actions bot commented Sep 18, 2025 •

edited

Loading