Skip to content

Fix allocation logic: unconnected alloc/logical#5185

Merged
jjsjann123 merged 5 commits intomainfrom
jj/allocation_for_layout_op_PR_1
Oct 7, 2025
Merged

Fix allocation logic: unconnected alloc/logical#5185
jjsjann123 merged 5 commits intomainfrom
jj/allocation_for_layout_op_PR_1

Conversation

@jjsjann123
Copy link
Collaborator

@jjsjann123 jjsjann123 commented Sep 18, 2025

Stacked PRs

Follow up PR on enabling python API and updating test_moe.py is still in cleaning mode.
#5174 allow layout op in automatic scheduler
#5185 Fix allocation logic: unconnected alloc/logical <- this one

This PR

Fixes allocation logic to ensure that the output tensor has:

  1. shape matching its logical domain;
  2. buffer size matching the allocation domain.

Without this PR, the output tensor from PreprocessGroupedMatmulInputSf will have a mismatch shape from its logical domain, causing validation failure in downstream consumers.

Context

PreprocessGroupedMatmulInputSf op has:

  1. unconnected logical and allocation domain.
  2. larger allocation size, because extra padding is represented via arithmetic operations on the extent directly.

Existing allocation logic allocate buffer matches logical sizes/strides. This is not the right behavior. Because allocation domain could have larger extent. We also cannot use allocation sizes/strides neither, because consumer of the tensor expects a tensor matching the logical size.

We updated the logic to use allocation domain for buffer allocation. Then we slice into the buffer using logical domain to produce the correct-sized output.
For the case of PreprocessGroupedMatmulInputSf, because there's no correct way to slice into the buffer for indexing, we give up on producing correct strides and just use a naive stride instead. It's safe to do so, since we are not using indexing logic on the output.

Code change

  1. refactor buffer allocation buffer to use allocation domain, instead of logical domain.
  2. fixing projection from allocation to logical special path when projection is not possible: We now compute correct extent instead of returning the allocation buffer as-is, this allows that layout op to return a tensor with the correct logical size, while still allocating a large enough buffer to accommodate the padding requirement.

@github-actions
Copy link

github-actions bot commented Sep 18, 2025

Review updated until commit fd8826c

Description

  • Fix buffer allocation to use allocation domain sizes/strides

  • Restrict output tensor to logical domain for correct consumer expectations

  • Handle unprojectable allocation-to-logical cases with correct logical sizes

  • Add shape validation in test for layout op output consistency


Changes walkthrough 📝

Relevant files
Bug fix
allocations.cpp
Use allocation domain for buffer, logical for output         

csrc/runtime/allocations.cpp

  • Allocate buffer using allocation sizes/strides to prevent
    out-of-bounds access
  • Restriddle buffer to logical sizes/strides before returning to
    consumer
  • Handle case when allocation-to-logical projection is not possible
  • Compute correct logical sizes with naive strides as fallback
  • +65/-12 
    Tests
    test_layout_op.cpp
    Validate logical shape in layout op test                                 

    tests/cpp/test_layout_op.cpp

  • Add validation for output tensor logical shape
  • Compute padded dimensions for strided view
  • Ensure output shape matches reference tensor
  • +6/-0     

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    🧪 PR contains tests
    ⚡ Recommended focus areas for review

    Possible Issue

    The logic for computing strides in the fallback path when frontier_set != logical_set uses a naive cumulative product, but the loop processes domains in reverse order. This may result in incorrect stride values if the logical domain ordering is not preserved correctly.

    int64_t cur_stride = 1;
    for (const auto&& [i, id] : enumerate(logical) | std::views::reverse) {
      int64_t cur_size = ee.evaluate(id->extent()).as<int64_t>();
      logical_sizes[i] = cur_size;
      logical_strides[i] = cur_stride;
      cur_stride *= cur_size;
    }
    Performance Concern

    Allocating based on allocation domain and then restriding may lead to unnecessary memory usage or copy operations. Consider whether this two-step allocation and restriding pattern could be optimized for common cases where logical and allocation domains are the same.

    if (!out_info.shape_info.allocation_sizes.empty()) {
      alloc_tensor = at::native::empty_strided_cuda(
          out_info.shape_info.allocation_sizes,
          out_info.shape_info.allocation_strides,
          out_info.type,
          c10::nullopt,
          device,
          c10::nullopt);
      alloc_tensor = alloc_tensor.as_strided_(
          out_info.shape_info.logical_sizes,
          out_info.shape_info.logical_strides);
    } else {
      // A special case where allocation sizes & strides are NOT availabe,
      // logical sizes & strides are used in place of allocation sizes &
      // strides, hence no restride is necessary.
      alloc_tensor = at::native::empty_strided_cuda(
          out_info.shape_info.logical_sizes,
          out_info.shape_info.logical_strides,
          out_info.type,
          c10::nullopt,
          device,
          c10::nullopt);
    }

    @jjsjann123 jjsjann123 changed the title Fix output buffer size for PreprocessGroupedMatmulInputSf Fix allocation logic: unconnected alloc/logical Sep 18, 2025
    @jjsjann123 jjsjann123 force-pushed the jj/allocation_for_layout_op_PR_1 branch from c64d299 to 33d0ce3 Compare September 18, 2025 21:40
    @jjsjann123 jjsjann123 marked this pull request as ready for review September 18, 2025 21:41
    @jjsjann123 jjsjann123 force-pushed the jj/allocation_for_layout_op_PR_1 branch from 33d0ce3 to 17df15a Compare September 19, 2025 22:55
    @jjsjann123
    Copy link
    Collaborator Author

    !test

    @jjsjann123 jjsjann123 force-pushed the jj/allocation_for_layout_op_PR_1 branch from 17df15a to f9acfc3 Compare September 22, 2025 21:50
    @jjsjann123
    Copy link
    Collaborator Author

    !test

    @jjsjann123 jjsjann123 force-pushed the jj/allocation_for_layout_op_PR_1 branch from f9acfc3 to 87afb60 Compare September 23, 2025 17:39
    @jjsjann123
    Copy link
    Collaborator Author

    !test

    @jjsjann123 jjsjann123 marked this pull request as draft September 25, 2025 00:03
    fixing logical size of allocated buffer for layout op
    @jjsjann123 jjsjann123 changed the base branch from jj/allocation_PR_0 to main October 1, 2025 20:10
    @jjsjann123 jjsjann123 force-pushed the jj/allocation_for_layout_op_PR_1 branch 2 times, most recently from c64d299 to ea3cd68 Compare October 1, 2025 20:17
    @jjsjann123
    Copy link
    Collaborator Author

    !test

    @jjsjann123 jjsjann123 marked this pull request as ready for review October 2, 2025 19:05
    @jjsjann123
    Copy link
    Collaborator Author

    !test

    for (int i = static_cast<int>(tensor_new_shape.size()) - 1; i >= 0; --i) {
    prod *= tensor_new_shape[i];
    tensor_new_strides[i] = prod;
    prod *= tensor_new_shape[i];
    Copy link
    Collaborator Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    a random bug fix not related in this PR.

    std::set<IterDomain*> logical_set(logical.begin(), logical.end());
    if (frontier_set != logical_set) {
    return tensor;
    std::vector<int64_t> logical_sizes(logical.size(), 0);
    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    IIUC, logical_sizes is the correct shape of the output, but logical_strides isn't. Can you leave a comment a little more about the context, e.g., why it should be done this way and why the incorrect strides don't matter.

    c10::nullopt,
    device,
    c10::nullopt);
    at::Tensor alloc_tensor;
    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Can you explain why it's changed this way?

    @jjsjann123 jjsjann123 requested a review from naoyam October 7, 2025 18:11
    @jjsjann123
    Copy link
    Collaborator Author

    !test

    Copy link
    Collaborator

    @naoyam naoyam left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    LGTM

    @jjsjann123
    Copy link
    Collaborator Author

    !build

    @jjsjann123 jjsjann123 merged commit 33337e9 into main Oct 7, 2025
    17 checks passed
    @jjsjann123 jjsjann123 deleted the jj/allocation_for_layout_op_PR_1 branch October 7, 2025 23:41
    tbqh pushed a commit that referenced this pull request Nov 12, 2025
    ## Stacked PRs
    
    Follow up PR on enabling python API and updating test_moe.py is still in
    cleaning mode.
    #5174 allow layout op in automatic scheduler
    #5185 Fix allocation logic: unconnected alloc/logical    <- this one
    
    ## This PR
    
    Fixes allocation logic to ensure that the output tensor has:
    1. shape matching its logical domain;
    2. buffer size matching the allocation domain.
    
    Without this PR, the output tensor from `PreprocessGroupedMatmulInputSf`
    will have a mismatch shape from its logical domain, causing validation
    failure in downstream consumers.
    
    ### Context
    
    PreprocessGroupedMatmulInputSf op has:
    1. unconnected logical and allocation domain.
    4. larger allocation size, because extra padding is represented via
    arithmetic operations on the extent directly.
    
    Existing allocation logic allocate buffer matches logical sizes/strides.
    This is not the right behavior. Because allocation domain could have
    larger extent. We also cannot use allocation sizes/strides neither,
    because consumer of the tensor expects a tensor matching the logical
    size.
    
    We updated the logic to use allocation domain for buffer allocation.
    Then we slice into the buffer using logical domain to produce the
    correct-sized output.
    For the case of PreprocessGroupedMatmulInputSf, because there's no
    correct way to slice into the buffer for indexing, we give up on
    producing correct strides and just use a naive stride instead. It's safe
    to do so, since we are not using indexing logic on the output.
    
    ### Code change
    
    1. refactor buffer allocation buffer to use allocation domain, instead
    of logical domain.
    5. fixing projection from allocation to logical special path when
    projection is not possible: We now compute correct extent instead of
    returning the allocation buffer as-is, this allows that layout op to
    return a tensor with the correct logical size, while still allocating a
    large enough buffer to accommodate the padding requirement.
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Labels

    None yet

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    2 participants