Skip to content

don't propage allocation domain to cached inputs in normalization scheduler#4723

Closed
liqiangxl wants to merge 16 commits intomainfrom
llu/_ws_tma_alloc_domain
Closed

don't propage allocation domain to cached inputs in normalization scheduler#4723
liqiangxl wants to merge 16 commits intomainfrom
llu/_ws_tma_alloc_domain

Conversation

@liqiangxl
Copy link
Collaborator

@liqiangxl liqiangxl commented Jul 3, 2025

The fusion IR of the newly added cpp test is a simple fusion uses inner outer persistent sheduler.

Fusion before PreSegmenter:
Inputs:
  T0_g_float[bS0{1}, iS1{4096}, iS2{4096}]
Outputs:
  T4_g_float[bS10{1}, iS11{4096}, iS12{4096}]
  T6_g_float[rS15{4096}, iS16{4096}]

%kernel_math {
T1_l_float[iS3{4096}, iS4{4096}]
   = squeeze( T0_g_float[bS0{1}, iS1{4096}, iS2{4096}], flags = {true, false, false} )
T2_l_float[iS5{4096}, rS6{4096}]
   = reduction( T1_l_float[iS3{4096}, iS4{4096}], op = add, initial value = float(0), allreduce = false )
T3_l_float[bS7{1}, iS8{4096}, bS9{1}]
   = broadcast( T2_l_float[iS5{4096}, rS6{4096}], flags = {true, false, true} )
T4_g_float[bS10{1}, iS11{4096}, iS12{4096}]
   = T0_g_float[bS0{1}, iS1{4096}, iS2{4096}]
   + T3_l_float[bS7{1}, iS8{4096}, bS9{1}];
T5_l_float[iS13{4096}, iS14{4096}]
   = squeeze( T4_g_float[bS10{1}, iS11{4096}, iS12{4096}], flags = {true, false, false} )
T6_g_float[rS15{4096}, iS16{4096}]
   = reduction( T5_l_float[iS13{4096}, iS14{4096}], op = add, initial value = float(0), allreduce = false )
} // %kernel_math 

Its input has an allocation domain.

T0_g_float[bS0{1}, iS1{4096}, iS2{4096}]
 logical domain : (bS0{1}, iS1{4096}, iS2{4096})
 allocation domain : (iS1{4096}, bS0{1}, iS2{4096})
 contiguity: t n t
 loop domain : (bS0{1}, iS1{4096}, iS2{4096})

This allocation domain is passed to its consumer T7_s_float:

T7_s_float[iblockIdx.y118{132}, iS119{8}, iS120{2}, iS116{2}, bS20{1}, iB22{4096}] ca_pos( 2 )
 logical domain : (bS20{1}, iS21{4096}, iB22{4096})
 allocation domain : (iS21{4096}, bS20{1}, iB22{4096})
 contiguity: t n t
  Split: iS21{4096} by factor 2 -> iS115{2048}, iS116{2}
  Split: iS115{2048} by factor 132 -> iS117{16}, iblockIdx.y118{132}
  Split: iS117{16} by factor 2 -> iS119{8}, iS120{2}
 loop domain : (iblockIdx.y118{132}, iS119{8}, iS120{2}, iS116{2}, bS20{1}, iB22{4096})

The allocation of T7 on shared memory should be derived from its compute at position and loop domain since iS21{4096} has been transformed and its compuate at position is between the transformed domains.

One remaining issue:
This specific case is not influenced but reduction and normalization scheulder may fail to derive the correct reduction type when allocation domain exists, e.g. #2202

@github-actions
Copy link

github-actions bot commented Jul 3, 2025

Review updated until commit 4250b71

Description

  • Added propagate_allocation_domain parameter to cacheInputs.

  • Updated TensorView::cacheAfter to conditionally propagate allocation domains.

  • Added test case for allocation domain with broadcast domains.


Changes walkthrough 📝

Relevant files
Enhancement
normalization_utils.cpp
Update cacheInputs call with new parameter                             

csrc/scheduler/normalization_utils.cpp

  • Added propagate_allocation_domain parameter to cacheInputs call.
  • Added TODO comment for related issue.
  • +4/-2     
    utils.cpp
    Add propagate_allocation_domain parameter to cacheInputs 

    csrc/scheduler/utils.cpp

  • Added propagate_allocation_domain parameter to cacheInputs function.
  • Updated cacheInputs to use propagate_allocation_domain parameter.
  • +5/-2     
    tensor_view.cpp
    Conditionally propagate allocation domains in cacheAfter 

    csrc/tensor_view.cpp

  • Updated cacheAfter to conditionally propagate allocation domains based
    on propagate_allocation_domain.
  • +6/-6     
    transform_replay.cpp
    Parallelize new IDs in replayCasP                                               

    csrc/transform_replay.cpp

    • Updated replayCasP to parallelize new IDs based on original IDs.
    +4/-1     
    utils.h
    Update cacheInputs function signature                                       

    csrc/scheduler/utils.h

  • Updated cacheInputs function signature to include
    propagate_allocation_domain parameter.
  • +4/-1     
    Tests
    test_combined_inner_outer_reduction.cpp
    Add test case for allocation domain with broadcast domains

    tests/cpp/test_combined_inner_outer_reduction.cpp

  • Added test case AllocationDomainBroadcast to verify allocation domain
    handling with broadcast domains.
  • +32/-0   

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    🧪 PR contains tests
    ⚡ Recommended focus areas for review

    Code Comment

    The comment "Normalization scheudler is not aware of allocation domains." has a typo. It should be "Normalization scheduler is not aware of allocation domains."

    // Normalization scheudler is not aware of allocation domains.
    Default Parameter

    The default parameter propagate_allocation_domain = true in the function signature in utils.h is set to true, but in the implementation in utils.cpp, it is not used. Ensure that the default parameter is consistent across the function signature and implementation.

    std::vector<TensorView*> cacheInputs(
        Fusion* fusion,
        bool unroll,
        bool propagate_allocation_domain) {
    Redundant Fusion Creation

    In the test AllocationDomainBroadcast, two Fusion objects are created: fusion_ptr and fusion. Only fusion is used. Remove the unused fusion_ptr to clean up the code.

    });
    

    @liqiangxl liqiangxl force-pushed the llu/ws_tma_vect_check branch from ab17e68 to 0a5b731 Compare July 11, 2025 21:01
    @liqiangxl liqiangxl marked this pull request as ready for review July 14, 2025 18:26
    @liqiangxl liqiangxl requested review from jjsjann123 and naoyam July 15, 2025 14:09
    tv->getLoopDomain().end(),
    tv->getAllocationDomain().begin(),
    tv->getAllocationDomain().end())) {
    if (!ir_utils::isCpAsyncBulk1D(tv->definition()) &&
    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    So we are just excluding using allocation domain set on shared memory TV. Nitpick is to add this in the comment.

    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    I'm not quite following the logic flow here. to establish a base line:

    Here we are dealing with the cached_input problem. T7_s_float is the cached input from loading T0_g

    I think this PR is saying that the allocation domain set on T7_s isn't relevant for CpAsyncBulk1D. I'm wondering where we are inserting CpAsyncBulk1D and shouldn't that be the place where we back up the allocation domain on its output?

    Copy link
    Collaborator Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    We are saying: Don't use the pre-set allocation domain if the shared memory tensor is loaded with CpAsyncBulk1D. It is not safe.
    For example, in this case, T7_s_float is a shared memory tensor and it is loaded with CpAsyncBulk1D. Its loop domain is (iblockIdx.y118{132}, iS119{8}, iS120{2}, iS116{2}, bS20{1}, iB22{4096}) and ca_pos = 2, its allocation domain in shared memory should be iS120{2}, iS116{2}, bS20{1}, iB22{4096}, which can't be coverted by selecting some domains from the pre-set allocation domain becuase iS120{2}, iS116{2} do not exist in the pre-set allocation domain.

    T7_s_float[iblockIdx.y118{132}, iS119{8}, iS120{2}, iS116{2}, bS20{1}, iB22{4096}] ca_pos( 2 )
     logical domain : (bS20{1}, iS21{4096}, iB22{4096})
     allocation domain : (iS21{4096}, bS20{1}, iB22{4096})
     contiguity: t n t
      Split: iS21{4096} by factor 2 -> iS115{2048}, iS116{2}
      Split: iS115{2048} by factor 132 -> iS117{16}, iblockIdx.y118{132}
      Split: iS117{16} by factor 2 -> iS119{8}, iS120{2}
     loop domain : (iblockIdx.y118{132}, iS119{8}, iS120{2}, iS116{2}, bS20{1}, iB22{4096})
    

    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Thanks for that explanation.

    IIUC, somewhere in the scheduler the allocation domain is added. So I was wondering if it makes more sense to fix that in the scheduler, rather than trying to patch it afterwards in lowering logic.

    Copy link
    Collaborator Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Looks like there is a long history and many dicsusisons about allocation domain.
    (1) allocation domain is passed from input to cached input, this feature was added in #2309 for matmul and also applied to all schedulers.
    (2) #3621 changed to propagates allocation only for matmul schedulers to fix #3479
    (3) #3621 is then reverted due to misaligned memory access from transpose kernel #3701

    Looks like transpose and matmul scheduler need propagated allocation domain. For reduction & normalization we can skip that for now.

    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Generally speaking, when a tensor has an allocation domain, that should truly represent the allocation of the tensor. If it doesn't have an allocation domain, we need to make some inference, but if it does, it shouldn't have an invalid domain. There are several exceptions to this general rule due to some historical reasons, but we should not add more exceptions whenever possible.

    In the case of input caches, propagated allocation domains usually don't matter because they are usually Local tensors, for which allocation domains are almost always ignored as one of the existing exceptions. In this case, however, it exposed the problem because the cache tensor is Shared.

    I think what we should do is to schedule the allocation domain explicitly rather than relying on propagation or inference. We schedule the tensor to be allocated on the shared memory and be loaded with TMA, and in addition to them, we should also schedule its allocation domain so that there should be no more additional inference rule.

    Copy link
    Collaborator Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Thanks for the suggestion and will take a look of this approach.

    Copy link
    Collaborator Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    @naoyam and @jjsjann123, I drafted a PR to schedule the allocation domain, can you take a quick look? Just want to make sure I am on the right track.

    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    updating the allocation domain on shared TV in reduction scheduler looks right to me.
    Wondering why we are doing the refactor though.

    Base automatically changed from llu/ws_tma_vect_check to main July 16, 2025 00:34
    @liqiangxl
    Copy link
    Collaborator Author

    !test

    @liqiangxl liqiangxl changed the title Don't use pre-set alloc domain for 1D TMA loaded tensors don't propage allocation domain to cached inputs in normalization scheduler Jul 16, 2025
    @liqiangxl
    Copy link
    Collaborator Author

    !test

    @liqiangxl
    Copy link
    Collaborator Author

    we decided to transform the allocation domain following loop domain transformations, #4795

    @liqiangxl liqiangxl closed this Aug 6, 2025
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Labels

    None yet

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    3 participants