don't propage allocation domain to cached inputs in normalization scheduler by liqiangxl · Pull Request #4723 · NVIDIA/Fuser

liqiangxl · 2025-07-03T20:47:49Z

The fusion IR of the newly added cpp test is a simple fusion uses inner outer persistent sheduler.

Fusion before PreSegmenter:
Inputs:
  T0_g_float[bS0{1}, iS1{4096}, iS2{4096}]
Outputs:
  T4_g_float[bS10{1}, iS11{4096}, iS12{4096}]
  T6_g_float[rS15{4096}, iS16{4096}]

%kernel_math {
T1_l_float[iS3{4096}, iS4{4096}]
   = squeeze( T0_g_float[bS0{1}, iS1{4096}, iS2{4096}], flags = {true, false, false} )
T2_l_float[iS5{4096}, rS6{4096}]
   = reduction( T1_l_float[iS3{4096}, iS4{4096}], op = add, initial value = float(0), allreduce = false )
T3_l_float[bS7{1}, iS8{4096}, bS9{1}]
   = broadcast( T2_l_float[iS5{4096}, rS6{4096}], flags = {true, false, true} )
T4_g_float[bS10{1}, iS11{4096}, iS12{4096}]
   = T0_g_float[bS0{1}, iS1{4096}, iS2{4096}]
   + T3_l_float[bS7{1}, iS8{4096}, bS9{1}];
T5_l_float[iS13{4096}, iS14{4096}]
   = squeeze( T4_g_float[bS10{1}, iS11{4096}, iS12{4096}], flags = {true, false, false} )
T6_g_float[rS15{4096}, iS16{4096}]
   = reduction( T5_l_float[iS13{4096}, iS14{4096}], op = add, initial value = float(0), allreduce = false )
} // %kernel_math

Its input has an allocation domain.

T0_g_float[bS0{1}, iS1{4096}, iS2{4096}]
 logical domain : (bS0{1}, iS1{4096}, iS2{4096})
 allocation domain : (iS1{4096}, bS0{1}, iS2{4096})
 contiguity: t n t
 loop domain : (bS0{1}, iS1{4096}, iS2{4096})

This allocation domain is passed to its consumer T7_s_float:

T7_s_float[iblockIdx.y118{132}, iS119{8}, iS120{2}, iS116{2}, bS20{1}, iB22{4096}] ca_pos( 2 )
 logical domain : (bS20{1}, iS21{4096}, iB22{4096})
 allocation domain : (iS21{4096}, bS20{1}, iB22{4096})
 contiguity: t n t
  Split: iS21{4096} by factor 2 -> iS115{2048}, iS116{2}
  Split: iS115{2048} by factor 132 -> iS117{16}, iblockIdx.y118{132}
  Split: iS117{16} by factor 2 -> iS119{8}, iS120{2}
 loop domain : (iblockIdx.y118{132}, iS119{8}, iS120{2}, iS116{2}, bS20{1}, iB22{4096})

The allocation of T7 on shared memory should be derived from its compute at position and loop domain since iS21{4096} has been transformed and its compuate at position is between the transformed domains.

One remaining issue:
This specific case is not influenced but reduction and normalization scheulder may fail to derive the correct reduction type when allocation domain exists, e.g. #2202

… vectorization

This reverts commit 9f1bd0e.

github-actions · 2025-07-03T20:48:24Z

Review updated until commit 4250b71

Description

Added propagate_allocation_domain parameter to cacheInputs.
Updated TensorView::cacheAfter to conditionally propagate allocation domains.
Added test case for allocation domain with broadcast domains.

Changes walkthrough 📝

Relevant files

Enhancement

normalization_utils.cpp `Update cacheInputs call with new parameter` csrc/scheduler/normalization_utils.cpp Added `propagate_allocation_domain` parameter to `cacheInputs` call. Added TODO comment for related issue.	+4/-2
utils.cpp `Add propagate_allocation_domain parameter to cacheInputs` csrc/scheduler/utils.cpp Added `propagate_allocation_domain` parameter to `cacheInputs` function. Updated `cacheInputs` to use `propagate_allocation_domain` parameter.	+5/-2
tensor_view.cpp `Conditionally propagate allocation domains in cacheAfter` csrc/tensor_view.cpp Updated `cacheAfter` to conditionally propagate allocation domains based on `propagate_allocation_domain`.	+6/-6
transform_replay.cpp `Parallelize new IDs in replayCasP` csrc/transform_replay.cpp Updated `replayCasP` to parallelize new IDs based on original IDs.	+4/-1
utils.h `Update cacheInputs function signature` csrc/scheduler/utils.h Updated `cacheInputs` function signature to include `propagate_allocation_domain` parameter.	+4/-1

Tests

test_combined_inner_outer_reduction.cpp `Add test case for allocation domain with broadcast domains` tests/cpp/test_combined_inner_outer_reduction.cpp Added test case `AllocationDomainBroadcast` to verify allocation domain handling with broadcast domains.	+32/-0

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🧪 PR contains tests
⚡ Recommended focus areas for review Code Comment The comment "Normalization scheudler is not aware of allocation domains." has a typo. It should be "Normalization scheduler is not aware of allocation domains." // Normalization scheudler is not aware of allocation domains. Default Parameter The default parameter `propagate_allocation_domain = true` in the function signature in `utils.h` is set to `true`, but in the implementation in `utils.cpp`, it is not used. Ensure that the default parameter is consistent across the function signature and implementation. std::vector<TensorView> cacheInputs( Fusion fusion, bool unroll, bool propagate_allocation_domain) { Redundant Fusion Creation In the test `AllocationDomainBroadcast`, two `Fusion` objects are created: `fusion_ptr` and `fusion`. Only `fusion` is used. Remove the unused `fusion_ptr` to clean up the code. });

…d broadcast domains

jjsjann123 · 2025-07-15T17:46:44Z

csrc/device_lower/pass/allocation.cpp

-                tv->getLoopDomain().end(),
-                tv->getAllocationDomain().begin(),
-                tv->getAllocationDomain().end())) {
+        if (!ir_utils::isCpAsyncBulk1D(tv->definition()) &&


So we are just excluding using allocation domain set on shared memory TV. Nitpick is to add this in the comment.

I'm not quite following the logic flow here. to establish a base line:

Here we are dealing with the cached_input problem. T7_s_float is the cached input from loading T0_g

I think this PR is saying that the allocation domain set on T7_s isn't relevant for CpAsyncBulk1D. I'm wondering where we are inserting CpAsyncBulk1D and shouldn't that be the place where we back up the allocation domain on its output?

We are saying: Don't use the pre-set allocation domain if the shared memory tensor is loaded with CpAsyncBulk1D. It is not safe.
For example, in this case, T7_s_float is a shared memory tensor and it is loaded with CpAsyncBulk1D. Its loop domain is (iblockIdx.y118{132}, iS119{8}, iS120{2}, iS116{2}, bS20{1}, iB22{4096}) and ca_pos = 2, its allocation domain in shared memory should be iS120{2}, iS116{2}, bS20{1}, iB22{4096}, which can't be coverted by selecting some domains from the pre-set allocation domain becuase iS120{2}, iS116{2} do not exist in the pre-set allocation domain.

T7_s_float[iblockIdx.y118{132}, iS119{8}, iS120{2}, iS116{2}, bS20{1}, iB22{4096}] ca_pos( 2 ) logical domain : (bS20{1}, iS21{4096}, iB22{4096}) allocation domain : (iS21{4096}, bS20{1}, iB22{4096}) contiguity: t n t Split: iS21{4096} by factor 2 -> iS115{2048}, iS116{2} Split: iS115{2048} by factor 132 -> iS117{16}, iblockIdx.y118{132} Split: iS117{16} by factor 2 -> iS119{8}, iS120{2} loop domain : (iblockIdx.y118{132}, iS119{8}, iS120{2}, iS116{2}, bS20{1}, iB22{4096})

Thanks for that explanation.

IIUC, somewhere in the scheduler the allocation domain is added. So I was wondering if it makes more sense to fix that in the scheduler, rather than trying to patch it afterwards in lowering logic.

Looks like there is a long history and many dicsusisons about allocation domain.
(1) allocation domain is passed from input to cached input, this feature was added in #2309 for matmul and also applied to all schedulers.
(2) #3621 changed to propagates allocation only for matmul schedulers to fix #3479
(3) #3621 is then reverted due to misaligned memory access from transpose kernel #3701

Looks like transpose and matmul scheduler need propagated allocation domain. For reduction & normalization we can skip that for now.

Generally speaking, when a tensor has an allocation domain, that should truly represent the allocation of the tensor. If it doesn't have an allocation domain, we need to make some inference, but if it does, it shouldn't have an invalid domain. There are several exceptions to this general rule due to some historical reasons, but we should not add more exceptions whenever possible.

In the case of input caches, propagated allocation domains usually don't matter because they are usually Local tensors, for which allocation domains are almost always ignored as one of the existing exceptions. In this case, however, it exposed the problem because the cache tensor is Shared.

I think what we should do is to schedule the allocation domain explicitly rather than relying on propagation or inference. We schedule the tensor to be allocated on the shared memory and be loaded with TMA, and in addition to them, we should also schedule its allocation domain so that there should be no more additional inference rule.

Thanks for the suggestion and will take a look of this approach.

@naoyam and @jjsjann123, I drafted a PR to schedule the allocation domain, can you take a quick look? Just want to make sure I am on the right track.

updating the allocation domain on shared TV in reduction scheduler looks right to me.
Wondering why we are doing the refactor though.

liqiangxl · 2025-07-16T02:13:42Z

!test

liqiangxl · 2025-07-16T21:41:58Z

!test

liqiangxl · 2025-08-06T13:31:30Z

we decided to transform the allocation domain following loop domain transformations, #4795

liqiangxl added 5 commits July 3, 2025 11:00

use tma when persistent buffer are projected to inputs

cd50023

last loop domain may be a broadcast domain, don't just check that for…

c2731e7

… vectorization

wip

9f1bd0e

Revert "wip"

61c2ae0

This reverts commit 9f1bd0e.

don't use preset alloc domain for 1d tma loaded tv

998b484

test ws tma normalization scheduler using fusions with non-concretize…

0a5b731

…d broadcast domains

liqiangxl force-pushed the llu/ws_tma_vect_check branch from ab17e68 to 0a5b731 Compare July 11, 2025 21:01

liqiangxl and others added 5 commits July 11, 2025 17:10

Merge branch 'main' into llu/ws_tma_vect_check

865fdda

clean

5349b0d

merge

4a910ce

clean

50c0b46

add test

d6465cd

liqiangxl marked this pull request as ready for review July 14, 2025 18:26

liqiangxl requested review from jjsjann123 and naoyam July 15, 2025 14:09

jjsjann123 reviewed Jul 15, 2025

View reviewed changes

Base automatically changed from llu/ws_tma_vect_check to main July 16, 2025 00:34

liqiangxl added 4 commits July 15, 2025 18:46

don't propagate allocation domain in normalization schedulers

cf3aca8

Merge branch 'main' into llu/_ws_tma_alloc_domain

45b59c2

revert changes in allocation.cpp

6faa97c

merge main

6eb07bc

liqiangxl changed the title ~~Don't use pre-set alloc domain for 1D TMA loaded tensors~~ don't propage allocation domain to cached inputs in normalization scheduler Jul 16, 2025

fix MultiDeviceTest.DivideBySum

4250b71

liqiangxl mentioned this pull request Jul 17, 2025

replay loop domain transforms to allocation domain #4795

Merged

liqiangxl closed this Aug 6, 2025

Conversation

liqiangxl commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes walkthrough 📝

PR Reviewer Guide 🔍

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liqiangxl commented Jul 16, 2025

Uh oh!

liqiangxl commented Jul 16, 2025

Uh oh!

liqiangxl commented Aug 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

liqiangxl commented Jul 3, 2025 •

edited

Loading

github-actions bot commented Jul 3, 2025 •

edited

Loading