Support Split between logical domain to allocation domain to represent padding#5184
Support Split between logical domain to allocation domain to represent padding#5184jjsjann123 wants to merge 89 commits intojj/skip_vectorization_allocation_validationfrom
Split between logical domain to allocation domain to represent padding#5184Conversation
|
Review updated until commit 4d240a4 Description
|
| Relevant files | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| Enhancement |
| ||||||||
| Bug_fix |
| ||||||||
| Tests |
|
PR Reviewer Guide
Here are some key observations to aid the review process:
| 🧪 PR contains tests |
| ⚡ Recommended focus areas for review |
Memory Management
|
Test failures
-
(Medium, 1)
Tensor numerical mismatches in nvFuser matmul tests (H100 runner)Test Name H100 Source HopperMatmulTest.HSH_NT_UseScheduler_MultipleInstructionsPerWarpTile ❌ Link
clangformat
tests/cpp/test_layout_op.cpp
Outdated
| out->split(1, 16); | ||
| out->setAllocationDomain(out->getLoopDomain(), true); | ||
| // restore loop domain | ||
| out->merge(1); |
There was a problem hiding this comment.
This doesn't restore. Is this necessary?
There was a problem hiding this comment.
Touche. It unsplit the loop domain so that it has the same size as logical domain.
You are right that the extent is no longer the same, so it's not a restoration.
Schedulers expects un-scheduled fusion. Without this merge, I'm hitting the assert here:
Fuser/csrc/scheduler/pointwise.cpp
Line 357 in db9721d
There was a problem hiding this comment.
Hmm, not sure that's good enough WAR, though this is just a test.
I thought the schedulers can work with some scheduled loop domains (for DID parallelization), not?
There was a problem hiding this comment.
Fuser/csrc/scheduler/pointwise.cpp
Lines 231 to 233 in 12121b9
DID related IDs are just ignored by scheduler. So that's just too specific for multi-device.
I'm not a fan of this neither. Let me see if I can skip messing with loop and play transformation on allocation directly.
There was a problem hiding this comment.
I suppose you can just modify the allocation domain with AbstractTensor. I remember there are some tests.
There was a problem hiding this comment.
I can also directly using IterDomain::split for that.
Anyway, looks like if the transformation is not on logical to loop, our replay wouldn't pick it up. Felt similar to the allocation domain replay that rfactor was missing. fyi @Priya2698
#0 nvfuser::nvfCheckFail (func=0xaaaaac218080 "validateDomainEquivalence",
file=0xaaaaac216938 "/opt/pytorch/nvfuser/csrc/ir/utils.cpp", line=1162,
msg=" INTERNAL ASSERT FAILED at /opt/pytorch/nvfuser/csrc/ir/utils.cpp:1162, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. \nExpected !compare_result.dom0_has_u"...) at /opt/pytorch/nvfuser/csrc/exceptions.cpp:267
#1 0x0000aaaaab1bbe68 in nvfuser::nvfErrorFail (func=0xaaaaac218080 "validateDomainEquivalence",
file=0xaaaaac216938 "/opt/pytorch/nvfuser/csrc/ir/utils.cpp", line=1162,
condMsg=0xaaaaac217fd8 " INTERNAL ASSERT FAILED at /opt/pytorch/nvfuser/csrc/ir/utils.cpp:1162, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. ",
userMsg="Expected !compare_result.dom0_has_unreachable_ids . dom0 has unreachable IDs. dom0: iS10{i0}, iS11{i2}. dom1: iS10{i0}") at /opt/pytorch/nvfuser/csrc/exceptions.cpp:277
#2 0x0000aaaaab60a3e8 in nvfuser::ir_utils::validateDomainEquivalence (
dom0=std::vector of length 2, capacity 2 = {...}, dom1=std::vector of length 1, capacity 3 = {...},
additional_ids=std::vector of length 0, capacity 0) at /opt/pytorch/nvfuser/csrc/ir/utils.cpp:1162
#3 0x0000aaaaab4aac30 in nvfuser::TensorDomain::setAllocationDomain (this=0xaaaab20918b0,
new_allocation_domain=std::vector of length 1, capacity 3 = {...},
new_contiguity=std::vector of length 1, capacity 3 = {...})
at /opt/pytorch/nvfuser/csrc/ir/nodes.cpp:4055
#4 0x0000aaaaabc7b368 in nvfuser::TransformReplay::replayCasP (consumer=0xaaaab2088c00,
producer=0xaaaab2091200, producer_pos=2, logical_map=..., opt=...)
at /opt/pytorch/nvfuser/csrc/transform_replay.cpp:917
#5 0x0000aaaaabc7b7fc in nvfuser::TransformReplay::replayCasP (consumer=0xaaaab2088c00,
producer=0xaaaab2091200, compute_at_axis=-1, opt=...)
at /opt/pytorch/nvfuser/csrc/transform_replay.cpp:945
#6 0x0000aaaaabc44ccc in nvfuser::TensorView::cacheBefore (this=0xaaaab2088c00,
op_type=nvfuser::LoadStoreOpType::Set) at /opt/pytorch/nvfuser/csrc/tensor_view.cpp:1160
#7 0x0000aaaaabbdb250 in nvfuser::scheduler_utils::cacheAndForkOutputs (fusion=0xaaaab2084910,
unroll=true) at /opt/pytorch/nvfuser/csrc/scheduler/utils.cpp:1357
#8 0x0000aaaaabb067dc in nvfuser::schedulePointwise (fusion=0xaaaab2084910, pparams=0xaaaab207f880)
at /opt/pytorch/nvfuser/csrc/scheduler/pointwise.cpp:822
#9 0x0000aaaaabb0898c in nvfuser::PointWiseScheduler::schedule (this=0xaaaab2083460,
fusion=0xaaaab2084910, params=0xaaaab207f880)
at /opt/pytorch/nvfuser/csrc/scheduler/pointwise.cpp:1304
There was a problem hiding this comment.
So, what did you decide to do? Nothing seems to have changed?
I can also directly using IterDomain::split for that.
Of course, but you'd need to maintain the proper ordering of the ID vector yourself.
There was a problem hiding this comment.
I can also directly using
IterDomain::splitfor that.Anyway, looks like if the transformation is not on logical to loop, our replay wouldn't pick it up. Felt similar to the allocation domain replay that rfactor was missing. fyi @Priya2698
Yes rfactor replay for allocation will also complain similarly if allocation transforms are disjoint from root-to-loop.
replayPasC also uses the loop domain as the target so if you intend to use IterDomain::split, we will have to update that, among other things.
There was a problem hiding this comment.
yep. switched to selfReplay instead of replayCasP for TensorView::cacheBefore
| } | ||
| }; | ||
|
|
||
| TEST_F(LayoutOpTest, LogicalAndAllocationSizes) { |
There was a problem hiding this comment.
What is being tested here?
There was a problem hiding this comment.
Without the relaxation in vectorization analysis, this test will trigger an assert.
So the test just verifies that we do allow allocation domain split now.
In the follow up PR, we added more validation to this test to check the produce tensor matches the logical sizes.
|
!test |
|
!test |
Sorry I don't have anything on that yet. I'll try to write up one when I have the end-2-end example working at least in a prototype. Mostly trying to wing it at this moment. |
|
!test |
csrc/transform_replay.cpp
Outdated
|
|
||
| // Replay loop. | ||
| if (self_loop != self->logical()) { | ||
| ReplaySelf replay(self_loop, axis_map); |
There was a problem hiding this comment.
Just FYI: #4585 reversed this. I expect some tests to break.
There was a problem hiding this comment.
Thanks a ton. Let me sweep through failing tests and see if there's anything easy to patch. 🧑💼
|
!test |
|
!test |
| fusion.addOutput(out); | ||
| // padding output to multiple of 16 on allocation domain | ||
| auto&& [io, ii] = IterDomain::split( | ||
| out->axis(1), IrBuilder::create<Val>(16L, DataType::Index), true); |
There was a problem hiding this comment.
tagging @naoyam changed the test to only apply split on logical -> allocation.
| domain()->logical() | std::views::transform([](IterDomain* id) { | ||
| id->setDefinition(nullptr); | ||
| return id->resetRFactorProduct(); | ||
| }), |
There was a problem hiding this comment.
logic: mutating IterDomain objects by clearing their definitions with setDefinition(nullptr) affects the original objects that may still be referenced elsewhere in the codebase, potentially causing issues if other code paths expect these definitions to be intact
| // FIXME: this doesn't feel right, we have to mark contiguity on axis(0) as | ||
| // `false` to avoid accidntal indexing collapsing, this should be figured out | ||
| // by indexing from the ceilDiv. | ||
| out->setAllocationDomain({out->axis(0), io, ii}, {false, true, true}); |
There was a problem hiding this comment.
logic: the FIXME indicates the contiguity workaround (false for axis(0)) is hiding an indexing bug - the indexing logic should automatically handle ceilDiv from splits without manual contiguity adjustments
| // FIXME: output shape inference is not correct yet. | ||
| // output should remain the correct logical size | ||
| // EXPECT_EQ( | ||
| // cg_outputs[0].as<at::Tensor>().sizes(), std::vector<int64_t>({512, | ||
| // 9})); | ||
| // padding on the inner dimension is represented as stride on the outer | ||
| // dimension | ||
| EXPECT_EQ( | ||
| cg_outputs[0].as<at::Tensor>().strides(), std::vector<int64_t>({16, 1})); | ||
| // We need to slice because output buffer shape is not right | ||
| EXPECT_TRUE(t0.equal(cg_outputs[0].as<at::Tensor>().slice(1, 0, k))); |
There was a problem hiding this comment.
logic: shape inference is broken for non-divisible splits - the output tensor has wrong size requiring manual slicing to verify correctness, which means production code using this feature will produce incorrectly-shaped outputs
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
…wering' into HEAD
| domain()->logical() | std::views::transform([](IterDomain* id) { | ||
| id->setDefinition(nullptr); | ||
| return id->resetRFactorProduct(); |
There was a problem hiding this comment.
logic: in-place mutation of IterDomain objects via setDefinition(nullptr) modifies objects that may still be referenced elsewhere (e.g., in producer_map, graph structures, or other tensor views). The comment mentions this is needed for BestEffortReplay, but this mutation happens BEFORE the consumer domain is created and could affect code that expects these IDs to retain their definitions
|
!test |
|
!test |
|
Looks like it's hitting a vectorization validation error now. I'm not sure if it's coming from changes in #5594. Taking a look at that now. |
|
!test |
|
!test |
| domain()->logical() | std::views::transform([](IterDomain* id) { | ||
| id->setDefinition(nullptr); | ||
| return id->resetRFactorProduct(); |
There was a problem hiding this comment.
logic: calling setDefinition(nullptr) mutates the original IterDomain objects from domain()->logical(). These IDs are shared and may be referenced by:
old_domain(stored at line 1237)producer_mapentries (populated at lines 1140, 1180)- Other tensors in the fusion graph
- Downstream transform replay operations
The mutation happens before the consumer domain is created, so any code expecting these IDs to retain their definitions will break. The comment mentions this is needed for BestEffortReplay, but the timing and scope of this mutation need validation.
| // FIXME: this doesn't feel right, we have to mark contiguity on axis(0) as | ||
| // `false` to avoid accidntal indexing collapsing, this should be figured out | ||
| // by indexing from the ceilDiv. | ||
| out->setAllocationDomain({out->axis(0), io, ii}, {false, true, true}); |
There was a problem hiding this comment.
logic: FIXME indicates a workaround for broken indexing logic. Setting axis(0) contiguity to false prevents indexing from collapsing dimensions, but this should be automatically inferred from the ceilDiv in the split. This manual workaround masks an underlying indexing bug that could affect other use cases.
| // FIXME: this doesn't feel right, we have to mark contiguity on axis(0) as | |
| // `false` to avoid accidntal indexing collapsing, this should be figured out | |
| // by indexing from the ceilDiv. | |
| out->setAllocationDomain({out->axis(0), io, ii}, {false, true, true}); | |
| // TODO: Remove this workaround once indexing properly handles ceilDiv splits | |
| out->setAllocationDomain({out->axis(0), io, ii}, {false, true, true}); |
| // FIXME: output shape inference is not correct yet. | ||
| // output should remain the correct logical size | ||
| // EXPECT_EQ( | ||
| // cg_outputs[0].as<at::Tensor>().sizes(), std::vector<int64_t>({512, | ||
| // 9})); | ||
| // padding on the inner dimension is represented as stride on the outer | ||
| // dimension | ||
| EXPECT_EQ( | ||
| cg_outputs[0].as<at::Tensor>().strides(), std::vector<int64_t>({16, 1})); | ||
| // We need to slice because output buffer shape is not right | ||
| EXPECT_TRUE(t0.equal(cg_outputs[0].as<at::Tensor>().slice(1, 0, k))); |
There was a problem hiding this comment.
logic: shape inference is broken - the output tensor has the wrong size and requires manual slicing to verify correctness. This means production code using non-divisible splits will produce incorrectly-shaped outputs, breaking downstream operations that depend on tensor shapes.
The commented-out assertion at lines 119-121 shows the expected behavior isn't working. This is a critical correctness issue that needs resolution before merge.
Stacked PR
PR0: #5622 skip aggressive validation check on allocation domain for vectorization
PR1: #5184 Support Split between logical domain to allocation domain to represent padding <-- this one
This PR
Allows split of ID on the path
logical->allocationto represent padding logic on allocation. Notably, we no longer require allocation domain on the path betweenlogical->loop.Motivation
Spliton allocation domain allows a clean representation for padding. i.e.The example above is just specifying that dimension i1 would be padded to a multiple of 16.
Main Code Change
In order to support this, we have to update
TensorView::cacheBefore. CacheBefore changes the graph fromthistoproducer -> set -> consumer:this->domain()onproducerand replays from logical to loop onconsumer;consumer;Hence this PR changes the
cacheBeforelogic such that:producer;this->domain()is now preserved onconsumerafter reduction IDs were removed.Technical Challenges
cacheat all. One exception where allocation domain is preserved on cache is when the cache is sharded. This is because our shape inference done via ExpressionEvaluator relies on allocation domain. Without proper allocation domain, the reshape call would be called on global tensor instead of local tensor;LogicalAndAllocationSizes. Since this PR is growing in size, I'll fix it in follow up PRs;