refactor number of groups in layout op#5198
Conversation
|
Review updated until commit a85c363 Description
Changes walkthrough 📝
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
34b394d to
3f56fcc
Compare
c40c78f to
58885a8
Compare
c6ea939 to
64a9f9e
Compare
5e62117 to
9547246
Compare
Shouldn't have sliced swizzled & padded data when copying it back to the original buffer. The issue was noticed when I try to validate the layout op in #5198 This unfortunately didn't affect the threshold for result validation.
39fa2b3 to
b3fbe15
Compare
9c2d85e to
30d1699
Compare
b3fbe15 to
64d1651
Compare
012365d to
6d0b512
Compare
64d1651 to
79e1f8c
Compare
6d0b512 to
350f318
Compare
a110e89 to
a9821a0
Compare
csrc/ops/indexing.cpp
Outdated
| // set it as IterType::Symbolic to avoid domain validation errors | ||
| return IterDomainBuilder(id) | ||
| .extent(padded_ext) | ||
| .iter_type(IterType::Symbolic) |
There was a problem hiding this comment.
What validation are you referring to?
There was a problem hiding this comment.
This is coming from
nvfuser::ir_utils::validateDomainEquivalence
The error occurs during transformation replay, specifically, during transform propagation.
(gdb) f 5
#5 0x0000aaaaabd018fc in nvfuser::TransformReplay::replayCasP (consumer=0xfffc8c00d720, producer=0xfffc8c005fc0,
producer_pos=2, logical_map=..., opt=...) at /opt/pytorch/nvfuser/csrc/transform_replay.cpp:870
870 TensorDomain* replayed = IrBuilder::createInContainer<TensorDomain>(
(gdb) p consumer->printTransform()
Couldn't find method nvfuser::TensorView::printTransform
(gdb) p consumer->printTransforms()
logical domain : (iS54{i0}, iS55{9})
allocation domain : (iS56{( i0 + ( i3 * 127 ) )}, iS57{( ( 12 / 4 ) * 4 )})
contiguity: t t
loop domain : (iS54{i0}, iS55{9})
$3 = void
(gdb) p producer->printTransforms()
logical domain : (iS39{i0}, iS40{9}, rS11{16})
contiguity: t t n
loop domain : (iS40{9}, rS11{16}, iS39{i0})
$4 = void
(gdb) p new_loop[0]->toString(0)
$5 = "iS55{9}"
(gdb) p new_loop[1]->toString(0)
$6 = "iS54{i0}"
(gdb)
The program tried to update the consumer's domain as
870 TensorDomain* replayed = IrBuilder::createInContainer<TensorDomain>(
871 consumer->container(),
872 consumer->getRootDomain(),
873 consumer->getLogicalDomain(),
874 consumer->getAllocationDomain(),
875 new_loop,
876 consumer->domain()->contiguity());
It won't work as the allocation domain doesn't match the logical domain. However, by marking padded allocation domain as symbolic, ir_utils::compareDomains wouldn't try to map it and that silences the error.
Here's the full stack
#2 0x0000aaaaab64e5bc in nvfuser::ir_utils::validateDomainEquivalence (
dom0=std::vector of length 2, capacity 2 = {...}, dom1=std::vector of length 2, capacity 2 = {...},
additional_ids=std::vector of length 0, capacity 0) at /opt/pytorch/nvfuser/csrc/ir/utils.cpp:1163
#3 0x0000aaaaab4ea6e0 in nvfuser::TensorDomain::TensorDomain (this=0xfffc8c037c70, passkey=...,
root_domain=std::vector of length 0, capacity 0, logical_domain=std::vector of length 0, capacity 0,
allocation_domain=std::vector of length 0, capacity 0, loop_domain=std::vector of length 0, capacity 0,
contiguity=std::vector of length 0, capacity 0, additional_ids=std::vector of length 0, capacity 0)
at /opt/pytorch/nvfuser/csrc/ir/nodes.cpp:3417
#4 0x0000aaaaabd0a36c in nvfuser::IrBuilder::createInContainer<nvfuser::TensorDomain, std::vector<nvfuser::IterDomain
*, std::allocator<nvfuser::IterDomain*> > const&, std::vector<nvfuser::IterDomain*, std::allocator<nvfuser::IterDomain
*> > const&, std::vector<nvfuser::IterDomain*, std::allocator<nvfuser::IterDomain*> > const&, std::vector<nvfuser::Ite
rDomain*, std::allocator<nvfuser::IterDomain*> >&, std::vector<std::optional<bool>, std::allocator<std::optional<bool>
> > const&> (container=0xfffc8c000e30) at /opt/pytorch/nvfuser/csrc/ir/builder.h:46
#5 0x0000aaaaabd018fc in nvfuser::TransformReplay::replayCasP (consumer=0xfffc8c00d720, producer=0xfffc8c005fc0,
producer_pos=2, logical_map=..., opt=...) at /opt/pytorch/nvfuser/csrc/transform_replay.cpp:870
#6 0x0000aaaaabd0216c in nvfuser::TransformReplay::replayCasP (consumer=0xfffc8c00d720, producer=0xfffc8c005fc0,
compute_at_axis=2, opt=...) at /opt/pytorch/nvfuser/csrc/transform_replay.cpp:948
#7 0x0000aaaaabd035dc in nvfuser::TransformPropagator::propagateP2C (this=0xfffcd5db9d50, from=0xfffc8c005fc0,
to=0xfffc8c00d720) at /opt/pytorch/nvfuser/csrc/transform_replay.cpp:1206
#8 0x0000aaaaabc22428 in nvfuser::MaxInfoSpanningTree::traverse (this=0xfffcd5db9d90, propagator=0xfffcd5db9d50)
at /opt/pytorch/nvfuser/csrc/scheduler/tools/maxinfo_propagator.cpp:158
#9 0x0000aaaaabc612e0 in nvfuser::scheduler_utils::transformPropagateToAllFrom (from_tv=0xfffc8c00f710, pos=2)
at /opt/pytorch/nvfuser/csrc/scheduler/utils.cpp:1935
#10 0x0000aaaaabc648c8 in nvfuser::scheduler_utils::propagateReshapeTransforms (fusion=0xfffc8c000e30, ca_map=...)
at /opt/pytorch/nvfuser/csrc/scheduler/utils.cpp:2460
#11 0x0000aaaaabb62fb4 in nvfuser::normalization_scheduler_utils::scheduleReductionGeneral (fusion=0xfffc8c000e30,
rparams=0xaaaab29a03f0, reduction_tvs=std::vector of length 1, capacity 1 = {...},
scheduler_type=nvfuser::SchedulerType::InnerPersistent)
at /opt/pytorch/nvfuser/csrc/scheduler/normalization_utils.cpp:1527
#12 0x0000aaaaabb633c8 in nvfuser::normalization_scheduler_utils::schedulePersistentKernel (fusion=0xfffc8c000e30,
rparams=0xaaaab29a03f0, scheduler_type=nvfuser::SchedulerType::InnerPersistent)
at /opt/pytorch/nvfuser/csrc/scheduler/normalization_utils.cpp:1586
#13 0x0000aaaaabaef90c in nvfuser::InnerPersistentKernelScheduler::schedule (this=0xfffc8c01ed20,
fusion=0xfffc8c000e30, params=0xaaaab29a03f0) at /opt/pytorch/nvfuser/csrc/scheduler/normalization_inner.cpp:1226
There was a problem hiding this comment.
OK, scatter has a similar problem with the loop domain, so I added the skip_loop_validaiton flag. I don't quite like the idea of using IterType::Symbolic as it's meant to indicate the iteration type is still unknown, whereas in this case that isn't the case.
Have you considered any other workarounds?
There was a problem hiding this comment.
Initially I tried to follow the same route and add skip_validation for setAllocationDomain. The reason I used this hack was a a coincidence. (i.e. this issue didn't show up until I fixed dynamic_transform in #5345, and why comparing the fusion IR side by side, I realized symbolic iter type would let me get away with it 😛 ).
Have you considered any other workarounds?
IMHO, we shouldn't be validating domain equivalence, since the new direction we are moving towards is trying to genuine support this as a feature. Or rather, we should instead opt-in to run the validation.
There was a problem hiding this comment.
sorry about that typo. don't know where that genuine was coming from.
There are certain properties that must hold, and making sure those properties are satisfied seems like a good idea to me.
I agree that validation on proerties that must hold is a good idea.
Given that we now started using allocation domain to represent padding/swizzle, similarly we use loop domain to represent sparse update in scatter. I think that's saying that domain equivalence between logical to allocation, as well as logical to loop, no longer hold and we shouldn't require the equivalency during transformations.
We can certainly still imply those checks, but as an opt-in for cases when we know such properties hold.
There was a problem hiding this comment.
auto replayed = TensorDomainBuilder(consumer).loop(new_loop).build();
Sounds like a reasonable WAR to try for this.
I think the scatter case where the loop domain on the consumer doesn't converge to the logical domain would still be unhappy with it. Though we haven't got any cases where we replay across a scatter op in the scheduler.
There was a problem hiding this comment.
Given that we now started using allocation domain to represent padding/swizzle, similarly we use loop domain to represent sparse update in scatter. I think that's saying that domain equivalence between logical to allocation, as well as logical to loop, no longer hold and we shouldn't require the equivalency during transformations.
Please think about what we could do to make the overall system more robust rather than just throwing away safety checks. In the case of allocation domains, it seems to be that we should at least check if each iter domain is reachable and symbolically validate the size of an allocation domain is equal to or larger than that of the logical domain.
Sounds like a reasonable WAR to try for this.
To be clear, I didn't consider it's just a WAR but proposed as a valid design. If something is already valid, why should we need to validate it again?
There was a problem hiding this comment.
Got'ya.
In the case of allocation domains, it seems to be that we should at least check if each iter domain is reachable and symbolically validate the size of an allocation domain is equal to or larger than that of the logical domain.
Does that mean we would need to have ID op connecting logical to allocation domain (like a resize), rather than relying on expression directly on the extent like in our current implementation?
Lines 353 to 379 in 6219364
e749d3f to
17dd08a
Compare
|
!test |
## Stacked PRs #5230 moe layer with nvfp4 grouped_mm #5345 exposing layout op at direct python binding #5198 refactor number of groups in layout op #5174 allow layout op in automatic scheduler <-- this PR ## This PR Allow scheduler to take `PreprocessGroupedMatmulInputSf` as a pointwise operation using the runtime function. The main code change is to addressing the assumption of the runtime function: - [x] add segmentation for offsets to ensure they are in global memory. * Existing assumption is that two offsets inputs and output of the layout op would be in global memory, where the runtime function could read the entirety of both offsets and write the output via data dependent indexing. This allows the operation to be treated as a trivial pointwise-op. * avoids caching layout op outputs or offsets inputs. * avoids putting layout op output into persistent buffers (since we require write to global memory). - [x] detect unsafe consumption of PreprocessGroupedMatmulInputSf output in `fusion_segmenter.cpp` - [x] relax asserts on assumption that there's always a legit path between loop->allocation and logical->allocation in some scheduler utils. TODOs for future PR: * end-2-end python test with direct binding.
|
@jjsjann123 Can you fix the conflicts to clean up the diffs? |
17dd08a to
f669051
Compare
cleaned up now. I wish github would at least try a rebase when the target merged. 🤷 |
|
!test |
|
!test |
naoyam
left a comment
There was a problem hiding this comment.
LGTM. Thanks for the update.
|
!test |
## Stacked PRs #5230 moe layer with nvfp4 grouped_mm #5345 exposing layout op at direct python binding <-- this PR #5198 refactor number of groups in layout op #5174 allow layout op in automatic scheduler ## This PR Expose layout op at python direct binding. Added nvfp4 grouped gemm in python test. Minor fixes: 1. ~Added support of allocation domain for output of layout op in concretization pass to maintain the dependency on padded allocation domain to its logical domain.~ No longer needed, handled in #5384 2. Skipped validation for `setAllocationDomain` 3. updated reference implementation to match the math order in nvfuser decomposed nvfp4 quantization. TODO: python tests requires IdModel Indexer in order to work. See issue #5200, as well as suggested WAR in #5200 (comment)
## Stacked PRs #5230 moe layer with nvfp4 grouped_mm #5345 exposing layout op at direct python binding #5198 refactor number of groups in layout op #5174 allow layout op in automatic scheduler <-- this PR ## This PR Allow scheduler to take `PreprocessGroupedMatmulInputSf` as a pointwise operation using the runtime function. The main code change is to addressing the assumption of the runtime function: - [x] add segmentation for offsets to ensure they are in global memory. * Existing assumption is that two offsets inputs and output of the layout op would be in global memory, where the runtime function could read the entirety of both offsets and write the output via data dependent indexing. This allows the operation to be treated as a trivial pointwise-op. * avoids caching layout op outputs or offsets inputs. * avoids putting layout op output into persistent buffers (since we require write to global memory). - [x] detect unsafe consumption of PreprocessGroupedMatmulInputSf output in `fusion_segmenter.cpp` - [x] relax asserts on assumption that there's always a legit path between loop->allocation and logical->allocation in some scheduler utils. TODOs for future PR: * end-2-end python test with direct binding.
## Stacked PRs #5230 moe layer with nvfp4 grouped_mm #5345 exposing layout op at direct python binding #5198 refactor number of groups in layout op <-- this PR #5174 allow layout op in automatic scheduler ## This PR This is a tiny refactor to only expect the two `offsets` to have the size equal to num_groups. The reason is that our cutlass kernel were expecting that in the first place and I didn't match it right in the first time. e.g. with total sequence length 10, and tokens per expert [2, 3, 5] Previously the offsets would be [0, 2, 5, 10]; after the refactor, the offsets would be [0, 2, 5].
## Stacked PRs #5230 moe layer with nvfp4 grouped_mm #5345 exposing layout op at direct python binding <-- this PR #5198 refactor number of groups in layout op #5174 allow layout op in automatic scheduler ## This PR Expose layout op at python direct binding. Added nvfp4 grouped gemm in python test. Minor fixes: 1. ~Added support of allocation domain for output of layout op in concretization pass to maintain the dependency on padded allocation domain to its logical domain.~ No longer needed, handled in #5384 2. Skipped validation for `setAllocationDomain` 3. updated reference implementation to match the math order in nvfuser decomposed nvfp4 quantization. TODO: python tests requires IdModel Indexer in order to work. See issue #5200, as well as suggested WAR in #5200 (comment)
Stacked PRs
#5230 moe layer with nvfp4 grouped_mm
#5345 exposing layout op at direct python binding
#5198 refactor number of groups in layout op <-- this PR
#5174 allow layout op in automatic scheduler
This PR
This is a tiny refactor to only expect the two
offsetsto have the size equal to num_groups. The reason is that our cutlass kernel were expecting that in the first place and I didn't match it right in the first time.e.g. with total sequence length 10, and tokens per expert [2, 3, 5]
Previously the offsets would be [0, 2, 5, 10]; after the refactor, the offsets would be [0, 2, 5].