Conversation
|
!test |
|
Review updated until commit fab8823 Description
Changes walkthrough 📝
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
csrc/scheduler/registry.cpp
Outdated
| } | ||
|
|
||
| // check PreprocessGroupedMatmulInputSf's output is in global memory by | ||
| // forcing it as a fusion segment output |
There was a problem hiding this comment.
@naoyam I think I should have done a memory promotion instead of forcing a segment. But since the consumer is in a separate kernel anyway, I don't think the two alternative has any meaningful difference.
| scheduler_debug_utils::canScheduleRejectReason( | ||
| scheduler_type, "Fusion has consumer of non indexable ops."); | ||
| return false; | ||
| } |
There was a problem hiding this comment.
Note for myself. I think I need to do the same for offsets. promoting memory type to shared/global so each thread can access the entire offsets in case they are computed within the kernel.
I'm not sure if sync analysis during lowering would have done that for me automatically.
818c574 to
567d2ac
Compare
33d0ce3 to
17df15a
Compare
567d2ac to
34b394d
Compare
17df15a to
f9acfc3
Compare
34b394d to
3f56fcc
Compare
f9acfc3 to
87afb60
Compare
c6ea939 to
64a9f9e
Compare
c64d299 to
ea3cd68
Compare
64a9f9e to
9c2d85e
Compare
9c2d85e to
30d1699
Compare
## Stacked PRs Follow up PR on enabling python API and updating test_moe.py is still in cleaning mode. #5174 allow layout op in automatic scheduler #5185 Fix allocation logic: unconnected alloc/logical <- this one ## This PR Fixes allocation logic to ensure that the output tensor has: 1. shape matching its logical domain; 2. buffer size matching the allocation domain. Without this PR, the output tensor from `PreprocessGroupedMatmulInputSf` will have a mismatch shape from its logical domain, causing validation failure in downstream consumers. ### Context PreprocessGroupedMatmulInputSf op has: 1. unconnected logical and allocation domain. 4. larger allocation size, because extra padding is represented via arithmetic operations on the extent directly. Existing allocation logic allocate buffer matches logical sizes/strides. This is not the right behavior. Because allocation domain could have larger extent. We also cannot use allocation sizes/strides neither, because consumer of the tensor expects a tensor matching the logical size. We updated the logic to use allocation domain for buffer allocation. Then we slice into the buffer using logical domain to produce the correct-sized output. For the case of PreprocessGroupedMatmulInputSf, because there's no correct way to slice into the buffer for indexing, we give up on producing correct strides and just use a naive stride instead. It's safe to do so, since we are not using indexing logic on the output. ### Code change 1. refactor buffer allocation buffer to use allocation domain, instead of logical domain. 5. fixing projection from allocation to logical special path when projection is not possible: We now compute correct extent instead of returning the allocation buffer as-is, this allows that layout op to return a tensor with the correct logical size, while still allocating a large enough buffer to accommodate the padding requirement.
012365d to
6d0b512
Compare
test case added wip prevent cacheAndForkOutputs disabl cacheInputs for offsets TVs change domain stuff in reference TV revert unused changes err something isn't working right wip adding more test examples misc fixes adding quick note for me to pick up later break consumer of layout op into separate fusion err quick fix on logic refactor to use resize for transform replay clear allocation domain revert resize change for allocation domain; wipe out allocation for layout op on fusion segment boundaries trying to patch allocation domain handling during allocation fix cases where allocation domain isn't available fixing allocation transform and fixing tests fixing test sizes I think I'm doing the right thing fixing normalization scheduler example clangformat forgot to import namespaces adding comment; warning; clangformat fixing the safety check to after adding outputs lintrunner fixing errors from main changes; updating segmenter logic to force segment on layout op offset inputs to ensure they are in global memory switching warning to asserts
6d0b512 to
350f318
Compare
|
!test |
|
!test |
csrc/fusion_segmenter.cpp
Outdated
| for (auto inp : getAllInputs(sg)) { | ||
| auto clone_tv = complete_to_segment_map.clone(inp); | ||
| fusion_segment->addInput(clone_tv); | ||
| if (inp->isDefinitionType<ReshapeOp>()) { |
There was a problem hiding this comment.
Do we still need this branch?
There was a problem hiding this comment.
I don't see why we need it and it doesn't have an error message. So my gut feeling is just remove it, since the assert is not adding more debugging insights.
There was a problem hiding this comment.
I'm guessing there was something here for reshape. It's deleted but not completely.
| NVF_ERROR(clone_tv != nullptr && clone_tv->isA<TensorView>()); | ||
| view_tvs.push_back(clone_tv->as<TensorView>()); | ||
| } else if (inp->isDefinitionType<PreprocessGroupedMatmulInputSf>()) { | ||
| // There's no point of replaying allocation domain if we cannot index into |
There was a problem hiding this comment.
Oh, yes, this is an input.
Is there any reason to have the assertion at line 1896 there, not here?
Also, where do you have a problem with TransformReplay? If the segment just contains the CUTLASS grouped MMAOp, where does TransformReplay bite us?
csrc/scheduler/registry_utils.h
Outdated
|
|
||
| static bool hasResizeAndIndexOps(Fusion* fusion); | ||
|
|
||
| // Checks if fusion contains illegal non-indexable ops. E.g. for |
There was a problem hiding this comment.
I see. The name sounds a little weird to me. We already have rejectScheduleFusionInputRequirement. Why not just adds something similar for fusion outputs and use them?
|
!test |
csrc/scheduler/registry_utils.cpp
Outdated
| bool rejectScheduleFusionOutputRequirement( | ||
| Expr* expr, | ||
| SchedulerType scheduler_type) { | ||
| TensorView* out = ir_utils::getTvOutput(expr); |
There was a problem hiding this comment.
Don't all outputs need to be checked?
There was a problem hiding this comment.
Since we are only using it for layout op, which has only a single output TV, so I didn't bother doing that.
Conceptually, similar to inputs, we might only want to check certain output staying on global. I would argue that we should have
bool rejectScheduleFusionOutputRequirement( Expr* expr, Val* val, SchedulerType scheduler_type);
There was a problem hiding this comment.
Yeah, I think the consistent interface would look better, so +1 for rejectScheduleFusionOutputRequirement( Expr* expr, Val* val, SchedulerType scheduler_type);
|
Is |
We fuse it to its producers. (i.e. quantize math that produces the blocked scaling factor. |
|
!test |
|
!test |
|
Failures are not related. merging as-is. |
## Stacked PRs #5230 moe layer with nvfp4 grouped_mm #5345 exposing layout op at direct python binding #5198 refactor number of groups in layout op <-- this PR #5174 allow layout op in automatic scheduler ## This PR This is a tiny refactor to only expect the two `offsets` to have the size equal to num_groups. The reason is that our cutlass kernel were expecting that in the first place and I didn't match it right in the first time. e.g. with total sequence length 10, and tokens per expert [2, 3, 5] Previously the offsets would be [0, 2, 5, 10]; after the refactor, the offsets would be [0, 2, 5].
## Stacked PRs #5230 moe layer with nvfp4 grouped_mm #5345 exposing layout op at direct python binding <-- this PR #5198 refactor number of groups in layout op #5174 allow layout op in automatic scheduler ## This PR Expose layout op at python direct binding. Added nvfp4 grouped gemm in python test. Minor fixes: 1. ~Added support of allocation domain for output of layout op in concretization pass to maintain the dependency on padded allocation domain to its logical domain.~ No longer needed, handled in #5384 2. Skipped validation for `setAllocationDomain` 3. updated reference implementation to match the math order in nvfuser decomposed nvfp4 quantization. TODO: python tests requires IdModel Indexer in order to work. See issue #5200, as well as suggested WAR in #5200 (comment)
## Stacked PRs Follow up PR on enabling python API and updating test_moe.py is still in cleaning mode. #5174 allow layout op in automatic scheduler #5185 Fix allocation logic: unconnected alloc/logical <- this one ## This PR Fixes allocation logic to ensure that the output tensor has: 1. shape matching its logical domain; 2. buffer size matching the allocation domain. Without this PR, the output tensor from `PreprocessGroupedMatmulInputSf` will have a mismatch shape from its logical domain, causing validation failure in downstream consumers. ### Context PreprocessGroupedMatmulInputSf op has: 1. unconnected logical and allocation domain. 4. larger allocation size, because extra padding is represented via arithmetic operations on the extent directly. Existing allocation logic allocate buffer matches logical sizes/strides. This is not the right behavior. Because allocation domain could have larger extent. We also cannot use allocation sizes/strides neither, because consumer of the tensor expects a tensor matching the logical size. We updated the logic to use allocation domain for buffer allocation. Then we slice into the buffer using logical domain to produce the correct-sized output. For the case of PreprocessGroupedMatmulInputSf, because there's no correct way to slice into the buffer for indexing, we give up on producing correct strides and just use a naive stride instead. It's safe to do so, since we are not using indexing logic on the output. ### Code change 1. refactor buffer allocation buffer to use allocation domain, instead of logical domain. 5. fixing projection from allocation to logical special path when projection is not possible: We now compute correct extent instead of returning the allocation buffer as-is, this allows that layout op to return a tensor with the correct logical size, while still allocating a large enough buffer to accommodate the padding requirement.
## Stacked PRs #5230 moe layer with nvfp4 grouped_mm #5345 exposing layout op at direct python binding #5198 refactor number of groups in layout op #5174 allow layout op in automatic scheduler <-- this PR ## This PR Allow scheduler to take `PreprocessGroupedMatmulInputSf` as a pointwise operation using the runtime function. The main code change is to addressing the assumption of the runtime function: - [x] add segmentation for offsets to ensure they are in global memory. * Existing assumption is that two offsets inputs and output of the layout op would be in global memory, where the runtime function could read the entirety of both offsets and write the output via data dependent indexing. This allows the operation to be treated as a trivial pointwise-op. * avoids caching layout op outputs or offsets inputs. * avoids putting layout op output into persistent buffers (since we require write to global memory). - [x] detect unsafe consumption of PreprocessGroupedMatmulInputSf output in `fusion_segmenter.cpp` - [x] relax asserts on assumption that there's always a legit path between loop->allocation and logical->allocation in some scheduler utils. TODOs for future PR: * end-2-end python test with direct binding.
## Stacked PRs #5230 moe layer with nvfp4 grouped_mm #5345 exposing layout op at direct python binding #5198 refactor number of groups in layout op <-- this PR #5174 allow layout op in automatic scheduler ## This PR This is a tiny refactor to only expect the two `offsets` to have the size equal to num_groups. The reason is that our cutlass kernel were expecting that in the first place and I didn't match it right in the first time. e.g. with total sequence length 10, and tokens per expert [2, 3, 5] Previously the offsets would be [0, 2, 5, 10]; after the refactor, the offsets would be [0, 2, 5].
## Stacked PRs #5230 moe layer with nvfp4 grouped_mm #5345 exposing layout op at direct python binding <-- this PR #5198 refactor number of groups in layout op #5174 allow layout op in automatic scheduler ## This PR Expose layout op at python direct binding. Added nvfp4 grouped gemm in python test. Minor fixes: 1. ~Added support of allocation domain for output of layout op in concretization pass to maintain the dependency on padded allocation domain to its logical domain.~ No longer needed, handled in #5384 2. Skipped validation for `setAllocationDomain` 3. updated reference implementation to match the math order in nvfuser decomposed nvfp4 quantization. TODO: python tests requires IdModel Indexer in order to work. See issue #5200, as well as suggested WAR in #5200 (comment)
Stacked PRs
#5230 moe layer with nvfp4 grouped_mm
#5345 exposing layout op at direct python binding
#5198 refactor number of groups in layout op
#5174 allow layout op in automatic scheduler <-- this PR
This PR
Allow scheduler to take
PreprocessGroupedMatmulInputSfas a pointwise operation using the runtime function.The main code change is to addressing the assumption of the runtime function:
add segmentation for offsets to ensure they are in global memory.
detect unsafe consumption of PreprocessGroupedMatmulInputSf output in
fusion_segmenter.cpprelax asserts on assumption that there's always a legit path between loop->allocation and logical->allocation in some scheduler utils.
TODOs for future PR: