exposing layout op at direct python binding#5345
Conversation
|
Review updated until commit ce77b62 Description
Changes walkthrough 📝
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
e749d3f to
17dd08a
Compare
4c2e30a to
c342568
Compare
|
!test |
csrc/dynamic_transform.cpp
Outdated
| logical_dom, layout_op->g(), layout_op->layout()); | ||
| // skip validation because allocation domain doesn't converge to logical | ||
| // domain. | ||
| out_tv->domain()->setAllocationDomain(alloc_dom, true, true); |
There was a problem hiding this comment.
tagging @jacobhinkle since I'm touching concretization.
## Stacked PRs #5230 moe layer with nvfp4 grouped_mm #5345 exposing layout op at direct python binding #5198 refactor number of groups in layout op #5174 allow layout op in automatic scheduler <-- this PR ## This PR Allow scheduler to take `PreprocessGroupedMatmulInputSf` as a pointwise operation using the runtime function. The main code change is to addressing the assumption of the runtime function: - [x] add segmentation for offsets to ensure they are in global memory. * Existing assumption is that two offsets inputs and output of the layout op would be in global memory, where the runtime function could read the entirety of both offsets and write the output via data dependent indexing. This allows the operation to be treated as a trivial pointwise-op. * avoids caching layout op outputs or offsets inputs. * avoids putting layout op output into persistent buffers (since we require write to global memory). - [x] detect unsafe consumption of PreprocessGroupedMatmulInputSf output in `fusion_segmenter.cpp` - [x] relax asserts on assumption that there's always a legit path between loop->allocation and logical->allocation in some scheduler utils. TODOs for future PR: * end-2-end python test with direct binding.
17dd08a to
f669051
Compare
bc84003 to
210bbe5
Compare
| std::vector<IterDomain*> new_allocation_domain, | ||
| std::vector<std::optional<bool>> new_contiguity) { | ||
| std::vector<std::optional<bool>> new_contiguity, | ||
| bool skip_validation) { |
There was a problem hiding this comment.
I wonder if we still need this, since we already marked allocation domain as symbolic now.
|
!test |
de4645d to
b4a3a2f
Compare
|
!test |
|
|
||
| # FIXME: force indexing to use IdModel indexer to avoid indexing error. | ||
| # see issue: https://github.com/NVIDIA/Fuser/issues/5200 | ||
| with set_env(NVFUSER_ENABLE="id_model(all)"): |
There was a problem hiding this comment.
this doesn't seem to work when we run multiple tests together, since we cache the env_variable. I need to change this to the scope of this file maybe... or just skip this test.
Previously, we only traversed all producers of the loop domain in IterVisitor::traverseBetween. That is a problem in cases where we schedule like producer of a reshape, or in exotic cases like #5345 where the domains are disconnected. This PR ensures that we traverse every ID in the TensorDomain regardless of the relations between the domains contained within. Note that it calls TensorDomain::allIDs when getting the "next" statements, which will do a redundant topological sort.
This is split off from #5345, so I don't have a specific repro for any incorrect behavior that this fixes. Previously, we only traversed all producers of the loop domain in IterVisitor::traverseBetween. That is a problem in cases where we schedule like producer of a reshape, or in exotic cases like #5345 where the domains are disconnected. This PR ensures that we traverse every ID in the TensorDomain regardless of the relations between the domains contained within. --------- Co-authored-by: jjsjann123 <jiej@nvidia.com>
## Stacked PRs #5230 moe layer with nvfp4 grouped_mm #5345 exposing layout op at direct python binding #5198 refactor number of groups in layout op <-- this PR #5174 allow layout op in automatic scheduler ## This PR This is a tiny refactor to only expect the two `offsets` to have the size equal to num_groups. The reason is that our cutlass kernel were expecting that in the first place and I didn't match it right in the first time. e.g. with total sequence length 10, and tokens per expert [2, 3, 5] Previously the offsets would be [0, 2, 5, 10]; after the refactor, the offsets would be [0, 2, 5].
This reverts commit eba5fe1.
b4a3a2f to
fe8da62
Compare
|
Had to rebase & force push to avoid resolving all conflicts from the base change. |
|
!test |
|
everything looks good, except the env var thing that's not working properly. I might change that to a standalone python test, just so it's not going to mess with the other tests. |
|
!test |
|
!test |
## Stacked PRs #5230 moe layer with nvfp4 grouped_mm #5345 exposing layout op at direct python binding #5198 refactor number of groups in layout op #5174 allow layout op in automatic scheduler <-- this PR ## This PR Allow scheduler to take `PreprocessGroupedMatmulInputSf` as a pointwise operation using the runtime function. The main code change is to addressing the assumption of the runtime function: - [x] add segmentation for offsets to ensure they are in global memory. * Existing assumption is that two offsets inputs and output of the layout op would be in global memory, where the runtime function could read the entirety of both offsets and write the output via data dependent indexing. This allows the operation to be treated as a trivial pointwise-op. * avoids caching layout op outputs or offsets inputs. * avoids putting layout op output into persistent buffers (since we require write to global memory). - [x] detect unsafe consumption of PreprocessGroupedMatmulInputSf output in `fusion_segmenter.cpp` - [x] relax asserts on assumption that there's always a legit path between loop->allocation and logical->allocation in some scheduler utils. TODOs for future PR: * end-2-end python test with direct binding.
This is split off from #5345, so I don't have a specific repro for any incorrect behavior that this fixes. Previously, we only traversed all producers of the loop domain in IterVisitor::traverseBetween. That is a problem in cases where we schedule like producer of a reshape, or in exotic cases like #5345 where the domains are disconnected. This PR ensures that we traverse every ID in the TensorDomain regardless of the relations between the domains contained within. --------- Co-authored-by: jjsjann123 <jiej@nvidia.com>
## Stacked PRs #5230 moe layer with nvfp4 grouped_mm #5345 exposing layout op at direct python binding #5198 refactor number of groups in layout op <-- this PR #5174 allow layout op in automatic scheduler ## This PR This is a tiny refactor to only expect the two `offsets` to have the size equal to num_groups. The reason is that our cutlass kernel were expecting that in the first place and I didn't match it right in the first time. e.g. with total sequence length 10, and tokens per expert [2, 3, 5] Previously the offsets would be [0, 2, 5, 10]; after the refactor, the offsets would be [0, 2, 5].
## Stacked PRs #5230 moe layer with nvfp4 grouped_mm #5345 exposing layout op at direct python binding <-- this PR #5198 refactor number of groups in layout op #5174 allow layout op in automatic scheduler ## This PR Expose layout op at python direct binding. Added nvfp4 grouped gemm in python test. Minor fixes: 1. ~Added support of allocation domain for output of layout op in concretization pass to maintain the dependency on padded allocation domain to its logical domain.~ No longer needed, handled in #5384 2. Skipped validation for `setAllocationDomain` 3. updated reference implementation to match the math order in nvfuser decomposed nvfp4 quantization. TODO: python tests requires IdModel Indexer in order to work. See issue #5200, as well as suggested WAR in #5200 (comment)
Stacked PRs
#5230 moe layer with nvfp4 grouped_mm
#5345 exposing layout op at direct python binding <-- this PR
#5198 refactor number of groups in layout op
#5174 allow layout op in automatic scheduler
This PR
Expose layout op at python direct binding.
Added nvfp4 grouped gemm in python test.
Minor fixes:
Added support of allocation domain for output of layout op in concretization pass to maintain the dependency on padded allocation domain to its logical domain.No longer needed, handled in Visit allIDs in IterVisitor #5384setAllocationDomainTODO:
python tests requires IdModel Indexer in order to work. See issue #5200, as well as suggested WAR in #5200 (comment)