Conversation
|
Review updated until commit 13db45d Description
Changes walkthrough 📝
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
tests/python/test_moe.py
Outdated
| #from thunder.torch.custom_op import _register_nvfuser_translator | ||
| #_register_nvfuser_translator(_sym_of_nvfp4_scaled_grouped_mm, gmm_nvfuser) |
There was a problem hiding this comment.
nit-picking, this function just calls one function twice --
from thunder.executors.nvfuserex_impl import register_supported
from thunder.executors.torchex import _always_executable
register_supported(symbol, translator_for_nvfuser, checker or _always_executable)
register_supported(symbol.id, translator_for_nvfuser, checker or _always_executable)I'm taking a bit of time cleaning up the Lightning-AI/lightning-thunder#2481 tests. To reduce the number of cherry-picks, just calling register_supported here might make things easier
39fa2b3 to
b3fbe15
Compare
c4b9629 to
529149d
Compare
57a0da9 to
80888d9
Compare
64d1651 to
79e1f8c
Compare
79e1f8c to
a110e89
Compare
|
!test |
a110e89 to
a9821a0
Compare
|
!test |
4c2e30a to
c342568
Compare
c63a9f6 to
5ff4e24
Compare
|
I'm undecided how I would want to proceed with tests. But the other smaller pieces should be good for review. |
fixing tests and also fixing GatherScatter iter type propagation at least it's finally running, even though with wrong result just so I can run gdb IE: revert me to repro the loop graph issue I encountered earlier.
bc84003 to
210bbe5
Compare
17a3e07 to
6fd0038
Compare
|
I'll change target and refactor this to only handle nvfp4 via direct binding. since @Priya2698 is cherry-picking the propagation in #5365 |
de4645d to
b4a3a2f
Compare
## Stacked PRs #5230 moe layer with nvfp4 grouped_mm #5345 exposing layout op at direct python binding #5198 refactor number of groups in layout op <-- this PR #5174 allow layout op in automatic scheduler ## This PR This is a tiny refactor to only expect the two `offsets` to have the size equal to num_groups. The reason is that our cutlass kernel were expecting that in the first place and I didn't match it right in the first time. e.g. with total sequence length 10, and tokens per expert [2, 3, 5] Previously the offsets would be [0, 2, 5, 10]; after the refactor, the offsets would be [0, 2, 5].
b4a3a2f to
fe8da62
Compare
Cherry-picked from #5230 * packed fp4 dtype needs to be supported by python API in order to support framework integration. FusionDefinition is not expecting to have packed dtype. But since that's the only fp4 dtype supported by framework, our integration would still need to support it. This PR adds a quick translation at `FusionDefinition.define_tensor` to translate packed dtype into unpacked dtype to keep the WAR transparent to integration/user.
## Stacked PRs #5230 moe layer with nvfp4 grouped_mm #5345 exposing layout op at direct python binding <-- this PR #5198 refactor number of groups in layout op #5174 allow layout op in automatic scheduler ## This PR Expose layout op at python direct binding. Added nvfp4 grouped gemm in python test. Minor fixes: 1. ~Added support of allocation domain for output of layout op in concretization pass to maintain the dependency on padded allocation domain to its logical domain.~ No longer needed, handled in #5384 2. Skipped validation for `setAllocationDomain` 3. updated reference implementation to match the math order in nvfuser decomposed nvfp4 quantization. TODO: python tests requires IdModel Indexer in order to work. See issue #5200, as well as suggested WAR in #5200 (comment)
|
All code changes have been merged in separate PRs, it's only the test_moe.py that's being updated in this PR. I'll clean it up and request a sanity check afterwards. |
## Stacked PRs #5230 moe layer with nvfp4 grouped_mm #5345 exposing layout op at direct python binding #5198 refactor number of groups in layout op #5174 allow layout op in automatic scheduler <-- this PR ## This PR Allow scheduler to take `PreprocessGroupedMatmulInputSf` as a pointwise operation using the runtime function. The main code change is to addressing the assumption of the runtime function: - [x] add segmentation for offsets to ensure they are in global memory. * Existing assumption is that two offsets inputs and output of the layout op would be in global memory, where the runtime function could read the entirety of both offsets and write the output via data dependent indexing. This allows the operation to be treated as a trivial pointwise-op. * avoids caching layout op outputs or offsets inputs. * avoids putting layout op output into persistent buffers (since we require write to global memory). - [x] detect unsafe consumption of PreprocessGroupedMatmulInputSf output in `fusion_segmenter.cpp` - [x] relax asserts on assumption that there's always a legit path between loop->allocation and logical->allocation in some scheduler utils. TODOs for future PR: * end-2-end python test with direct binding.
## Stacked PRs #5230 moe layer with nvfp4 grouped_mm #5345 exposing layout op at direct python binding #5198 refactor number of groups in layout op <-- this PR #5174 allow layout op in automatic scheduler ## This PR This is a tiny refactor to only expect the two `offsets` to have the size equal to num_groups. The reason is that our cutlass kernel were expecting that in the first place and I didn't match it right in the first time. e.g. with total sequence length 10, and tokens per expert [2, 3, 5] Previously the offsets would be [0, 2, 5, 10]; after the refactor, the offsets would be [0, 2, 5].
Cherry-picked from #5230 * packed fp4 dtype needs to be supported by python API in order to support framework integration. FusionDefinition is not expecting to have packed dtype. But since that's the only fp4 dtype supported by framework, our integration would still need to support it. This PR adds a quick translation at `FusionDefinition.define_tensor` to translate packed dtype into unpacked dtype to keep the WAR transparent to integration/user.
## Stacked PRs #5230 moe layer with nvfp4 grouped_mm #5345 exposing layout op at direct python binding <-- this PR #5198 refactor number of groups in layout op #5174 allow layout op in automatic scheduler ## This PR Expose layout op at python direct binding. Added nvfp4 grouped gemm in python test. Minor fixes: 1. ~Added support of allocation domain for output of layout op in concretization pass to maintain the dependency on padded allocation domain to its logical domain.~ No longer needed, handled in #5384 2. Skipped validation for `setAllocationDomain` 3. updated reference implementation to match the math order in nvfuser decomposed nvfp4 quantization. TODO: python tests requires IdModel Indexer in order to work. See issue #5200, as well as suggested WAR in #5200 (comment)
Stacked PRs
#5230 moe layer with nvfp4 grouped_mm <-- this PR
#5345 exposing layout op at direct python binding
#5198 refactor number of groups in layout op
#5174 allow layout op in automatic scheduler
This PR
Fixes that I want to be reviewed:
Changes on test_moe.py is just a display-of-concept:
Note: This has the indexing issue in #5200. This branch also requires some thunder inflight PRs.