Allocation order refactor by jjsjann123 · Pull Request #2168 · NVIDIA/Fuser

jjsjann123 · 2024-05-01T15:49:55Z

refactored allocation order inference pass:

Instead of per operation propagation rule, we are now using IdModel mapping to map allocation domain of reference tensor to rfactor domain of target tensor.
Updated the inference API to allow specified sources and destinations for the propagation.

void inferenceAllocationOrder(
  Fusion* fusion,
  const std::vector<TensorView*>& srcs,
  const std::vector<TensorView*>& dsts);

The propagation tried to keep the memory format of dsts closer to the srcs to simplify scheduling as well as facilitate vectorization. It works roughly as:
- For each entry dst, among all its producers in srcs, we'll find the one with the most loop iter domain in its allocation domain as the reference ref
- We try to map each iter domain in dst's rfactor domain to ref's allocation order domain and push those as the inner dimension in dst's new allocation domain, while pushing unmapped iter domains as outer dimensions.
- I have to put in a WAR for the mapping logic for now, since reduction scheduler is struggling with permuted output. See issue Reduction scheduler does not handle allocation domain properly and trigger assert when reduction output has specified allocation domain #2202. The WAR is simply to preserve the existing position of reduction iter domain in rfactor the same as it would be in its new allocation domain. This WAR is supposed to be removed at a later point once we fixed reduction scheduler. I kept both code path in the PR for easier future cleanup.

jjsjann123 · 2024-05-01T15:51:37Z

local test failures are encountered here:

[  FAILED  ] 9 tests, listed below:
[  FAILED  ] NVFuserTest.CombinedSchedulerSharedProducer_CUDA
[  FAILED  ] NVFuserTest.DynamicSqueezeTrivialWelford
[  FAILED  ] AliasTest.NotAllOutputsAlias_Reduction
[  FAILED  ] IndexingOpTest.TakeAlongAxisIntermediateTensorTranspose1_CUDA
[  FAILED  ] TransposeTest.FusionTransposeSelfMapping
[  FAILED  ] TransposeTest.TransposeAggregatedVectorizationWidth
[  FAILED  ] TransposeTest.TransposeSplitAggregatedVectorizationWidth
[  FAILED  ] MoveSplitCatTest.Noncancellable_SomeButNotAllArePermuted
[  FAILED  ] ResizeTest.PadScheduler4

I'm suspecting most to be test related failures, certain tests are likely expecting output allocation to be in certain way that's now violated from this change. I'll try to clean them up a bit.

jjsjann123 · 2024-05-01T15:54:05Z

tests/cpp/test_allocation_order_inference.cpp

+    EXPECT_THAT(getAllocationDomainPermutation(tv3), ElementsAre(3, 1, 0, 2));
  }
+  // TODO: open an issue. seems to hit an assert in IdModel(&fusion)
+  // {


Note to myself. verify this with ToT main. I'm guessing it's just some idmodel config that I wasn't using properly.

…ctor

…location_order_refactor

jjsjann123 · 2024-05-01T23:09:35Z

local test failures are encountered here:

[  FAILED  ] IndexingOpTest.TakeAlongAxisIntermediateTensorTranspose1_CUDA
[  FAILED  ] TransposeTest.FusionTransposeSelfMapping
[  FAILED  ] TransposeTest.TransposeAggregatedVectorizationWidth
[  FAILED  ] TransposeTest.TransposeSplitAggregatedVectorizationWidth
[  FAILED  ] MoveSplitCatTest.Noncancellable_SomeButNotAllArePermuted

jjsjann123 · 2024-05-07T21:40:59Z

I think we can take a short-cut for matmul/linear. We can add inputs to matmul/linear ops in our dsts vector and infer their allocation domain from in the global fusion IR. Hopefully that's still going to be a right decision to make?! 🤞 Open to discussion on this as well

Sounds like an interesting approach. Let's try that (in a separate PR).

SGTM

jacobhinkle · 2024-05-08T13:37:54Z

If we infer a global order of all ValGroups based on all inputs, and make this information available to all schedulers, could that be a solution to #2198?

Doesn't transpose mean we cannot have a total order on ValGroups that places innermost dims properly everywhere?

Co-authored-by: Jingyue Wu <wujingyue@gmail.com>

csrc/preseg_passes/allocation_order_inference.cpp

naoyam · 2024-05-08T18:12:21Z

csrc/preseg_passes/allocation_order_inference.cpp

+  std::vector<IterDomain*> mapped_id_vec;
+  std::unordered_set<IterDomain*> mapped_id_set;
+
+  // logic to preserve reduction iter domain in target to WAR issue #2202


Is this going to be removed once #2202 is addressed?

yes that's the plan.

jjsjann123 · 2024-05-09T07:29:27Z

!bulid --diff

jjsjann123 · 2024-05-09T08:07:25Z

!build --diff

jjsjann123 · 2024-05-09T08:56:32Z

!build --diff

jjsjann123 · 2024-05-09T21:19:24Z

jit_thunder_tests on A100 seems to be reporting kernel with wrong arch.
00:00:36 FATAL: kernel fmha_cutlassF_f32_aligned_64x64_rf_sm75 is for sm75-sm80, but was built for sm70

Don't think that one is coming from this PR and I don't see a CI nightly with thunder failure. cc'ing @xwang233

xwang233 · 2024-05-10T00:09:32Z

jit_thunder_tests on A100 seems to be reporting kernel with wrong arch. 00:00:36 FATAL: kernel fmha_cutlassF_f32_aligned_64x64_rf_sm75 is for sm75-sm80, but was built for sm70

Don't think that one is coming from this PR and I don't see a CI nightly with thunder failure. cc'ing @xwang233

Somehow that test job got a T4 GPU instead of A100. Need to investigate that. Please restart new jobs CI if needed.

jjsjann123 · 2024-05-13T20:22:37Z

!build

jjsjann123 · 2024-05-13T20:24:13Z

I addressed all issues in comment. I'll be merging the PR after CI clears again since I have already got a stamp earlier.

Please do block merge if you have further concerns. Regarding comments on propagation rule for reshape. Let's move the discussion here #2235 .

jjsjann123 · 2024-05-14T08:03:45Z

Thunder failure isn't related. I'm merging this.

Fixes a subtle [bug](https://gitlab-master.nvidia.com/dl/pytorch/fuser-gh-mirror/-/jobs/92948751), exposed by #2168

This reverts commit 8c18701.

jjsjann123 and others added 17 commits April 29, 2024 17:28

wip

7f2fab4

WIP

4a3c28a

WIP

2202589

fixing build

2c2ba72

fixing build

de1fd00

fixing build

1f064cd

build

3eece06

fixing test build

094a3bd

fixing skipping logic on non_bc id count

efe6674

fix skipping logic

85c5c04

building graph

e7a324d

fixing dependency check

cb255de

skipping broadcast

bd52d4a

restoring some behavior

4f5b1ea

fixing tests

9524562

removing obsolete tests

371b8d6

removing failing tests

a75ccec

jjsjann123 commented May 1, 2024

View reviewed changes

jjsjann123 added 9 commits May 1, 2024 09:41

Merge remote-tracking branch 'origin/main' into allocation_order_refa…

8b6b28b

…ctor

Merge remote-tracking branch 'jiej/allocation_order_refactor' into al…

553c303

…location_order_refactor

updating logic and skip setting alloc when it's trivial

ef68c47

quick refactor

44c91d3

fixing typo

be1b369

comma

de6b231

quick patch

9bff4e0

removing half finished line

8ed9896

updating tests

3d730ac

fixing test; patching logic for selfmapping

676ba20

naoyam and others added 3 commits May 8, 2024 10:01

Update csrc/preseg_passes/allocation_order_inference.cpp

e5b2652

Co-authored-by: Jingyue Wu <wujingyue@gmail.com>

Update csrc/preseg_passes/allocation_order_inference.cpp

93e26c3

Co-authored-by: Jingyue Wu <wujingyue@gmail.com>

Update csrc/preseg_passes/allocation_order_inference.cpp

1920f81

Co-authored-by: Jingyue Wu <wujingyue@gmail.com>

naoyam reviewed May 8, 2024

View reviewed changes

code cleaning per review comment

87ea434

jjsjann123 and others added 2 commits May 9, 2024 01:05

fixing logic

aa6a626

Merge branch 'main' into allocation_order_refactor

4cc295f

jjsjann123 added 3 commits May 9, 2024 01:28

xiang's comment on removing nested for loop

f4a8e16

linter

caf819f

Merge remote-tracking branch 'jiej/allocation_order_refactor' into HEAD

0ab850e

jjsjann123 mentioned this pull request May 13, 2024

allocation domain propagation to support reshape operations #2235

Closed

Merge branch 'main' into allocation_order_refactor

ae4b389

jjsjann123 requested review from naoyam and zasdfgbnm May 13, 2024 20:22

jjsjann123 merged commit 8c18701 into NVIDIA:main May 14, 2024

jjsjann123 deleted the allocation_order_refactor branch May 14, 2024 08:04

wujingyue mentioned this pull request May 14, 2024

Fix bug in scatter #2245

Merged

wujingyue pushed a commit that referenced this pull request May 15, 2024

Fix bug in scatter (#2245)

1a7c6f6

Fixes a subtle [bug](https://gitlab-master.nvidia.com/dl/pytorch/fuser-gh-mirror/-/jobs/92948751), exposed by #2168

jjsjann123 added a commit that referenced this pull request May 15, 2024

Revert "Allocation order refactor (#2168)"

c5c3596

This reverts commit 8c18701.

Priya2698 mentioned this pull request Jun 13, 2024

RuntimeError: Stride mismatch with contiguity info. #2354

Closed

Conversation

jjsjann123 commented May 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jjsjann123 commented May 1, 2024

Uh oh!

jjsjann123 May 1, 2024

Choose a reason for hiding this comment

Uh oh!

jjsjann123 commented May 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jjsjann123 commented May 7, 2024

Uh oh!

jacobhinkle commented May 8, 2024

Uh oh!

Uh oh!

naoyam May 8, 2024

Choose a reason for hiding this comment

Uh oh!

jjsjann123 May 8, 2024

Choose a reason for hiding this comment

Uh oh!

jjsjann123 commented May 9, 2024

Uh oh!

jjsjann123 commented May 9, 2024

Uh oh!

jjsjann123 commented May 9, 2024

Uh oh!

jjsjann123 commented May 9, 2024

Uh oh!

xwang233 commented May 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jjsjann123 commented May 13, 2024

Uh oh!

jjsjann123 commented May 13, 2024

Uh oh!

jjsjann123 commented May 14, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

jjsjann123 commented May 1, 2024 •

edited

Loading

jjsjann123 commented May 1, 2024 •

edited

Loading

xwang233 commented May 10, 2024 •

edited

Loading