Remove LdMatrixTranspose type from LoadStoreOp types by protonu · Pull Request #2315 · NVIDIA/Fuser

protonu · 2024-05-29T18:15:42Z

This PR modifies how we determine when we do a LdMatrix vs LdMatrixTranspose.

We look at the (maybe) allocation domain of the producer and consumer of the load store op do determine if a transpose is required. We get the allocation domain of the consumer and check if the innermost dimension maps to the innermost dimension of the allocation domain of the producer. We transpose if it doesn't map.

csrc/device_lower/pass/inline_ptx.cpp

csrc/scheduler/mma_utils.cpp

protonu · 2024-05-29T22:25:28Z

!build

csrc/scheduler/mma_utils.cpp

csrc/scheduler/mma_utils.h

csrc/scheduler/mma_utils.cpp

csrc/scheduler/matmul.cpp

csrc/scheduler/mma_utils.cpp

csrc/scheduler/mma_utils.h

protonu · 2024-05-30T17:51:16Z

!build

protonu · 2024-05-31T00:05:59Z

Terribly sorry for the force push - I had gotten myself into a mess.

protonu · 2024-05-31T00:06:21Z

!build

jacobhinkle

LGTM but I'll leave it to @zasdfgbnm to give the final approval.

zasdfgbnm

A few comments on the clarity of comment. This code should be working.

csrc/scheduler/mma_utils.cpp

protonu · 2024-05-31T17:19:06Z

!build

csrc/scheduler/mma_utils.cpp

Co-authored-by: Gao, Xiang <qasdfgtyuiop@gmail.com>

protonu · 2024-05-31T18:22:10Z

!build

csrc/scheduler/mma_utils.cpp

Co-authored-by: Jacob Hinkle <1454944+jacobhinkle@users.noreply.github.com>

protonu · 2024-05-31T18:49:05Z

!build

…eduler (#2309) In this PR we extend the matmul scheduler to support inputs with allocation domains. To the fusion (with inputs tv_a and tv_b), we add two LoadStoreOps to both inputs. The first Op corresponds to a load to shared memory, where we propagate the allocation domain. The second op corresponds to reading to registers, where we don't propagate the allocation domain since the scheduler takes charge of setting the allocation domain in the registers. Based on the difference in the (maybe)allocation domain of the producer and consumer of the second LoadStoreOp, we may do transposed load when reading to registers. ![image](https://github.com/NVIDIA/Fuser/assets/10635897/89395990-9b85-4ce1-8e7d-006e43a86b85) See also #2315.

This PR modifies how we determine when we do a LdMatrix vs LdMatrixTranspose. We look at the (maybe) allocation domain of the producer and consumer of the load store op do determine if a transpose is required. We get the allocation domain of the consumer and check if the innermost dimension maps to the innermost dimension of the allocation domain of the producer. We transpose if it doesn't map. --------- Co-authored-by: Gao, Xiang <qasdfgtyuiop@gmail.com> Co-authored-by: Jacob Hinkle <1454944+jacobhinkle@users.noreply.github.com>

…eduler (#2309) In this PR we extend the matmul scheduler to support inputs with allocation domains. To the fusion (with inputs tv_a and tv_b), we add two LoadStoreOps to both inputs. The first Op corresponds to a load to shared memory, where we propagate the allocation domain. The second op corresponds to reading to registers, where we don't propagate the allocation domain since the scheduler takes charge of setting the allocation domain in the registers. Based on the difference in the (maybe)allocation domain of the producer and consumer of the second LoadStoreOp, we may do transposed load when reading to registers. ![image](https://github.com/NVIDIA/Fuser/assets/10635897/89395990-9b85-4ce1-8e7d-006e43a86b85) See also #2315.

protonu marked this pull request as ready for review May 29, 2024 18:15

protonu requested review from jacobhinkle, jjsjann123, naoyam and zasdfgbnm May 29, 2024 18:16

protonu mentioned this pull request May 29, 2024

Handling allocation domain of the input TensorViews in the matmul scheduler #2309

Merged

zasdfgbnm reviewed May 29, 2024

View reviewed changes

csrc/device_lower/pass/inline_ptx.cpp Outdated Show resolved Hide resolved

zasdfgbnm reviewed May 29, 2024

View reviewed changes

csrc/scheduler/mma_utils.cpp Outdated Show resolved Hide resolved

protonu requested a review from zasdfgbnm May 29, 2024 21:12

naoyam reviewed May 29, 2024

View reviewed changes

csrc/scheduler/mma_utils.cpp Outdated Show resolved Hide resolved

csrc/scheduler/mma_utils.cpp Outdated Show resolved Hide resolved

zasdfgbnm reviewed May 29, 2024

View reviewed changes

csrc/scheduler/mma_utils.h Outdated Show resolved Hide resolved

csrc/scheduler/mma_utils.cpp Outdated Show resolved Hide resolved

csrc/scheduler/mma_utils.cpp Outdated Show resolved Hide resolved

csrc/scheduler/mma_utils.cpp Outdated Show resolved Hide resolved

jacobhinkle reviewed May 29, 2024

View reviewed changes

csrc/scheduler/matmul.cpp Show resolved Hide resolved

csrc/scheduler/mma_utils.cpp Outdated Show resolved Hide resolved

csrc/scheduler/mma_utils.h Outdated Show resolved Hide resolved

protonu added 8 commits May 30, 2024 23:56

removing uses of ldmatrixtranspose

1520f04

changing func signature and cleaning up test

818dac2

clean up

094673c

adding comments

c987616

renaming functions

38dee11

format

104fba6

addressing reviewer commnents

bf000fa

modifying isLdMatrixTranspose

ddc169f

protonu force-pushed the pbasu_remove_ldmatrixtranspose branch from 1ce518a to ddc169f Compare May 31, 2024 00:05

protonu requested review from jacobhinkle, naoyam and zasdfgbnm May 31, 2024 00:06

protonu requested a review from kevinstephano May 31, 2024 00:06

jacobhinkle reviewed May 31, 2024

View reviewed changes

zasdfgbnm approved these changes May 31, 2024

View reviewed changes

csrc/scheduler/mma_utils.cpp Outdated Show resolved Hide resolved

csrc/scheduler/mma_utils.cpp Outdated Show resolved Hide resolved

csrc/scheduler/mma_utils.cpp Outdated Show resolved Hide resolved

csrc/scheduler/mma_utils.cpp Outdated Show resolved Hide resolved

protonu added 2 commits May 31, 2024 17:18

addressing reviewer comments

f500c66

Merge branch 'main' into pbasu_remove_ldmatrixtranspose

cf150bb

zasdfgbnm approved these changes May 31, 2024

View reviewed changes

csrc/scheduler/mma_utils.cpp Outdated Show resolved Hide resolved

Update csrc/scheduler/mma_utils.cpp

ed3a6a4

Co-authored-by: Gao, Xiang <qasdfgtyuiop@gmail.com>

jacobhinkle reviewed May 31, 2024

View reviewed changes

csrc/scheduler/mma_utils.cpp Outdated Show resolved Hide resolved

Update csrc/scheduler/mma_utils.cpp

af4391e

Co-authored-by: Jacob Hinkle <1454944+jacobhinkle@users.noreply.github.com>

protonu merged commit fe1ea2c into main Jun 1, 2024

protonu deleted the pbasu_remove_ldmatrixtranspose branch June 1, 2024 15:20

jacobhinkle mentioned this pull request Jun 4, 2024

Infer matmul dimension roles to compute vectorization #2303

Merged

Conversation

protonu commented May 29, 2024

Uh oh!

Uh oh!

Uh oh!

protonu commented May 29, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

protonu commented May 30, 2024

Uh oh!

protonu commented May 31, 2024

Uh oh!

protonu commented May 31, 2024

Uh oh!

jacobhinkle left a comment

Choose a reason for hiding this comment

Uh oh!

zasdfgbnm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

protonu commented May 31, 2024

Uh oh!

Uh oh!

protonu commented May 31, 2024

Uh oh!

Uh oh!

protonu commented May 31, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants