Skip to content

Remove LdMatrixTranspose type from LoadStoreOp types#2315

Merged
protonu merged 12 commits intomainfrom
pbasu_remove_ldmatrixtranspose
Jun 1, 2024
Merged

Remove LdMatrixTranspose type from LoadStoreOp types#2315
protonu merged 12 commits intomainfrom
pbasu_remove_ldmatrixtranspose

Conversation

@protonu
Copy link
Collaborator

@protonu protonu commented May 29, 2024

This PR modifies how we determine when we do a LdMatrix vs LdMatrixTranspose.

We look at the (maybe) allocation domain of the producer and consumer of the load store op do determine if a transpose is required. We get the allocation domain of the consumer and check if the innermost dimension maps to the innermost dimension of the allocation domain of the producer. We transpose if it doesn't map.

@protonu protonu requested a review from zasdfgbnm May 29, 2024 21:12
@protonu
Copy link
Collaborator Author

protonu commented May 29, 2024

!build

@protonu
Copy link
Collaborator Author

protonu commented May 30, 2024

!build

@protonu protonu force-pushed the pbasu_remove_ldmatrixtranspose branch from 1ce518a to ddc169f Compare May 31, 2024 00:05
@protonu
Copy link
Collaborator Author

protonu commented May 31, 2024

Terribly sorry for the force push - I had gotten myself into a mess.

@protonu
Copy link
Collaborator Author

protonu commented May 31, 2024

!build

@protonu protonu requested a review from kevinstephano May 31, 2024 00:06
Copy link
Collaborator

@jacobhinkle jacobhinkle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but I'll leave it to @zasdfgbnm to give the final approval.

Copy link
Collaborator

@zasdfgbnm zasdfgbnm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments on the clarity of comment. This code should be working.

@protonu
Copy link
Collaborator Author

protonu commented May 31, 2024

!build

Co-authored-by: Gao, Xiang <qasdfgtyuiop@gmail.com>
@protonu
Copy link
Collaborator Author

protonu commented May 31, 2024

!build

Co-authored-by: Jacob Hinkle <1454944+jacobhinkle@users.noreply.github.com>
@protonu
Copy link
Collaborator Author

protonu commented May 31, 2024

!build

@protonu protonu merged commit fe1ea2c into main Jun 1, 2024
@protonu protonu deleted the pbasu_remove_ldmatrixtranspose branch June 1, 2024 15:20
protonu added a commit that referenced this pull request Jun 4, 2024
…eduler (#2309)

In this PR we extend the matmul scheduler to support inputs with
allocation domains.


To the fusion (with inputs tv_a and tv_b), we add two LoadStoreOps to
both inputs.
The first Op corresponds to a load to shared memory, where we propagate
the allocation domain. The second op corresponds to reading to
registers, where we don't propagate the allocation domain since the
scheduler takes charge of setting the allocation domain in the
registers. Based on the difference in the (maybe)allocation domain of
the producer and consumer of the second LoadStoreOp, we may do
transposed load when reading to registers.


![image](https://github.com/NVIDIA/Fuser/assets/10635897/89395990-9b85-4ce1-8e7d-006e43a86b85)


See also #2315.
zasdfgbnm added a commit that referenced this pull request Jun 5, 2024
This PR modifies how we determine when we do a LdMatrix vs
LdMatrixTranspose.

We look at the (maybe) allocation domain of the producer and consumer of
the load store op do determine if a transpose is required. We get the
allocation domain of the consumer and check if the innermost dimension
maps to the innermost dimension of the allocation domain of the
producer. We transpose if it doesn't map.

---------

Co-authored-by: Gao, Xiang <qasdfgtyuiop@gmail.com>
Co-authored-by: Jacob Hinkle <1454944+jacobhinkle@users.noreply.github.com>
zasdfgbnm pushed a commit that referenced this pull request Jun 5, 2024
…eduler (#2309)

In this PR we extend the matmul scheduler to support inputs with
allocation domains.


To the fusion (with inputs tv_a and tv_b), we add two LoadStoreOps to
both inputs.
The first Op corresponds to a load to shared memory, where we propagate
the allocation domain. The second op corresponds to reading to
registers, where we don't propagate the allocation domain since the
scheduler takes charge of setting the allocation domain in the
registers. Based on the difference in the (maybe)allocation domain of
the producer and consumer of the second LoadStoreOp, we may do
transposed load when reading to registers.


![image](https://github.com/NVIDIA/Fuser/assets/10635897/89395990-9b85-4ce1-8e7d-006e43a86b85)


See also #2315.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants