Skip to content

Compute matmul dim roles with no-devices leaf domain#2300

Merged
jacobhinkle merged 1 commit intomainfrom
matmul_dim_roles_nodevices
May 27, 2024
Merged

Compute matmul dim roles with no-devices leaf domain#2300
jacobhinkle merged 1 commit intomainfrom
matmul_dim_roles_nodevices

Conversation

@jacobhinkle
Copy link
Collaborator

@jacobhinkle jacobhinkle commented May 24, 2024

This fixes a bug introduced by #2272 in test_multidevice where we reject a matmul segment shaped like [iDIDxMo, iMi, bN, iK] for having too many M dimensions. Locally this still has a single M dimension so it is valid. This PR ignores device dims for the purposes of computing tensor roles and problem shape.

Further issues we should look into:

  1. As mentioned in Generalize CombineMulSum as MatmulPatterns #2272 we should proceed to handle multiple M, N, K, and Batch dimensions, although in this case the restriction was useful for surfacing this bug.
  2. Even if the matmul scheduler is completely broken or disabled, the reduction scheduler should have been able to schedule this fusion. However, it identified the reduction tensor as isResharding and removed it from the reduction_tvs list, causing a failure in scheduleReduction. We should clean up that check to be able to schedule this type of fusion as a reduction.
  3. Inside the matmul scheduler we call canonicalizeMmaTvOrdering which I believe still uses rfactor domain to determine domain ordering. Instead this should be updated to use dim roles that are already computed from the MatmulPattern.
  4. The rfactor domain is often used for scheduling utilities to inspect the logical size of tensors. However, because multidevice scheduling modifies the leaf domain before segmentation, we should probably audit our schedulers to ensure they use the leaf domain and ignore device dims where necessary.
  5. I should also not forget to rerun !build before merging PRs 😅

This fixes a bug introduced by #2272 in `test_multidevice` where we
reject a matmul segment shaped like `[iDIDxMo, iMi, bN, iK]` for having
too many M dimensions. Locally this still has a single M dimension so it
is valid. This PR ignores device dims for the purposes of computing
tensor roles and problem shape.

Further issues we should look into:
1. As mentioned in #2272 we should proceed to handle multiple M, N, K,
   and Batch dimensions, although in this case the restriction was
   useful for surfacing this bug.
2. Even if the matmul scheduler is completely broken or disabled, the
   _reduction_ scheduler should have been able to schedule this fusion.
   However, it identified the reduction tensor as `isResharding` and
   removed it from the `reduction_tvs` list, causing a failure in
   `scheduleReduction`. We should clean up that check to be able to
   schedule this type of fusion as a reduction.
3. The rfactor domain is often used for scheduling utilities to inspect
   the logical size of tensors. However, because multidevice scheduling
   modifies the leaf domain before segmentation, we should probably
   audit our schedulers to ensure they use the leaf domain and ignore
   device dims where necessary.
4. I should also not forget to rerun `!build` before merging PRs :-).
@jacobhinkle
Copy link
Collaborator Author

!build --diff

Copy link
Collaborator

@cowanmeg cowanmeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@jacobhinkle jacobhinkle merged commit 8baa550 into main May 27, 2024
@jacobhinkle jacobhinkle deleted the matmul_dim_roles_nodevices branch May 27, 2024 00:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants