Conversation
wujingyue
left a comment
There was a problem hiding this comment.
Is it worth adding a check to make sure ATen kicked in? E.g.
cowanmeg
left a comment
There was a problem hiding this comment.
Thanks! I'm glad the ATen pathway worked out of the box!
I assume the extra size 1 axes in the aten tensors are being treated as batch dimensions which is why it works out ok right?
| std::vector<int64_t> orig_size = {K, M, N}; | ||
| std::vector<int64_t> new_size = {K, Mo, Mi, N}; |
There was a problem hiding this comment.
This is actually an error from me....but K should be Ko.
I'm surprised it compiled and was correct before.
Yes, that is treated as a BatchMatmul, case 5 in https://pytorch.org/docs/stable/generated/torch.matmul.html |
How do I access the executors from |
|
!build |
|
Sorry for adding this late - can we add back the original
MultiDeviceExecutor doesn't have accessor functions for executors so we would have to add that first. @wujingyue Some background: See here for the internals: |
Good question. We could plumb it through but I don't think it's worth the benefits at this moment. |
Good question! AFAIK, nvFuser offers three ways (arguably too many, 🤷) to generate a matmul.
The current tests in this file exercise only (2). I agree it's useful to exercise (1). MatmulScheduler is already DID-capable and eventually we do want nvFuser to generate matmul using the scheduler, so it's good to have tests to maintain that feature. However, I'm not aware that nvFuser can reliably turn a decomposed broadcast+mul+sum into an MmaOp/MatmulOp. Correct me if I'm wrong. So we may need to add new tests that use fusedMultiplySum instead. I was told by @Priya2698 that (3) is a corner case that will go away. So I'll leave the decision to her whether to cover that as well. |
Yes, (3) will be removed. For now, from the python frontend API, the only way is to call To exercise the matmul scheduler:
|
|
Thanks for clarifying. I missed Fuser/csrc/scheduler/mma_utils.cpp Line 1462 in 0f66dc2 So we have three ways to exercise the matmul scheduler:
Wdyt? (3) seems to me the best option: we'll get |
Yes, we can use the
We are still seeing some issues such as #2354, so it is not enabled by default yet. I plan on running all the thunder benchmarks and tests with this enabled to preemptively identify existing issues first. |
We can write an accessor function to return the FusionExecutorCache(s) used by the MultiDeviceRuntime. It will be a vector because each segment has its own FusionExecutorCache. In this case since there's only one compute segment it will only have one entry. Related translation for the MatmulOp to MmaOp for matmul scheduler we eventually need to update |
So currently, what you suggest is retaining one of the original test cases? |
|
Let's keep one of the original test cases. |
|
!build |
Add `MultiDeviceExecutor::getFusionExecutorCaches` which returns a vector of pointers to the multidevice executor's fusion executor caches. Note, Also renames the field `workspace` to `workspace_` to match naming conventions. cc @Priya2698 for #2386
bb469f1 to
49ebf3b
Compare
|
!build |
Add `MultiDeviceExecutor::getFusionExecutorCaches` which returns a vector of pointers to the multidevice executor's fusion executor caches. Note, Also renames the field `workspace` to `workspace_` to match naming conventions. cc @Priya2698 for #2386
Issue #2372.
Modifying the tests to use
matmulin place ofmul-sum.Note:
matmulAPI requires the logical shapes[M,K] x [K,N]and the output has the same dtype as the input.