Conversation
| mma_ops.size()); | ||
|
|
||
| // Skip scheduling if Matmul will be expression evaluated. | ||
| if (isOptionEnabled(EnableOption::MatmulExprEval)) { |
There was a problem hiding this comment.
Does this mean any matmul pattern including mma + epilogue will be handled by the expression evaluator? Shouldn't it only take care of the mma part?
There was a problem hiding this comment.
There are two problems with taking care of just the mma:
- at::matmul doesn't do HH->S. We could plug in another backend that supports HH->S for EE though.
- MMA is never alone in these GPT models (e.g. LLaMA). It's always part of a linear layer or an SDPA. nvFuser doesn't do SDPA well and we will have to offload it to another executor for quite some time, so scratch that. A linear layer however comes with this MMA->BiasAdd pattern. In order for its performance to be on par with framework-not-giving-nvFuser-the-matmul, we have to execute MMA+epilogue in one call.
Wdyt, @naoyam? Extending EE to do MMA+epilogue is the most obvious way to me to solve the above problems. But I could definitely be wrong.
There was a problem hiding this comment.
Extending EE to do MMA+epilogue
How would it do that? Does Aten support matmul with some epilogue op?
There was a problem hiding this comment.
Oh yeah, it's called at::addmm. I realized I was wrong about Relu. torch.nn.Linear doesn't do Relu, so the pattern would be matmul+biasadd which is what at::addmm does.
There was a problem hiding this comment.
So, how would you handle matmul+epilogue patterns that are accepted by the nvFuser matmul scheduler but there's no corresponding aten version? Would that end up doing mma and the epilogue op separately?
There was a problem hiding this comment.
That makes sense. Or, maybe we might want to have a separate scheduler for EE? We try the native matmul scheduler first, and then the EE matmul scheduler, and then the other schedulers. No particular preference but just my two cents.
There was a problem hiding this comment.
Yes, that's a certainly great option to consider. It makes things more composable at a risk of being harder to share logics with MatmulScheduler. I hope the preference will be more obvious when we know what the heuristics look like!
There was a problem hiding this comment.
Adding on to @wujingyue's comments, the plan for the next PRs is:
- Support Mma + Cast -> avoid roundtrip casting (
half-> float -> half) by checkingmma_out->uses()and if it is either a castOp or pointwise ops with inputs of the same type (half), then skip casting back the output ofat::matmul. This will not executematmul + biasin a single call. - Handle common epilogue fusions: We will need to pattern match and evaluate within the MmaOp accordingly.
test_matmul_scheduler.cppcurrently has a few cases that I will start with:mma + bias,mma + bias + relu/gelu,mma + relu. - Epilogue fusions that are not be supported: They can still be computed through EE but should ideally not be plumbed down to the matmul scheduler.
There was a problem hiding this comment.
Do 2 only when it's needed. You should double check this but I think Llama for example uses linear with bias off and none of the linear layers in our benchmarks do relu or gelu
There was a problem hiding this comment.
Llama2 has bias=False in the linear layers but some GPT configs have bias=True.
wujingyue
left a comment
There was a problem hiding this comment.
Nice work! The PR currently shows as a draft. I'll hold off my reviews when it's ready.
wujingyue
left a comment
There was a problem hiding this comment.
LGTM with some comments to be resolved.
jacobhinkle
left a comment
There was a problem hiding this comment.
LGTM. Just some comments on the tests.
|
Thanks for preparing this change, they look good to me after caught up to the latest discussions around matmul scheduler. |
|
Nit: Will it make sense to make this C++ test file a part of the |
Thanks for the suggestion! Moved the test file. |
|
FYI, this pr seems to break CI on V100. |
|
Thanks for pointing this out. Let's revert this until I identify the patch. |
This reverts commit a0cb47a.
|
FYI, these are the failing tests from CI: |
|
Yes, these are the tests I added to check functionality. |
Thank you! I can't say enough great things about revert-and-debug-later! |
Reverts #1743. This is breaking on V100.
|
Are we looking to support V100 through the default path? CC: @kevinstephano Fuser/csrc/scheduler/matmul_utils.cpp Line 44 in 77caa57 If we wish to support V100, we can have appropriate checks in the heuristic verification to allow other architectures when we are using expression evaluator. |
I'd do it. This is actually something that can be supported with much less effort in fallback mode than in codegen. Sounds like a low-hanging fruit to me. |
FYI, I suspect it's not just V100. https://github.com/NVIDIA/Fuser/actions/runs/7910657123/job/21593586846 seems to be the same error but for H100. |
|
On option could be use to use something like this (https://github.com/NVIDIA/Fuser/blob/88727dc828684f5a62d7f1837a610b7589f629d1/test/test_combine_mul_sum.cpp#L40C1-L57C3) to reduce the machines these tests run on, in case you guys don't plan on adding support for V100/H100. |
Thanks for the suggestion, I am moving forward with supporting any architecture since it's simple enough. |
Initial PR for Issue #1669.
EnableOption::MatmulExprEvalto turn on expression evaluation for matmul while the API in under progress.MmaOp. The next PRs will amend this to look ahead and evaluate Mma + Cast, which is what we should see in the fusion definitions. See discussion here. In the absence of this we may have casts such asbfloat->float->bfloat.