Run with and without half-precision matmul reduction in benchmark#3203
Run with and without half-precision matmul reduction in benchmark#3203jacobhinkle merged 10 commits intomainfrom
Conversation
This disables reduction in fp16 or bf16, which is enabled by default in PyTorch. There are two reasons to disable this for our benchmarks: 1. nvFuser does not support split-K in reduced precision (see #1719). Since half precision reduction is much faster than single precision, this means eager mode will be faster but less precise than nvFuser by default. For fair comparison, we can both use single precision. 2. The accuracy of matmuls is degraded for split-K problems (small M&N, large K) by default in PyTorch. This can lead to validation errors where nvFuser actually performs an accurate computation but our baseline is inaccurate.
|
!build --matmul-bench |
|
Do you think, we should separate out torch baselines from nvfuser benchmark to also have a marker for the default pytorch performance? |
Yeah we could do that in order to track the effect of reduced precision reduction on perf in our baseline. In the separate torch baselines we would use
My .02 is that users already have a knob to control this (the ones i'm turning in this PR), so we should merge something like #1719 and check for that value in ATen to decide the reduction dtype in our heuristic. |
Yes -- running
This looks more robust and we would cover all comparison metrics. Can we also add this point in a comment around the changes? These are not blockers for this PR. Once we have the matmul benchmarks in our CI, we should revisit this. |
benchmarks/python/test_matmul.py
Outdated
| torch.backends.cuda.matmul.allow_fp16_reduced_precision_reduction = False | ||
| torch.backends.cuda.matmul.allow_bf16_reduced_precision_reduction = False | ||
|
|
There was a problem hiding this comment.
Can we add a comment on the other option of using PR #1719 for determining the reduction dtype for later reference.
Looks like there are CI errors for matmul/linear translation tests.
Yes but the CI errors are not blocking. Those were there before this PR. What happens is the |
Is there a tracking issue? |
|
In the latest version I am computing baselines in the same test (when |
|
@jacobhinkle is this PR ready for review or do you first want to merge PR #3252 |
It is ready. I think the two PRs are pretty much independent. |
benchmarks/python/test_matmul.py
Outdated
| @pytest.mark.parametrize("dtype", [torch.float16, torch.bfloat16]) | ||
| def test_matmul_nvf_benchmark( | ||
| benchmark, | ||
| eager: bool, |
There was a problem hiding this comment.
If we separate eager benchmark into its own benchmark, they will not run by default (
Fuser/benchmarks/python/conftest.py
Lines 107 to 116 in 5a6c90b
We can separate out the common code into a utility function and call them from test_matmul_nvf_benchmark and test_matmul_baseline_benchmark.
There was a problem hiding this comment.
When I split the test into two tests there will be no difference between eager and compile, but my understanding is that we need to have both of those to trigger this code path.
There was a problem hiding this comment.
I guess I can just put that compile option in, since eventually we will probably extend this to cover some epilogue cases, and multi-matmul cases
There was a problem hiding this comment.
The second function will have the compile parameter with just the [False] value to exercise the eager benchmark and it will not be run by default.
there will be no difference between eager and compile, but my understanding is that we need to have both of those to trigger this code path.
I am not sure what you mean here -- trigger which code path?
There was a problem hiding this comment.
The compile parameter is what is required to skip eager by default.
There was a problem hiding this comment.
Ah right. Gotcha. I will make that change tonight/AM eastern
There was a problem hiding this comment.
I just pushed the change splitting the baseline into a separate test that can be enabled with --benchmark-eager.
If we are still seeing 0 measurements, that seems like a real error. |
|
!build |
Priya2698
left a comment
There was a problem hiding this comment.
LGTM apart from the disable-benchmarking flag in the eager benchmark that needs to be removed.
Thanks for the changes!
benchmarks/python/test_matmul.py
Outdated
| b = b.as_strided(size=[k, n], stride=[1, k]) | ||
|
|
||
| # NOTE: we never need to validate eager, as it is our baseline | ||
| if not disable_benchmarking: |
There was a problem hiding this comment.
This conditional is not needed. This flag is for nvfuser benchmarks. I'll make a note to rename this. If --benchmark-eager is used we always run this benchmark.
|
!build |

This changes the python matmul benchmark to run four times as many tests:
eager. If this is true we directly computetorch.matmulwithout involving nvFuser. Otherwise we use nvFuser. This lets us compute baselines in the same run as we compute the nvFuser result instead of needing to re-run the benchmark with different environment variables as we previously had to.nvFuser does not support split-K in reduced precision (see #1719), so we skip these cases for now.