Run with and without half-precision matmul reduction in benchmark by jacobhinkle · Pull Request #3203 · NVIDIA/Fuser

jacobhinkle · 2024-10-17T00:38:12Z

This changes the python matmul benchmark to run four times as many tests:

We parametrize by reduction in float or in fp16/bf16, which is enabled by default in PyTorch.
We parametrize by eager. If this is true we directly compute torch.matmul without involving nvFuser. Otherwise we use nvFuser. This lets us compute baselines in the same run as we compute the nvFuser result instead of needing to re-run the benchmark with different environment variables as we previously had to.

nvFuser does not support split-K in reduced precision (see #1719), so we skip these cases for now.

This disables reduction in fp16 or bf16, which is enabled by default in PyTorch. There are two reasons to disable this for our benchmarks: 1. nvFuser does not support split-K in reduced precision (see #1719). Since half precision reduction is much faster than single precision, this means eager mode will be faster but less precise than nvFuser by default. For fair comparison, we can both use single precision. 2. The accuracy of matmuls is degraded for split-K problems (small M&N, large K) by default in PyTorch. This can lead to validation errors where nvFuser actually performs an accurate computation but our baseline is inaccurate.

jacobhinkle · 2024-10-17T00:38:24Z

!build --matmul-bench

Priya2698 · 2024-10-17T17:57:22Z

Do you think, we should separate out torch baselines from nvfuser benchmark to also have a marker for the default pytorch performance?
Like you said, pytorch for half reduction will be faster but less accurate, so I am not sure if we should be using default pytorch behavior for comparison (which is what nvfuser will be compared against by users), or use single precision for comparing with same settings.

jacobhinkle · 2024-10-17T19:52:28Z

Do you think, we should separate out torch baselines from nvfuser benchmark to also have a marker for the default pytorch performance?

Yeah we could do that in order to track the effect of reduced precision reduction on perf in our baseline. In the separate torch baselines we would use torch.matmul directly instead of running through nvFuser right? That could work, but I think ideally we will start adding benchmarks involving fusions as well, in which case the point of comparison is to create an epilogue kernel but compute the matmul using aten, i.e. our default nvfuser path. I currently am running the "baseline" by running without env vars, then running the nvfuser version using NVFUSER_ENABLE=fuse_matmul NVFUSER_DISABLE=matmul_expr_eval, and a heuristic plugin.

Like you said, pytorch for half reduction will be faster but less accurate, so I am not sure if we should be using default pytorch behavior for comparison (which is what nvfuser will be compared against by users), or use single precision for comparing with same settings.

My .02 is that users already have a knob to control this (the ones i'm turning in this PR), so we should merge something like #1719 and check for that value in ATen to decide the reduction dtype in our heuristic.

Priya2698 · 2024-10-17T21:53:26Z

Do you think, we should separate out torch baselines from nvfuser benchmark to also have a marker for the default pytorch performance?

Yeah we could do that in order to track the effect of reduced precision reduction on perf in our baseline. In the separate torch baselines we would use torch.matmul directly instead of running through nvFuser right?

Yes -- running torch.matmul directly.

My .02 is that users already have a knob to control this (the ones i'm turning in this PR), so we should merge something like #1719 and check for that value in ATen to decide the reduction dtype in our heuristic.

This looks more robust and we would cover all comparison metrics. Can we also add this point in a comment around the changes?

These are not blockers for this PR. Once we have the matmul benchmarks in our CI, we should revisit this.

Priya2698

LGTM.

Priya2698 · 2024-10-17T21:57:23Z

benchmarks/python/test_matmul.py

+    torch.backends.cuda.matmul.allow_fp16_reduced_precision_reduction = False
+    torch.backends.cuda.matmul.allow_bf16_reduced_precision_reduction = False
+


Can we add a comment on the other option of using PR #1719 for determining the reduction dtype for later reference.

Looks like there are CI errors for matmul/linear translation tests.

jacobhinkle · 2024-10-17T23:09:35Z

Looks like there are CI errors for matmul/linear translation tests.

Yes but the CI errors are not blocking. Those were there before this PR. What happens is the --matmul-bench option runs test_matmul with our internal heuristic plugin enabled, and there are some unrelated bugs in that plugin that are providing invalid configs currently.

Priya2698

LGTM

Priya2698 · 2024-10-17T23:11:33Z

Looks like there are CI errors for matmul/linear translation tests.

Yes but the CI errors are not blocking. Those were there before this PR. What happens is the --matmul-bench option runs test_matmul with our internal heuristic plugin enabled, and there are some unrelated bugs in that plugin that are providing invalid configs currently.

Is there a tracking issue?

jacobhinkle · 2024-10-18T09:51:50Z

Is there a tracking issue?

#3213 and #2979

jacobhinkle · 2024-10-18T15:11:59Z

In the latest version I am computing baselines in the same test (when eager == True), and I also parametrize by whether we reduce in half or full precision. This lets us lay everything out at once:

Thanks for the suggestion @Priya2698

…ble_half_reduction

Priya2698 · 2024-10-23T00:30:07Z

@jacobhinkle is this PR ready for review or do you first want to merge PR #3252

jacobhinkle · 2024-10-23T16:22:01Z

@jacobhinkle is this PR ready for review or do you first want to merge PR #3252

It is ready. I think the two PRs are pretty much independent.

benchmarks/python/core.py

Priya2698 · 2024-10-23T19:04:09Z

benchmarks/python/test_matmul.py

-@pytest.mark.parametrize("dtype", [torch.float16, torch.bfloat16])
 def test_matmul_nvf_benchmark(
    benchmark,
+    eager: bool,


If we separate eager benchmark into its own benchmark, they will not run by default (

Fuser/benchmarks/python/conftest.py

Lines 107 to 116 in 5a6c90b

if not run_eager:

skip_eager = pytest.mark.skip(reason="need --benchmark-eager option to run")

for item in items:

# If the benchmark has compile=False parameter (eager mode), skip it.

if (

hasattr(item, "callspec")

and "compile" in item.callspec.params

and not item.callspec.params["compile"]

):

item.add_marker(skip_eager)

). Right now, we will run both nvfuser (which is ATen at this time) and eager in every run.

We can separate out the common code into a utility function and call them from test_matmul_nvf_benchmark and test_matmul_baseline_benchmark.

When I split the test into two tests there will be no difference between eager and compile, but my understanding is that we need to have both of those to trigger this code path.

I guess I can just put that compile option in, since eventually we will probably extend this to cover some epilogue cases, and multi-matmul cases

The second function will have the compile parameter with just the [False] value to exercise the eager benchmark and it will not be run by default.

there will be no difference between eager and compile, but my understanding is that we need to have both of those to trigger this code path.

I am not sure what you mean here -- trigger which code path?

The compile parameter is what is required to skip eager by default.

Ah right. Gotcha. I will make that change tonight/AM eastern

I just pushed the change splitting the baseline into a separate test that can be enabled with --benchmark-eager.

Priya2698 · 2024-10-23T19:11:06Z

@jacobhinkle is this PR ready for review or do you first want to merge PR #3252

It is ready. I think the two PRs are pretty much independent.

If we are still seeing 0 measurements, that seems like a real error.
The benchmark itself has not changed, only the parametrization over the reduction type. Do you have a list of cases where we see stats.min = 0. For nvfuser, we are still running the same settings (since reduction in half precision is not supported), so only the eager benchmarks with half precision are the new additions.

This reverts commit e6ac0c3.

jacobhinkle · 2024-10-25T00:05:20Z

!build

Priya2698

LGTM apart from the disable-benchmarking flag in the eager benchmark that needs to be removed.

Thanks for the changes!

Priya2698 · 2024-10-25T20:02:41Z

benchmarks/python/test_matmul.py

+            b = b.as_strided(size=[k, n], stride=[1, k])
+
+        # NOTE: we never need to validate eager, as it is our baseline
+        if not disable_benchmarking:


This conditional is not needed. This flag is for nvfuser benchmarks. I'll make a note to rename this. If --benchmark-eager is used we always run this benchmark.

jacobhinkle · 2024-10-25T20:06:45Z

!build

jacobhinkle requested a review from Priya2698 October 17, 2024 00:38

Priya2698 previously approved these changes Oct 17, 2024

View reviewed changes

Priya2698 approved these changes Oct 17, 2024

View reviewed changes

jacobhinkle added 2 commits October 18, 2024 11:03

Run eager using torch.matmul with and without half reduction

17382c2

Fix up ids and reorder parametrization for test

3205d50

jacobhinkle changed the title ~~Disable cuBLAS half-precision matmul reduction in benchmark~~ Run with and without half-precision matmul reduction in benchmark Oct 18, 2024

jacobhinkle requested a review from Priya2698 October 18, 2024 15:14

jacobhinkle added 4 commits October 18, 2024 11:31

Merge remote-tracking branch 'origin/main' into matmul_benchmark_disa…

995026c

…ble_half_reduction

Avoid creating FusionDefinition for eager tests

2c0898a

Fail benchmark whenever no CUDA events are found

9b5bb51

Only set extra_info if there are CUDA events

e6ac0c3

Priya2698 reviewed Oct 23, 2024

View reviewed changes

benchmarks/python/core.py Show resolved Hide resolved

Priya2698 reviewed Oct 23, 2024

View reviewed changes

benchmarks/python/core.py Outdated Show resolved Hide resolved

Priya2698 reviewed Oct 23, 2024

View reviewed changes

jacobhinkle added 2 commits October 23, 2024 15:35

Revert "Only set extra_info if there are CUDA events"

6912ebe

This reverts commit e6ac0c3.

Split baseline out into separate test

3cab9a7

Priya2698 approved these changes Oct 25, 2024

View reviewed changes

Remove disable_benchmarking condition

44c31a3

jacobhinkle merged commit 9fc7bb3 into main Oct 25, 2024

jacobhinkle deleted the matmul_benchmark_disable_half_reduction branch October 25, 2024 21:33

		torch.backends.cuda.matmul.allow_fp16_reduced_precision_reduction = False
		torch.backends.cuda.matmul.allow_bf16_reduced_precision_reduction = False

	if not run_eager:
	skip_eager = pytest.mark.skip(reason="need --benchmark-eager option to run")
	for item in items:
	# If the benchmark has compile=False parameter (eager mode), skip it.
	if (
	hasattr(item, "callspec")
	and "compile" in item.callspec.params
	and not item.callspec.params["compile"]
	):
	item.add_marker(skip_eager)

Conversation

jacobhinkle commented Oct 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jacobhinkle commented Oct 17, 2024

Uh oh!

Priya2698 commented Oct 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jacobhinkle commented Oct 17, 2024

Uh oh!

Priya2698 commented Oct 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Priya2698 left a comment

Choose a reason for hiding this comment

Uh oh!

Priya2698 Oct 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jacobhinkle commented Oct 17, 2024

Uh oh!

Priya2698 left a comment

Choose a reason for hiding this comment

Uh oh!

Priya2698 commented Oct 17, 2024

Uh oh!

jacobhinkle commented Oct 18, 2024

Uh oh!

jacobhinkle commented Oct 18, 2024

Uh oh!

Priya2698 commented Oct 23, 2024

Uh oh!

jacobhinkle commented Oct 23, 2024

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Priya2698 commented Oct 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jacobhinkle commented Oct 25, 2024

Uh oh!

Priya2698 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jacobhinkle commented Oct 25, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jacobhinkle commented Oct 17, 2024 •

edited

Loading

Priya2698 commented Oct 17, 2024 •

edited

Loading

Priya2698 commented Oct 17, 2024 •

edited

Loading

Priya2698 Oct 17, 2024 •

edited

Loading

Priya2698 commented Oct 23, 2024 •

edited

Loading