Skip to content

Run with and without half-precision matmul reduction in benchmark#3203

Merged
jacobhinkle merged 10 commits intomainfrom
matmul_benchmark_disable_half_reduction
Oct 25, 2024
Merged

Run with and without half-precision matmul reduction in benchmark#3203
jacobhinkle merged 10 commits intomainfrom
matmul_benchmark_disable_half_reduction

Conversation

@jacobhinkle
Copy link
Collaborator

@jacobhinkle jacobhinkle commented Oct 17, 2024

This changes the python matmul benchmark to run four times as many tests:

  • We parametrize by reduction in float or in fp16/bf16, which is enabled by default in PyTorch.
  • We parametrize by eager. If this is true we directly compute torch.matmul without involving nvFuser. Otherwise we use nvFuser. This lets us compute baselines in the same run as we compute the nvFuser result instead of needing to re-run the benchmark with different environment variables as we previously had to.

nvFuser does not support split-K in reduced precision (see #1719), so we skip these cases for now.

This disables reduction in fp16 or bf16, which is enabled by default in
PyTorch. There are two reasons to disable this for our benchmarks:
1. nvFuser does not support split-K in reduced precision (see #1719).
   Since half precision reduction is much faster than single precision,
   this means eager mode will be faster but less precise than
   nvFuser by default. For fair comparison, we can both use single
   precision.
2. The accuracy of matmuls is degraded for split-K problems (small M&N,
   large K) by default in PyTorch. This can lead to validation errors
   where nvFuser actually performs an accurate computation but our
   baseline is inaccurate.
@jacobhinkle jacobhinkle requested a review from Priya2698 October 17, 2024 00:38
@jacobhinkle
Copy link
Collaborator Author

!build --matmul-bench

@Priya2698
Copy link
Collaborator

Priya2698 commented Oct 17, 2024

Do you think, we should separate out torch baselines from nvfuser benchmark to also have a marker for the default pytorch performance?
Like you said, pytorch for half reduction will be faster but less accurate, so I am not sure if we should be using default pytorch behavior for comparison (which is what nvfuser will be compared against by users), or use single precision for comparing with same settings.

@jacobhinkle
Copy link
Collaborator Author

Do you think, we should separate out torch baselines from nvfuser benchmark to also have a marker for the default pytorch performance?

Yeah we could do that in order to track the effect of reduced precision reduction on perf in our baseline. In the separate torch baselines we would use torch.matmul directly instead of running through nvFuser right? That could work, but I think ideally we will start adding benchmarks involving fusions as well, in which case the point of comparison is to create an epilogue kernel but compute the matmul using aten, i.e. our default nvfuser path. I currently am running the "baseline" by running without env vars, then running the nvfuser version using NVFUSER_ENABLE=fuse_matmul NVFUSER_DISABLE=matmul_expr_eval, and a heuristic plugin.

Like you said, pytorch for half reduction will be faster but less accurate, so I am not sure if we should be using default pytorch behavior for comparison (which is what nvfuser will be compared against by users), or use single precision for comparing with same settings.

My .02 is that users already have a knob to control this (the ones i'm turning in this PR), so we should merge something like #1719 and check for that value in ATen to decide the reduction dtype in our heuristic.

@Priya2698
Copy link
Collaborator

Priya2698 commented Oct 17, 2024

Do you think, we should separate out torch baselines from nvfuser benchmark to also have a marker for the default pytorch performance?

Yeah we could do that in order to track the effect of reduced precision reduction on perf in our baseline. In the separate torch baselines we would use torch.matmul directly instead of running through nvFuser right?

Yes -- running torch.matmul directly.

My .02 is that users already have a knob to control this (the ones i'm turning in this PR), so we should merge something like #1719 and check for that value in ATen to decide the reduction dtype in our heuristic.

This looks more robust and we would cover all comparison metrics. Can we also add this point in a comment around the changes?

These are not blockers for this PR. Once we have the matmul benchmarks in our CI, we should revisit this.

Priya2698
Priya2698 previously approved these changes Oct 17, 2024
Copy link
Collaborator

@Priya2698 Priya2698 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Comment on lines +45 to +47
torch.backends.cuda.matmul.allow_fp16_reduced_precision_reduction = False
torch.backends.cuda.matmul.allow_bf16_reduced_precision_reduction = False

Copy link
Collaborator

@Priya2698 Priya2698 Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a comment on the other option of using PR #1719 for determining the reduction dtype for later reference.

@Priya2698 Priya2698 dismissed their stale review October 17, 2024 22:02

Looks like there are CI errors for matmul/linear translation tests.

@jacobhinkle
Copy link
Collaborator Author

Looks like there are CI errors for matmul/linear translation tests.

Yes but the CI errors are not blocking. Those were there before this PR. What happens is the --matmul-bench option runs test_matmul with our internal heuristic plugin enabled, and there are some unrelated bugs in that plugin that are providing invalid configs currently.

Copy link
Collaborator

@Priya2698 Priya2698 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Priya2698
Copy link
Collaborator

Looks like there are CI errors for matmul/linear translation tests.

Yes but the CI errors are not blocking. Those were there before this PR. What happens is the --matmul-bench option runs test_matmul with our internal heuristic plugin enabled, and there are some unrelated bugs in that plugin that are providing invalid configs currently.

Is there a tracking issue?

@jacobhinkle
Copy link
Collaborator Author

Is there a tracking issue?

#3213 and #2979

@jacobhinkle
Copy link
Collaborator Author

In the latest version I am computing baselines in the same test (when eager == True), and I also parametrize by whether we reduce in half or full precision. This lets us lay everything out at once:
image
Thanks for the suggestion @Priya2698

@jacobhinkle jacobhinkle changed the title Disable cuBLAS half-precision matmul reduction in benchmark Run with and without half-precision matmul reduction in benchmark Oct 18, 2024
@jacobhinkle jacobhinkle requested a review from Priya2698 October 18, 2024 15:14
@Priya2698
Copy link
Collaborator

@jacobhinkle is this PR ready for review or do you first want to merge PR #3252

@jacobhinkle
Copy link
Collaborator Author

@jacobhinkle is this PR ready for review or do you first want to merge PR #3252

It is ready. I think the two PRs are pretty much independent.

@pytest.mark.parametrize("dtype", [torch.float16, torch.bfloat16])
def test_matmul_nvf_benchmark(
benchmark,
eager: bool,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we separate eager benchmark into its own benchmark, they will not run by default (

if not run_eager:
skip_eager = pytest.mark.skip(reason="need --benchmark-eager option to run")
for item in items:
# If the benchmark has compile=False parameter (eager mode), skip it.
if (
hasattr(item, "callspec")
and "compile" in item.callspec.params
and not item.callspec.params["compile"]
):
item.add_marker(skip_eager)
). Right now, we will run both nvfuser (which is ATen at this time) and eager in every run.

We can separate out the common code into a utility function and call them from test_matmul_nvf_benchmark and test_matmul_baseline_benchmark.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I split the test into two tests there will be no difference between eager and compile, but my understanding is that we need to have both of those to trigger this code path.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I can just put that compile option in, since eventually we will probably extend this to cover some epilogue cases, and multi-matmul cases

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second function will have the compile parameter with just the [False] value to exercise the eager benchmark and it will not be run by default.

there will be no difference between eager and compile, but my understanding is that we need to have both of those to trigger this code path.

I am not sure what you mean here -- trigger which code path?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The compile parameter is what is required to skip eager by default.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right. Gotcha. I will make that change tonight/AM eastern

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just pushed the change splitting the baseline into a separate test that can be enabled with --benchmark-eager.

@Priya2698
Copy link
Collaborator

Priya2698 commented Oct 23, 2024

@jacobhinkle is this PR ready for review or do you first want to merge PR #3252

It is ready. I think the two PRs are pretty much independent.

If we are still seeing 0 measurements, that seems like a real error.
The benchmark itself has not changed, only the parametrization over the reduction type. Do you have a list of cases where we see stats.min = 0. For nvfuser, we are still running the same settings (since reduction in half precision is not supported), so only the eager benchmarks with half precision are the new additions.

@jacobhinkle
Copy link
Collaborator Author

!build

Copy link
Collaborator

@Priya2698 Priya2698 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM apart from the disable-benchmarking flag in the eager benchmark that needs to be removed.

Thanks for the changes!

b = b.as_strided(size=[k, n], stride=[1, k])

# NOTE: we never need to validate eager, as it is our baseline
if not disable_benchmarking:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This conditional is not needed. This flag is for nvfuser benchmarks. I'll make a note to rename this. If --benchmark-eager is used we always run this benchmark.

@jacobhinkle
Copy link
Collaborator Author

!build

@jacobhinkle jacobhinkle merged commit 9fc7bb3 into main Oct 25, 2024
@jacobhinkle jacobhinkle deleted the matmul_benchmark_disable_half_reduction branch October 25, 2024 21:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants