Introduce @retry_on_oom_or_skip_test by jacobhinkle · Pull Request #3252 · NVIDIA/Fuser

jacobhinkle · 2024-10-22T12:36:44Z

Inspired by #3174

This is an alternative to #3238.

Previously we were manually resetting the cuda cache whenever the usage was above 80%. This is not ideal since we could have 79% usage and a test that requires 25% and that would fail. We also might clear the cache unnecessarily sometimes: e.g. we are using 81% but only need a few percent for the remainder of tests.

This PR cleans this up by introducing a new test decorator @retry_on_oom_or_skip_test. This decorator must be placed innermost, underneath the other decorators. It will execute the test inside a try block. If the test fails due to torch.OutOfMemoryError, we clear the cuda cache and retry the test. If it fails again due to torch.OutOfMemoryError, then we skip the test.

I updated the python benchmarks to apply this decorator automatically, and to remove the manual clear_cuda_cache() calls.

Inspired by #3174 Previously we were manually resetting the cuda cache whenever the usage was above 80%. This is not ideal since we could have 79% usage and a test that requires 25% and that would fail. We also might clear the cache unnecessarily sometimes: e.g. we are using 81% but only need a few percent for the remainder of tests. This PR cleans this up by introducing a new test decorator @retry_on_oom_or_skip_test. This decorator must be placed innermost, underneath the other decorators. It will execute the test inside a try block. If the test fails due to torch.OutOfMemoryError, we clear the cuda cache and retry the test. If it fails again due to torch.OutOfMemoryError, then we skip the test.

jacobhinkle · 2024-10-22T12:36:55Z

!build

Priya2698 · 2024-10-22T20:32:20Z

This is great idea.
We may be able to move this decorator or wrap it into a pytest fixture and turn on autouse: https://docs.pytest.org/en/stable/how-to/fixtures.html#autouse-fixtures-fixtures-you-don-t-have-to-request.
This would no longer require us to individually mark all tests and do not have to think about the order of the decorators in the tests.

jacobhinkle · 2024-10-23T16:42:43Z

This is great idea. We may be able to move this decorator or wrap it into a pytest fixture and turn on autouse: https://docs.pytest.org/en/stable/how-to/fixtures.html#autouse-fixtures-fixtures-you-don-t-have-to-request. This would no longer require us to individually mark all tests and do not have to think about the order of the decorators in the tests.

Interesting idea. I can't see yet how to make it actually retry the calling function, but it might be possible. Another option might be https://github.com/str0zzapreti/pytest-retry, but I couldn't see how to get that to run the gc/cache clear step before the retry.

This partially reverts commit a226a34.

jacobhinkle · 2024-10-24T19:37:47Z

!build

Priya2698 · 2024-10-25T20:14:58Z

benchmarks/python/test_matmul.py

    m, n, k, layout = config

-    clear_cuda_cache()
+    a = torch.randn(m, k, device="cuda", dtype=dtype)


Will the changes in the this file be superseded by the other PR?

I will merge it manually

…_oom

jacobhinkle · 2024-10-25T21:45:54Z

!build

Priya2698

LGTM -- this is a great improvement!
test_matmul still seems to have some changes though which are likely not part of this PR that may need to be cleaned manually.

jacobhinkle · 2024-10-25T22:24:55Z

test_matmul still seems to have some changes though which are likely not part of this PR that may need to be cleaned manually.

It does? I just removed the try block. Is there something else you noticed?

Priya2698 · 2024-10-25T22:26:24Z

test_matmul still seems to have some changes though which are likely not part of this PR that may need to be cleaned manually.

It does? I just removed the try block. Is there something else you noticed?

Oh you're right. I misunderstood.
The PR can be merged as-is.

jacobhinkle · 2024-10-26T17:26:45Z

!build

jacobhinkle · 2024-10-28T01:13:22Z

The jit_python_bc_advisory_17_A100 failure is real. I will fix it before merging

…_oom

jacobhinkle · 2024-10-28T01:19:19Z

!build

jacobhinkle · 2024-10-28T01:20:11Z

The jit_python_bc_advisory_17_A100 failure is real. I will fix it before merging

The error seems strange, like CI was using an older test_ops.py which was trying to import clear_cuda_cache. I'm trying it again. I also fixed test_matmul.py.

jacobhinkle · 2024-10-29T12:08:24Z

!build

jacobhinkle · 2024-10-29T12:10:56Z

The error seems strange, like CI was using an older test_ops.py which was trying to import clear_cuda_cache. I'm trying it again. I also fixed test_matmul.py.

This is actually intended behavior. That job runs the tests from merge-base to detect changes in Python API. In this case it detects that we have changed nvfuser/python_utils.py to remove clear_cuda_cache. Since that is a user-facing library, I bumped version.txt in the latest push.

This benchmark was added recently and did not have the changes added by PR #3252. The benchmark will fail on the CI due to missing import function

jacobhinkle added 2 commits October 21, 2024 09:12

Fix missing import pytest

f4d3635

jacobhinkle requested a review from Priya2698 October 22, 2024 12:36

jacobhinkle marked this pull request as ready for review October 22, 2024 15:01

Priya2698 mentioned this pull request Oct 23, 2024

Run with and without half-precision matmul reduction in benchmark #3203

Merged

jacobhinkle added 5 commits October 24, 2024 13:48

Add OOM retry checking as a pytest hook

24430bb

Modify test_matmul.py to use the hook (no mention of OOM)

cc7ab52

Apply decorator to all tests during modifyitems hook

ad925af

Remove manually-inserted decorator uses

2d17142

This partially reverts commit a226a34.

Remove calls to clear_cuda_cache

c7f7d93

jacobhinkle mentioned this pull request Oct 24, 2024

Fix torch allocator clearing cache on every benchmark #3238

Closed

Priya2698 reviewed Oct 25, 2024

View reviewed changes

Merge remote-tracking branch 'origin/main' into retry_python_tests_on…

758eb06

…_oom

Priya2698 approved these changes Oct 25, 2024

View reviewed changes

Merge branch 'main' into retry_python_tests_on_oom

0786ba3

jacobhinkle added 2 commits October 27, 2024 21:15

Merge remote-tracking branch 'origin/main' into retry_python_tests_on…

d8fc492

…_oom

Remove references to clear_cuda_cache in matmul benchmark

23ed918

Bump version due to changes in nvfuser/pytorch_utils.py

5e7b48c

jacobhinkle merged commit 5db18de into main Oct 29, 2024

jacobhinkle deleted the retry_python_tests_on_oom branch October 29, 2024 13:26

Priya2698 mentioned this pull request Oct 29, 2024

Remove deprecated clear_cuda_cache #3306

Merged

Priya2698 added a commit that referenced this pull request Oct 30, 2024

Remove deprecated clear_cuda_cache (#3306)

7220207

This benchmark was added recently and did not have the changes added by PR #3252. The benchmark will fail on the CI due to missing import function

Priya2698 mentioned this pull request Oct 31, 2024

Adjust input sizes based on device capacity #1615

Closed

Conversation

jacobhinkle commented Oct 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jacobhinkle commented Oct 22, 2024

Uh oh!

Priya2698 commented Oct 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jacobhinkle commented Oct 23, 2024

Uh oh!

jacobhinkle commented Oct 24, 2024

Uh oh!

Priya2698 Oct 25, 2024

Choose a reason for hiding this comment

Uh oh!

jacobhinkle Oct 25, 2024

Choose a reason for hiding this comment

Uh oh!

jacobhinkle Oct 25, 2024

Choose a reason for hiding this comment

Uh oh!

jacobhinkle commented Oct 25, 2024

Uh oh!

Priya2698 left a comment

Choose a reason for hiding this comment

Uh oh!

jacobhinkle commented Oct 25, 2024

Uh oh!

Priya2698 commented Oct 25, 2024

Uh oh!

jacobhinkle commented Oct 26, 2024

Uh oh!

jacobhinkle commented Oct 28, 2024

Uh oh!

jacobhinkle commented Oct 28, 2024

Uh oh!

jacobhinkle commented Oct 28, 2024

Uh oh!

jacobhinkle commented Oct 29, 2024

Uh oh!

jacobhinkle commented Oct 29, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jacobhinkle commented Oct 22, 2024 •

edited

Loading

Priya2698 commented Oct 22, 2024 •

edited

Loading