Skip to content

Introduce @retry_on_oom_or_skip_test#3252

Merged
jacobhinkle merged 12 commits intomainfrom
retry_python_tests_on_oom
Oct 29, 2024
Merged

Introduce @retry_on_oom_or_skip_test#3252
jacobhinkle merged 12 commits intomainfrom
retry_python_tests_on_oom

Conversation

@jacobhinkle
Copy link
Collaborator

@jacobhinkle jacobhinkle commented Oct 22, 2024

Inspired by #3174

This is an alternative to #3238.

Previously we were manually resetting the cuda cache whenever the usage was above 80%. This is not ideal since we could have 79% usage and a test that requires 25% and that would fail. We also might clear the cache unnecessarily sometimes: e.g. we are using 81% but only need a few percent for the remainder of tests.

This PR cleans this up by introducing a new test decorator @retry_on_oom_or_skip_test. This decorator must be placed innermost, underneath the other decorators. It will execute the test inside a try block. If the test fails due to torch.OutOfMemoryError, we clear the cuda cache and retry the test. If it fails again due to torch.OutOfMemoryError, then we skip the test.

I updated the python benchmarks to apply this decorator automatically, and to remove the manual clear_cuda_cache() calls.

Inspired by #3174

Previously we were manually resetting the cuda cache whenever the usage
was above 80%. This is not ideal since we could have 79% usage and a
test that requires 25% and that would fail. We also might clear the
cache unnecessarily sometimes: e.g. we are using 81% but only need a few
percent for the remainder of tests.

This PR cleans this up by introducing a new test decorator
@retry_on_oom_or_skip_test. This decorator must be placed innermost,
underneath the other decorators. It will execute the test inside a try
block. If the test fails due to torch.OutOfMemoryError, we clear the
cuda cache and retry the test. If it fails again due to
torch.OutOfMemoryError, then we skip the test.
@jacobhinkle jacobhinkle requested a review from Priya2698 October 22, 2024 12:36
@jacobhinkle
Copy link
Collaborator Author

!build

@jacobhinkle jacobhinkle marked this pull request as ready for review October 22, 2024 15:01
@Priya2698
Copy link
Collaborator

Priya2698 commented Oct 22, 2024

This is great idea.
We may be able to move this decorator or wrap it into a pytest fixture and turn on autouse: https://docs.pytest.org/en/stable/how-to/fixtures.html#autouse-fixtures-fixtures-you-don-t-have-to-request.
This would no longer require us to individually mark all tests and do not have to think about the order of the decorators in the tests.

@jacobhinkle
Copy link
Collaborator Author

This is great idea. We may be able to move this decorator or wrap it into a pytest fixture and turn on autouse: https://docs.pytest.org/en/stable/how-to/fixtures.html#autouse-fixtures-fixtures-you-don-t-have-to-request. This would no longer require us to individually mark all tests and do not have to think about the order of the decorators in the tests.

Interesting idea. I can't see yet how to make it actually retry the calling function, but it might be possible. Another option might be https://github.com/str0zzapreti/pytest-retry, but I couldn't see how to get that to run the gc/cache clear step before the retry.

@jacobhinkle
Copy link
Collaborator Author

!build

m, n, k, layout = config

clear_cuda_cache()
a = torch.randn(m, k, device="cuda", dtype=dtype)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the changes in the this file be superseded by the other PR?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will merge it manually

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merged

@jacobhinkle
Copy link
Collaborator Author

!build

Copy link
Collaborator

@Priya2698 Priya2698 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM -- this is a great improvement!
test_matmul still seems to have some changes though which are likely not part of this PR that may need to be cleaned manually.

@jacobhinkle
Copy link
Collaborator Author

test_matmul still seems to have some changes though which are likely not part of this PR that may need to be cleaned manually.

It does? I just removed the try block. Is there something else you noticed?

@Priya2698
Copy link
Collaborator

test_matmul still seems to have some changes though which are likely not part of this PR that may need to be cleaned manually.

It does? I just removed the try block. Is there something else you noticed?

Oh you're right. I misunderstood.
The PR can be merged as-is.

@jacobhinkle
Copy link
Collaborator Author

!build

@jacobhinkle
Copy link
Collaborator Author

The jit_python_bc_advisory_17_A100 failure is real. I will fix it before merging

@jacobhinkle
Copy link
Collaborator Author

!build

@jacobhinkle
Copy link
Collaborator Author

The jit_python_bc_advisory_17_A100 failure is real. I will fix it before merging

The error seems strange, like CI was using an older test_ops.py which was trying to import clear_cuda_cache. I'm trying it again. I also fixed test_matmul.py.

@jacobhinkle
Copy link
Collaborator Author

!build

@jacobhinkle
Copy link
Collaborator Author

The error seems strange, like CI was using an older test_ops.py which was trying to import clear_cuda_cache. I'm trying it again. I also fixed test_matmul.py.

This is actually intended behavior. That job runs the tests from merge-base to detect changes in Python API. In this case it detects that we have changed nvfuser/python_utils.py to remove clear_cuda_cache. Since that is a user-facing library, I bumped version.txt in the latest push.

@jacobhinkle jacobhinkle merged commit 5db18de into main Oct 29, 2024
@jacobhinkle jacobhinkle deleted the retry_python_tests_on_oom branch October 29, 2024 13:26
Priya2698 added a commit that referenced this pull request Oct 30, 2024
This benchmark was added recently and did not have the changes added by
PR #3252.
The benchmark will fail on the CI due to missing import function
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants