Introduce @retry_on_oom_or_skip_test#3252
Conversation
Inspired by #3174 Previously we were manually resetting the cuda cache whenever the usage was above 80%. This is not ideal since we could have 79% usage and a test that requires 25% and that would fail. We also might clear the cache unnecessarily sometimes: e.g. we are using 81% but only need a few percent for the remainder of tests. This PR cleans this up by introducing a new test decorator @retry_on_oom_or_skip_test. This decorator must be placed innermost, underneath the other decorators. It will execute the test inside a try block. If the test fails due to torch.OutOfMemoryError, we clear the cuda cache and retry the test. If it fails again due to torch.OutOfMemoryError, then we skip the test.
|
!build |
|
This is great idea. |
Interesting idea. I can't see yet how to make it actually retry the calling function, but it might be possible. Another option might be https://github.com/str0zzapreti/pytest-retry, but I couldn't see how to get that to run the gc/cache clear step before the retry. |
This partially reverts commit a226a34.
|
!build |
| m, n, k, layout = config | ||
|
|
||
| clear_cuda_cache() | ||
| a = torch.randn(m, k, device="cuda", dtype=dtype) |
There was a problem hiding this comment.
Will the changes in the this file be superseded by the other PR?
There was a problem hiding this comment.
I will merge it manually
|
!build |
Priya2698
left a comment
There was a problem hiding this comment.
LGTM -- this is a great improvement!
test_matmul still seems to have some changes though which are likely not part of this PR that may need to be cleaned manually.
It does? I just removed the try block. Is there something else you noticed? |
Oh you're right. I misunderstood. |
|
!build |
|
The |
|
!build |
The error seems strange, like CI was using an older test_ops.py which was trying to import |
|
!build |
This is actually intended behavior. That job runs the tests from merge-base to detect changes in Python API. In this case it detects that we have changed |
This benchmark was added recently and did not have the changes added by PR #3252. The benchmark will fail on the CI due to missing import function
Inspired by #3174
This is an alternative to #3238.
Previously we were manually resetting the cuda cache whenever the usage was above 80%. This is not ideal since we could have 79% usage and a test that requires 25% and that would fail. We also might clear the cache unnecessarily sometimes: e.g. we are using 81% but only need a few percent for the remainder of tests.
This PR cleans this up by introducing a new test decorator
@retry_on_oom_or_skip_test. This decorator must be placed innermost, underneath the other decorators. It will execute the test inside a try block. If the test fails due totorch.OutOfMemoryError, we clear the cuda cache and retry the test. If it fails again due totorch.OutOfMemoryError, then we skip the test.I updated the python benchmarks to apply this decorator automatically, and to remove the manual
clear_cuda_cache()calls.