Skip to content

Skip python matmul benchmarks that fail due to OOM#3174

Merged
jacobhinkle merged 5 commits intomainfrom
skip_oom_python_matmul_benchmarks
Oct 16, 2024
Merged

Skip python matmul benchmarks that fail due to OOM#3174
jacobhinkle merged 5 commits intomainfrom
skip_oom_python_matmul_benchmarks

Conversation

@jacobhinkle
Copy link
Collaborator

@jacobhinkle jacobhinkle commented Oct 12, 2024

I'm adding this skip so that we can soon expect to run python matmul benchmarks without any failures. Some of these tests use extreme sizes needing over 200GB of memory. Instead of removing these tests, I try to run all tests and skip the ones that fail due to torch OOM error.

@jacobhinkle jacobhinkle requested a review from Priya2698 October 12, 2024 18:38
@jacobhinkle jacobhinkle changed the title Skip python matmul benchmarks that use at least 90% of gmem Skip python matmul benchmarks that fail due to OOM Oct 15, 2024
@jacobhinkle
Copy link
Collaborator Author

!build --matmul-bench

@jacobhinkle jacobhinkle merged commit fc67b4e into main Oct 16, 2024
@jacobhinkle jacobhinkle deleted the skip_oom_python_matmul_benchmarks branch October 16, 2024 11:59
jacobhinkle added a commit that referenced this pull request Oct 21, 2024
Inspired by #3174

Previously we were manually resetting the cuda cache whenever the usage
was above 80%. This is not ideal since we could have 79% usage and a
test that requires 25% and that would fail. We also might clear the
cache unnecessarily sometimes: e.g. we are using 81% but only need a few
percent for the remainder of tests.

This PR cleans this up by introducing a new test decorator
@retry_on_oom_or_skip_test. This decorator must be placed innermost,
underneath the other decorators. It will execute the test inside a try
block. If the test fails due to torch.OutOfMemoryError, we clear the
cuda cache and retry the test. If it fails again due to
torch.OutOfMemoryError, then we skip the test.
jacobhinkle added a commit that referenced this pull request Oct 29, 2024
Inspired by #3174

This is an alternative to #3238.

Previously we were manually resetting the cuda cache whenever the usage
was above 80%. This is not ideal since we could have 79% usage and a
test that requires 25% and that would fail. We also might clear the
cache unnecessarily sometimes: e.g. we are using 81% but only need a few
percent for the remainder of tests.

This PR cleans this up by introducing a new test decorator
`@retry_on_oom_or_skip_test`. This decorator must be placed innermost,
underneath the other decorators. It will execute the test inside a try
block. If the test fails due to `torch.OutOfMemoryError`, we clear the
cuda cache and retry the test. If it fails again due to
`torch.OutOfMemoryError`, then we skip the test.

I updated the python benchmarks to apply this decorator automatically,
and to remove the manual `clear_cuda_cache()` calls.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants