Skip to content

Fix torch allocator clearing cache on every benchmark#3238

Closed
jacobhinkle wants to merge 1 commit intomainfrom
fix_clear_cuda_cache
Closed

Fix torch allocator clearing cache on every benchmark#3238
jacobhinkle wants to merge 1 commit intomainfrom
fix_clear_cuda_cache

Conversation

@jacobhinkle
Copy link
Collaborator

This speeds up running the benchmarks by quite a bit since malloc is so slow.

This speeds up running the benchmarks by quite a bit since malloc is so
slow.
@jacobhinkle jacobhinkle requested a review from Priya2698 October 20, 2024 00:24
@jacobhinkle
Copy link
Collaborator Author

!build

@jacobhinkle
Copy link
Collaborator Author

!build --matmul-bench

Utility function to clear CUDA cache before running a test.
"""
if (
torch.cuda.memory_allocated()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we safely remove this?
Shouldn't we always clear the L2 cache if there is memory allocated at the beginning to benchmark each round from the same memory state?

I was under the impression that this just queries the device statistics, how does this use malloc? Can you point me to any reference on its working?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned here, "allocated" refers to only the memory occupied by tensors currently, but deleted tensors' memory may still remain allocated to the pytorch cache. That unused but still allocated memory is included along with the allocated memory in the "reserved" amount. I don't think we care what is allocated, and in particular having this here means that we will clear the cache every time this is called if there is even a single tensor still in memory somewhere i.e. not yet garbage collected.

@jacobhinkle
Copy link
Collaborator Author

Closing in favor of #3252

jacobhinkle added a commit that referenced this pull request Oct 29, 2024
Inspired by #3174

This is an alternative to #3238.

Previously we were manually resetting the cuda cache whenever the usage
was above 80%. This is not ideal since we could have 79% usage and a
test that requires 25% and that would fail. We also might clear the
cache unnecessarily sometimes: e.g. we are using 81% but only need a few
percent for the remainder of tests.

This PR cleans this up by introducing a new test decorator
`@retry_on_oom_or_skip_test`. This decorator must be placed innermost,
underneath the other decorators. It will execute the test inside a try
block. If the test fails due to `torch.OutOfMemoryError`, we clear the
cuda cache and retry the test. If it fails again due to
`torch.OutOfMemoryError`, then we skip the test.

I updated the python benchmarks to apply this decorator automatically,
and to remove the manual `clear_cuda_cache()` calls.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants