Fix torch allocator clearing cache on every benchmark#3238
Fix torch allocator clearing cache on every benchmark#3238jacobhinkle wants to merge 1 commit intomainfrom
Conversation
This speeds up running the benchmarks by quite a bit since malloc is so slow.
|
!build |
|
!build --matmul-bench |
| Utility function to clear CUDA cache before running a test. | ||
| """ | ||
| if ( | ||
| torch.cuda.memory_allocated() |
There was a problem hiding this comment.
Can we safely remove this?
Shouldn't we always clear the L2 cache if there is memory allocated at the beginning to benchmark each round from the same memory state?
I was under the impression that this just queries the device statistics, how does this use malloc? Can you point me to any reference on its working?
There was a problem hiding this comment.
As mentioned here, "allocated" refers to only the memory occupied by tensors currently, but deleted tensors' memory may still remain allocated to the pytorch cache. That unused but still allocated memory is included along with the allocated memory in the "reserved" amount. I don't think we care what is allocated, and in particular having this here means that we will clear the cache every time this is called if there is even a single tensor still in memory somewhere i.e. not yet garbage collected.
|
Closing in favor of #3252 |
Inspired by #3174 This is an alternative to #3238. Previously we were manually resetting the cuda cache whenever the usage was above 80%. This is not ideal since we could have 79% usage and a test that requires 25% and that would fail. We also might clear the cache unnecessarily sometimes: e.g. we are using 81% but only need a few percent for the remainder of tests. This PR cleans this up by introducing a new test decorator `@retry_on_oom_or_skip_test`. This decorator must be placed innermost, underneath the other decorators. It will execute the test inside a try block. If the test fails due to `torch.OutOfMemoryError`, we clear the cuda cache and retry the test. If it fails again due to `torch.OutOfMemoryError`, then we skip the test. I updated the python benchmarks to apply this decorator automatically, and to remove the manual `clear_cuda_cache()` calls.
This speeds up running the benchmarks by quite a bit since malloc is so slow.