Fix torch allocator clearing cache on every benchmark by jacobhinkle · Pull Request #3238 · NVIDIA/Fuser

jacobhinkle · 2024-10-20T00:24:11Z

This speeds up running the benchmarks by quite a bit since malloc is so slow.

jacobhinkle · 2024-10-20T00:24:37Z

!build

jacobhinkle · 2024-10-20T08:20:19Z

!build --matmul-bench

Priya2698 · 2024-10-21T21:13:48Z

nvfuser/pytorch_utils.py

    Utility function to clear CUDA cache before running a test.
    """
-    if (
-        torch.cuda.memory_allocated()


Can we safely remove this?
Shouldn't we always clear the L2 cache if there is memory allocated at the beginning to benchmark each round from the same memory state?

I was under the impression that this just queries the device statistics, how does this use malloc? Can you point me to any reference on its working?

As mentioned here, "allocated" refers to only the memory occupied by tensors currently, but deleted tensors' memory may still remain allocated to the pytorch cache. That unused but still allocated memory is included along with the allocated memory in the "reserved" amount. I don't think we care what is allocated, and in particular having this here means that we will clear the cache every time this is called if there is even a single tensor still in memory somewhere i.e. not yet garbage collected.

jacobhinkle · 2024-10-24T23:58:30Z

Closing in favor of #3252

Inspired by #3174 This is an alternative to #3238. Previously we were manually resetting the cuda cache whenever the usage was above 80%. This is not ideal since we could have 79% usage and a test that requires 25% and that would fail. We also might clear the cache unnecessarily sometimes: e.g. we are using 81% but only need a few percent for the remainder of tests. This PR cleans this up by introducing a new test decorator `@retry_on_oom_or_skip_test`. This decorator must be placed innermost, underneath the other decorators. It will execute the test inside a try block. If the test fails due to `torch.OutOfMemoryError`, we clear the cuda cache and retry the test. If it fails again due to `torch.OutOfMemoryError`, then we skip the test. I updated the python benchmarks to apply this decorator automatically, and to remove the manual `clear_cuda_cache()` calls.

Fix torch allocator clearing cache on every benchmark

a1fc73f

This speeds up running the benchmarks by quite a bit since malloc is so slow.

jacobhinkle requested a review from Priya2698 October 20, 2024 00:24

Priya2698 reviewed Oct 21, 2024

View reviewed changes

jacobhinkle mentioned this pull request Oct 22, 2024

Introduce @retry_on_oom_or_skip_test #3252

Merged

jacobhinkle closed this Oct 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix torch allocator clearing cache on every benchmark#3238

Fix torch allocator clearing cache on every benchmark#3238
jacobhinkle wants to merge 1 commit intomainfrom
fix_clear_cuda_cache

jacobhinkle commented Oct 20, 2024

Uh oh!

jacobhinkle commented Oct 20, 2024

Uh oh!

jacobhinkle commented Oct 20, 2024

Uh oh!

Priya2698 Oct 21, 2024

Uh oh!

jacobhinkle Oct 22, 2024

Uh oh!

jacobhinkle commented Oct 24, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jacobhinkle commented Oct 20, 2024

Uh oh!

jacobhinkle commented Oct 20, 2024

Uh oh!

jacobhinkle commented Oct 20, 2024

Uh oh!

Priya2698 Oct 21, 2024

Choose a reason for hiding this comment

Uh oh!

jacobhinkle Oct 22, 2024

Choose a reason for hiding this comment

Uh oh!

jacobhinkle commented Oct 24, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants