Skip to content

[Feature] Add automatic GPU memory cleanup fixture for ROCm tests#181

Open
eppaneamd wants to merge 4 commits intoROCm:amd-integrationfrom
eppaneamd:fix/test-cache-cleanup
Open

[Feature] Add automatic GPU memory cleanup fixture for ROCm tests#181
eppaneamd wants to merge 4 commits intoROCm:amd-integrationfrom
eppaneamd:fix/test-cache-cleanup

Conversation

@eppaneamd
Copy link
Copy Markdown

📌 Description

Introduces an automatic pytest fixture that monitors and cleans GPU memory after each test in the ROCm test suite.

  • Add _maybe_clear_gpu_memory() helper function to conditionally clear GPU cache based on memory threshold
  • Add clear_gpu_memory fixture with autouse=True to run automatically after each test
  • Configurable threshold via FLASHINFER_TEST_MEMORY_THRESHOLD environment variable (default: 0.75)

Copilot AI review requested due to automatic review settings February 23, 2026 22:04
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces automatic GPU memory cleanup for ROCm tests by adding a pytest fixture that monitors and conditionally clears GPU cache after each test based on a configurable memory threshold.

Changes:

  • Added _maybe_clear_gpu_memory() helper function to check GPU memory usage and clear cache when above threshold
  • Added clear_gpu_memory autouse pytest fixture to automatically run cleanup after each test
  • Introduced FLASHINFER_TEST_MEMORY_THRESHOLD environment variable for configurable threshold (default: 0.75)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/rocm_tests/conftest.py Outdated

def _maybe_clear_gpu_memory(device: torch.device) -> None:
total_memory = torch.cuda.get_device_properties(device).total_memory
reserved_memory = torch.cuda.memory_reserved()
Copy link

Copilot AI Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The torch.cuda.memory_reserved() call is missing the device parameter. Without specifying the device, it defaults to the current device, which may not match the device passed to this function. This should be torch.cuda.memory_reserved(device) to ensure consistency with the device used for getting total memory on line 65.

Suggested change
reserved_memory = torch.cuda.memory_reserved()
reserved_memory = torch.cuda.memory_reserved(device)

Copilot uses AI. Check for mistakes.
Comment on lines +64 to +73
def _maybe_clear_gpu_memory(device: torch.device) -> None:
total_memory = torch.cuda.get_device_properties(device).total_memory
reserved_memory = torch.cuda.memory_reserved()

# FLASHINFER_TEST_MEMORY_THRESHOLD: threshold for PyTorch reserved memory usage (default: 0.75)
threshold = float(os.environ.get("FLASHINFER_TEST_MEMORY_THRESHOLD", "0.75"))

if reserved_memory > threshold * total_memory:
gc.collect()
torch.cuda.empty_cache()
Copy link

Copilot AI Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function duplicates the logic from clear_cuda_cache in tests/test_helpers/test_helpers.py with two differences: (1) the default threshold is 0.75 here vs 0.9 there, and (2) the existing function also has the same bug of missing device parameter in memory_reserved(). Consider either reusing the existing function or ensuring consistency in implementation. The lower threshold (0.75) means more aggressive cleanup, which may impact test performance differently than the existing tests.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants