[Feature] Add automatic GPU memory cleanup fixture for ROCm tests#181
[Feature] Add automatic GPU memory cleanup fixture for ROCm tests#181eppaneamd wants to merge 4 commits intoROCm:amd-integrationfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR introduces automatic GPU memory cleanup for ROCm tests by adding a pytest fixture that monitors and conditionally clears GPU cache after each test based on a configurable memory threshold.
Changes:
- Added
_maybe_clear_gpu_memory()helper function to check GPU memory usage and clear cache when above threshold - Added
clear_gpu_memoryautouse pytest fixture to automatically run cleanup after each test - Introduced
FLASHINFER_TEST_MEMORY_THRESHOLDenvironment variable for configurable threshold (default: 0.75)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| def _maybe_clear_gpu_memory(device: torch.device) -> None: | ||
| total_memory = torch.cuda.get_device_properties(device).total_memory | ||
| reserved_memory = torch.cuda.memory_reserved() |
There was a problem hiding this comment.
The torch.cuda.memory_reserved() call is missing the device parameter. Without specifying the device, it defaults to the current device, which may not match the device passed to this function. This should be torch.cuda.memory_reserved(device) to ensure consistency with the device used for getting total memory on line 65.
| reserved_memory = torch.cuda.memory_reserved() | |
| reserved_memory = torch.cuda.memory_reserved(device) |
| def _maybe_clear_gpu_memory(device: torch.device) -> None: | ||
| total_memory = torch.cuda.get_device_properties(device).total_memory | ||
| reserved_memory = torch.cuda.memory_reserved() | ||
|
|
||
| # FLASHINFER_TEST_MEMORY_THRESHOLD: threshold for PyTorch reserved memory usage (default: 0.75) | ||
| threshold = float(os.environ.get("FLASHINFER_TEST_MEMORY_THRESHOLD", "0.75")) | ||
|
|
||
| if reserved_memory > threshold * total_memory: | ||
| gc.collect() | ||
| torch.cuda.empty_cache() |
There was a problem hiding this comment.
This function duplicates the logic from clear_cuda_cache in tests/test_helpers/test_helpers.py with two differences: (1) the default threshold is 0.75 here vs 0.9 there, and (2) the existing function also has the same bug of missing device parameter in memory_reserved(). Consider either reusing the existing function or ensuring consistency in implementation. The lower threshold (0.75) means more aggressive cleanup, which may impact test performance differently than the existing tests.
📌 Description
Introduces an automatic pytest fixture that monitors and cleans GPU memory after each test in the ROCm test suite.
_maybe_clear_gpu_memory()helper function to conditionally clear GPU cache based on memory thresholdclear_gpu_memoryfixture withautouse=Trueto run automatically after each testFLASHINFER_TEST_MEMORY_THRESHOLDenvironment variable (default: 0.75)