cuda: add partial eviction on pool OOM#22193
cuda: add partial eviction on pool OOM#22193leonardHONG wants to merge 3 commits intoggml-org:masterfrom
Conversation
|
I'm not convinced that this will actually be beneficial on average. |
|
might help the slowdown reported in #22094 (comment) should be tested. But yeah this pr needs evidence. |
|
I mean, even without evidence I would expect clearing only the largest buffers to be prone to memory fragmentation. |
|
One variant of this PR that I would be willing to approve and merge would be something like this:
|
Per JohannesGaessler's feedback on ggml-org#22193: - Add LRU timestamps to pool buffers - On OOM, evict least recently used first (up to 3x requested) - Fall back to full flush if LRU eviction isn't enough - Add alloc_oversize() with configurable factor - FA temp buffers use 2x overalloc on HIP to reduce thrashing 3 files, pool-internal fix, no separate allocator.
|
Thank you for the crystal clear roadmap, @JohannesGaessler! |
Signed-off-by: 梁厚宏 <2695316095@qq.com>
ef40865 to
d3e3a69
Compare
1. Legacy pool / LRU reclaim
2. FA conversion / overallocate
|
Signed-off-by: 梁厚宏 <2695316095@qq.com>
6fc78c6 to
00afd1b
Compare
|
Thanks @IMbackK for the root-cause work in ROCm/rocm-systems#4817 and cc @JohannesGaessler since this follows the roadmap . For context, #22193 follows #22155 and was opened before #22207, implementing the same direction: LRU reclaim, 3x bound, and FA overallocate. I have now pushed a HIP multi-GPU guard as a follow-up, to preserve master-equivalent Happy to coordinate as needed and leave the merge decision to the maintainers. |
|
It sounds like the reason we are at this spot because flash attention buffers grow exponentially and unless we over-allocate them in double + 1MB (what I measured) the cache is useless. So it sounds like a better fix would be to get FA their own pool with a smaller depth. |
This is a small follow-up to #22155.
In #22155, legacy-pool OOM recovery was handled by flushing the whole cache and retrying once. That fixed the immediate issue, but it is also a fairly blunt fallback.
This PR changes two things.
First, in the legacy pool OOM path, it adds an LRU-style bounded reclaim step before the existing full flush. On
cudaErrorMemoryAllocation, it evicts the oldest cached buffers first, up to a bound of 3x the current request size, and then retries the allocation. If that still fails, it falls back to the existingclear_pool() + retrypath, so the safety net from #22155 is preserved.Second, it adds a
bool overallocateparameter toggml_cuda_pool::alloc(). Whenfalse, the existing 1.05x look-ahead behavior stays unchanged. Whentrue, the legacy pool allocates 2x instead. This is enabled for the FAK_f16/V_f16conversion buffers, since those buffers tend to grow as context length grows and this helps avoid repeated reallocations.I read @JohannesGaessler's design comment and followed the requested behavior, while making two small implementation choices differently. I used a
bool overallocateflag instead of a float lookahead parameter because the current requirements only need two modes: the default 1.05x behavior and an opt-in 2x overallocate path. If more tuning points are needed later, expanding this to a float parameter should still be straightforward. I also kept the LRU bookkeeping in a parallelbuffer_pool_ts[]array instead of adding alast_usefield intoggml_cuda_buffer, to keep the LRU bookkeeping separate from the buffer struct. Happy to refactor either of these if a float lookahead API or in-structlast_useis preferred.I tested two cases: (a) legacy-pool LRU reclaim, triggered via one-shot fault injection, and (b) FA
K_f16/V_f16overallocate, triggered under quantized KV cache (-ctk q8_0 -ctv q8_0). Detailed logs are in the comment below.