ggml-cuda: flush legacy pool on OOM and retry#22155
Conversation
Signed-off-by: 梁厚宏 <2695316095@qq.com>
|
This is not safe to do. If the memory is still in use it must not be freed. |
Thanks for the catch! I completely overlooked the async lifecycle issue here. I'll drop this unsafe logic and rework it tonight. |
|
I.. actually don't see it (how clear_pool can prune in use chunks) |
|
The legacy CUDA buffer pool is essentially just a list of previously used temporary buffers. "Allocating" and "freeing" a buffer just means retrieving a buffer for use within a kernel. It does not mean that there no longer are any kernels queued that will attempt to use the buffer. Since the buffer pool is per Did you consider replacing the legacy buffer pool with |
Jeah exactly, i was really confused there thinking that this is perhaps not true for cuda and only for hip. |
I did try this, it is slow. Hopefully they will fix the damn virtual memory support soon. |
|
I'd suggest if you go with this fix, don't rely on this behavior of The CUDA documentation for |
| void clear_pool() { | ||
| for (int i = 0; i < MAX_BUFFERS; ++i) { | ||
| ggml_cuda_buffer & b = buffer_pool[i]; | ||
| if (b.ptr != nullptr) { | ||
| CUDA_CHECK(cudaFree(b.ptr)); | ||
| pool_size -= b.size; | ||
| b.ptr = nullptr; | ||
| b.size = 0; | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
To ensure consistency, please call clear_pool in the destructor.
| size_t look_ahead_size = (size_t) (1.05 * size); | ||
| look_ahead_size = 256 * ((look_ahead_size + 255)/256); | ||
| ggml_cuda_set_device(device); | ||
| #if defined(GGML_USE_MUSA) |
There was a problem hiding this comment.
If at all possible, please just add the missing defines to the MUSA header if they're missing rather than to have per-vendor logic.
On hip there is no may so you would be safe there, doing an explicit synchronization would not hurt ofc. |
|
Thanks everyone for the amazing clarifications! I really appreciate the guidance. I will apply all the suggested fixes (adding explicit sync, updating the destructor, and cleaning up the MUSA logic) and push an update tonight. |
Thank you for correcting me, it seems I suffered from confirmation bias and should have been more careful. The way I had remembered it the synchronization was guaranteed, that is what Gemini said when I asked "Does cudaFree imply a device synchronization?", and that is what I read when I quickly checked the CUDA documentation it linked. I agree that there should be an explicit synchronization to be safe. |
|
AFAIK, there is always a sync with |
…up MUSA macros Signed-off-by: 梁厚宏 <2695316095@qq.com>
|
Thanks for the feedback — I pushed an update. This revision:
I also reran |
On HIP without VMM, the legacy pool retains these at peak size causing quantized KV to OOM before f16. ggml_cuda_direct_alloc<T> uses raw hipMalloc/hipFree instead. HIP-only, complements ggml-org#22155. Fixes ggml-org#22107 without performance degradation. Tested: gfx1100, gfx1200, gfx1201.
* ggml-cuda: flush legacy pool on OOM and retry Signed-off-by: 梁厚宏 <2695316095@qq.com> * Address review comments: add explicit sync, update destructor, clean up MUSA macros Signed-off-by: 梁厚宏 <2695316095@qq.com> --------- Signed-off-by: 梁厚宏 <2695316095@qq.com>
* ggml-cuda: flush legacy pool on OOM and retry Signed-off-by: 梁厚宏 <2695316095@qq.com> * Address review comments: add explicit sync, update destructor, clean up MUSA macros Signed-off-by: 梁厚宏 <2695316095@qq.com> --------- Signed-off-by: 梁厚宏 <2695316095@qq.com>
* ggml-cuda: flush legacy pool on OOM and retry Signed-off-by: 梁厚宏 <2695316095@qq.com> * Address review comments: add explicit sync, update destructor, clean up MUSA macros Signed-off-by: 梁厚宏 <2695316095@qq.com> --------- Signed-off-by: 梁厚宏 <2695316095@qq.com>
* ggml-cuda: flush legacy pool on OOM and retry Signed-off-by: 梁厚宏 <2695316095@qq.com> * Address review comments: add explicit sync, update destructor, clean up MUSA macros Signed-off-by: 梁厚宏 <2695316095@qq.com> --------- Signed-off-by: 梁厚宏 <2695316095@qq.com>
This adds a conservative fallback for the legacy CUDA/HIP pool allocator.
On non-VMM setups, the legacy pool can end up holding cached free buffers that are individually too small for a new request, but still occupy enough VRAM to make the next allocation fail. In that case, this patch flushes the cached legacy-pool buffers and retries the allocation once before aborting.
The normal hit path is unchanged. This is intended as a narrow mitigation for legacy-pool OOMs, not a broader allocator redesign. I validated the retry path locally with a synthetic OOM injection on a legacy-pool build.
This is intended to mitigate the legacy-pool OOM behavior reported in #22075 and #22107.