cuda: LRU eviction + overalloc for legacy pool#22207
cuda: LRU eviction + overalloc for legacy pool#22207TheTom wants to merge 2 commits intoggml-org:masterfrom
Conversation
|
Hi @TheTom, thanks for your contribution! Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:
Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below. |
copied from previous PR for clarity: |
| size_t evict_lru(size_t target) { | ||
| size_t freed = 0; | ||
| ggml_cuda_set_device(device); | ||
| while (freed < target) { | ||
| int oldest = -1; | ||
| uint64_t oldest_ts = UINT64_MAX; | ||
| for (int i = 0; i < MAX_BUFFERS; ++i) { | ||
| if (buffer_pool[i].ptr != nullptr && buffer_pool[i].last_used < oldest_ts) { | ||
| oldest_ts = buffer_pool[i].last_used; | ||
| oldest = i; | ||
| } | ||
| } | ||
| if (oldest < 0) { | ||
| break; | ||
| } | ||
| ggml_cuda_buffer & b = buffer_pool[oldest]; | ||
| CUDA_CHECK(cudaFree(b.ptr)); | ||
| freed += b.size; | ||
| pool_size -= b.size; | ||
| b.ptr = nullptr; | ||
| b.size = 0; | ||
| } | ||
| return freed; | ||
| } |
There was a problem hiding this comment.
Inline this function.
| virtual void * alloc(size_t size, size_t * actual_size) = 0; | ||
| virtual void * alloc_oversize(size_t size, size_t * actual_size, double factor) { | ||
| GGML_UNUSED(factor); | ||
| return alloc(size, actual_size); | ||
| } |
There was a problem hiding this comment.
Instead of adding a new method, add new argument float lookahead = 1.05f to alloc.
| uint64_t last_used = 0; | ||
| }; | ||
|
|
||
| ggml_cuda_buffer buffer_pool[MAX_BUFFERS] = {}; | ||
| size_t pool_size = 0; | ||
| uint64_t timestamp = 0; |
There was a problem hiding this comment.
| uint64_t last_used = 0; | |
| }; | |
| ggml_cuda_buffer buffer_pool[MAX_BUFFERS] = {}; | |
| size_t pool_size = 0; | |
| uint64_t timestamp = 0; | |
| uint64_t last_use = 0; | |
| }; | |
| ggml_cuda_buffer buffer_pool[MAX_BUFFERS] = {}; | |
| size_t pool_size = 0; | |
| uint64_t usage_counter = 0; |
What you implemented is not a timestamp but would also work. You should change the variable names though to avoid confusion.
6ac5e04 to
4e68d9e
Compare
|
comments addressed PTAL |
|
Sounds good let me know what you find. Happy to adjust if needed. |
|
Hey @IMbackK , were you able to repro or get a backtrace for me to investigate? |
|
I have already traced the problem and it looks like this makes ROCm/rocm-systems#4817 ie failures in hipMemcpyAsync with valid source and destination parameters in multigpu scenarios, very much way more common. There is not a whole lot we can do about this since our code is correct. At the same time pretty much breaking multigpu hip for non-fp16 fattn isent really an option either |
thanks for tracking this down. looks like the RC of ROCm/rocm-systems#4817 is a pre-existing hipMemcpyAsync host mapping race on multi-gpu. the LRU eviction changes free/realloc timing which exposes it more often, but the underlying bug is in the ROCm runtime as you surmise. thoughts,on how to unblock: i can gate the LRU path behind a single-gpu check on HIP and fall back to clear_pool() for multi-gpu. that way multi-gpu HIP gets the same behavior it had before (no regression) and single-gpu HIP + all CUDA users get the fix. want me to push that, or do you have a different approach in mind? I also have some AMD contacts I can ask if needed |
ROCm/rocm-systems#4817: LRU free/realloc cycles amplify a hipMemcpyAsync host-mapping race on multi-GPU setups. Gate the LRU path behind a single-GPU check on HIP and fall back to clear_pool() for multi-GPU. Single-GPU HIP + all CUDA users still get LRU eviction.
f68e40f to
a2e25b0
Compare
|
Added an ifdef for multigpu. Open to changes. Let me know. |
|
I dont know if there is a point to doing this. As far as i know the cuda devices will all use the VMM allocator anyhow, im not sure under what circumstances If this not an encountered case then i would recommend to just leave this pr open until i or amd figure out where the race in rocr/clr is exactly. I get that its also useful for the single gpu hip case - but im not sure this justifies having the hack in the code. There is also something to be said for not doing this sort of workaround and instead pushing amd to fix their shit. |
@JohannesGaessler @IMbackK looking for guidance here. single-gpu HIP users hit both OOM and prefill slowdown on q8_0 KV at long ctx, confirmed across gfx1100/1200/1201/906 by multiple community testers. i don't have multi-gpu amd to verify or fix 4817. the current PR is gated to single-gpu HIP only - multi-gpu path is unchanged. happy to ping my amd contacts on 4817 in parallel to push their side, but the user-facing fix shouldn't have to wait on rocm's queue. if the ifdef is the blocker, happy to swap to a runtime device-count check. what shape would you accept here? |
Fixes #22107. Per #22193 (comment).
On OOM, evict LRU buffers first. FA temps use 2x overalloc.
Tested on gfx1201, q8_0 @ d40000: 369 t/s (was OOM).
Requirements