cuda: LRU eviction + overalloc for legacy pool by TheTom · Pull Request #22207 · ggml-org/llama.cpp

TheTom · 2026-04-21T09:26:25Z

Fixes #22107. Per #22193 (comment).

On OOM, evict LRU buffers first. FA temps use 2x overalloc.
Tested on gfx1201, q8_0 @ d40000: 369 t/s (was OOM).

Requirements

I have read and agree with the contributing guidelines
AI usage: no

ggml-gh-bot · 2026-04-21T09:31:52Z

Hi @TheTom, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

Multiple open PRs from a new contributor: We limit new contributors (those without a previously merged PR) to 1 open PR at a time. You currently have 3 open PRs.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

TheTom · 2026-04-21T09:34:27Z

Hi @TheTom, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

Multiple open PRs from a new contributor: We limit new contributors (those without a previously merged PR) to 1 open PR at a time. You currently have 3 open PRs.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

copied from previous PR for clarity:
One is a draft (#21119), one has been waiting on review for 2 weeks (#21452). This is a bug fix for an OOM affecting all HIP users with quantized KV at long context. Happy to prioritize however maintainers prefesr.

JohannesGaessler · 2026-04-21T10:26:31Z

+    size_t evict_lru(size_t target) {
+        size_t freed = 0;
+        ggml_cuda_set_device(device);
+        while (freed < target) {
+            int oldest = -1;
+            uint64_t oldest_ts = UINT64_MAX;
+            for (int i = 0; i < MAX_BUFFERS; ++i) {
+                if (buffer_pool[i].ptr != nullptr && buffer_pool[i].last_used < oldest_ts) {
+                    oldest_ts = buffer_pool[i].last_used;
+                    oldest = i;
+                }
+            }
+            if (oldest < 0) {
+                break;
+            }
+            ggml_cuda_buffer & b = buffer_pool[oldest];
+            CUDA_CHECK(cudaFree(b.ptr));
+            freed += b.size;
+            pool_size -= b.size;
+            b.ptr = nullptr;
+            b.size = 0;
+        }
+        return freed;
+    }


Inline this function.

JohannesGaessler · 2026-04-21T10:29:07Z

    virtual void * alloc(size_t size, size_t * actual_size) = 0;
+    virtual void * alloc_oversize(size_t size, size_t * actual_size, double factor) {
+        GGML_UNUSED(factor);
+        return alloc(size, actual_size);
+    }


Instead of adding a new method, add new argument float lookahead = 1.05f to alloc.

JohannesGaessler · 2026-04-21T10:32:14Z

+        uint64_t last_used = 0;
    };

    ggml_cuda_buffer buffer_pool[MAX_BUFFERS] = {};
    size_t pool_size = 0;
+    uint64_t timestamp = 0;


Suggested change

uint64_t last_used = 0;

};

ggml_cuda_buffer buffer_pool[MAX_BUFFERS] = {};

size_t pool_size = 0;

uint64_t timestamp = 0;

uint64_t last_use = 0;

};

ggml_cuda_buffer buffer_pool[MAX_BUFFERS] = {};

size_t pool_size = 0;

uint64_t usage_counter = 0;

What you implemented is not a timestamp but would also work. You should change the variable names though to avoid confusion.

Fixes ggml-org#22107.

TheTom · 2026-04-21T12:27:01Z

comments addressed PTAL

IMbackK

In testing i have found memory access faults caused by this pr. investigating...

TheTom · 2026-04-22T15:14:49Z

Sounds good let me know what you find. Happy to adjust if needed.

TheTom · 2026-04-23T14:15:52Z

Hey @IMbackK , were you able to repro or get a backtrace for me to investigate?

IMbackK · 2026-04-23T14:21:05Z

I have already traced the problem and it looks like this makes ROCm/rocm-systems#4817 ie failures in hipMemcpyAsync with valid source and destination parameters in multigpu scenarios, very much way more common.

There is not a whole lot we can do about this since our code is correct. At the same time pretty much breaking multigpu hip for non-fp16 fattn isent really an option either

TheTom · 2026-04-23T14:31:37Z

I have already traced the problem and it looks like this makes ROCm/rocm-systems#4817 ie failures in hipMemcpyAsync with valid source and destination parameters in multigpu scenarios, very much way more common.

There is not a whole lot we can do about this since our code is correct. At the same time pretty much breaking multigpu hip for non-fp16 fattn isent really an option either

thanks for tracking this down. looks like the RC of ROCm/rocm-systems#4817 is a pre-existing hipMemcpyAsync host mapping race on multi-gpu. the LRU eviction changes free/realloc timing which exposes it more often, but the underlying bug is in the ROCm runtime as you surmise.

thoughts,on how to unblock: i can gate the LRU path behind a single-gpu check on HIP and fall back to clear_pool() for multi-gpu. that way multi-gpu HIP gets the same behavior it had before (no regression) and single-gpu HIP + all CUDA users get the fix.

want me to push that, or do you have a different approach in mind? I also have some AMD contacts I can ask if needed

ROCm/rocm-systems#4817: LRU free/realloc cycles amplify a hipMemcpyAsync host-mapping race on multi-GPU setups. Gate the LRU path behind a single-GPU check on HIP and fall back to clear_pool() for multi-GPU. Single-GPU HIP + all CUDA users still get LRU eviction.

TheTom · 2026-04-23T23:24:04Z

Added an ifdef for multigpu. Open to changes. Let me know.

IMbackK · 2026-04-26T18:55:41Z

I dont know if there is a point to doing this. As far as i know the cuda devices will all use the VMM allocator anyhow, im not sure under what circumstances CU_DEVICE_ATTRIBUTE_VIRTUAL_MEMORY_MANAGEMENT_SUPPORTED is false on cuda. Maybe @JohannesGaessler can comment on if this is an in any way common case.

If this not an encountered case then i would recommend to just leave this pr open until i or amd figure out where the race in rocr/clr is exactly.

I get that its also useful for the single gpu hip case - but im not sure this justifies having the hack in the code. There is also something to be said for not doing this sort of workaround and instead pushing amd to fix their shit.

TheTom · 2026-04-29T15:02:03Z

I dont know if there is a point to doing this. As far as i know the cuda devices will all use the VMM allocator anyhow, im not sure under what circumstances CU_DEVICE_ATTRIBUTE_VIRTUAL_MEMORY_MANAGEMENT_SUPPORTED is false on cuda. Maybe @JohannesGaessler can comment on if this is an in any way common case.

If this not an encountered case then i would recommend to just leave this pr open until i or amd figure out where the race in rocr/clr is exactly.

I get that its also useful for the single gpu hip case - but im not sure this justifies having the hack in the code. There is also something to be said for not doing this sort of workaround and instead pushing amd to fix their shit.

@JohannesGaessler @IMbackK looking for guidance here.

single-gpu HIP users hit both OOM and prefill slowdown on q8_0 KV at long ctx, confirmed across gfx1100/1200/1201/906 by multiple community testers.

i don't have multi-gpu amd to verify or fix 4817. the current PR is gated to single-gpu HIP only - multi-gpu path is unchanged. happy to ping my amd contacts on 4817 in parallel to push their side, but the user-facing fix shouldn't have to wait on rocm's queue.

if the ifdef is the blocker, happy to swap to a runtime device-count check. what shape would you accept here?

TheTom requested a review from a team as a code owner April 21, 2026 09:26

TheTom mentioned this pull request Apr 21, 2026

hip: direct alloc for FA f16 temp buffers #22185

Closed

JohannesGaessler reviewed Apr 21, 2026

View reviewed changes

cuda: LRU eviction + overalloc for legacy pool

4e68d9e

Fixes ggml-org#22107.

TheTom force-pushed the experiment/pool-threshold-free branch from 6ac5e04 to 4e68d9e Compare April 21, 2026 12:22

github-actions Bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Apr 21, 2026

JohannesGaessler approved these changes Apr 21, 2026

View reviewed changes

JohannesGaessler requested a review from IMbackK April 21, 2026 12:44

IMbackK approved these changes Apr 22, 2026

View reviewed changes

IMbackK requested changes Apr 22, 2026

View reviewed changes

leonardHONG mentioned this pull request Apr 23, 2026

cuda: add partial eviction on pool OOM #22193

Open

TheTom force-pushed the experiment/pool-threshold-free branch from f68e40f to a2e25b0 Compare April 23, 2026 23:22

TheTom requested a review from IMbackK April 23, 2026 23:23

Conversation

TheTom commented Apr 21, 2026

Requirements

Uh oh!

ggml-gh-bot Bot commented Apr 21, 2026

Uh oh!

TheTom commented Apr 21, 2026

Uh oh!

JohannesGaessler Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

TheTom commented Apr 21, 2026

Uh oh!

IMbackK left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TheTom commented Apr 22, 2026

Uh oh!

TheTom commented Apr 23, 2026

Uh oh!

IMbackK commented Apr 23, 2026

Uh oh!

TheTom commented Apr 23, 2026

Uh oh!

TheTom commented Apr 23, 2026

Uh oh!

IMbackK commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TheTom commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

IMbackK left a comment •

edited

Loading

IMbackK commented Apr 26, 2026 •

edited

Loading