ggml : add fallback to CPU for unsupported ops in scheduler by angt · Pull Request #19884 · ggml-org/llama.cpp

angt · 2026-02-25T10:03:02Z

I found this while fixing:

llama-server -v -hf unsloth/LFM2.5-VL-1.6B-GGUF:Q8_0

with this PR: #19867

I guess it's a fallback of the fallback mechanism somehow and might not be the real root cause.

anyway, with that, I have llama-server running:

...
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 1
sched_reserve: reserving ...
sched_reserve: max_nodes = 1192
sched_reserve: reserving full memory module
sched_reserve: worst-case: n_tokens = 512, n_seqs = 4, n_outputs = 4
graph_reserve: reserving a graph for ubatch with n_tokens =    1, n_seqs =  4, n_outputs =    4
graph_reserve: making n_tokens a multiple of n_seqs - n_tokens = 4, n_seqs = 4, n_outputs = 4
sched_reserve: Flash Attention was auto, set to enabled
graph_reserve: reserving a graph for ubatch with n_tokens =  512, n_seqs =  4, n_outputs =  512
graph_reserve: reserving a graph for ubatch with n_tokens =    4, n_seqs =  4, n_outputs =    4
graph_reserve: reserving a graph for ubatch with n_tokens =  512, n_seqs =  4, n_outputs =  512
sched_reserve:        CPU compute buffer size =   395.01 MiB
sched_reserve: graph nodes  = 549
sched_reserve: graph splits = 1
sched_reserve: reserve took 1.47 ms, sched copies = 1
...

and at the end:

que    start_loop: terminate
srv    operator(): operator(): cleaning up before exit...
llama_memory_breakdown_print: | memory breakdown [MiB] | total   free    self   model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - Host               |                 3064 =  1169 +    1500 +     395                |
llama_memory_breakdown_print: |   - AMX                |                 1325 =  1325 +       0 +       0                |
~llama_context:        CPU compute buffer size is 395.0137 MiB, matches expectation of 395.0137 MiB

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

ggerganov · 2026-02-25T10:28:39Z

Not sure this is the proper fix.

graph_reserve: reserving a graph for ubatch with n_tokens = 512, n_seqs = 4, n_outputs = 512
sched_reserve: CPU compute buffer size = 395.01 MiB
sched_reserve: graph nodes = 549
...
llama_memory_breakdown_print: | - Host | 3064 = 1169 + 1500 + 395 |
llama_memory_breakdown_print: | - AMX | 1325 = 1325 + 0 + 0 |

This logs indicate the llama.cpp logic incorrectly decides to create AMX buffers for some of the weights, even though the computation is not supported (hence 0 AMX compute buffers). Rather something in llama-model.cpp / weight_buft_supported() likely needs to be adjusted to not create the AMX buffers in the first place.

angt · 2026-02-25T12:37:01Z

Looking into weight_buft_supported i think i'll need to pass more context info (like n_parallel or n_seqs) to correctly accept or refuse a backends like AMX (so in the llama_model struct).

Using --parallel 1 works without any patch.

ggerganov · 2026-02-25T13:11:47Z

Looking into weight_buft_supported i think i'll need to pass more context info (like n_parallel or n_seqs) to correctly accept or refuse a backends like AMX (so in the llama_model struct).

Using --parallel 1 works without any patch.

It should be possible to infer these from the tensor and src shapes I think?

ggerganov · 2026-02-25T13:30:22Z

This logs indicate the llama.cpp logic incorrectly decides to create AMX buffers for some of the weights, even though the computation is not supported (hence 0 AMX compute buffers).

So I think this earlier statement is not correct. Seems like the compute buffer can remain 0 size and the backend to be still used. Atm, I am not sure what is the proper fix.

angt · 2026-02-25T14:11:05Z

when it fails (parrallel > 1) we have (custom logs to debug):

weight_buft_supported: tensor=blk.0.shortconv.in_proj.weight, op=MUL_MAT, buft=AMX
weight_buft_supported: tensor=blk.0.shortconv.in_proj.weight, ne = [2048,6144,1,1]
operator(): tensor blk.0.shortconv.in_proj.weight, ne = [2048,6144,1,1] OK
operator(): tensor , ne = [2048,512,1,1] OK

so we check AMX with this code, w->ne[2]=1 and w->ne[3]=1:

ggml_tensor * b = ggml_new_tensor_4d(ctx, GGML_TYPE_F32, w->ne[0], 512, w->ne[2], w->ne[3]);
op_tensor = ggml_mul_mat(ctx, w, b);

which is OK
but when doing the job for real with parallel>1 later:

operator(): tensor blk.0.shortconv.in_proj.weight, ne = [2048,6144,1,1] OK
operator(): tensor model.layers.{}.operator_norm-0 (reshaped), ne = [2048,1,4,1] NOT OK
/home/angt/hf/llama.cpp/ggml/src/ggml-backend.cpp:1163: GGML_ASSERT(*cur_backend_id != -1) failed
operator(): tensor blk.0.shortconv.out_proj.weight, ne = [2048,2048,1,1] OK

we now have w->ne[2]=4 so we need to check for the reshaped tensors, taking into account n_parallel 🤔

ggerganov · 2026-02-25T15:13:34Z

Got it. It's much simpler to extend AMX ops to not impose limit on dim 2 and 3.

angt · 2026-02-25T15:13:58Z

I was worried but it turns out all the code was already in place to allow n_parallel to be used for backend selection.

angt · 2026-02-25T15:18:37Z

Got it. It's much simpler to extend AMX ops to not impose limit on dim 2 and 3.

Yes this is the best solution. I had assumed it would be harder and wanted to fix this issue.

But the n_parallel PR doesn’t seem that hacky and makes sense to me.

ggerganov · 2026-02-25T15:19:52Z

n_parallel is context-specific parameter. It's not only simpler - it is also consistent with the logic of libllama. In principle for a single model, we can have multiple contexts, each with different n_parallel. So it's not OK to impose a restriction at model creation for n_parallel.

angt · 2026-02-25T15:29:40Z

That was my initial thought, i was thinking of a max value of n_parallel maybe.

Ok, so we should not have a backend that use the w->ne in supports_op().

ggerganov · 2026-02-25T15:40:15Z

Ok, so we should not have a backend that use the w->ne in supports_op().

For example, these usages of ne[0] are completely fine:

llama.cpp/ggml/src/ggml-metal/ggml-metal-device.m

Lines 1132 to 1151 in 9051663

    
           case GGML_OP_FLASH_ATTN_EXT: 
        
               // for new head sizes, add checks here 
        
               if (op->src[0]->ne[0] != 32 && 
        
                   op->src[0]->ne[0] != 40 && 
        
                   op->src[0]->ne[0] != 48 && 
        
                   op->src[0]->ne[0] != 64 && 
        
                   op->src[0]->ne[0] != 72 && 
        
                   op->src[0]->ne[0] != 80 && 
        
                   op->src[0]->ne[0] != 96 && 
        
                   op->src[0]->ne[0] != 112 && 
        
                   op->src[0]->ne[0] != 128 && 
        
                   op->src[0]->ne[0] != 192 && 
        
                   op->src[0]->ne[0] != 256 && 
        
                   op->src[0]->ne[0] != 576) { 
        
                   return false; 
        
               } 
        
               if (op->src[1]->type != op->src[2]->type) { 
        
                   return false; 
        
               } 
        
               return has_simdgroup_mm; // TODO: over-restricted for vec-kernels

Conditioning on ne[2] and ne[3] for GGML_OP_MUL_MAT is problematic as these dimensions can be different based on context parameters.

Technically, the weight_buft_supported() logic should exercise all potential usages of the weight tensor. For example, we see that this is not enough:

llama.cpp/src/llama-model.cpp

Lines 203 to 207 in 9051663

    
           case GGML_OP_MUL_MAT: 
        
               { 
        
                   ggml_tensor * b = ggml_new_tensor_4d(ctx, GGML_TYPE_F32, w->ne[0], 512, w->ne[2], w->ne[3]); 
        
                   op_tensor = ggml_mul_mat(ctx, w, b); 
        
               } break;

It would also have to check support when b->ne[2] and b->ne[3] is a multiple of w->ne[2] and w->ne[3]. We can improve this logic separately.

angt · 2026-02-25T16:00:34Z

True, but i still dont know how we could support a backend like AMX that restrict ne[2] to 1 in the current design.

so far my options are:

change ggml-backend to correctly report the issue or fallback to CPU (maybe we need to memcpy the tensor).
use n_parallel or n_seqs in the weight_buft_supported.

angt · 2026-02-25T21:08:56Z

I have a batched version for AMX, i'll push tomorrow after more testing so let's close this one

ggml : add fallback to CPU for unsupported ops in scheduler

c8ec776

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

angt mentioned this pull request Feb 25, 2026

ggml : fix AMX and improve alignment checks #19867

Closed

github-actions Bot added the ggml changes relating to the ggml tensor library for machine learning label Feb 25, 2026

angt mentioned this pull request Feb 25, 2026

llama : use n_parallel to select the correct backend #19895

Closed

angt closed this Feb 25, 2026

Conversation

angt commented Feb 25, 2026

Uh oh!

ggerganov commented Feb 25, 2026

Uh oh!

angt commented Feb 25, 2026

Uh oh!

ggerganov commented Feb 25, 2026

Uh oh!

ggerganov commented Feb 25, 2026

Uh oh!

angt commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Feb 25, 2026

Uh oh!

angt commented Feb 25, 2026

Uh oh!

angt commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Feb 25, 2026

Uh oh!

angt commented Feb 25, 2026

Uh oh!

ggerganov commented Feb 25, 2026

Uh oh!

angt commented Feb 25, 2026

Uh oh!

angt commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

angt commented Feb 25, 2026 •

edited

Loading

angt commented Feb 25, 2026 •

edited

Loading