Skip to content

ggml : add fallback to CPU for unsupported ops in scheduler#19884

Closed
angt wants to merge 1 commit intoggml-org:masterfrom
angt:ggml-add-fallback-to-cpu-for-unsupported-ops-in-scheduler
Closed

ggml : add fallback to CPU for unsupported ops in scheduler#19884
angt wants to merge 1 commit intoggml-org:masterfrom
angt:ggml-add-fallback-to-cpu-for-unsupported-ops-in-scheduler

Conversation

@angt
Copy link
Copy Markdown
Member

@angt angt commented Feb 25, 2026

I found this while fixing:

llama-server -v -hf unsloth/LFM2.5-VL-1.6B-GGUF:Q8_0

with this PR: #19867

I guess it's a fallback of the fallback mechanism somehow and might not be the real root cause.

anyway, with that, I have llama-server running:

...
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 1
sched_reserve: reserving ...
sched_reserve: max_nodes = 1192
sched_reserve: reserving full memory module
sched_reserve: worst-case: n_tokens = 512, n_seqs = 4, n_outputs = 4
graph_reserve: reserving a graph for ubatch with n_tokens =    1, n_seqs =  4, n_outputs =    4
graph_reserve: making n_tokens a multiple of n_seqs - n_tokens = 4, n_seqs = 4, n_outputs = 4
sched_reserve: Flash Attention was auto, set to enabled
graph_reserve: reserving a graph for ubatch with n_tokens =  512, n_seqs =  4, n_outputs =  512
graph_reserve: reserving a graph for ubatch with n_tokens =    4, n_seqs =  4, n_outputs =    4
graph_reserve: reserving a graph for ubatch with n_tokens =  512, n_seqs =  4, n_outputs =  512
sched_reserve:        CPU compute buffer size =   395.01 MiB
sched_reserve: graph nodes  = 549
sched_reserve: graph splits = 1
sched_reserve: reserve took 1.47 ms, sched copies = 1
...

and at the end:

que    start_loop: terminate
srv    operator(): operator(): cleaning up before exit...
llama_memory_breakdown_print: | memory breakdown [MiB] | total   free    self   model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - Host               |                 3064 =  1169 +    1500 +     395                |
llama_memory_breakdown_print: |   - AMX                |                 1325 =  1325 +       0 +       0                |
~llama_context:        CPU compute buffer size is 395.0137 MiB, matches expectation of 395.0137 MiB

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
@github-actions github-actions Bot added the ggml changes relating to the ggml tensor library for machine learning label Feb 25, 2026
@ggerganov
Copy link
Copy Markdown
Member

Not sure this is the proper fix.

graph_reserve: reserving a graph for ubatch with n_tokens = 512, n_seqs = 4, n_outputs = 512
sched_reserve: CPU compute buffer size = 395.01 MiB
sched_reserve: graph nodes = 549
...
llama_memory_breakdown_print: | - Host | 3064 = 1169 + 1500 + 395 |
llama_memory_breakdown_print: | - AMX | 1325 = 1325 + 0 + 0 |

This logs indicate the llama.cpp logic incorrectly decides to create AMX buffers for some of the weights, even though the computation is not supported (hence 0 AMX compute buffers). Rather something in llama-model.cpp / weight_buft_supported() likely needs to be adjusted to not create the AMX buffers in the first place.

@angt
Copy link
Copy Markdown
Member Author

angt commented Feb 25, 2026

Looking into weight_buft_supported i think i'll need to pass more context info (like n_parallel or n_seqs) to correctly accept or refuse a backends like AMX (so in the llama_model struct).

Using --parallel 1 works without any patch.

@ggerganov
Copy link
Copy Markdown
Member

Looking into weight_buft_supported i think i'll need to pass more context info (like n_parallel or n_seqs) to correctly accept or refuse a backends like AMX (so in the llama_model struct).

Using --parallel 1 works without any patch.

It should be possible to infer these from the tensor and src shapes I think?

@ggerganov
Copy link
Copy Markdown
Member

This logs indicate the llama.cpp logic incorrectly decides to create AMX buffers for some of the weights, even though the computation is not supported (hence 0 AMX compute buffers).

So I think this earlier statement is not correct. Seems like the compute buffer can remain 0 size and the backend to be still used. Atm, I am not sure what is the proper fix.

@angt
Copy link
Copy Markdown
Member Author

angt commented Feb 25, 2026

when it fails (parrallel > 1) we have (custom logs to debug):

weight_buft_supported: tensor=blk.0.shortconv.in_proj.weight, op=MUL_MAT, buft=AMX
weight_buft_supported: tensor=blk.0.shortconv.in_proj.weight, ne = [2048,6144,1,1]
operator(): tensor blk.0.shortconv.in_proj.weight, ne = [2048,6144,1,1] OK
operator(): tensor , ne = [2048,512,1,1] OK

so we check AMX with this code, w->ne[2]=1 and w->ne[3]=1:

ggml_tensor * b = ggml_new_tensor_4d(ctx, GGML_TYPE_F32, w->ne[0], 512, w->ne[2], w->ne[3]);
op_tensor = ggml_mul_mat(ctx, w, b);

which is OK
but when doing the job for real with parallel>1 later:

operator(): tensor blk.0.shortconv.in_proj.weight, ne = [2048,6144,1,1] OK
operator(): tensor model.layers.{}.operator_norm-0 (reshaped), ne = [2048,1,4,1] NOT OK
/home/angt/hf/llama.cpp/ggml/src/ggml-backend.cpp:1163: GGML_ASSERT(*cur_backend_id != -1) failed
operator(): tensor blk.0.shortconv.out_proj.weight, ne = [2048,2048,1,1] OK

we now have w->ne[2]=4 so we need to check for the reshaped tensors, taking into account n_parallel 🤔

@ggerganov
Copy link
Copy Markdown
Member

Got it. It's much simpler to extend AMX ops to not impose limit on dim 2 and 3.

@angt
Copy link
Copy Markdown
Member Author

angt commented Feb 25, 2026

I was worried but it turns out all the code was already in place to allow n_parallel to be used for backend selection.

@angt
Copy link
Copy Markdown
Member Author

angt commented Feb 25, 2026

Got it. It's much simpler to extend AMX ops to not impose limit on dim 2 and 3.

Yes this is the best solution. I had assumed it would be harder and wanted to fix this issue.

But the n_parallel PR doesn’t seem that hacky and makes sense to me.

@ggerganov
Copy link
Copy Markdown
Member

n_parallel is context-specific parameter. It's not only simpler - it is also consistent with the logic of libllama. In principle for a single model, we can have multiple contexts, each with different n_parallel. So it's not OK to impose a restriction at model creation for n_parallel.

@angt
Copy link
Copy Markdown
Member Author

angt commented Feb 25, 2026

That was my initial thought, i was thinking of a max value of n_parallel maybe.

Ok, so we should not have a backend that use the w->ne in supports_op().

@ggerganov
Copy link
Copy Markdown
Member

Ok, so we should not have a backend that use the w->ne in supports_op().

For example, these usages of ne[0] are completely fine:

case GGML_OP_FLASH_ATTN_EXT:
// for new head sizes, add checks here
if (op->src[0]->ne[0] != 32 &&
op->src[0]->ne[0] != 40 &&
op->src[0]->ne[0] != 48 &&
op->src[0]->ne[0] != 64 &&
op->src[0]->ne[0] != 72 &&
op->src[0]->ne[0] != 80 &&
op->src[0]->ne[0] != 96 &&
op->src[0]->ne[0] != 112 &&
op->src[0]->ne[0] != 128 &&
op->src[0]->ne[0] != 192 &&
op->src[0]->ne[0] != 256 &&
op->src[0]->ne[0] != 576) {
return false;
}
if (op->src[1]->type != op->src[2]->type) {
return false;
}
return has_simdgroup_mm; // TODO: over-restricted for vec-kernels

Conditioning on ne[2] and ne[3] for GGML_OP_MUL_MAT is problematic as these dimensions can be different based on context parameters.

Technically, the weight_buft_supported() logic should exercise all potential usages of the weight tensor. For example, we see that this is not enough:

case GGML_OP_MUL_MAT:
{
ggml_tensor * b = ggml_new_tensor_4d(ctx, GGML_TYPE_F32, w->ne[0], 512, w->ne[2], w->ne[3]);
op_tensor = ggml_mul_mat(ctx, w, b);
} break;

It would also have to check support when b->ne[2] and b->ne[3] is a multiple of w->ne[2] and w->ne[3]. We can improve this logic separately.

@angt
Copy link
Copy Markdown
Member Author

angt commented Feb 25, 2026

True, but i still dont know how we could support a backend like AMX that restrict ne[2] to 1 in the current design.

so far my options are:

  • change ggml-backend to correctly report the issue or fallback to CPU (maybe we need to memcpy the tensor).
  • use n_parallel or n_seqs in the weight_buft_supported.

@angt
Copy link
Copy Markdown
Member Author

angt commented Feb 25, 2026

I have a batched version for AMX, i'll push tomorrow after more testing so let's close this one

@angt angt closed this Feb 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants