test-backend-ops: allow loading tests from file and parsing model operators into file by 0cc4m · Pull Request #19896 · ggml-org/llama.cpp

0cc4m · 2026-02-25T15:13:29Z

When working on backends, I often have the problem that a specific operator inside of a model fails. I have to track that down, add a test and then figure out a fix. This change is supposed to make that easier by allowing you to extract operators from a model file in a way that test-backend-ops can run them directly.

I first wanted to allow test-backend-ops to parse them directly, but since it's GGML-specific, I didn't find a good way to do that. Instead, I added the tool llama-export-graph-ops to load a model, parse the pp/tg cgraphs and put the operators into a JSON file, which test-backend-ops can load. Let me know if there's a better way to do this that I missed, I had to add llama_graph_reserve to the public API to avoid using internal API headers.

A "generic operator" test also didn't fit as neatly into test-backend-ops as I had hoped, I tried to put all the special error threshold and initialization functions into it in the least intrusive way. Let me know if there's a better way to handle this and the graph extraction that I didn't see.

I plan to expand llama-export-graph-ops to allow other sources for tests, for example HF metadata (#19796) could be useful to avoid downloading a model if a backend issue has been reported with it.

Model-specific test-backend-ops may also be useful to identify operators that perform worse than expected inside of models or to compare operator performance across backends.

Example for Qwen3 4B Q8_0:

> build/test-backend-ops --test-json qwen3_4b_q8_0.json
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Arc(tm) Graphics (MTL) (Intel open-source Mesa driver) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 65536 | int dot: 1 | matrix cores: none
Testing 2 devices

Backend 1/2: Vulkan0
  Device description: Intel(R) Arc(tm) Graphics (MTL)
  Device memory: 47790 MB (41065 MB free)

  ADD(name=ffn_inp-0,type=f32,ne=[2560,1,1,1],op_params=[],sources=f32[2560,1,1,1],f32[2560,1,1,1]): OK
  ADD(name=ffn_inp-0,type=f32,ne=[2560,512,1,1],op_params=[],sources=f32[2560,512,1,1],f32[2560,512,1,1]): OK
  MUL(name=Kcur_normed-0,type=f32,ne=[128,8,1,1],op_params=[],sources=f32[128,8,1,1],f32[128,1,1,1]): OK
  MUL(name=Kcur_normed-0,type=f32,ne=[128,8,512,1],op_params=[],sources=f32[128,8,512,1],f32[128,1,1,1]): OK
  MUL(name=Qcur_normed-0,type=f32,ne=[128,32,1,1],op_params=[],sources=f32[128,32,1,1],f32[128,1,1,1]): OK
  MUL(name=Qcur_normed-0,type=f32,ne=[128,32,512,1],op_params=[],sources=f32[128,32,512,1],f32[128,1,1,1]): OK
  MUL(name=attn_norm-0,type=f32,ne=[2560,1,1,1],op_params=[],sources=f32[2560,1,1,1],f32[2560,1,1,1]): OK
  MUL(name=attn_norm-0,type=f32,ne=[2560,512,1,1],op_params=[],sources=f32[2560,512,1,1],f32[2560,1,1,1]): OK
  RMS_NORM(name=norm-0,type=f32,ne=[128,8,1,1],op_params=[0:897988541],sources=f32[128,8,1,1]): OK
  RMS_NORM(name=norm-0,type=f32,ne=[128,8,512,1],op_params=[0:897988541],sources=f32[128,8,512,1]): OK
  RMS_NORM(name=norm-0,type=f32,ne=[128,32,1,1],op_params=[0:897988541],sources=f32[128,32,1,1]): OK
  RMS_NORM(name=norm-0,type=f32,ne=[128,32,512,1],op_params=[0:897988541],sources=f32[128,32,512,1]): OK
  RMS_NORM(name=norm-0,type=f32,ne=[2560,1,1,1],op_params=[0:897988541],sources=f32[2560,1,1,1]): OK
  RMS_NORM(name=norm-0,type=f32,ne=[2560,512,1,1],op_params=[0:897988541],sources=f32[2560,512,1,1]): OK
  MUL_MAT(name=Vcur-0,type=f32,ne=[1024,1,1,1],op_params=[],sources=q8_0[2560,1024,1,1],f32[2560,1,1,1]): OK
  MUL_MAT(name=Vcur-0,type=f32,ne=[1024,512,1,1],op_params=[],sources=q8_0[2560,1024,1,1],f32[2560,512,1,1]): OK
  MUL_MAT(name=node_28,type=f32,ne=[2560,1,1,1],op_params=[],sources=q8_0[4096,2560,1,1],f32[4096,1,1,1]): OK
  MUL_MAT(name=ffn_out-0,type=f32,ne=[2560,1,1,1],op_params=[],sources=q8_0[9728,2560,1,1],f32[9728,1,1,1]): OK
  MUL_MAT(name=node_28,type=f32,ne=[2560,512,1,1],op_params=[],sources=q8_0[4096,2560,1,1],f32[4096,512,1,1]): OK
  MUL_MAT(name=ffn_out-0,type=f32,ne=[2560,512,1,1],op_params=[],sources=q8_0[9728,2560,1,1],f32[9728,512,1,1]): OK
  MUL_MAT(name=Qcur-0,type=f32,ne=[4096,1,1,1],op_params=[],sources=q8_0[2560,4096,1,1],f32[2560,1,1,1]): OK
  MUL_MAT(name=Qcur-0,type=f32,ne=[4096,512,1,1],op_params=[],sources=q8_0[2560,4096,1,1],f32[2560,512,1,1]): OK
  MUL_MAT(name=ffn_gate-0,type=f32,ne=[9728,1,1,1],op_params=[],sources=q8_0[2560,9728,1,1],f32[2560,1,1,1]): OK
  MUL_MAT(name=ffn_gate-0,type=f32,ne=[9728,512,1,1],op_params=[],sources=q8_0[2560,9728,1,1],f32[2560,512,1,1]): OK
  MUL_MAT(name=result_output,type=f32,ne=[151936,1,1,1],op_params=[],sources=q8_0[2560,151936,1,1],f32[2560,1,1,1]): OK
  MUL_MAT(name=result_output,type=f32,ne=[151936,512,1,1],op_params=[],sources=q8_0[2560,151936,1,1],f32[2560,512,1,1]): OK
  CPY(name= (copy),type=f16,ne=[4096,1,1,1],op_params=[],sources=f32[4096,1,1,1],f16[4096,1,1,1]): OK
  CPY(name= (copy),type=f16,ne=[4096,512,1,1],op_params=[],sources=f32[4096,512,1,1],f16[4096,512,1,1]): OK
  GET_ROWS(name=node_1254,type=f32,ne=[2560,1,1,1],op_params=[],sources=f32[2560,1,1,1],i32[1,1,1,1]): OK
  GET_ROWS(name=embd,type=f32,ne=[2560,1,1,1],op_params=[],sources=q8_0[2560,151936,1,1],i32[1,1,1,1]): OK
  GET_ROWS(name=node_1254,type=f32,ne=[2560,512,1,1],op_params=[],sources=f32[2560,512,1,1],i32[512,1,1,1]): OK
  GET_ROWS(name=embd,type=f32,ne=[2560,512,1,1],op_params=[],sources=q8_0[2560,151936,1,1],i32[512,1,1,1]): OK
  SET_ROWS(name=cache_k_l0 (view),type=f16,ne=[1024,4096,1,1],op_params=[],sources=f32[1024,1,1,1],i64[1,1,1,1],f16[1024,4096,1,1]): OK
  SET_ROWS(name=cache_k_l0 (view),type=f16,ne=[1024,4096,1,1],op_params=[],sources=f32[1024,512,1,1],i64[512,1,1,1],f16[1024,4096,1,1]): OK
  ROPE(name=Kcur-0,type=f32,ne=[128,8,1,1],op_params=[1:128,2:2,4:262144,5:1251513984,6:1065353216,8:1065353216,9:1107296256,10:1065353216],sources=f32[128,8,1,1],i32[1,1,1,1]): OK
  ROPE(name=Kcur-0,type=f32,ne=[128,8,512,1],op_params=[1:128,2:2,4:262144,5:1251513984,6:1065353216,8:1065353216,9:1107296256,10:1065353216],sources=f32[128,8,512,1],i32[512,1,1,1]): OK
  ROPE(name=Qcur-0,type=f32,ne=[128,32,1,1],op_params=[1:128,2:2,4:262144,5:1251513984,6:1065353216,8:1065353216,9:1107296256,10:1065353216],sources=f32[128,32,1,1],i32[1,1,1,1]): OK
  ROPE(name=Qcur-0,type=f32,ne=[128,32,512,1],op_params=[1:128,2:2,4:262144,5:1251513984,6:1065353216,8:1065353216,9:1107296256,10:1065353216],sources=f32[128,32,512,1],i32[512,1,1,1]): OK
  FLASH_ATTN_EXT(name=__fattn__-0,type=f32,ne=[128,32,1,1],op_params=[0:1035273459,3:10],sources=f32[128,1,32,1]nb[4,16384,512,16384],f16[128,4096,8,1]nb[2,2048,256,8388608],f16[128,4096,8,1]nb[2,2048,256,8388608],f16[4096,1,1,1]): OK
  FLASH_ATTN_EXT(name=__fattn__-0,type=f32,ne=[128,32,512,1],op_params=[0:1035273459,3:10],sources=f32[128,512,32,1]nb[4,16384,512,8388608],f16[128,4096,8,1]nb[2,2048,256,8388608],f16[128,4096,8,1]nb[2,2048,256,8388608],f16[4096,512,1,1]): OK
  SWIGLU(name=ffn_swiglu-0,type=f32,ne=[9728,1,1,1],op_params=[0:2],sources=f32[9728,1,1,1],f32[9728,1,1,1]): OK
  SWIGLU(name=ffn_swiglu-0,type=f32,ne=[9728,512,1,1],op_params=[0:2],sources=f32[9728,512,1,1],f32[9728,512,1,1]): OK
  42/42 tests passed
  Backend Vulkan0: OK
Backend 2/2: CPU
  Skipping CPU backend
2/2 backends passed
OK

Claude Code was used to assist, but I wrote and tested the code.

0cc4m · 2026-03-02T09:03:48Z

@ggerganov @CISC Could one of you take a look at this?

ggerganov · 2026-03-02T09:14:42Z

I think using JSON here is not really necessary - would prefer to avoid it. Simple ad-hoc data read/write should be OK.

0cc4m · 2026-03-02T09:20:25Z

You mean a simple binary format? Sure, I can do that.

ggerganov · 2026-03-02T09:22:46Z

Yes, either simple binary, or even text is fine.

0cc4m · 2026-03-02T12:56:43Z

Alright, switched to a simple text file format.

0cc4m · 2026-03-04T17:09:54Z

@ggerganov Is it okay like this?

ggerganov

~~Yes, this is better.~~

I'm still wondering if we should mark the new llama_graph_reserve as experimental/unstable in some way? With the upcoming llama.cpp packages (#20042) we should be mindful how we change the public API and minimize the changes. This function seems quite

Sorry, some leftover partial comment from earlier - ignore.

…and()

…t-graph-ops to tests/

0cc4m · 2026-03-11T11:32:36Z

@ggerganov It's failing to build on Windows because llama-ext.h does not export the function with __declspec(dllexport). What's the right fix for this, add LLAMA_API to the function in the ext-header, or would that get us back to the issue that the function shouldn't be part of the public API?

ggerganov · 2026-03-11T11:36:18Z

Yes, add LLAMA_API. It not a big issue because a 3rd party using libllama would only see the declarations from llama.h and not from llama-ext.h (because it is private header).

vt-alt · 2026-03-16T09:20:34Z

Does it supposed to be installed as export-graph-ops? Not as llama-export-graph-ops as tests/test-backend-ops.cpp mention it, and if this is testing helper so why install at all?

0cc4m · 2026-03-16T10:01:29Z

I suppose it could be called test-export-graph-ops, but it's not a full test. It's being treated like all other tests, you don't have to install them.

vt-alt · 2026-03-16T10:08:11Z

Thanks for clarifying. ~~I don't install it, it is installed.~~

vt-alt · 2026-03-16T10:13:36Z

~~Perhaps, it's installed due to using llama_build unlike other tests using llama_build_and_test.~~

vt-alt · 2026-03-16T10:15:38Z

Ah yeah I now looked deeper, we (for ALT Linux) delete test binaries with test-* pattern and export-graph-ops is not deleted, so llama_build is unrelated (sorry) and only missing test- prefix matters in our case.

Excuse me, llama.cpp build process is so compilated to comprehend everything from the first glance.

0cc4m · 2026-03-16T10:22:26Z

Oh, I wasn't aware of that. @ggerganov Should I rename the binary?

…rators into file (ggml-org#19896) * tests: allow loading test-backend-ops tests from json * add error threshold based on op * add error when file cannot be read * add graph operator json extraction tool * add nb parameter for non-contiguous input tensors * fix view check * only use view if non-contiguous/permuted, use C++ random instead of rand() * replace internal API calls with public llama_graph_reserve call * reduce test description length * fix nb[0] not getting set for view * add name to tests * fix inplace error * use text file instead of json * move llama_graph_reserve function to new llama-ext header, move export-graph-ops to tests/ * fix missing declaration * use pragma once * fix indent * fix Windows build

0cc4m requested a review from ggerganov as a code owner February 25, 2026 15:13

github-actions Bot added testing Everything test related examples labels Feb 25, 2026

ggerganov reviewed Mar 4, 2026

View reviewed changes

Comment thread include/llama.h Outdated

0cc4m force-pushed the 0cc4m/test-backend-ops-model-load branch from 86c0299 to 201b8e4 Compare March 9, 2026 15:05

ggerganov approved these changes Mar 9, 2026

View reviewed changes

Comment thread src/llama-ext.h Outdated

0cc4m added 17 commits March 11, 2026 12:28

tests: allow loading test-backend-ops tests from json

ddf4b9d

add error threshold based on op

ff557ea

add error when file cannot be read

e77c43e

add graph operator json extraction tool

c2450b5

add nb parameter for non-contiguous input tensors

a9dc28e

fix view check

0c7a505

only use view if non-contiguous/permuted, use C++ random instead of r…

99e62a6

…and()

replace internal API calls with public llama_graph_reserve call

a09708e

reduce test description length

c80b146

fix nb[0] not getting set for view

323f01b

add name to tests

6e38875

fix inplace error

91634e6

use text file instead of json

7feb7a7

move llama_graph_reserve function to new llama-ext header, move expor…

8055a05

…t-graph-ops to tests/

fix missing declaration

cd7fe03

use pragma once

a0c532c

fix indent

3e21a58

fix Windows build

062e7b1

0cc4m force-pushed the 0cc4m/test-backend-ops-model-load branch from 5e469b6 to 062e7b1 Compare March 11, 2026 11:39

0cc4m changed the title ~~test-backend-ops: allow loading tests from JSON and parsing model operators into JSON~~ test-backend-ops: allow loading tests from file and parsing model operators into file Mar 12, 2026

0cc4m merged commit 128142f into master Mar 12, 2026
72 of 82 checks passed

0cc4m deleted the 0cc4m/test-backend-ops-model-load branch March 12, 2026 12:26

0cc4m mentioned this pull request Mar 29, 2026

Multi-backend profiler #21138

Closed

Conversation

0cc4m commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

0cc4m commented Mar 2, 2026

Uh oh!

ggerganov commented Mar 2, 2026

Uh oh!

0cc4m commented Mar 2, 2026

Uh oh!

ggerganov commented Mar 2, 2026

Uh oh!

0cc4m commented Mar 2, 2026

Uh oh!

0cc4m commented Mar 4, 2026

Uh oh!

Uh oh!

ggerganov left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

0cc4m commented Mar 11, 2026

Uh oh!

ggerganov commented Mar 11, 2026

Uh oh!

Uh oh!

vt-alt commented Mar 16, 2026

Uh oh!

0cc4m commented Mar 16, 2026

Uh oh!

vt-alt commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vt-alt commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vt-alt commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

0cc4m commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

0cc4m commented Feb 25, 2026 •

edited

Loading

ggerganov left a comment •

edited

Loading

vt-alt commented Mar 16, 2026 •

edited

Loading

vt-alt commented Mar 16, 2026 •

edited

Loading

vt-alt commented Mar 16, 2026 •

edited

Loading