Skip to content

test-backend-ops: allow loading tests from file and parsing model operators into file#19896

Merged
0cc4m merged 18 commits intomasterfrom
0cc4m/test-backend-ops-model-load
Mar 12, 2026
Merged

test-backend-ops: allow loading tests from file and parsing model operators into file#19896
0cc4m merged 18 commits intomasterfrom
0cc4m/test-backend-ops-model-load

Conversation

@0cc4m
Copy link
Copy Markdown
Contributor

@0cc4m 0cc4m commented Feb 25, 2026

When working on backends, I often have the problem that a specific operator inside of a model fails. I have to track that down, add a test and then figure out a fix. This change is supposed to make that easier by allowing you to extract operators from a model file in a way that test-backend-ops can run them directly.

I first wanted to allow test-backend-ops to parse them directly, but since it's GGML-specific, I didn't find a good way to do that. Instead, I added the tool llama-export-graph-ops to load a model, parse the pp/tg cgraphs and put the operators into a JSON file, which test-backend-ops can load. Let me know if there's a better way to do this that I missed, I had to add llama_graph_reserve to the public API to avoid using internal API headers.

A "generic operator" test also didn't fit as neatly into test-backend-ops as I had hoped, I tried to put all the special error threshold and initialization functions into it in the least intrusive way. Let me know if there's a better way to handle this and the graph extraction that I didn't see.

I plan to expand llama-export-graph-ops to allow other sources for tests, for example HF metadata (#19796) could be useful to avoid downloading a model if a backend issue has been reported with it.

Model-specific test-backend-ops may also be useful to identify operators that perform worse than expected inside of models or to compare operator performance across backends.

Example for Qwen3 4B Q8_0:
> build/test-backend-ops --test-json qwen3_4b_q8_0.json
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Arc(tm) Graphics (MTL) (Intel open-source Mesa driver) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 65536 | int dot: 1 | matrix cores: none
Testing 2 devices

Backend 1/2: Vulkan0
  Device description: Intel(R) Arc(tm) Graphics (MTL)
  Device memory: 47790 MB (41065 MB free)

  ADD(name=ffn_inp-0,type=f32,ne=[2560,1,1,1],op_params=[],sources=f32[2560,1,1,1],f32[2560,1,1,1]): OK
  ADD(name=ffn_inp-0,type=f32,ne=[2560,512,1,1],op_params=[],sources=f32[2560,512,1,1],f32[2560,512,1,1]): OK
  MUL(name=Kcur_normed-0,type=f32,ne=[128,8,1,1],op_params=[],sources=f32[128,8,1,1],f32[128,1,1,1]): OK
  MUL(name=Kcur_normed-0,type=f32,ne=[128,8,512,1],op_params=[],sources=f32[128,8,512,1],f32[128,1,1,1]): OK
  MUL(name=Qcur_normed-0,type=f32,ne=[128,32,1,1],op_params=[],sources=f32[128,32,1,1],f32[128,1,1,1]): OK
  MUL(name=Qcur_normed-0,type=f32,ne=[128,32,512,1],op_params=[],sources=f32[128,32,512,1],f32[128,1,1,1]): OK
  MUL(name=attn_norm-0,type=f32,ne=[2560,1,1,1],op_params=[],sources=f32[2560,1,1,1],f32[2560,1,1,1]): OK
  MUL(name=attn_norm-0,type=f32,ne=[2560,512,1,1],op_params=[],sources=f32[2560,512,1,1],f32[2560,1,1,1]): OK
  RMS_NORM(name=norm-0,type=f32,ne=[128,8,1,1],op_params=[0:897988541],sources=f32[128,8,1,1]): OK
  RMS_NORM(name=norm-0,type=f32,ne=[128,8,512,1],op_params=[0:897988541],sources=f32[128,8,512,1]): OK
  RMS_NORM(name=norm-0,type=f32,ne=[128,32,1,1],op_params=[0:897988541],sources=f32[128,32,1,1]): OK
  RMS_NORM(name=norm-0,type=f32,ne=[128,32,512,1],op_params=[0:897988541],sources=f32[128,32,512,1]): OK
  RMS_NORM(name=norm-0,type=f32,ne=[2560,1,1,1],op_params=[0:897988541],sources=f32[2560,1,1,1]): OK
  RMS_NORM(name=norm-0,type=f32,ne=[2560,512,1,1],op_params=[0:897988541],sources=f32[2560,512,1,1]): OK
  MUL_MAT(name=Vcur-0,type=f32,ne=[1024,1,1,1],op_params=[],sources=q8_0[2560,1024,1,1],f32[2560,1,1,1]): OK
  MUL_MAT(name=Vcur-0,type=f32,ne=[1024,512,1,1],op_params=[],sources=q8_0[2560,1024,1,1],f32[2560,512,1,1]): OK
  MUL_MAT(name=node_28,type=f32,ne=[2560,1,1,1],op_params=[],sources=q8_0[4096,2560,1,1],f32[4096,1,1,1]): OK
  MUL_MAT(name=ffn_out-0,type=f32,ne=[2560,1,1,1],op_params=[],sources=q8_0[9728,2560,1,1],f32[9728,1,1,1]): OK
  MUL_MAT(name=node_28,type=f32,ne=[2560,512,1,1],op_params=[],sources=q8_0[4096,2560,1,1],f32[4096,512,1,1]): OK
  MUL_MAT(name=ffn_out-0,type=f32,ne=[2560,512,1,1],op_params=[],sources=q8_0[9728,2560,1,1],f32[9728,512,1,1]): OK
  MUL_MAT(name=Qcur-0,type=f32,ne=[4096,1,1,1],op_params=[],sources=q8_0[2560,4096,1,1],f32[2560,1,1,1]): OK
  MUL_MAT(name=Qcur-0,type=f32,ne=[4096,512,1,1],op_params=[],sources=q8_0[2560,4096,1,1],f32[2560,512,1,1]): OK
  MUL_MAT(name=ffn_gate-0,type=f32,ne=[9728,1,1,1],op_params=[],sources=q8_0[2560,9728,1,1],f32[2560,1,1,1]): OK
  MUL_MAT(name=ffn_gate-0,type=f32,ne=[9728,512,1,1],op_params=[],sources=q8_0[2560,9728,1,1],f32[2560,512,1,1]): OK
  MUL_MAT(name=result_output,type=f32,ne=[151936,1,1,1],op_params=[],sources=q8_0[2560,151936,1,1],f32[2560,1,1,1]): OK
  MUL_MAT(name=result_output,type=f32,ne=[151936,512,1,1],op_params=[],sources=q8_0[2560,151936,1,1],f32[2560,512,1,1]): OK
  CPY(name= (copy),type=f16,ne=[4096,1,1,1],op_params=[],sources=f32[4096,1,1,1],f16[4096,1,1,1]): OK
  CPY(name= (copy),type=f16,ne=[4096,512,1,1],op_params=[],sources=f32[4096,512,1,1],f16[4096,512,1,1]): OK
  GET_ROWS(name=node_1254,type=f32,ne=[2560,1,1,1],op_params=[],sources=f32[2560,1,1,1],i32[1,1,1,1]): OK
  GET_ROWS(name=embd,type=f32,ne=[2560,1,1,1],op_params=[],sources=q8_0[2560,151936,1,1],i32[1,1,1,1]): OK
  GET_ROWS(name=node_1254,type=f32,ne=[2560,512,1,1],op_params=[],sources=f32[2560,512,1,1],i32[512,1,1,1]): OK
  GET_ROWS(name=embd,type=f32,ne=[2560,512,1,1],op_params=[],sources=q8_0[2560,151936,1,1],i32[512,1,1,1]): OK
  SET_ROWS(name=cache_k_l0 (view),type=f16,ne=[1024,4096,1,1],op_params=[],sources=f32[1024,1,1,1],i64[1,1,1,1],f16[1024,4096,1,1]): OK
  SET_ROWS(name=cache_k_l0 (view),type=f16,ne=[1024,4096,1,1],op_params=[],sources=f32[1024,512,1,1],i64[512,1,1,1],f16[1024,4096,1,1]): OK
  ROPE(name=Kcur-0,type=f32,ne=[128,8,1,1],op_params=[1:128,2:2,4:262144,5:1251513984,6:1065353216,8:1065353216,9:1107296256,10:1065353216],sources=f32[128,8,1,1],i32[1,1,1,1]): OK
  ROPE(name=Kcur-0,type=f32,ne=[128,8,512,1],op_params=[1:128,2:2,4:262144,5:1251513984,6:1065353216,8:1065353216,9:1107296256,10:1065353216],sources=f32[128,8,512,1],i32[512,1,1,1]): OK
  ROPE(name=Qcur-0,type=f32,ne=[128,32,1,1],op_params=[1:128,2:2,4:262144,5:1251513984,6:1065353216,8:1065353216,9:1107296256,10:1065353216],sources=f32[128,32,1,1],i32[1,1,1,1]): OK
  ROPE(name=Qcur-0,type=f32,ne=[128,32,512,1],op_params=[1:128,2:2,4:262144,5:1251513984,6:1065353216,8:1065353216,9:1107296256,10:1065353216],sources=f32[128,32,512,1],i32[512,1,1,1]): OK
  FLASH_ATTN_EXT(name=__fattn__-0,type=f32,ne=[128,32,1,1],op_params=[0:1035273459,3:10],sources=f32[128,1,32,1]nb[4,16384,512,16384],f16[128,4096,8,1]nb[2,2048,256,8388608],f16[128,4096,8,1]nb[2,2048,256,8388608],f16[4096,1,1,1]): OK
  FLASH_ATTN_EXT(name=__fattn__-0,type=f32,ne=[128,32,512,1],op_params=[0:1035273459,3:10],sources=f32[128,512,32,1]nb[4,16384,512,8388608],f16[128,4096,8,1]nb[2,2048,256,8388608],f16[128,4096,8,1]nb[2,2048,256,8388608],f16[4096,512,1,1]): OK
  SWIGLU(name=ffn_swiglu-0,type=f32,ne=[9728,1,1,1],op_params=[0:2],sources=f32[9728,1,1,1],f32[9728,1,1,1]): OK
  SWIGLU(name=ffn_swiglu-0,type=f32,ne=[9728,512,1,1],op_params=[0:2],sources=f32[9728,512,1,1],f32[9728,512,1,1]): OK
  42/42 tests passed
  Backend Vulkan0: OK
Backend 2/2: CPU
  Skipping CPU backend
2/2 backends passed
OK

Claude Code was used to assist, but I wrote and tested the code.

@0cc4m 0cc4m requested a review from ggerganov as a code owner February 25, 2026 15:13
@github-actions github-actions Bot added testing Everything test related examples labels Feb 25, 2026
@0cc4m
Copy link
Copy Markdown
Contributor Author

0cc4m commented Mar 2, 2026

@ggerganov @CISC Could one of you take a look at this?

@ggerganov
Copy link
Copy Markdown
Member

I think using JSON here is not really necessary - would prefer to avoid it. Simple ad-hoc data read/write should be OK.

@0cc4m
Copy link
Copy Markdown
Contributor Author

0cc4m commented Mar 2, 2026

You mean a simple binary format? Sure, I can do that.

@ggerganov
Copy link
Copy Markdown
Member

Yes, either simple binary, or even text is fine.

@0cc4m
Copy link
Copy Markdown
Contributor Author

0cc4m commented Mar 2, 2026

Alright, switched to a simple text file format.

@0cc4m
Copy link
Copy Markdown
Contributor Author

0cc4m commented Mar 4, 2026

@ggerganov Is it okay like this?

Comment thread include/llama.h Outdated
@0cc4m 0cc4m force-pushed the 0cc4m/test-backend-ops-model-load branch from 86c0299 to 201b8e4 Compare March 9, 2026 15:05
Copy link
Copy Markdown
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is better.

I'm still wondering if we should mark the new llama_graph_reserve as experimental/unstable in some way? With the upcoming llama.cpp packages (#20042) we should be mindful how we change the public API and minimize the changes. This function seems quite

Sorry, some leftover partial comment from earlier - ignore.

Comment thread src/llama-ext.h Outdated
@0cc4m
Copy link
Copy Markdown
Contributor Author

0cc4m commented Mar 11, 2026

@ggerganov It's failing to build on Windows because llama-ext.h does not export the function with __declspec(dllexport). What's the right fix for this, add LLAMA_API to the function in the ext-header, or would that get us back to the issue that the function shouldn't be part of the public API?

@ggerganov
Copy link
Copy Markdown
Member

Yes, add LLAMA_API. It not a big issue because a 3rd party using libllama would only see the declarations from llama.h and not from llama-ext.h (because it is private header).

@0cc4m 0cc4m force-pushed the 0cc4m/test-backend-ops-model-load branch from 5e469b6 to 062e7b1 Compare March 11, 2026 11:39
@0cc4m 0cc4m changed the title test-backend-ops: allow loading tests from JSON and parsing model operators into JSON test-backend-ops: allow loading tests from file and parsing model operators into file Mar 12, 2026
@0cc4m 0cc4m merged commit 128142f into master Mar 12, 2026
72 of 82 checks passed
@0cc4m 0cc4m deleted the 0cc4m/test-backend-ops-model-load branch March 12, 2026 12:26
@vt-alt
Copy link
Copy Markdown

vt-alt commented Mar 16, 2026

Does it supposed to be installed as export-graph-ops? Not as llama-export-graph-ops as tests/test-backend-ops.cpp mention it, and if this is testing helper so why install at all?

@0cc4m
Copy link
Copy Markdown
Contributor Author

0cc4m commented Mar 16, 2026

I suppose it could be called test-export-graph-ops, but it's not a full test. It's being treated like all other tests, you don't have to install them.

@vt-alt
Copy link
Copy Markdown

vt-alt commented Mar 16, 2026

Thanks for clarifying. I don't install it, it is installed.

@vt-alt
Copy link
Copy Markdown

vt-alt commented Mar 16, 2026

Perhaps, it's installed due to using llama_build unlike other tests using llama_build_and_test.

@vt-alt
Copy link
Copy Markdown

vt-alt commented Mar 16, 2026

Ah yeah I now looked deeper, we (for ALT Linux) delete test binaries with test-* pattern and export-graph-ops is not deleted, so llama_build is unrelated (sorry) and only missing test- prefix matters in our case.

Excuse me, llama.cpp build process is so compilated to comprehend everything from the first glance.

@0cc4m
Copy link
Copy Markdown
Contributor Author

0cc4m commented Mar 16, 2026

Oh, I wasn't aware of that. @ggerganov Should I rename the binary?

Ethan-a2 pushed a commit to Ethan-a2/llama.cpp that referenced this pull request Mar 20, 2026
…rators into file (ggml-org#19896)

* tests: allow loading test-backend-ops tests from json

* add error threshold based on op

* add error when file cannot be read

* add graph operator json extraction tool

* add nb parameter for non-contiguous input tensors

* fix view check

* only use view if non-contiguous/permuted, use C++ random instead of rand()

* replace internal API calls with public llama_graph_reserve call

* reduce test description length

* fix nb[0] not getting set for view

* add name to tests

* fix inplace error

* use text file instead of json

* move llama_graph_reserve function to new llama-ext header, move export-graph-ops to tests/

* fix missing declaration

* use pragma once

* fix indent

* fix Windows build
@0cc4m 0cc4m mentioned this pull request Mar 29, 2026
Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026
…rators into file (ggml-org#19896)

* tests: allow loading test-backend-ops tests from json

* add error threshold based on op

* add error when file cannot be read

* add graph operator json extraction tool

* add nb parameter for non-contiguous input tensors

* fix view check

* only use view if non-contiguous/permuted, use C++ random instead of rand()

* replace internal API calls with public llama_graph_reserve call

* reduce test description length

* fix nb[0] not getting set for view

* add name to tests

* fix inplace error

* use text file instead of json

* move llama_graph_reserve function to new llama-ext header, move export-graph-ops to tests/

* fix missing declaration

* use pragma once

* fix indent

* fix Windows build
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants