Ug/port sweep bench by Thireus · Pull Request #17 · Thireus/llama.cpp

Thireus · 2025-09-27T06:34:52Z

Make sure to read the contributing guidelines before submitting a PR

…d incorrect zTensor free (ggml-org#15839)

ggml-ci

* metal : run graphs ops concurrently ggml-ci * cont : add flags for debugging and disabling concurrency ggml-ci * cont : refactor and handle fusing ggml-ci * cont : simplify - no need to use GPU address ggml-ci * cont : prepare mem ranges for reuse + add ggml-metal-common.cpp ggml-ci * cont : avoid redundant keywords in cpp [no ci] * metal : reorder graph for better concurrency ggml-ci * metal : fix race on mem pool buffers ggml-ci * cont : add env GGML_METAL_GRAPH_OPTIMIZE_DISABLE ggml-ci * cont : refactor, optimize, add comments ggml-ci * cont : refactor ggml-metal.m ggml-ci * minor : update logs [no ci]

* metal : refactor bin kernels loading ggml-ci * metal : refactor rms kernel loading ggml-ci * ci : try to add memory leaks check ggml-ci * ci : try to enable memory leak detection for Mac * cont : seems to be working

* llama : allow using iGPUs with --device * mtmd : allow iGPU * rpc-server : allow iGPU

…ers (ggml-org#15705) Use this to query register count for shader compiles on NVIDIA. Currently this is only for performance debug, but it could eventually be used in some heuristics like split_k.

* vulkan: fix failing dequant shaders * add missing const

* ggml-zdnn: rm user mapped buffers Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: rm dead code Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: attempt to fix missing extra data buffer free Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> --------- Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* doc : update documentation for --tensor-split * Update tools/main/README.md Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Update tools/main/README.md Co-authored-by: Diego Devesa <slarengh@gmail.com> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de> Co-authored-by: Diego Devesa <slarengh@gmail.com>

* releases : update ROCM, add gfx1200, gfx1201, gfx1151 * releases : set target to 13.3 for macos-x64 * add hipblaslt.dll to release * add hipblaslt/library to release

Fix regression introduced with commit 50f4281

* metal : fix kernel requirements ggml-ci * cont : fix supports_op * cont : fix supports_op for ARGMAX

) * build: fix the cache keys for Windows HIP release job Update the cache keys to include the HIP SDK version, preventing the use of outdated ROCm installation caches. * build: sync changes from release.yml to build.yml - Update HIP SDK version to 25.Q3 and ROCm version to 6.4.2 - Update the cache keys to reflect the new versions * build: remove Windows HIP release for gfx1151 since the current stable rocWMMA does not support gfx1151.

* vulkan: move mul_mm dequantization steps into a separate file and functions * improve mul_mm vector load code * fix debug mode issues and warnings

…adeon RX 9000 series (ggml-org#15994) * rocm.Dockerfile: added gfx1200,gfx1201 architectures to support AMD Radeon RX 9000 series https://rocm.docs.amd.com/projects/install-on-linux/en/docs-6.4.1/reference/system-requirements.html#rdna-os states the Radeon RX 9000 series is supported support from Ubuntu 24.04.2, and the dockerfile is using 24.04 which is ROCm 6.4. This fixed the `ROCm error: invalid device function` I was getting when trying to use the rocm container.

* metal : remove mem pool usage ggml-ci * metal : remove mem pool implementation ggml-ci * metal : take into account the actual allocated memory of the tensor ggml-ci * cont : use ggml_backend_buft_get_alloc_size ggml-ci * cont : improve, comments ggml-ci * cont : add functions for the extra tensor sizes * metal : add comments ggml-ci * metal : implement .get_alloc_size for the rest of the buffer types ggml-ci * metal : remove ggml_metal_heap ggml-ci

* add grok-2 support * type fix * type fix * type fix * "fix" vocab for invalid sequences * fix expert tensor mapping and spaces in vocab * add chat template * fix norm tensor mapping * rename layer_out_norm to ffn_post_norm * ensure ffn_post_norm is mapped * fix experts merging * remove erroneous FFN_GATE entry * concatenate split tensors and add more metadata * process all expert layers and try cat instead of hstack * add support for community BPE vocab * fix expert feed forward length and ffn_down concat * commit this too * add ffn_up/gate/down, unsure if sequence is right * add ffn_gate/down/up to tensor names * correct residual moe (still not working) * mess-- * fix embedding scale being applied twice * add built in chat template * change beta fast for grok if default value * remove spm vocab in favor of community bpe vocab * change attention temp length metadata type to integer * update attention temp length metadata * remove comment * replace M_SQRT2 with std::sqrt(2) * add yarn metadata, move defaults to hparams

In `llama-perplexity`, when using `--kl-divergence`, the KL divergence statistics output mistakenly displays the 99th percentile twice. This change fixes that and correctly displays the 90th percentile as originally intended (presumably).

)

* llama-run: Fix model download on Windows * fix SSL error (SSL peer certificate or SSH remote key was not OK) * fix program crash on std::filesystem::rename * llama-run: create a separate method to utilize RAII * llama-run: handle rename exception

* SYCL: Add COUNT_EQUAL operator support (rebased on master) * SYCL: remove duplicate op_count_equal definition * tests: remove test_count_equal_typed and use test_count_equal for all cases * tests: keep only I32 case for COUNT_EQUAL as suggested * tests: keep only I32 case for COUNT_EQUAL as requested

) * releases : switch to rocWMMA develop branch, add gfx1151 * remove unused variable ROCM_VERSION

…g#15956) * fix im2col_3d to respect non-contiguous inputs (views) The CUDA 3D im2col kernel computed source addresses assuming compact layout (products of dims), ignoring nb[] strides. This patch switches im2col_3d source indexing to use true strides derived from src1->nb[] (in elements), mirroring the approach used in the 2D CUDA im2col path. Destination indexing is unchanged. * use ggml_element_size() for src strides Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* ci : update macos-latest* jobs to use macos-latest This commit updates the jobs that are named macos-latest* to use the macos-latest label instead explicit versions. The motivation for this is that there is currently a mixuture of versions in this workflow and there are jobs that are failing because they require a newer version. Refs: https://github.com/ggml-org/llama.cpp/actions/runs/17644792595/job/50140010907#step:5:1759 * ci : add xcodebuild -downloadPlatform iOS command

clang-format previously broke long CUDA macros (e.g. __launch_bounds__) into unreadable line breaks inside template declarations, such as: template<int D, int ncols, int nwarps, int VKQ_stride, typename KQ_acc_t, bool use_logit_softcap> __launch_bounds__(nwarps*ggml_cuda_get_physical_warp_size(), 1) This change adjusts formatting rules so that CUDA macros remain consistent and aligned with the surrounding template syntax.

…6010) This commit updates the github workflows build.yml file to include steps for uploading and downloading the xcframework artifact. The macos-latest-swift job now depends on the ios-xcode-build job and downloads the xcframework artifact produced by it. The motivation for this changes is that it takes a long time to build the xcframework and we are currently doing this twice in the workflow. With this change, we only build it once and reuse the artifact.

The JSON parser is temporarily kept only for backward compatibility. It reads the etag from old .json files to prevent unnecessary re-downloads for existing users. This legacy code can be removed in a future version. Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* metal : dynamic simdgroups for MV kernels * cont : minor

…-org#16328) * Fix Nemotron Nano v2 9B not executing as CUDA Graph on NVIDIA GPUs * fix to ensure test-backend-ops check passes

`test-arg-parser.cpp` has been updated to work consistently, regardless of whether CURL or SSL support is available, and now always points to `ggml.ai`. The previous timeout test has been removed, but it can be added back by providing a dedicated URL under `ggml.ai`. Signed-off-by: Adrien Gallouët <angt@huggingface.co>

…org#16187) * Work on rope * Simplify inplace operation generation and combine mul/add generation * Work on rope variants * implement neox rope * rope complete * Add sub,div,glu operators * implement scale op * Update cpy shader to handle cont/more types * formatting * Update test vars printing for rope,rms_norm * Avoid ROPE hardcoded constants * Add TODO to change ROPE constants to enum Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * fix TODO comment --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* fix: skip empty sampling fields instead of coercing to 0 in chat API options * chore: update webui build output

* common : disable progress bar without a tty Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Add missing headers Signed-off-by: Adrien Gallouët <angt@huggingface.co> --------- Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* fix ccache key for ubuntu-cpu-cmake * set it for release as well [no ci]

…gml-org#16359) * Make a few GLM tensors not required layer.nextn.shared_head_head and layer.nextn.embed_tokens are both excluded from GLM 4.6 resulting in the model not loading after conversion/quantization, this marks those tensors as not required which makes it work * Update llama-model.cpp layer.nextn.shared_head_norm also not required in case of future models

…ml-org#16363)

…gml-org#16345) * make ggml_vk_default_dispatcher support older vulkan headers * simpilfy with using

…16337) * feat: Add a setting to include model name used to generate the message * feat: UI improvements * feat: Save model info along with the database message entry creation * chore: Build webui static output

* feat: Improve code block theming * chore: update webui build output * chore: Update webui static build

To make it easier to compare performance across forks

This is based on saood06's PR ikawrakow/ik_llama.cpp#225

This patch is not my own work but taken from this gzip: ikawrakow/ik_llama.cpp#354 Thanks ikawrakow and saood06 for this!

From ikawrakow/ik_llama.cpp#375 Hardcoded to true to always run to avoid adding more arguments.

Due to changes with `14030`

Behavior of mainline llama.cpp `-fa` changed and now *requires* an argument of `on` or `1` it seems to enable flash attenion explicitly. This diverges from ik_llama.cpp behavior which omitting it is disabled, however on mainline that means `auto` which means "probably enabled" I believe. Details here: ggml-org#15434 This patch just changes all `s/flash_attn/flash_attn_type/g`.

* Basic JIT compilation for mul_mat, get_rows, and scale (#17) * scale jit working * preliminary working jit for getrows and mulmat, needs refining * simplified mul_mat preprocessing switch statement * get_rows fixes, mul_mat refinement * formatted + last edits * removed some extraneous prints * fixed get_rows, fixed workgroup dispatch in mul_mat. no gibberish * small fix * some changes, working * get_rows and mul_mat jit fixed and working * Update formatting * formatting * Add header --------- Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local> Co-authored-by: Reese Levine <reeselevine1@gmail.com> * Start work on all-encompassing shader library * refactor argmax, set_rows * Refactor all but flashattention, mat mul * flashattention and matrix multiplication moved to new format * clean up preprocessing * Formatting * remove duplicate constants * Split large shaders into multiple static strings --------- Co-authored-by: neha-ha <137219201+neha-ha@users.noreply.github.com>

…better shader parameter handling (ggml-org#20173) * K quant speedup (#20) * Basic JIT compilation for mul_mat, get_rows, and scale (#17) * scale jit working * preliminary working jit for getrows and mulmat, needs refining * simplified mul_mat preprocessing switch statement * get_rows fixes, mul_mat refinement * formatted + last edits * removed some extraneous prints * fixed get_rows, fixed workgroup dispatch in mul_mat. no gibberish * small fix * some changes, working * get_rows and mul_mat jit fixed and working * Update formatting * formatting * Add header --------- Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local> Co-authored-by: Reese Levine <reeselevine1@gmail.com> * Start work on all-encompassing shader library * refactor argmax, set_rows * Refactor all but flashattention, mat mul * no gibberish, all k quants added, merged * vec memory fix * q6_k matching metal on my machine, tests passing * Set tile size for q6_k separately * Separate out fast shaders --------- Co-authored-by: neha-ha <137219201+neha-ha@users.noreply.github.com> * Move towards writeBuffer for params * Move away from multiple buffers for set_rows errors, remove host buffer for parameter buffers, minor cleanups * Remove extra file * Formatting --------- Co-authored-by: neha-ha <137219201+neha-ha@users.noreply.github.com>

) * ggml: backend-agnostic tensor parallelism * support for GPT-OSS, Qwen 3 MoE * partial Vulkan fix * add support for 4/8 GPUs * unconditional peer access * re-use buffers + ggml contexts * fix output pattern * NCCL support * GGML: HIP: add RCCL support * Remove shfl and AllReduce from backend interface * move allocation workaround out of ggml-alloc.c * 2d tensor set/get support * Fix the seg fault without NCCL * Apply suggestion from JohannesGaessler * support for tensor dims % n_devs != 0 * fix view_offs scaling * arbitrary num. of GPUs/tensor split * fix compilation * better granularity estimate * Support device-specific host buffer types if all underlying backends expose the same type. This allows using pinned memory instead of pageable memory for CUDA. Fix compilation errors. * partial Qwen 3 Next support * Fix qwen3 30b (#8) * Fix crash with Qwen-30B-A3B Q4_0 Qwen-30B-A3B Q4_0 has an intermediate dimension of 768. Using a granularity of 256 forces an uneven split between GPUs, which is not supported by the current implementation. * Decide block size based on tensor quantization type * Fix crashes due to KV cache serialization (#9) KV cache serialization requires non-zero offsets on the tensor. Add support in the meta backend to set/get a tensor with a non-zero offset. * metal : fix build (#7) * static memory allocations, fix usage count * fix tensor granularity * more even memory distribution * use BF16 for allreduce * rebase fixup * better error message for unsupported architectures * Fix device mismatch during scatter of allReduce. (#11) There is a mismatch between the dst buffer device and the backend device, causing the use of sync copies * Enable the previous allreduce implementation. It is better in both perf and stability (#12) * delay AllReduce for Moe for less I/O * build : clean-up compile warnings * backend : move most of the meta backend API to ggml-backend-impl.h * cont : hide unused public API in the implementation * llama : use llama_device + remove ggml_backend_dev_is_meta() * ggml-backend : remove unused alloc include * minor : remove regex include * ggml : introduce ggml-ext.h for staging new APIs * rebase fixup * fix tests * llama : more robust logic for determining Meta devices (#16) * llama : more robust logic for determining Meta devices * cont : fix devs size check Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * cont : fix log type Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * disable roundtrip for meta backend * fix arch selection * Qwen 3.5 support * fix Gemma 4 MoE * fix OpenVino, SYCL * fix test-llama-archs for CPU-only builds * Fix Qwen 3.5 MoE * disable meta backend tests for WebGPU * tests : filter CPU-based devices from the Meta backend tests (#17) * meta : formatting, naming, indentation (#18) * formatting : llama-model.cpp * formatting : ggml-ext.h * formatting : ggml-backend-meta.cpp * meta : add TODO * add documentation * better error messages * fix GPT-OSS --------- Co-authored-by: Carl Philipp Klemm <carl@uvos.xyz> Co-authored-by: Gaurav Garg <gaugarg@nvidia.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

taronaeo and others added 30 commits September 13, 2025 02:39

ggml-zdnn: fix ggml-org#15414, activate FP16 and BF16 acceleration an…

40be511

…d incorrect zTensor free (ggml-org#15839)

metal : fix memory leaks (ggml-org#15962)

84d7b2f

ggml-ci

metal : refactor kernel loading (ggml-org#15964)

55758b0

* metal : refactor bin kernels loading ggml-ci * metal : refactor rms kernel loading ggml-ci * ci : try to add memory leaks check ggml-ci * ci : try to enable memory leak detection for Mac * cont : seems to be working

llama : allow using iGPUs with --device (ggml-org#15951)

50f4281

* llama : allow using iGPUs with --device * mtmd : allow iGPU * rpc-server : allow iGPU

vulkan: initialize vulkan-hpp to allow using extension function point…

b9c9c9f

…ers (ggml-org#15705) Use this to query register count for shader compiles on NVIDIA. Currently this is only for performance debug, but it could eventually be used in some heuristics like split_k.

vulkan: fix failing dequant shaders (ggml-org#15862)

aa0c461

* vulkan: fix failing dequant shaders * add missing const

releases : update ROCM, add gfx1200, gfx1201, gfx1151 (ggml-org#15972)

9ecb884

* releases : update ROCM, add gfx1200, gfx1201, gfx1151 * releases : set target to 13.3 for macos-x64 * add hipblaslt.dll to release * add hipblaslt/library to release

rpc : fix regression when --device is used (ggml-org#15981)

918b26f

Fix regression introduced with commit 50f4281

metal : fix kernel requirements (ggml-org#15983)

a14bd35

* metal : fix kernel requirements ggml-ci * cont : fix supports_op * cont : fix supports_op for ARGMAX

Vulkan: Clean up mul_mm shader (ggml-org#15987)

261e6a2

* vulkan: move mul_mm dequantization steps into a separate file and functions * improve mul_mm vector load code * fix debug mode issues and warnings

server : only attempt to enable thinking if using jinja (ggml-org#15967)

6c019cb

fix KLD percentile output (ggml-org#15999)

a68f31e

In `llama-perplexity`, when using `--kl-divergence`, the KL divergence statistics output mistakenly displays the 99th percentile twice. This change fixes that and correctly displays the 90th percentile as originally intended (presumably).

CUDA: some micro-optimizations in mmf.cuh for mul_mat_id (ggml-org#15926

1062205

)

releases : switch to rocWMMA develop branch, add gfx1151 (ggml-org#15992

10d1974

) * releases : switch to rocWMMA develop branch, add gfx1151 * remove unused variable ROCM_VERSION

docker : enable rocWMMA in ROCm images, add gfx1151 (ggml-org#15997)

dc381aa

Add LLaDA-7b-MoE diffusion model (ggml-org#16003)

6d75883

cmake : Do not install tools on iOS targets (ggml-org#15903)

07808eb

chaxu01 and others added 25 commits September 30, 2025 10:07

kleidiai : fix work size and threads sync for fp16 (ggml-org#16246)

f1eb1cb

metal : dynamic simdgroups for MV kernels (ggml-org#16340)

35fb824

* metal : dynamic simdgroups for MV kernels * cont : minor

cuda : Enable CUDA Graph usage for Nemotron Nano v2 (NemotronH) (ggml…

a014310

…-org#16328) * Fix Nemotron Nano v2 9B not executing as CUDA Graph on NVIDIA GPUs * fix to ensure test-backend-ops check passes

ggml : bump version to 0.9.4 (ggml/1363)

075c015

ci : disable ccache for android (ggml-org#16348)

2df5bcf

opencl: support ne3 in get_rows (ggml-org#15866)

d1c84a6

Chatapi ignore empty sampling (ggml-org#16330)

16b0ca0

* fix: skip empty sampling fields instead of coercing to 0 in chat API options * chore: update webui build output

opencl: support pad_ext (ggml-org#15888)

7c156df

ci : fix ccache key for ubuntu-cpu-cmake (ggml-org#16355)

b2ba81d

* fix ccache key for ubuntu-cpu-cmake * set it for release as well [no ci]

webui: Remove running llama-server within WebUI dev.sh script (gg…

aa9538a

…ml-org#16363)

vulkan: make ggml_vk_default_dispatcher support older vulkan headers (g…

132d673

…gml-org#16345) * make ggml_vk_default_dispatcher support older vulkan headers * simpilfy with using

Add optional setting for showing "Model used:" information (ggml-org#…

4f15759

…16337) * feat: Add a setting to include model name used to generate the message * feat: UI improvements * feat: Save model info along with the database message entry creation * chore: Build webui static output

ci : use registry cache for docker builds (ggml-org#16366)

1104ca1

Improve code block color theming (ggml-org#16325)

2a9b633

* feat: Improve code block theming * chore: update webui build output * chore: Update webui static build

Porting saood06's ik_llama.cpp utility

0e90a21

To make it easier to compare performance across forks

Refactor ik_llama.cpp back to llama.cpp

7fbdeaf

This is based on saood06's PR ikawrakow/ik_llama.cpp#225

Use ikawrakow's adaptation of sweep-bench

e51a326

This patch is not my own work but taken from this gzip: ikawrakow/ik_llama.cpp#354 Thanks ikawrakow and saood06 for this!

Adapt batch warmup to sweep-bench

8ce128b

From ikawrakow/ik_llama.cpp#375 Hardcoded to true to always run to avoid adding more arguments.

Refactor llama_get_memory(ctx) and llama_memory_ API.

e4c71c6

Due to changes with `14030`

ubergarm force-pushed the ug/port-sweep-bench branch from e42d203 to 0ae5c26 Compare October 1, 2025 14:11

Thireus merged commit 710b18d into Thireus:ug/port-sweep-bench Oct 2, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ug/port sweep bench#17

Ug/port sweep bench#17
Thireus merged 207 commits intoThireus:ug/port-sweep-benchfrom
ubergarm:ug/port-sweep-bench

Thireus commented Sep 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

Thireus commented Sep 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants