Skip to content

Ug/port sweep bench#17

Merged
Thireus merged 207 commits intoThireus:ug/port-sweep-benchfrom
ubergarm:ug/port-sweep-bench
Oct 2, 2025
Merged

Ug/port sweep bench#17
Thireus merged 207 commits intoThireus:ug/port-sweep-benchfrom
ubergarm:ug/port-sweep-bench

Conversation

@Thireus
Copy link
Copy Markdown
Owner

@Thireus Thireus commented Sep 27, 2025

Make sure to read the contributing guidelines before submitting a PR

taronaeo and others added 30 commits September 13, 2025 02:39
* metal : run graphs ops concurrently

ggml-ci

* cont : add flags for debugging and disabling concurrency

ggml-ci

* cont : refactor and handle fusing

ggml-ci

* cont : simplify - no need to use GPU address

ggml-ci

* cont : prepare mem ranges for reuse + add ggml-metal-common.cpp

ggml-ci

* cont : avoid redundant keywords in cpp [no ci]

* metal : reorder graph for better concurrency

ggml-ci

* metal : fix race on mem pool buffers

ggml-ci

* cont : add env GGML_METAL_GRAPH_OPTIMIZE_DISABLE

ggml-ci

* cont : refactor, optimize, add comments

ggml-ci

* cont : refactor ggml-metal.m

ggml-ci

* minor : update logs [no ci]
* metal : refactor bin kernels loading

ggml-ci

* metal : refactor rms kernel loading

ggml-ci

* ci : try to add memory leaks check

ggml-ci

* ci : try to enable memory leak detection for Mac

* cont : seems to be working
* llama : allow using iGPUs with --device

* mtmd : allow iGPU

* rpc-server : allow iGPU
…ers (ggml-org#15705)

Use this to query register count for shader compiles on NVIDIA. Currently
this is only for performance debug, but it could eventually be used in some
heuristics like split_k.
* vulkan: fix failing dequant shaders

* add missing const
* ggml-zdnn: rm user mapped buffers

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: rm dead code

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: attempt to fix missing extra data buffer free

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

---------

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* doc : update documentation for --tensor-split

* Update tools/main/README.md

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Update tools/main/README.md

Co-authored-by: Diego Devesa <slarengh@gmail.com>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Diego Devesa <slarengh@gmail.com>
* releases : update ROCM, add gfx1200, gfx1201, gfx1151

* releases : set target to 13.3 for macos-x64

* add hipblaslt.dll to release

* add hipblaslt/library to release
* metal : fix kernel requirements

ggml-ci

* cont : fix supports_op

* cont : fix supports_op for ARGMAX
)

* build: fix the cache keys for Windows HIP release job

Update the cache keys to include the HIP SDK version, preventing the
use of outdated ROCm installation caches.

* build: sync changes from release.yml to build.yml

- Update HIP SDK version to 25.Q3 and ROCm version to 6.4.2
- Update the cache keys to reflect the new versions

* build: remove Windows HIP release for gfx1151
since the current stable rocWMMA does not support gfx1151.
* vulkan: move mul_mm dequantization steps into a separate file and functions

* improve mul_mm vector load code

* fix debug mode issues and warnings
…adeon RX 9000 series (ggml-org#15994)

* rocm.Dockerfile: added gfx1200,gfx1201 architectures to support  AMD Radeon RX 9000 series

https://rocm.docs.amd.com/projects/install-on-linux/en/docs-6.4.1/reference/system-requirements.html#rdna-os
states the Radeon RX 9000 series is supported support from Ubuntu 24.04.2, and the dockerfile is using 24.04 which is ROCm 6.4.

This fixed the `ROCm error: invalid device function` I was getting when trying to use the rocm container.
* metal : remove mem pool usage

ggml-ci

* metal : remove mem pool implementation

ggml-ci

* metal : take into account the actual allocated memory of the tensor

ggml-ci

* cont : use ggml_backend_buft_get_alloc_size

ggml-ci

* cont : improve, comments

ggml-ci

* cont : add functions for the extra tensor sizes

* metal : add comments

ggml-ci

* metal : implement .get_alloc_size for the rest of the buffer types

ggml-ci

* metal : remove ggml_metal_heap

ggml-ci
* add grok-2 support

* type fix

* type fix

* type fix

* "fix" vocab for invalid sequences

* fix expert tensor mapping and spaces in vocab

* add chat template

* fix norm tensor mapping

* rename layer_out_norm to ffn_post_norm

* ensure ffn_post_norm is mapped

* fix experts merging

* remove erroneous FFN_GATE entry

* concatenate split tensors and add more metadata

* process all expert layers and try cat instead of hstack

* add support for community BPE vocab

* fix expert feed forward length and ffn_down concat

* commit this too

* add ffn_up/gate/down, unsure if sequence is right

* add ffn_gate/down/up to tensor names

* correct residual moe (still not working)

* mess--

* fix embedding scale being applied twice

* add built in chat template

* change beta fast for grok if default value

* remove spm vocab in favor of community bpe vocab

* change attention temp length metadata type to integer

* update attention temp length metadata

* remove comment

* replace M_SQRT2 with std::sqrt(2)

* add yarn metadata, move defaults to hparams
In `llama-perplexity`, when using `--kl-divergence`, the KL divergence statistics output mistakenly displays the 99th percentile twice. This change fixes that and correctly displays the 90th percentile as originally intended (presumably).
* llama-run: Fix model download on Windows
 * fix SSL error (SSL peer certificate or SSH remote key was not OK)
 * fix program crash on std::filesystem::rename

* llama-run: create a separate method to utilize RAII

* llama-run: handle rename exception
* SYCL: Add COUNT_EQUAL operator support (rebased on master)

* SYCL: remove duplicate op_count_equal definition

* tests: remove test_count_equal_typed and use test_count_equal for all cases

* tests: keep only I32 case for COUNT_EQUAL as suggested

* tests: keep only I32 case for COUNT_EQUAL as requested
)

* releases : switch to rocWMMA develop branch, add gfx1151

* remove unused variable ROCM_VERSION
…g#15956)

* fix im2col_3d to respect non-contiguous inputs (views)

The CUDA 3D im2col kernel computed source addresses assuming compact layout (products of dims), ignoring nb[] strides. 

This patch switches im2col_3d source indexing to use true strides derived from src1->nb[] (in elements), mirroring the approach used in the 2D CUDA im2col path. Destination indexing is unchanged.

* use ggml_element_size() for src strides

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* ci : update macos-latest* jobs to use macos-latest

This commit updates the jobs that are named macos-latest* to use the
macos-latest label instead explicit versions.

The motivation for this is that there is currently a mixuture of
versions in this workflow and there are jobs that are failing because
they require a newer version.

Refs: https://github.com/ggml-org/llama.cpp/actions/runs/17644792595/job/50140010907#step:5:1759

* ci : add xcodebuild -downloadPlatform iOS command
clang-format previously broke long CUDA macros (e.g. __launch_bounds__) into
unreadable line breaks inside template declarations, such as:

  template<int D, int ncols, int nwarps, int VKQ_stride,
           typename KQ_acc_t, bool use_logit_softcap>
      __launch_bounds__(nwarps*ggml_cuda_get_physical_warp_size(), 1)

This change adjusts formatting rules so that CUDA macros remain consistent
and aligned with the surrounding template syntax.
…6010)

This commit updates the github workflows build.yml file to include steps
for uploading and downloading the xcframework artifact. The
macos-latest-swift job now depends on the ios-xcode-build job and
downloads the xcframework artifact produced by it.

The motivation for this changes is that it takes a long time to build
the xcframework and we are currently doing this twice in the workflow.
With this change, we only build it once and reuse the artifact.
chaxu01 and others added 25 commits September 30, 2025 10:07
The JSON parser is temporarily kept only for backward compatibility. It
reads the etag from old .json files to prevent unnecessary re-downloads
for existing users.

This legacy code can be removed in a future version.

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* metal : dynamic simdgroups for MV kernels

* cont : minor
…-org#16328)

* Fix Nemotron Nano v2 9B not executing as CUDA Graph on NVIDIA GPUs

* fix to ensure test-backend-ops check passes
`test-arg-parser.cpp` has been updated to work consistently,
regardless of whether CURL or SSL support is available, and
now always points to `ggml.ai`.

The previous timeout test has been removed, but it can be
added back by providing a dedicated URL under `ggml.ai`.

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
…org#16187)

* Work on rope

* Simplify inplace operation generation and combine mul/add generation

* Work on rope variants

* implement neox rope

* rope complete

* Add sub,div,glu operators

* implement scale op

* Update cpy shader to handle cont/more types

* formatting

* Update test vars printing for rope,rms_norm

* Avoid ROPE hardcoded constants

* Add TODO to change ROPE constants to enum

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* fix TODO comment

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* fix: skip empty sampling fields instead of coercing to 0 in chat API options

* chore: update webui build output
* common : disable progress bar without a tty

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Add missing headers

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

---------

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* fix ccache key for ubuntu-cpu-cmake

* set it for release as well [no ci]
…gml-org#16359)

* Make a few GLM tensors not required

layer.nextn.shared_head_head and layer.nextn.embed_tokens are both excluded from GLM 4.6 resulting in the model not loading after conversion/quantization, this marks those tensors as not required which makes it work

* Update llama-model.cpp

layer.nextn.shared_head_norm also not required in case of future models
…gml-org#16345)

* make ggml_vk_default_dispatcher support older vulkan headers

* simpilfy with using
…16337)

* feat: Add a setting to include model name used to generate the message

* feat: UI improvements

* feat: Save model info along with the database message entry creation

* chore: Build webui static output
* feat: Improve code block theming

* chore: update webui build output

* chore: Update webui static build
To make it easier to compare performance across forks
This patch is not my own work but taken from this gzip:
ikawrakow/ik_llama.cpp#354

Thanks ikawrakow and saood06 for this!
From ikawrakow/ik_llama.cpp#375

Hardcoded to true to always run to avoid adding more arguments.
Behavior of mainline llama.cpp `-fa` changed and now *requires* an
argument of `on` or `1` it seems to enable flash attenion explicitly.
This diverges from ik_llama.cpp behavior which omitting it is disabled,
however on mainline that means `auto` which means "probably enabled" I
believe.

Details here: ggml-org#15434

This patch just changes all `s/flash_attn/flash_attn_type/g`.
@ubergarm ubergarm force-pushed the ug/port-sweep-bench branch from e42d203 to 0ae5c26 Compare October 1, 2025 14:11
@Thireus Thireus merged commit 710b18d into Thireus:ug/port-sweep-bench Oct 2, 2025
1 check passed
Thireus pushed a commit that referenced this pull request Feb 18, 2026
* Basic JIT compilation for mul_mat, get_rows, and scale (#17)

* scale jit working

* preliminary working jit for getrows and mulmat, needs refining

* simplified mul_mat preprocessing switch statement

* get_rows fixes, mul_mat refinement

* formatted + last edits

* removed some extraneous prints

* fixed get_rows, fixed workgroup dispatch in mul_mat. no gibberish

* small fix

* some changes, working

* get_rows and mul_mat jit fixed and working

* Update formatting

* formatting

* Add header

---------

Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local>
Co-authored-by: Reese Levine <reeselevine1@gmail.com>

* Start work on all-encompassing shader library

* refactor argmax, set_rows

* Refactor all but flashattention, mat mul

* flashattention and matrix multiplication moved to new format

* clean up preprocessing

* Formatting

* remove duplicate constants

* Split large shaders into multiple static strings

---------

Co-authored-by: neha-ha <137219201+neha-ha@users.noreply.github.com>
Thireus pushed a commit that referenced this pull request Mar 10, 2026
…better shader parameter handling (ggml-org#20173)

* K quant speedup (#20)

* Basic JIT compilation for mul_mat, get_rows, and scale (#17)

* scale jit working

* preliminary working jit for getrows and mulmat, needs refining

* simplified mul_mat preprocessing switch statement

* get_rows fixes, mul_mat refinement

* formatted + last edits

* removed some extraneous prints

* fixed get_rows, fixed workgroup dispatch in mul_mat. no gibberish

* small fix

* some changes, working

* get_rows and mul_mat jit fixed and working

* Update formatting

* formatting

* Add header

---------

Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local>
Co-authored-by: Reese Levine <reeselevine1@gmail.com>

* Start work on all-encompassing shader library

* refactor argmax, set_rows

* Refactor all but flashattention, mat mul

* no gibberish, all k quants added, merged

* vec memory fix

* q6_k matching metal on my machine, tests passing

* Set tile size for q6_k separately

* Separate out fast shaders

---------

Co-authored-by: neha-ha <137219201+neha-ha@users.noreply.github.com>

* Move towards writeBuffer for params

* Move away from multiple buffers for set_rows errors, remove host buffer for parameter buffers, minor cleanups

* Remove extra file

* Formatting

---------

Co-authored-by: neha-ha <137219201+neha-ha@users.noreply.github.com>
Thireus pushed a commit that referenced this pull request Apr 9, 2026
)

* ggml: backend-agnostic tensor parallelism

* support for GPT-OSS, Qwen 3 MoE

* partial Vulkan fix

* add support for 4/8 GPUs

* unconditional peer access

* re-use buffers + ggml contexts

* fix output pattern

* NCCL support

* GGML: HIP: add RCCL support

* Remove shfl and AllReduce from backend interface

* move allocation workaround out of ggml-alloc.c

* 2d tensor set/get support

* Fix the seg fault without NCCL

* Apply suggestion from JohannesGaessler

* support for tensor dims % n_devs != 0

* fix view_offs scaling

* arbitrary num. of GPUs/tensor split

* fix compilation

* better granularity estimate

* Support device-specific host buffer types if all underlying backends expose the same type. This allows using pinned memory instead of pageable memory for CUDA.

Fix compilation errors.

* partial Qwen 3 Next support

* Fix qwen3 30b (#8)

* Fix crash with Qwen-30B-A3B Q4_0

Qwen-30B-A3B Q4_0 has an intermediate dimension of 768. Using a granularity of 256 forces an uneven split between GPUs, which is not supported by the current implementation.

* Decide block size based on tensor quantization type

* Fix crashes due to KV cache serialization (#9)

KV cache serialization requires non-zero offsets on the tensor. Add support in the meta backend to set/get a tensor with a non-zero offset.

* metal : fix build (#7)

* static memory allocations, fix usage count

* fix tensor granularity

* more even memory distribution

* use BF16 for allreduce

* rebase fixup

* better error message for unsupported architectures

* Fix device mismatch during scatter of allReduce. (#11)

There is a mismatch between the dst buffer device and the backend device, causing the use of sync copies

* Enable the previous allreduce implementation. It is better in both perf and stability (#12)

* delay AllReduce for Moe for less I/O

* build : clean-up compile warnings

* backend : move most of the meta backend API to ggml-backend-impl.h

* cont : hide unused public API in the implementation

* llama : use llama_device + remove ggml_backend_dev_is_meta()

* ggml-backend : remove unused alloc include

* minor : remove regex include

* ggml : introduce ggml-ext.h for staging new APIs

* rebase fixup

* fix tests

* llama : more robust logic for determining Meta devices (#16)

* llama : more robust logic for determining Meta devices

* cont : fix devs size check

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* cont : fix log type

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* disable roundtrip for meta backend

* fix arch selection

* Qwen 3.5 support

* fix Gemma 4 MoE

* fix OpenVino, SYCL

* fix test-llama-archs for CPU-only builds

* Fix Qwen 3.5 MoE

* disable meta backend tests for WebGPU

* tests : filter CPU-based devices from the Meta backend tests (#17)

* meta : formatting, naming, indentation (#18)

* formatting : llama-model.cpp

* formatting : ggml-ext.h

* formatting : ggml-backend-meta.cpp

* meta : add TODO

* add documentation

* better error messages

* fix GPT-OSS

---------

Co-authored-by: Carl Philipp Klemm <carl@uvos.xyz>
Co-authored-by: Gaurav Garg <gaugarg@nvidia.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.