Latest updates by geoffmunn · Pull Request #8 · geoffmunn/llama.cpp

geoffmunn · 2025-12-14T09:19:36Z

Make sure to read the contributing guidelines before submitting a PR

* gguf_convert_endian.py: skip MXFP4 data * Use gguf.constants.GGML_QUANT_SIZES to determine block sizes

This is no longer passing the build, needs more packages. Signed-off-by: Eric Curtin <eric.curtin@docker.com>

…only) (ggml-org#17494) * Enabled q4_K_4x8 path * Fixed generic Q4_K 8x4 implementation * wip: dotprod gemm * Working arm q4_K dotprod gemm Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * Undo acc rename Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * Q4_K arm dotprod gemm Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * Fix: q4_qs reinterpret from uint to int Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * Removed comments * Fixed macro guards * Fixed unused vars in generic implementation * Fixed unused vars in 8x4 repack * Fixed unused vars in generic implementation, unneeded comment * Missing arch fallback for x86 * minor : style --------- Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* vulkan: Implement SOLVE_TRI * load B matrix through shared memory * use FLOAT_TYPE

Co-authored-by: Zhang Jianyu <zhang.jianyu@outlook.com>

* enable mmf for rdna4 * move some mmvf to mmf * revert lds128 for wmma loading * Revert "revert lds128 for wmma loading" This reverts commit db9ae8b. * Revert "enable mmf for rdna4" This reverts commit 698c9f2. * Revert "move some mmvf to mmf" This reverts commit 99b92bd. * enable mul_mat for rdna4 --------- Co-authored-by: zhang hui <you@example.com>

Store the last computed graph and reuse it when possible. Also do not return response from GRAPH_COMPUTE and assume it always completes successfully. If this this is not the case, the server closes the connection. This saves us a network round trip to the server.

* vulkan: Implement GGML_OP_TRI * check types match

* Qwen3 Next - cleaned up version * Whitespaces and stuff * Correct minor errors * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Misc. fixes. * Clean up code, add missing hybrid qualifier * Did someone transpose the SOLVE_TRI result matrix? Perhaps... * Whitespace * Proper tensors for cb calls * Use llama-graph.h vertical alignment * BROKEN: chunking * Set new tensors as inputs. * Proper chunk logic * It's the circle of life... * More shenanigans for n_seq > 1 * Nail in the coffin? * Fix Windows build * Eh, one fails on Windows, the other fails on Mac... just use general capture. * quant : cleanup * model : cleanup * qwen3 : cleanup * cont : cleanup * cont : cleanup * ggml : revert change * qwen3 : cleanup * cont : cleanup * Readd cmath * qwen3 : fix typo * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Usual suspects * fix my bad suggestion --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* server : add Anthropic Messages API support * remove -@pytest.mark.slow from tool calling/jinja tests * server : remove unused code and slow/skip on test_anthropic_vision_base64_with_multimodal_model in test_anthropic_api.py * server : removed redundant n field logic in anthropic_params_from_json * server : use single error object instead of error_array in streaming response handler for /v1/chat/completions and use unordered_set instead of set in to_json_anthropic_stream() * server : refactor Anthropic API to use OAI conversion * make sure basic test always go first * clean up * clean up api key check, add test --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>

* ggml-cuda: make conditions for fusion more explicit * ggml-cuda: remove size check as std::equal already does it

* [MUSA] enable fp16/fast_fp16/bf16_mma on PH1 Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * Update ggml/src/ggml-cuda/fattn-vec.cuh Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Update ggml/src/ggml-cuda/fattn-vec.cuh Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Update ggml/src/ggml-cuda/fattn-tile.cuh Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Address review comments Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> --------- Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

…gml_backend_sched (ggml-org#17276) * ggml : add GGML_SCHED_NO_REALLOC option to disable reallocations in ggml_backend_sched Enabled in ggml-ci for testing. * llama : update worst-case graph for unified cache * ci : disable op offload in some tests * fix spelling --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

…at (ggml-org#17386) * fix: /metrics endpoint returning JSON-escaped Prometheus format * mod: remove string overload from ok() method

…7481)

…etadata) (ggml-org#17553) gguf_new_metadata.py reads data from reader. Reader doesn't byteswap tensors to native endianness. But writer does expect tensors in native endianness to convert them into requested endianness. There are two ways to fix this: update reader and do conversion to native endianness and back, or skip converting endianness in writer in this particular USE-case. gguf_editor_gui.py doesn't allow editing or viewing tensor data. Let's go with skipping excessive byteswapping. If eventually capability to view or edit tensor data is added, tensor data should be instead byteswapped when reading it.

…ml-org#17582)

* vulkan: split mul_mmq_funcs for mul_mat_vecq use * add mxfp4 mmvq * add q2_k mmvq * add q3_k mmvq * add q4_k and q5_k mmvq * add q6_k mmvq * handle 4x4 quants per mmvq thread * enable MUL_MAT_ID mmvq support * enable subgroup optimizations for mul_mat_vec_id shaders * device tuning * request prealloc_y sync after quantization * fix indentation * fix llvmpipe test failures * fix mul_mat_id mmvq condition * fix unused variable warning

) Signed-off-by: Wang Yang <yangwang@iscas.ac.cn>

…oc and script (ggml-org#17566) Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com>

* Fix json schema with '\' in literals * Add "literal string with escapes" test

As [1] explained, the real debug message will be like: "res operator(): operator() : queue result stop" Set the name explicitly, the message is easy for debugging: "res operator(): recv : queue result stop" The left "operator()" is generated by 'RES_DBG() ... __func__' [1]: https://clang.llvm.org/extra/clang-tidy/checks/bugprone/lambda-function-name.html Signed-off-by: Haiyue Wang <haiyuewa@163.com>

* git mv * add server-context.h * add server-context.h * clean up headers * cont : cleanup * also expose server_response_reader (to be used by CLI) * fix windows build * decouple server_routes and server_http --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

…7599) * cuda : add error checking for cudaMemcpyAsync in argsort (ggml-org#12836) * fix indentation

… (ggml/1394) Some backend depends on CMAKE_RUNTIME_OUTPUT_DIRECTORY to create temporary file like metal backened. Missing CMAKE_RUNTIME_OUTPUT_DIRECTORY will cause some cmake error like permission denied (try to copy file to root). This PR wants to setup a default path for CMAKE_RUNTIME_OUTPUT_DIRECTORY when it does not exist.

* models : fix YaRN regression + consolidate logic * cont : fix the fix * cont : remove header * cont : add header

* q6_k faster mul mat * 8 values * fix comment * switch to two at a time * start ci for .glsl files

)

) * common : refactor common_sampler + grammar logic changes * tests : increase max_tokens to get needed response * batched : fix uninitialized samplers

Latest updates

* FlashAttention (#13) * Add inplace softmax * Move rms_norm to split row approach * Update debug for supports_op * clean up debug statements * neg f16xf32xip builds and runs, havent actually ran a model that uses neg kernel yet though * neg passes backend test * unary operators pass ggml tests * rms_norm double declaration bug atoned * abides by editor-config * removed vestigial files * fixed autoconfig * All operators (inlcluding xielu) working * removed unnecesarry checking if node->src[1] exists for unary operators * responded and dealt with PR comments * implemented REPL_Template support and removed bug in unary operators kernel * formatted embed wgsl and ggml-webgpu.cpp * Faster tensors (#8) Add fast matrix and matrix/vector multiplication. * Use map for shader replacements instead of pair of strings * Wasm (#9) * webgpu : fix build on emscripten * more debugging stuff * test-backend-ops: force single thread on wasm * fix single-thread case for init_tensor_uniform * use jspi * add pthread * test: remember to set n_thread for cpu backend * Add buffer label and enable dawn-specific toggles to turn off some checks * Intermediate state * Fast working f16/f32 vec4 * Working float fast mul mat * Clean up naming of mul_mat to match logical model, start work on q mul_mat * Setup for subgroup matrix mat mul * Basic working subgroup matrix * Working subgroup matrix tiling * Handle weirder sg matrix sizes (but still % sg matrix size) * Working start to gemv * working f16 accumulation with shared memory staging * Print out available subgroup matrix configurations * Vectorize dst stores for sg matrix shader * Gemv working scalar * Minor set_rows optimization (#4) * updated optimization, fixed errors * non vectorized version now dispatches one thread per element * Simplify * Change logic for set_rows pipelines --------- Co-authored-by: Neha Abbas <nehaabbas@macbookpro.lan> Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local> Co-authored-by: Reese Levine <reeselevine1@gmail.com> * Comment on dawn toggles * Working subgroup matrix code for (semi)generic sizes * Remove some comments * Cleanup code * Update dawn version and move to portable subgroup size * Try to fix new dawn release * Update subgroup size comment * Only check for subgroup matrix configs if they are supported * Add toggles for subgroup matrix/f16 support on nvidia+vulkan * Make row/col naming consistent * Refactor shared memory loading * Move sg matrix stores to correct file * Working q4_0 * Formatting * Work with emscripten builds * Fix test-backend-ops emscripten for f16/quantized types * Use emscripten memory64 to support get_memory * Add build flags and try ci --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> * Remove extra whitespace * Move wasm single-thread logic out of test-backend-ops for cpu backend * Disable multiple threads for emscripten single-thread builds in ggml_graph_plan * Refactored pipelines and workgroup calculations (#10) * refactored pipelines * refactored workgroup calculation * removed commented out block of prior maps * Clean up ceiling division pattern --------- Co-authored-by: Neha Abbas <nehaabbas@eduroam-169-233-141-223.ucsc.edu> Co-authored-by: Reese Levine <reeselevine1@gmail.com> * Start work on flash attention * Shader structure set up (many bugs still) * debugging * Working first test * Working with head grouping, head sizes to 128, logit softcap, mask/sinks enabled, f32 * Generalize softmax to work with multiple subgroups, f16 accumulation, mask shared memory tiling * Start work on integrating pre-wgsl * Separate structs/initial shader compilation library into separate files * Work on compilation choices for flashattention * Work on subgroup matrix/tile size portability * subgroup size agnostic online softmax * Cleanups, quantization types * more cleanup * fix wasm build * Refactor flashattention to increase parallelism, use direct loads for KV in somce cases * Checkpoint * formatting * Update to account for default kv cache padding * formatting shader * Add workflow for ggml-ci webgpu * Try passing absolute path to dawn in ggml-ci * Avoid error on device destruction, add todos for proper cleanup * Fix unused warning * Forgot one parameter unused * Move some flashattn computation to f32 for correctness

AlekseiNikiforovIBM and others added 30 commits November 27, 2025 11:35

gguf-py : skip endian-conversion of MXFP4 data (ggml-org#17523)

4fcd87c

* gguf_convert_endian.py: skip MXFP4 data * Use gguf.constants.GGML_QUANT_SIZES to determine block sizes

devops: Add build-essential to Ubuntu 26.04 image (ggml-org#17531)

d21a76a

This is no longer passing the build, needs more packages. Signed-off-by: Eric Curtin <eric.curtin@docker.com>

cuda : fix UMA detection on discrete GPUs. (ggml-org#17537)

909072a

models : fix LFM2 tensors (ggml-org#17548)

6783b11

arch : add description about LLM_TENSOR_INFOS (ggml-org#17550)

c386114

vulkan: Implement SOLVE_TRI (ggml-org#17486)

4abef75

* vulkan: Implement SOLVE_TRI * load B matrix through shared memory * use FLOAT_TYPE

refactor pad_reflect_1d to make the UT case pass (ggml-org#17204)

efaaccd

Co-authored-by: Zhang Jianyu <zhang.jianyu@outlook.com>

SOLVE_TRI CUDA kernel for small matrices (ggml-org#17457)

cd0e3a7

vulkan: Implement GGML_OP_TRI (ggml-org#17503)

35cf888

* vulkan: Implement GGML_OP_TRI * check types match

CUDA: no FP16 arithmetic for vector FA kernel (ggml-org#17558)

73955f7

ggml-cuda: add stricter checking for fusion (ggml-org#17568)

2e7ef98

* ggml-cuda: make conditions for fusion more explicit * ggml-cuda: remove size check as std::equal already does it

server: fix: /metrics endpoint returning JSON-escaped Prometheus form…

3ce7a65

…at (ggml-org#17386) * fix: /metrics endpoint returning JSON-escaped Prometheus format * mod: remove string overload from ok() method

common : move all common_chat_parse_* to chat-parser.cpp. (ggml-org#1…

03914c7

…7481)

vulkan: improve topk perf for large k, fix overflow in unit tests (gg…

59d8d4e

…ml-org#17582)

ggml: replace hwcap with riscv_hwprobe for RVV detection (ggml-org#17567

f698a79

) Signed-off-by: Wang Yang <yangwang@iscas.ac.cn>

sycl : support to malloc memory on device more than 4GB, update the d…

7d2add5

…oc and script (ggml-org#17566) Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com>

common : fix json schema with '\' in literals (ggml-org#17307)

0874693

* Fix json schema with '\' in literals * Add "literal string with escapes" test

vulkan : fix FA mask load with bounds check (coopmat2) (ggml-org#17606)

385c3da

cuda : add error checking for cudaMemcpyAsync in argsort (ggml-org#1…

00425e2

…7599) * cuda : add error checking for cudaMemcpyAsync in argsort (ggml-org#12836) * fix indentation

HerrCai0907 and others added 12 commits December 14, 2025 08:33

ggml : arm repack fix build (whisper/0)

71fdcf0

sync : ggml

0e59224

ggml : arm repack fix build

a63cbaf

models : fix YaRN regression + consolidate logic (ggml-org#18006)

609a2d0

* models : fix YaRN regression + consolidate logic * cont : fix the fix * cont : remove header * cont : add header

model-conversion : cast logits to float32 (ggml-org#18009)

77ad854

vulkan: faster q6_k matmul (ggml-org#17813)

d15d177

* q6_k faster mul mat * 8 values * fix comment * switch to two at a time * start ci for .glsl files

vulkan: improve mul_mat_vec_iq1_s speed (ggml-org#17874)

4722671

vulkan: Fix data race/hang in scalar/cm1 flash attention (ggml-org#17887

3238b14

)

common : refactor common_sampler + grammar logic changes (ggml-org#17937

254098a

) * common : refactor common_sampler + grammar logic changes * tests : increase max_tokens to get needed response * batched : fix uninitialized samplers

Merge pull request #7 from ggml-org/master

bc3c5cf

Latest updates

Merge branch 'Q3_HIFI' into master

0e6f3aa

geoffmunn merged commit 9971857 into Q3_HIFI Dec 14, 2025
49 of 167 checks passed

github-actions bot added documentation Improvements or additions to documentation Apple Metal SYCL Nvidia GPU Vulkan testing examples python ggml build devops script server model Ascend NPU OpenCL labels Dec 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Latest updates#8

Latest updates#8
geoffmunn merged 229 commits intoQ3_HIFIfrom
master

geoffmunn commented Dec 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

geoffmunn commented Dec 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants