Remove unprintable characters from vocab list#25
Closed
beiller wants to merge 2 commits intoggml-org:masterfrom
Closed
Remove unprintable characters from vocab list#25beiller wants to merge 2 commits intoggml-org:masterfrom
beiller wants to merge 2 commits intoggml-org:masterfrom
Conversation
flowgrad
pushed a commit
to flowgrad/llama.cpp
that referenced
this pull request
Jun 27, 2023
* Added ggml_tensor_printf() to debug tensors. Not sure if all cases work, it was only tested a bit. Example for the ggml_repeat2 dst tensor after it was computed: +======================+======================+======================+======================+ | ggml_compute_forward_repeat2_f32:9497 | node_1233 +----------------------+----------------------+----------------------+----------------------+ | Dimensions | Quantization | Layer id | Backend | | 3 | f32 | 31 | CPU | +----------------------+----------------------+----------------------+----------------------+ | Elements | Src0 | Src1 | Operation | | 64 x 2 x 71 | 64 x 2 x 1 | 64 x 2 x 71 | REPEAT2 | +----------------------+----------------------+----------------------+----------------------+ | Src0 name: | node_1232 | | Src1 name: | leaf_17 | +----------------------+----------------------+----------------------+----------------------+ +-------------------------------------------------------------------------------------------+ | Content of src0 "node_1232" (3 dim) Layer 0 | -0.019758 0.772589 0.000000 | | 0.772589 0.000000 0.000000 | | 0.000000 0.000000 0.000000 | +-------------------------------------------------------------------------------------------+ Layer 1 | 0.001423 -1.063233 0.000000 | | -1.063233 0.000000 0.000000 | | 0.000000 0.000000 0.000000 | +-------------------------------------------------------------------------------------------+ Layer 2 | -0.042461 -0.936166 0.000000 | | -0.936166 0.000000 0.000000 | | 0.000000 0.000000 0.000000 | +-------------------------------------------------------------------------------------------+ +-------------------------------------------------------------------------------------------+ | Content of src1 "leaf_17" (3 dim) Layer 0 | 0.000000 0.000000 0.000000 | | 0.000000 0.000000 0.000000 | | 0.000000 0.000000 0.000000 | +-------------------------------------------------------------------------------------------+ Layer 1 | 0.000000 0.000000 0.000000 | | 0.000000 0.000000 0.000000 | | 0.000000 0.000000 0.000000 | +-------------------------------------------------------------------------------------------+ Layer 2 | 0.000000 0.000000 0.000000 | | 0.000000 0.000000 0.000000 | | 0.000000 0.000000 0.000000 | +-------------------------------------------------------------------------------------------+ +-------------------------------------------------------------------------------------------+ | Content of dst "node_1233" (3 dim) Layer 0 | -0.019758 -0.019758 -0.019758 | | 0.772589 0.772589 0.772589 | | -0.019758 -0.019758 -0.019758 | +-------------------------------------------------------------------------------------------+ Layer 1 | 0.001423 0.001423 0.001423 | | -1.063233 -1.063233 -1.063233 | | 0.001423 0.001423 0.001423 | +-------------------------------------------------------------------------------------------+ Layer 2 | -0.042461 -0.042461 -0.042461 | | -0.936166 -0.936166 -0.936166 | | -0.042461 -0.042461 -0.042461 | +-------------------------------------------------------------------------------------------+ +======================+======================+======================+======================+ * typo stride>n_elem - sample print is probably still bugged * added strides and boolean info flags --------- Co-authored-by: John <nolife+git@gmail.com>
rooprob
pushed a commit
to rooprob/llama.cpp
that referenced
this pull request
Aug 2, 2023
Add information on compiler flags
jesusmb1995
pushed a commit
to jesusmb1995/llama.cpp
that referenced
this pull request
Sep 30, 2025
QVAC-6093: Stream shards. Fixup for gradle.
rururush
pushed a commit
to USTC-ADSL/llama.cpp
that referenced
this pull request
Mar 16, 2026
spiritbuun
referenced
this pull request
in spiritbuun/buun-llama-cpp
Mar 27, 2026
- Experiment TheTom#25: QJL helps turbo4 by +0.3 PPL. Without QJL, turbo4 ≈ turbo3. - Experiment #25b: Sign+magnitude encoding is neutral (decode is memory-bound). - Experiment #25c: Long-context PPL validates turbo3 competitive with q8_0. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
InfernalDread
referenced
this pull request
in InfernalDread/llama.cpp
Apr 4, 2026
Mixed turbo3-K/turbo2-V and turbo2-K/turbo3-V had no CUDA FA kernel instances, causing ~11x prefill regression (falling back to CPU FA). Added VEC template instances for both cross-type pairs at D=64/128/256. Updated the mixed-type guard in get_best_fattn_kernel to allow any combination of turbo2, turbo3, and q8_0. Tested: turbo3/turbo2 and turbo2/turbo3 both run at full CUDA VEC speed (~170 t/s prefill, ~221 t/s decode on Qwen3.5 35B). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
InfernalDread
referenced
this pull request
in InfernalDread/llama.cpp
Apr 4, 2026
…bug 1 Ports GGML_TYPE_TURBO4_0 to CUDA using the 4-bit PolarQuant format (16 centroids, nibble-packed, no QJL). Previously turbo4 crashed on CUDA with "cannot run the operation (SET_ROWS)". Changes TURBO4_USE_4BIT default from Metal-only to all backends. The 4-bit format (16 centroids) has better quality than the legacy 3-bit+QJL format and is simpler to implement (no residual projection). Full CUDA stack: - turbo-quant.cuh: 4-bit centroids, midpoints, nearest-centroid, dequant element, per-block quantize - set-rows.cu: k_set_rows_turbo4 kernel (128 threads, WHT rotation, 4-bit quantize, nibble pack via warp shuffle, corrected norm) - dequantize.cuh + convert.cu: turbo4 to f16/f32 - fattn-common.cuh: vec_dot_KQ_turbo4 + dequantize_V_turbo4 - fattn-vec.cuh + fattn.cu: VEC dispatch + all cross-type instances (turbo4×turbo4, turbo4×q8_0, turbo4×turbo3, turbo4×turbo2) - ggml-cpu.c: CPU FA vec_dot for turbo4 PPL (Qwen3.5, wikitext-2): 6.23 (+0.8% vs q8_0) at 3.8× compression Speed: 217 t/s decode (comparable to turbo3 222 t/s) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
itme-brain
pushed a commit
to itme-brain/llama.cpp
that referenced
this pull request
Apr 16, 2026
…bug 2) Mixed turbo3-K/turbo2-V and turbo2-K/turbo3-V had no CUDA FA kernel instances, causing ~11x prefill regression (falling back to CPU FA). Added VEC template instances for both cross-type pairs at D=64/128/256. Updated the mixed-type guard in get_best_fattn_kernel to allow any combination of turbo2, turbo3, and q8_0. Tested: turbo3/turbo2 and turbo2/turbo3 both run at full CUDA VEC speed (~170 t/s prefill, ~221 t/s decode on Qwen3.5 35B). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
itme-brain
pushed a commit
to itme-brain/llama.cpp
that referenced
this pull request
Apr 16, 2026
…l-org#25 bug 1 Ports GGML_TYPE_TURBO4_0 to CUDA using the 4-bit PolarQuant format (16 centroids, nibble-packed, no QJL). Previously turbo4 crashed on CUDA with "cannot run the operation (SET_ROWS)". Changes TURBO4_USE_4BIT default from Metal-only to all backends. The 4-bit format (16 centroids) has better quality than the legacy 3-bit+QJL format and is simpler to implement (no residual projection). Full CUDA stack: - turbo-quant.cuh: 4-bit centroids, midpoints, nearest-centroid, dequant element, per-block quantize - set-rows.cu: k_set_rows_turbo4 kernel (128 threads, WHT rotation, 4-bit quantize, nibble pack via warp shuffle, corrected norm) - dequantize.cuh + convert.cu: turbo4 to f16/f32 - fattn-common.cuh: vec_dot_KQ_turbo4 + dequantize_V_turbo4 - fattn-vec.cuh + fattn.cu: VEC dispatch + all cross-type instances (turbo4×turbo4, turbo4×q8_0, turbo4×turbo3, turbo4×turbo2) - ggml-cpu.c: CPU FA vec_dot for turbo4 PPL (Qwen3.5, wikitext-2): 6.23 (+0.8% vs q8_0) at 3.8× compression Speed: 217 t/s decode (comparable to turbo3 222 t/s) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
erazortt
pushed a commit
to erazortt/llama.cpp
that referenced
this pull request
Apr 17, 2026
…bug 2) Mixed turbo3-K/turbo2-V and turbo2-K/turbo3-V had no CUDA FA kernel instances, causing ~11x prefill regression (falling back to CPU FA). Added VEC template instances for both cross-type pairs at D=64/128/256. Updated the mixed-type guard in get_best_fattn_kernel to allow any combination of turbo2, turbo3, and q8_0. Tested: turbo3/turbo2 and turbo2/turbo3 both run at full CUDA VEC speed (~170 t/s prefill, ~221 t/s decode on Qwen3.5 35B). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
erazortt
pushed a commit
to erazortt/llama.cpp
that referenced
this pull request
Apr 17, 2026
…l-org#25 bug 1 Ports GGML_TYPE_TURBO4_0 to CUDA using the 4-bit PolarQuant format (16 centroids, nibble-packed, no QJL). Previously turbo4 crashed on CUDA with "cannot run the operation (SET_ROWS)". Changes TURBO4_USE_4BIT default from Metal-only to all backends. The 4-bit format (16 centroids) has better quality than the legacy 3-bit+QJL format and is simpler to implement (no residual projection). Full CUDA stack: - turbo-quant.cuh: 4-bit centroids, midpoints, nearest-centroid, dequant element, per-block quantize - set-rows.cu: k_set_rows_turbo4 kernel (128 threads, WHT rotation, 4-bit quantize, nibble pack via warp shuffle, corrected norm) - dequantize.cuh + convert.cu: turbo4 to f16/f32 - fattn-common.cuh: vec_dot_KQ_turbo4 + dequantize_V_turbo4 - fattn-vec.cuh + fattn.cu: VEC dispatch + all cross-type instances (turbo4×turbo4, turbo4×q8_0, turbo4×turbo3, turbo4×turbo2) - ggml-cpu.c: CPU FA vec_dot for turbo4 PPL (Qwen3.5, wikitext-2): 6.23 (+0.8% vs q8_0) at 3.8× compression Speed: 217 t/s decode (comparable to turbo3 222 t/s) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ausshir
pushed a commit
to ausshir/llama.cpp-iso-rocm
that referenced
this pull request
Apr 20, 2026
…bug 2) Mixed turbo3-K/turbo2-V and turbo2-K/turbo3-V had no CUDA FA kernel instances, causing ~11x prefill regression (falling back to CPU FA). Added VEC template instances for both cross-type pairs at D=64/128/256. Updated the mixed-type guard in get_best_fattn_kernel to allow any combination of turbo2, turbo3, and q8_0. Tested: turbo3/turbo2 and turbo2/turbo3 both run at full CUDA VEC speed (~170 t/s prefill, ~221 t/s decode on Qwen3.5 35B). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ausshir
pushed a commit
to ausshir/llama.cpp-iso-rocm
that referenced
this pull request
Apr 20, 2026
…l-org#25 bug 1 Ports GGML_TYPE_TURBO4_0 to CUDA using the 4-bit PolarQuant format (16 centroids, nibble-packed, no QJL). Previously turbo4 crashed on CUDA with "cannot run the operation (SET_ROWS)". Changes TURBO4_USE_4BIT default from Metal-only to all backends. The 4-bit format (16 centroids) has better quality than the legacy 3-bit+QJL format and is simpler to implement (no residual projection). Full CUDA stack: - turbo-quant.cuh: 4-bit centroids, midpoints, nearest-centroid, dequant element, per-block quantize - set-rows.cu: k_set_rows_turbo4 kernel (128 threads, WHT rotation, 4-bit quantize, nibble pack via warp shuffle, corrected norm) - dequantize.cuh + convert.cu: turbo4 to f16/f32 - fattn-common.cuh: vec_dot_KQ_turbo4 + dequantize_V_turbo4 - fattn-vec.cuh + fattn.cu: VEC dispatch + all cross-type instances (turbo4×turbo4, turbo4×q8_0, turbo4×turbo3, turbo4×turbo2) - ggml-cpu.c: CPU FA vec_dot for turbo4 PPL (Qwen3.5, wikitext-2): 6.23 (+0.8% vs q8_0) at 3.8× compression Speed: 217 t/s decode (comparable to turbo3 222 t/s) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
YuruDeveloper
pushed a commit
to YuruDeveloper/llama.cpp-quant
that referenced
this pull request
Apr 21, 2026
…bug 2) Mixed turbo3-K/turbo2-V and turbo2-K/turbo3-V had no CUDA FA kernel instances, causing ~11x prefill regression (falling back to CPU FA). Added VEC template instances for both cross-type pairs at D=64/128/256. Updated the mixed-type guard in get_best_fattn_kernel to allow any combination of turbo2, turbo3, and q8_0. Tested: turbo3/turbo2 and turbo2/turbo3 both run at full CUDA VEC speed (~170 t/s prefill, ~221 t/s decode on Qwen3.5 35B). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
YuruDeveloper
pushed a commit
to YuruDeveloper/llama.cpp-quant
that referenced
this pull request
Apr 21, 2026
…l-org#25 bug 1 Ports GGML_TYPE_TURBO4_0 to CUDA using the 4-bit PolarQuant format (16 centroids, nibble-packed, no QJL). Previously turbo4 crashed on CUDA with "cannot run the operation (SET_ROWS)". Changes TURBO4_USE_4BIT default from Metal-only to all backends. The 4-bit format (16 centroids) has better quality than the legacy 3-bit+QJL format and is simpler to implement (no residual projection). Full CUDA stack: - turbo-quant.cuh: 4-bit centroids, midpoints, nearest-centroid, dequant element, per-block quantize - set-rows.cu: k_set_rows_turbo4 kernel (128 threads, WHT rotation, 4-bit quantize, nibble pack via warp shuffle, corrected norm) - dequantize.cuh + convert.cu: turbo4 to f16/f32 - fattn-common.cuh: vec_dot_KQ_turbo4 + dequantize_V_turbo4 - fattn-vec.cuh + fattn.cu: VEC dispatch + all cross-type instances (turbo4×turbo4, turbo4×q8_0, turbo4×turbo3, turbo4×turbo2) - ggml-cpu.c: CPU FA vec_dot for turbo4 PPL (Qwen3.5, wikitext-2): 6.23 (+0.8% vs q8_0) at 3.8× compression Speed: 217 t/s decode (comparable to turbo3 222 t/s) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
pestopoppa
added a commit
to pestopoppa/llama.cpp
that referenced
this pull request
Apr 27, 2026
…moe)
Enables proper graph-construction-time fusion of the o-projection MUL_MAT
with the inpSA residual ADD, bypassing the post-hoc op-mutation aliasing
bug from Phase 1.
Changes:
- build_attn(llm_graph_input_attn_kv*, ...) gains an optional `residual`
parameter. When safe (no LoRA, no wo_s, no wo_b, F32, shape match),
uses ggml_mul_mat_add_residual so the allocator sees src[2] from the
start and assigns a fresh output slot. Falls back to explicit ADD
otherwise so callers can treat the result as "already residual-added".
- qwen3moe.cpp passes inpSA as the residual (gated by
GGML_FUSE_ATTN_RES=1 while we validate). Skips fusion on the last
layer (inp_out_ids slicing) and when wo_s is set (outer scale would
rescale residual).
- repack.cpp: added apply_residual_chunk and per-chunk fusion inside
forward_mul_mat. This is the critical fix — repacked weights use
their own kernel and don't go through ggml_compute_forward_mul_mat,
so the original Phase 1 code path silently produced garbage when
the weight was in the repack buffer type. Per-chunk fusion needs no
extra barriers (each thread's chunk is disjoint).
- ggml-cpu.c: removed the mm_fused guard that was skipping extra_compute
_forward for fused ops, since the repack path now handles them.
Correctness (GGML_NUMA_WEIGHTS=1, 48t, NPS4, PPL over 20 chunks of wikitext):
fusion OFF: 10.4006
fusion ON: 10.4006 (bit-exact)
Throughput (48t, NPS4, NUMA_WEIGHTS=1):
fusion OFF: pp128=296.26 tg128=39.41 t/s
fusion ON: pp128=293.92 tg128=39.43 t/s
Gain is within run-to-run noise on decode; fusion primarily saves one
barrier per layer + one allocated tensor. Ship as env-gated so future
work (RMS_NORM+MUL_MAT, attention-internal fusions) has a validated
infrastructure to build on.
Resolves task ggml-org#25. Task ggml-org#22 (CCD pool teardown hang) also resolved —
non-reproducible on 4x24t and 4x48t concurrent benches, apparently
fixed by the earlier cpuset commits (0ade7bd + 69b4c3f).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #11
This fixes a Japanese prompt I was attempting to run
EG:
./main -m ./models/13B/ggml-model-q4_0.bin -t 8 -n 128 -n 512 -p $'人生の意味は'Output before change:
人生の意���、フロントカードに���いてる。 2019年3月 © All Rights Reserved. [end of text]So it is outputting some characters but some �
Output after change:
人生の意は、一人が一人ということであります。は安部が立していたので、去からは一人の人にれるのはにとどまったのですが、そう