Remove unprintable characters from vocab list by beiller · Pull Request #25 · ggml-org/llama.cpp

beiller · 2023-03-11T21:34:03Z

Fixes #11

This fixes a Japanese prompt I was attempting to run

EG:

./main -m ./models/13B/ggml-model-q4_0.bin -t 8 -n 128 -n 512 -p $'人生の意味は'

Output before change:

So it is outputting some characters but some �

Output after change:

人生の意は、一人が一人ということであります。は安部が立していたので、去からは一人の人にれるのはにとどまったのですが、そう

* Added ggml_tensor_printf() to debug tensors. Not sure if all cases work, it was only tested a bit. Example for the ggml_repeat2 dst tensor after it was computed: +======================+======================+======================+======================+ | ggml_compute_forward_repeat2_f32:9497 | node_1233 +----------------------+----------------------+----------------------+----------------------+ | Dimensions | Quantization | Layer id | Backend | | 3 | f32 | 31 | CPU | +----------------------+----------------------+----------------------+----------------------+ | Elements | Src0 | Src1 | Operation | | 64 x 2 x 71 | 64 x 2 x 1 | 64 x 2 x 71 | REPEAT2 | +----------------------+----------------------+----------------------+----------------------+ | Src0 name: | node_1232 | | Src1 name: | leaf_17 | +----------------------+----------------------+----------------------+----------------------+ +-------------------------------------------------------------------------------------------+ | Content of src0 "node_1232" (3 dim) Layer 0 | -0.019758 0.772589 0.000000 | | 0.772589 0.000000 0.000000 | | 0.000000 0.000000 0.000000 | +-------------------------------------------------------------------------------------------+ Layer 1 | 0.001423 -1.063233 0.000000 | | -1.063233 0.000000 0.000000 | | 0.000000 0.000000 0.000000 | +-------------------------------------------------------------------------------------------+ Layer 2 | -0.042461 -0.936166 0.000000 | | -0.936166 0.000000 0.000000 | | 0.000000 0.000000 0.000000 | +-------------------------------------------------------------------------------------------+ +-------------------------------------------------------------------------------------------+ | Content of src1 "leaf_17" (3 dim) Layer 0 | 0.000000 0.000000 0.000000 | | 0.000000 0.000000 0.000000 | | 0.000000 0.000000 0.000000 | +-------------------------------------------------------------------------------------------+ Layer 1 | 0.000000 0.000000 0.000000 | | 0.000000 0.000000 0.000000 | | 0.000000 0.000000 0.000000 | +-------------------------------------------------------------------------------------------+ Layer 2 | 0.000000 0.000000 0.000000 | | 0.000000 0.000000 0.000000 | | 0.000000 0.000000 0.000000 | +-------------------------------------------------------------------------------------------+ +-------------------------------------------------------------------------------------------+ | Content of dst "node_1233" (3 dim) Layer 0 | -0.019758 -0.019758 -0.019758 | | 0.772589 0.772589 0.772589 | | -0.019758 -0.019758 -0.019758 | +-------------------------------------------------------------------------------------------+ Layer 1 | 0.001423 0.001423 0.001423 | | -1.063233 -1.063233 -1.063233 | | 0.001423 0.001423 0.001423 | +-------------------------------------------------------------------------------------------+ Layer 2 | -0.042461 -0.042461 -0.042461 | | -0.936166 -0.936166 -0.936166 | | -0.042461 -0.042461 -0.042461 | +-------------------------------------------------------------------------------------------+ +======================+======================+======================+======================+ * typo stride>n_elem - sample print is probably still bugged * added strides and boolean info flags --------- Co-authored-by: John <nolife+git@gmail.com>

Add information on compiler flags

QVAC-6093: Stream shards. Fixup for gradle.

https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md

- Experiment TheTom#25: QJL helps turbo4 by +0.3 PPL. Without QJL, turbo4 ≈ turbo3. - Experiment #25b: Sign+magnitude encoding is neutral (decode is memory-bound). - Experiment #25c: Long-context PPL validates turbo3 competitive with q8_0. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Mixed turbo3-K/turbo2-V and turbo2-K/turbo3-V had no CUDA FA kernel instances, causing ~11x prefill regression (falling back to CPU FA). Added VEC template instances for both cross-type pairs at D=64/128/256. Updated the mixed-type guard in get_best_fattn_kernel to allow any combination of turbo2, turbo3, and q8_0. Tested: turbo3/turbo2 and turbo2/turbo3 both run at full CUDA VEC speed (~170 t/s prefill, ~221 t/s decode on Qwen3.5 35B). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…bug 1 Ports GGML_TYPE_TURBO4_0 to CUDA using the 4-bit PolarQuant format (16 centroids, nibble-packed, no QJL). Previously turbo4 crashed on CUDA with "cannot run the operation (SET_ROWS)". Changes TURBO4_USE_4BIT default from Metal-only to all backends. The 4-bit format (16 centroids) has better quality than the legacy 3-bit+QJL format and is simpler to implement (no residual projection). Full CUDA stack: - turbo-quant.cuh: 4-bit centroids, midpoints, nearest-centroid, dequant element, per-block quantize - set-rows.cu: k_set_rows_turbo4 kernel (128 threads, WHT rotation, 4-bit quantize, nibble pack via warp shuffle, corrected norm) - dequantize.cuh + convert.cu: turbo4 to f16/f32 - fattn-common.cuh: vec_dot_KQ_turbo4 + dequantize_V_turbo4 - fattn-vec.cuh + fattn.cu: VEC dispatch + all cross-type instances (turbo4×turbo4, turbo4×q8_0, turbo4×turbo3, turbo4×turbo2) - ggml-cpu.c: CPU FA vec_dot for turbo4 PPL (Qwen3.5, wikitext-2): 6.23 (+0.8% vs q8_0) at 3.8× compression Speed: 217 t/s decode (comparable to turbo3 222 t/s) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…bug 2) Mixed turbo3-K/turbo2-V and turbo2-K/turbo3-V had no CUDA FA kernel instances, causing ~11x prefill regression (falling back to CPU FA). Added VEC template instances for both cross-type pairs at D=64/128/256. Updated the mixed-type guard in get_best_fattn_kernel to allow any combination of turbo2, turbo3, and q8_0. Tested: turbo3/turbo2 and turbo2/turbo3 both run at full CUDA VEC speed (~170 t/s prefill, ~221 t/s decode on Qwen3.5 35B). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…l-org#25 bug 1 Ports GGML_TYPE_TURBO4_0 to CUDA using the 4-bit PolarQuant format (16 centroids, nibble-packed, no QJL). Previously turbo4 crashed on CUDA with "cannot run the operation (SET_ROWS)". Changes TURBO4_USE_4BIT default from Metal-only to all backends. The 4-bit format (16 centroids) has better quality than the legacy 3-bit+QJL format and is simpler to implement (no residual projection). Full CUDA stack: - turbo-quant.cuh: 4-bit centroids, midpoints, nearest-centroid, dequant element, per-block quantize - set-rows.cu: k_set_rows_turbo4 kernel (128 threads, WHT rotation, 4-bit quantize, nibble pack via warp shuffle, corrected norm) - dequantize.cuh + convert.cu: turbo4 to f16/f32 - fattn-common.cuh: vec_dot_KQ_turbo4 + dequantize_V_turbo4 - fattn-vec.cuh + fattn.cu: VEC dispatch + all cross-type instances (turbo4×turbo4, turbo4×q8_0, turbo4×turbo3, turbo4×turbo2) - ggml-cpu.c: CPU FA vec_dot for turbo4 PPL (Qwen3.5, wikitext-2): 6.23 (+0.8% vs q8_0) at 3.8× compression Speed: 217 t/s decode (comparable to turbo3 222 t/s) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…bug 2) Mixed turbo3-K/turbo2-V and turbo2-K/turbo3-V had no CUDA FA kernel instances, causing ~11x prefill regression (falling back to CPU FA). Added VEC template instances for both cross-type pairs at D=64/128/256. Updated the mixed-type guard in get_best_fattn_kernel to allow any combination of turbo2, turbo3, and q8_0. Tested: turbo3/turbo2 and turbo2/turbo3 both run at full CUDA VEC speed (~170 t/s prefill, ~221 t/s decode on Qwen3.5 35B). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…l-org#25 bug 1 Ports GGML_TYPE_TURBO4_0 to CUDA using the 4-bit PolarQuant format (16 centroids, nibble-packed, no QJL). Previously turbo4 crashed on CUDA with "cannot run the operation (SET_ROWS)". Changes TURBO4_USE_4BIT default from Metal-only to all backends. The 4-bit format (16 centroids) has better quality than the legacy 3-bit+QJL format and is simpler to implement (no residual projection). Full CUDA stack: - turbo-quant.cuh: 4-bit centroids, midpoints, nearest-centroid, dequant element, per-block quantize - set-rows.cu: k_set_rows_turbo4 kernel (128 threads, WHT rotation, 4-bit quantize, nibble pack via warp shuffle, corrected norm) - dequantize.cuh + convert.cu: turbo4 to f16/f32 - fattn-common.cuh: vec_dot_KQ_turbo4 + dequantize_V_turbo4 - fattn-vec.cuh + fattn.cu: VEC dispatch + all cross-type instances (turbo4×turbo4, turbo4×q8_0, turbo4×turbo3, turbo4×turbo2) - ggml-cpu.c: CPU FA vec_dot for turbo4 PPL (Qwen3.5, wikitext-2): 6.23 (+0.8% vs q8_0) at 3.8× compression Speed: 217 t/s decode (comparable to turbo3 222 t/s) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…bug 2) Mixed turbo3-K/turbo2-V and turbo2-K/turbo3-V had no CUDA FA kernel instances, causing ~11x prefill regression (falling back to CPU FA). Added VEC template instances for both cross-type pairs at D=64/128/256. Updated the mixed-type guard in get_best_fattn_kernel to allow any combination of turbo2, turbo3, and q8_0. Tested: turbo3/turbo2 and turbo2/turbo3 both run at full CUDA VEC speed (~170 t/s prefill, ~221 t/s decode on Qwen3.5 35B). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…l-org#25 bug 1 Ports GGML_TYPE_TURBO4_0 to CUDA using the 4-bit PolarQuant format (16 centroids, nibble-packed, no QJL). Previously turbo4 crashed on CUDA with "cannot run the operation (SET_ROWS)". Changes TURBO4_USE_4BIT default from Metal-only to all backends. The 4-bit format (16 centroids) has better quality than the legacy 3-bit+QJL format and is simpler to implement (no residual projection). Full CUDA stack: - turbo-quant.cuh: 4-bit centroids, midpoints, nearest-centroid, dequant element, per-block quantize - set-rows.cu: k_set_rows_turbo4 kernel (128 threads, WHT rotation, 4-bit quantize, nibble pack via warp shuffle, corrected norm) - dequantize.cuh + convert.cu: turbo4 to f16/f32 - fattn-common.cuh: vec_dot_KQ_turbo4 + dequantize_V_turbo4 - fattn-vec.cuh + fattn.cu: VEC dispatch + all cross-type instances (turbo4×turbo4, turbo4×q8_0, turbo4×turbo3, turbo4×turbo2) - ggml-cpu.c: CPU FA vec_dot for turbo4 PPL (Qwen3.5, wikitext-2): 6.23 (+0.8% vs q8_0) at 3.8× compression Speed: 217 t/s decode (comparable to turbo3 222 t/s) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…bug 2) Mixed turbo3-K/turbo2-V and turbo2-K/turbo3-V had no CUDA FA kernel instances, causing ~11x prefill regression (falling back to CPU FA). Added VEC template instances for both cross-type pairs at D=64/128/256. Updated the mixed-type guard in get_best_fattn_kernel to allow any combination of turbo2, turbo3, and q8_0. Tested: turbo3/turbo2 and turbo2/turbo3 both run at full CUDA VEC speed (~170 t/s prefill, ~221 t/s decode on Qwen3.5 35B). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…l-org#25 bug 1 Ports GGML_TYPE_TURBO4_0 to CUDA using the 4-bit PolarQuant format (16 centroids, nibble-packed, no QJL). Previously turbo4 crashed on CUDA with "cannot run the operation (SET_ROWS)". Changes TURBO4_USE_4BIT default from Metal-only to all backends. The 4-bit format (16 centroids) has better quality than the legacy 3-bit+QJL format and is simpler to implement (no residual projection). Full CUDA stack: - turbo-quant.cuh: 4-bit centroids, midpoints, nearest-centroid, dequant element, per-block quantize - set-rows.cu: k_set_rows_turbo4 kernel (128 threads, WHT rotation, 4-bit quantize, nibble pack via warp shuffle, corrected norm) - dequantize.cuh + convert.cu: turbo4 to f16/f32 - fattn-common.cuh: vec_dot_KQ_turbo4 + dequantize_V_turbo4 - fattn-vec.cuh + fattn.cu: VEC dispatch + all cross-type instances (turbo4×turbo4, turbo4×q8_0, turbo4×turbo3, turbo4×turbo2) - ggml-cpu.c: CPU FA vec_dot for turbo4 PPL (Qwen3.5, wikitext-2): 6.23 (+0.8% vs q8_0) at 3.8× compression Speed: 217 t/s decode (comparable to turbo3 222 t/s) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…moe) Enables proper graph-construction-time fusion of the o-projection MUL_MAT with the inpSA residual ADD, bypassing the post-hoc op-mutation aliasing bug from Phase 1. Changes: - build_attn(llm_graph_input_attn_kv*, ...) gains an optional `residual` parameter. When safe (no LoRA, no wo_s, no wo_b, F32, shape match), uses ggml_mul_mat_add_residual so the allocator sees src[2] from the start and assigns a fresh output slot. Falls back to explicit ADD otherwise so callers can treat the result as "already residual-added". - qwen3moe.cpp passes inpSA as the residual (gated by GGML_FUSE_ATTN_RES=1 while we validate). Skips fusion on the last layer (inp_out_ids slicing) and when wo_s is set (outer scale would rescale residual). - repack.cpp: added apply_residual_chunk and per-chunk fusion inside forward_mul_mat. This is the critical fix — repacked weights use their own kernel and don't go through ggml_compute_forward_mul_mat, so the original Phase 1 code path silently produced garbage when the weight was in the repack buffer type. Per-chunk fusion needs no extra barriers (each thread's chunk is disjoint). - ggml-cpu.c: removed the mm_fused guard that was skipping extra_compute _forward for fused ops, since the repack path now handles them. Correctness (GGML_NUMA_WEIGHTS=1, 48t, NPS4, PPL over 20 chunks of wikitext): fusion OFF: 10.4006 fusion ON: 10.4006 (bit-exact) Throughput (48t, NPS4, NUMA_WEIGHTS=1): fusion OFF: pp128=296.26 tg128=39.41 t/s fusion ON: pp128=293.92 tg128=39.43 t/s Gain is within run-to-run noise on decode; fusion primarily saves one barrier per layer + one allocated tensor. Ship as env-gated so future work (RMS_NORM+MUL_MAT, attention-internal fusions) has a validated infrastructure to build on. Resolves task ggml-org#25. Task ggml-org#22 (CCD pool teardown hang) also resolved — non-reproducible on 4x24t and 4x48t concurrent benches, apparently fixed by the earlier cpuset commits (0ade7bd + 69b4c3f). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

beiller added 2 commits March 11, 2023 16:32

Remove unprintable characters from vocab list

e236dbb

More robust unprintable character check

5e625ea

beiller closed this Mar 11, 2023

beiller deleted the feature/remove_unprintable branch March 11, 2023 22:09

rooprob pushed a commit to rooprob/llama.cpp that referenced this pull request Aug 2, 2023

Merge pull request ggml-org#25 from wsmoses/master

0e4076c

Add information on compiler flags

Bearsaerker mentioned this pull request Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Closed

codayon mentioned this pull request May 6, 2025

Misc. bug: [SYCL] Unexpected "setvars.sh has already been run" warning #13333

Closed

jesusmb1995 pushed a commit to jesusmb1995/llama.cpp that referenced this pull request Sep 30, 2025

Merge pull request ggml-org#25 from jesusmb1995/jmb/fixup_include

9009b2b

QVAC-6093: Stream shards. Fixup for gradle.

uttampc1 mentioned this pull request Nov 18, 2025

Throughput improvement for small batch sizes #17342

Open

sainnhe mentioned this pull request Jan 25, 2026

Eval bug: coredump due to ops of discontinuous tensor memory #19078

Closed

rururush pushed a commit to USTC-ADSL/llama.cpp that referenced this pull request Mar 16, 2026

opt mulmat base on official doc (ggml-org#25)

ff033e1

https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove unprintable characters from vocab list#25

Remove unprintable characters from vocab list#25
beiller wants to merge 2 commits intoggml-org:masterfrom
beiller:feature/remove_unprintable

beiller commented Mar 11, 2023 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

beiller commented Mar 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

beiller commented Mar 11, 2023 •

edited

Loading