Skip to content

Remove unprintable characters from vocab list#25

Closed
beiller wants to merge 2 commits intoggml-org:masterfrom
beiller:feature/remove_unprintable
Closed

Remove unprintable characters from vocab list#25
beiller wants to merge 2 commits intoggml-org:masterfrom
beiller:feature/remove_unprintable

Conversation

@beiller
Copy link
Copy Markdown
Contributor

@beiller beiller commented Mar 11, 2023

Fixes #11

This fixes a Japanese prompt I was attempting to run

EG:

./main -m ./models/13B/ggml-model-q4_0.bin -t 8 -n 128 -n 512 -p $'人生の意味は'

Output before change:

人生の意���、フロントカードに���いてる。 2019年3月 © All Rights Reserved. [end of text]

So it is outputting some characters but some �

Output after change:

人生の意は、一人が一人ということであります。は安部が立していたので、去からは一人の人にれるのはにとどまったのですが、そう

@beiller beiller closed this Mar 11, 2023
@beiller beiller deleted the feature/remove_unprintable branch March 11, 2023 22:09
flowgrad pushed a commit to flowgrad/llama.cpp that referenced this pull request Jun 27, 2023
* Added ggml_tensor_printf() to debug tensors.
Not sure if all cases work, it was only tested a bit.

Example for the ggml_repeat2 dst tensor after it was computed:
+======================+======================+======================+======================+
| ggml_compute_forward_repeat2_f32:9497
| node_1233
+----------------------+----------------------+----------------------+----------------------+
| Dimensions           | Quantization         | Layer id             | Backend              |
| 3                    | f32                  | 31                   | CPU                  |
+----------------------+----------------------+----------------------+----------------------+
| Elements             | Src0                 | Src1                 | Operation            |
| 64 x 2 x 71          | 64 x 2 x 1           | 64 x 2 x 71          | REPEAT2              |
+----------------------+----------------------+----------------------+----------------------+
| Src0 name:           | node_1232                                                          |
| Src1 name:           | leaf_17                                                            |
+----------------------+----------------------+----------------------+----------------------+

+-------------------------------------------------------------------------------------------+
| Content of src0 "node_1232" (3 dim)
Layer 0
| -0.019758            0.772589             0.000000             |
| 0.772589             0.000000             0.000000             |
| 0.000000             0.000000             0.000000             |
+-------------------------------------------------------------------------------------------+

Layer 1
| 0.001423             -1.063233            0.000000             |
| -1.063233            0.000000             0.000000             |
| 0.000000             0.000000             0.000000             |
+-------------------------------------------------------------------------------------------+

Layer 2
| -0.042461            -0.936166            0.000000             |
| -0.936166            0.000000             0.000000             |
| 0.000000             0.000000             0.000000             |
+-------------------------------------------------------------------------------------------+

+-------------------------------------------------------------------------------------------+
| Content of src1 "leaf_17" (3 dim)
Layer 0
| 0.000000             0.000000             0.000000             |
| 0.000000             0.000000             0.000000             |
| 0.000000             0.000000             0.000000             |
+-------------------------------------------------------------------------------------------+

Layer 1
| 0.000000             0.000000             0.000000             |
| 0.000000             0.000000             0.000000             |
| 0.000000             0.000000             0.000000             |
+-------------------------------------------------------------------------------------------+

Layer 2
| 0.000000             0.000000             0.000000             |
| 0.000000             0.000000             0.000000             |
| 0.000000             0.000000             0.000000             |
+-------------------------------------------------------------------------------------------+

+-------------------------------------------------------------------------------------------+
| Content of dst "node_1233" (3 dim)
Layer 0
| -0.019758            -0.019758            -0.019758            |
| 0.772589             0.772589             0.772589             |
| -0.019758            -0.019758            -0.019758            |
+-------------------------------------------------------------------------------------------+

Layer 1
| 0.001423             0.001423             0.001423             |
| -1.063233            -1.063233            -1.063233            |
| 0.001423             0.001423             0.001423             |
+-------------------------------------------------------------------------------------------+

Layer 2
| -0.042461            -0.042461            -0.042461            |
| -0.936166            -0.936166            -0.936166            |
| -0.042461            -0.042461            -0.042461            |
+-------------------------------------------------------------------------------------------+

+======================+======================+======================+======================+

* typo stride>n_elem - sample print is probably still bugged

* added strides and boolean info flags

---------

Co-authored-by: John <nolife+git@gmail.com>
rooprob pushed a commit to rooprob/llama.cpp that referenced this pull request Aug 2, 2023
Add information on compiler flags
jesusmb1995 pushed a commit to jesusmb1995/llama.cpp that referenced this pull request Sep 30, 2025
QVAC-6093: Stream shards. Fixup for gradle.
spiritbuun referenced this pull request in spiritbuun/buun-llama-cpp Mar 27, 2026
- Experiment TheTom#25: QJL helps turbo4 by +0.3 PPL. Without QJL, turbo4 ≈ turbo3.
- Experiment #25b: Sign+magnitude encoding is neutral (decode is memory-bound).
- Experiment #25c: Long-context PPL validates turbo3 competitive with q8_0.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
InfernalDread referenced this pull request in InfernalDread/llama.cpp Apr 4, 2026
Mixed turbo3-K/turbo2-V and turbo2-K/turbo3-V had no CUDA FA kernel
instances, causing ~11x prefill regression (falling back to CPU FA).

Added VEC template instances for both cross-type pairs at D=64/128/256.
Updated the mixed-type guard in get_best_fattn_kernel to allow any
combination of turbo2, turbo3, and q8_0.

Tested: turbo3/turbo2 and turbo2/turbo3 both run at full CUDA VEC
speed (~170 t/s prefill, ~221 t/s decode on Qwen3.5 35B).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
InfernalDread referenced this pull request in InfernalDread/llama.cpp Apr 4, 2026
…bug 1

Ports GGML_TYPE_TURBO4_0 to CUDA using the 4-bit PolarQuant format
(16 centroids, nibble-packed, no QJL). Previously turbo4 crashed on
CUDA with "cannot run the operation (SET_ROWS)".

Changes TURBO4_USE_4BIT default from Metal-only to all backends.
The 4-bit format (16 centroids) has better quality than the legacy
3-bit+QJL format and is simpler to implement (no residual projection).

Full CUDA stack:
- turbo-quant.cuh: 4-bit centroids, midpoints, nearest-centroid,
  dequant element, per-block quantize
- set-rows.cu: k_set_rows_turbo4 kernel (128 threads, WHT rotation,
  4-bit quantize, nibble pack via warp shuffle, corrected norm)
- dequantize.cuh + convert.cu: turbo4 to f16/f32
- fattn-common.cuh: vec_dot_KQ_turbo4 + dequantize_V_turbo4
- fattn-vec.cuh + fattn.cu: VEC dispatch + all cross-type instances
  (turbo4×turbo4, turbo4×q8_0, turbo4×turbo3, turbo4×turbo2)
- ggml-cpu.c: CPU FA vec_dot for turbo4

PPL (Qwen3.5, wikitext-2): 6.23 (+0.8% vs q8_0) at 3.8× compression
Speed: 217 t/s decode (comparable to turbo3 222 t/s)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
itme-brain pushed a commit to itme-brain/llama.cpp that referenced this pull request Apr 16, 2026
…bug 2)

Mixed turbo3-K/turbo2-V and turbo2-K/turbo3-V had no CUDA FA kernel
instances, causing ~11x prefill regression (falling back to CPU FA).

Added VEC template instances for both cross-type pairs at D=64/128/256.
Updated the mixed-type guard in get_best_fattn_kernel to allow any
combination of turbo2, turbo3, and q8_0.

Tested: turbo3/turbo2 and turbo2/turbo3 both run at full CUDA VEC
speed (~170 t/s prefill, ~221 t/s decode on Qwen3.5 35B).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
itme-brain pushed a commit to itme-brain/llama.cpp that referenced this pull request Apr 16, 2026
…l-org#25 bug 1

Ports GGML_TYPE_TURBO4_0 to CUDA using the 4-bit PolarQuant format
(16 centroids, nibble-packed, no QJL). Previously turbo4 crashed on
CUDA with "cannot run the operation (SET_ROWS)".

Changes TURBO4_USE_4BIT default from Metal-only to all backends.
The 4-bit format (16 centroids) has better quality than the legacy
3-bit+QJL format and is simpler to implement (no residual projection).

Full CUDA stack:
- turbo-quant.cuh: 4-bit centroids, midpoints, nearest-centroid,
  dequant element, per-block quantize
- set-rows.cu: k_set_rows_turbo4 kernel (128 threads, WHT rotation,
  4-bit quantize, nibble pack via warp shuffle, corrected norm)
- dequantize.cuh + convert.cu: turbo4 to f16/f32
- fattn-common.cuh: vec_dot_KQ_turbo4 + dequantize_V_turbo4
- fattn-vec.cuh + fattn.cu: VEC dispatch + all cross-type instances
  (turbo4×turbo4, turbo4×q8_0, turbo4×turbo3, turbo4×turbo2)
- ggml-cpu.c: CPU FA vec_dot for turbo4

PPL (Qwen3.5, wikitext-2): 6.23 (+0.8% vs q8_0) at 3.8× compression
Speed: 217 t/s decode (comparable to turbo3 222 t/s)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
erazortt pushed a commit to erazortt/llama.cpp that referenced this pull request Apr 17, 2026
…bug 2)

Mixed turbo3-K/turbo2-V and turbo2-K/turbo3-V had no CUDA FA kernel
instances, causing ~11x prefill regression (falling back to CPU FA).

Added VEC template instances for both cross-type pairs at D=64/128/256.
Updated the mixed-type guard in get_best_fattn_kernel to allow any
combination of turbo2, turbo3, and q8_0.

Tested: turbo3/turbo2 and turbo2/turbo3 both run at full CUDA VEC
speed (~170 t/s prefill, ~221 t/s decode on Qwen3.5 35B).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
erazortt pushed a commit to erazortt/llama.cpp that referenced this pull request Apr 17, 2026
…l-org#25 bug 1

Ports GGML_TYPE_TURBO4_0 to CUDA using the 4-bit PolarQuant format
(16 centroids, nibble-packed, no QJL). Previously turbo4 crashed on
CUDA with "cannot run the operation (SET_ROWS)".

Changes TURBO4_USE_4BIT default from Metal-only to all backends.
The 4-bit format (16 centroids) has better quality than the legacy
3-bit+QJL format and is simpler to implement (no residual projection).

Full CUDA stack:
- turbo-quant.cuh: 4-bit centroids, midpoints, nearest-centroid,
  dequant element, per-block quantize
- set-rows.cu: k_set_rows_turbo4 kernel (128 threads, WHT rotation,
  4-bit quantize, nibble pack via warp shuffle, corrected norm)
- dequantize.cuh + convert.cu: turbo4 to f16/f32
- fattn-common.cuh: vec_dot_KQ_turbo4 + dequantize_V_turbo4
- fattn-vec.cuh + fattn.cu: VEC dispatch + all cross-type instances
  (turbo4×turbo4, turbo4×q8_0, turbo4×turbo3, turbo4×turbo2)
- ggml-cpu.c: CPU FA vec_dot for turbo4

PPL (Qwen3.5, wikitext-2): 6.23 (+0.8% vs q8_0) at 3.8× compression
Speed: 217 t/s decode (comparable to turbo3 222 t/s)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ausshir pushed a commit to ausshir/llama.cpp-iso-rocm that referenced this pull request Apr 20, 2026
…bug 2)

Mixed turbo3-K/turbo2-V and turbo2-K/turbo3-V had no CUDA FA kernel
instances, causing ~11x prefill regression (falling back to CPU FA).

Added VEC template instances for both cross-type pairs at D=64/128/256.
Updated the mixed-type guard in get_best_fattn_kernel to allow any
combination of turbo2, turbo3, and q8_0.

Tested: turbo3/turbo2 and turbo2/turbo3 both run at full CUDA VEC
speed (~170 t/s prefill, ~221 t/s decode on Qwen3.5 35B).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ausshir pushed a commit to ausshir/llama.cpp-iso-rocm that referenced this pull request Apr 20, 2026
…l-org#25 bug 1

Ports GGML_TYPE_TURBO4_0 to CUDA using the 4-bit PolarQuant format
(16 centroids, nibble-packed, no QJL). Previously turbo4 crashed on
CUDA with "cannot run the operation (SET_ROWS)".

Changes TURBO4_USE_4BIT default from Metal-only to all backends.
The 4-bit format (16 centroids) has better quality than the legacy
3-bit+QJL format and is simpler to implement (no residual projection).

Full CUDA stack:
- turbo-quant.cuh: 4-bit centroids, midpoints, nearest-centroid,
  dequant element, per-block quantize
- set-rows.cu: k_set_rows_turbo4 kernel (128 threads, WHT rotation,
  4-bit quantize, nibble pack via warp shuffle, corrected norm)
- dequantize.cuh + convert.cu: turbo4 to f16/f32
- fattn-common.cuh: vec_dot_KQ_turbo4 + dequantize_V_turbo4
- fattn-vec.cuh + fattn.cu: VEC dispatch + all cross-type instances
  (turbo4×turbo4, turbo4×q8_0, turbo4×turbo3, turbo4×turbo2)
- ggml-cpu.c: CPU FA vec_dot for turbo4

PPL (Qwen3.5, wikitext-2): 6.23 (+0.8% vs q8_0) at 3.8× compression
Speed: 217 t/s decode (comparable to turbo3 222 t/s)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
YuruDeveloper pushed a commit to YuruDeveloper/llama.cpp-quant that referenced this pull request Apr 21, 2026
…bug 2)

Mixed turbo3-K/turbo2-V and turbo2-K/turbo3-V had no CUDA FA kernel
instances, causing ~11x prefill regression (falling back to CPU FA).

Added VEC template instances for both cross-type pairs at D=64/128/256.
Updated the mixed-type guard in get_best_fattn_kernel to allow any
combination of turbo2, turbo3, and q8_0.

Tested: turbo3/turbo2 and turbo2/turbo3 both run at full CUDA VEC
speed (~170 t/s prefill, ~221 t/s decode on Qwen3.5 35B).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
YuruDeveloper pushed a commit to YuruDeveloper/llama.cpp-quant that referenced this pull request Apr 21, 2026
…l-org#25 bug 1

Ports GGML_TYPE_TURBO4_0 to CUDA using the 4-bit PolarQuant format
(16 centroids, nibble-packed, no QJL). Previously turbo4 crashed on
CUDA with "cannot run the operation (SET_ROWS)".

Changes TURBO4_USE_4BIT default from Metal-only to all backends.
The 4-bit format (16 centroids) has better quality than the legacy
3-bit+QJL format and is simpler to implement (no residual projection).

Full CUDA stack:
- turbo-quant.cuh: 4-bit centroids, midpoints, nearest-centroid,
  dequant element, per-block quantize
- set-rows.cu: k_set_rows_turbo4 kernel (128 threads, WHT rotation,
  4-bit quantize, nibble pack via warp shuffle, corrected norm)
- dequantize.cuh + convert.cu: turbo4 to f16/f32
- fattn-common.cuh: vec_dot_KQ_turbo4 + dequantize_V_turbo4
- fattn-vec.cuh + fattn.cu: VEC dispatch + all cross-type instances
  (turbo4×turbo4, turbo4×q8_0, turbo4×turbo3, turbo4×turbo2)
- ggml-cpu.c: CPU FA vec_dot for turbo4

PPL (Qwen3.5, wikitext-2): 6.23 (+0.8% vs q8_0) at 3.8× compression
Speed: 217 t/s decode (comparable to turbo3 222 t/s)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
pestopoppa added a commit to pestopoppa/llama.cpp that referenced this pull request Apr 27, 2026
…moe)

Enables proper graph-construction-time fusion of the o-projection MUL_MAT
with the inpSA residual ADD, bypassing the post-hoc op-mutation aliasing
bug from Phase 1.

Changes:
  - build_attn(llm_graph_input_attn_kv*, ...) gains an optional `residual`
    parameter. When safe (no LoRA, no wo_s, no wo_b, F32, shape match),
    uses ggml_mul_mat_add_residual so the allocator sees src[2] from the
    start and assigns a fresh output slot. Falls back to explicit ADD
    otherwise so callers can treat the result as "already residual-added".
  - qwen3moe.cpp passes inpSA as the residual (gated by
    GGML_FUSE_ATTN_RES=1 while we validate). Skips fusion on the last
    layer (inp_out_ids slicing) and when wo_s is set (outer scale would
    rescale residual).
  - repack.cpp: added apply_residual_chunk and per-chunk fusion inside
    forward_mul_mat. This is the critical fix — repacked weights use
    their own kernel and don't go through ggml_compute_forward_mul_mat,
    so the original Phase 1 code path silently produced garbage when
    the weight was in the repack buffer type. Per-chunk fusion needs no
    extra barriers (each thread's chunk is disjoint).
  - ggml-cpu.c: removed the mm_fused guard that was skipping extra_compute
    _forward for fused ops, since the repack path now handles them.

Correctness (GGML_NUMA_WEIGHTS=1, 48t, NPS4, PPL over 20 chunks of wikitext):
  fusion OFF: 10.4006
  fusion ON:  10.4006   (bit-exact)

Throughput (48t, NPS4, NUMA_WEIGHTS=1):
  fusion OFF: pp128=296.26 tg128=39.41 t/s
  fusion ON:  pp128=293.92 tg128=39.43 t/s
  Gain is within run-to-run noise on decode; fusion primarily saves one
  barrier per layer + one allocated tensor. Ship as env-gated so future
  work (RMS_NORM+MUL_MAT, attention-internal fusions) has a validated
  infrastructure to build on.

Resolves task ggml-org#25. Task ggml-org#22 (CCD pool teardown hang) also resolved —
non-reproducible on 4x24t and 4x48t concurrent benches, apparently
fixed by the earlier cpuset commits (0ade7bd + 69b4c3f).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Unicode support

1 participant