Skip to content

ggml : fix AMX and improve alignment checks#19867

Closed
angt wants to merge 1 commit intoggml-org:masterfrom
angt:ggml-fix-amx-and-improve-alignment-checks
Closed

ggml : fix AMX and improve alignment checks#19867
angt wants to merge 1 commit intoggml-org:masterfrom
angt:ggml-fix-amx-and-improve-alignment-checks

Conversation

@angt
Copy link
Copy Markdown
Member

@angt angt commented Feb 24, 2026

Before, the selector was too restrictive and some models fell back to
CPU_REPACK:

perplexity: calculating perplexity over 2 chunks, n_ctx=2048, batch_size=2048, n_seq=1
perplexity: 2.82 seconds per pass - ETA 0.08 minutes
[1]17.3868,[2]22.2199,
Final estimate: PPL = 22.2199 +/- 1.59692

llama_perf_context_print:        load time =     511.56 ms
llama_perf_context_print: prompt eval time =    2529.70 ms /  4096 tokens (    0.62 ms per token,  1619.16 tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =    7630.80 ms /  4097 tokens
llama_perf_context_print:    graphs reused =          0
llama_memory_breakdown_print: | memory breakdown [MiB] | total   free    self   model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - Host               |                  845 =   318 +     224 +     302                |
llama_memory_breakdown_print: |   - CPU_REPACK         |                  288 =   288 +       0 +       0                |
llama_memory_breakdown_print: |   - AMX                |                   31 =    31 +       0 +       0                |

Now they use AMX directly:

perplexity: calculating perplexity over 2 chunks, n_ctx=2048, batch_size=2048, n_seq=1
perplexity: 2.39 seconds per pass - ETA 0.07 minutes
[1]17.2005,[2]21.8220,
Final estimate: PPL = 21.8220 +/- 1.56485

llama_perf_context_print:        load time =     376.37 ms
llama_perf_context_print: prompt eval time =    2159.65 ms /  4096 tokens (    0.53 ms per token,  1896.60 tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =    5189.06 ms /  4097 tokens
llama_perf_context_print:    graphs reused =          0
llama_memory_breakdown_print: | memory breakdown [MiB] | total   free    self   model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - Host               |                  845 =   318 +     224 +     302                |
llama_memory_breakdown_print: |   - AMX                |                  319 =   319 +       0 +       0                |

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
@angt angt requested a review from ggerganov as a code owner February 24, 2026 22:07
@angt
Copy link
Copy Markdown
Member Author

angt commented Feb 24, 2026

I use

llama-perplexity -hf ggml-org/Qwen3-0.6B-GGUF:Q4_0 -f wikitext-2-raw/wiki.test.raw -c 2048 -b 2048 --chunks 2

from #16315

@github-actions github-actions Bot added the ggml changes relating to the ggml tensor library for machine learning label Feb 24, 2026
@ggerganov
Copy link
Copy Markdown
Member

Does this fix the issues with LFM2.5 reported in #19184?

@angt
Copy link
Copy Markdown
Member Author

angt commented Feb 25, 2026

Does this fix the issues with LFM2.5 reported in #19184?

It works with the following PR #19884

just discovering the backend mechanism, maybe my fix is completely stupid..

@rgerganov
Copy link
Copy Markdown
Member

I think this PR fixes the performance regression for the AMX backend which I reported in #19039. Here are some benchmarks on a C4 instance with 16 vCPUs running in the Google Cloud (using 8 threads):

Model Test t/s da426cb t/s ggml-fix-amx-and-improve-alignment-checks Speedup
gemma3 4B Q4_K_M pp512 151.55 170.97 1.13
gemma3 4B Q4_K_M tg128 18.94 18.76 0.99
gemma3 4B Q8_0 pp512 123.92 233.03 1.88
gemma3 4B Q8_0 tg128 14.62 22.73 1.55
qwen3 4B Q4_K_M pp512 109.63 142.13 1.30
qwen3 4B Q4_K_M tg128 19.26 19.03 0.99
qwen3 4B Q8_0 pp512 90.49 191.75 2.12
qwen3 4B Q8_0 tg128 12.94 22.22 1.72

@rgerganov
Copy link
Copy Markdown
Member

Hm, when running llama-completion, the memory breakdown says that no AMX memory is used for compute, not sure if this is correct:

llama_memory_breakdown_print: |   - Host               |                 10179 =  4051 +    5760 +     368                |
llama_memory_breakdown_print: |   - AMX                |                  4555 =  4555 +       0 +       0                |

Full log below.

Details
$ bin/llama-completion -m ~/.cache/llama.cpp/Qwen3-4B-Q8_0.gguf -p 'who are you?' -n 128 -no-cnv
build: 8146 (44a65238) with GNU 12.2.0 for Linux x86_64
main: llama backend init
main: load the model and apply lora adapter, if any
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
llama_params_fit_impl: no devices with dedicated memory found
llama_params_fit: successfully fit params to free device memory
llama_params_fit: fitting params to free memory took 0.23 seconds
llama_model_loader: loaded meta data with 34 key-value pairs and 398 tensors from /home/rgerganov/.cache/llama.cpp/Qwen3-4B-Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen3 4B
llama_model_loader: - kv   3:                           general.basename str              = Qwen3
llama_model_loader: - kv   4:                         general.size_label str              = 4B
llama_model_loader: - kv   5:                            general.license str              = apache-2.0
llama_model_loader: - kv   6:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3-4B/...
llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
llama_model_loader: - kv   8:                  general.base_model.0.name str              = Qwen3 4B Base
llama_model_loader: - kv   9:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-4B-...
llama_model_loader: - kv  11:                               general.tags arr[str,1]       = ["text-generation"]
llama_model_loader: - kv  12:                          qwen3.block_count u32              = 36
llama_model_loader: - kv  13:                       qwen3.context_length u32              = 40960
llama_model_loader: - kv  14:                     qwen3.embedding_length u32              = 2560
llama_model_loader: - kv  15:                  qwen3.feed_forward_length u32              = 9728
llama_model_loader: - kv  16:                 qwen3.attention.head_count u32              = 32
llama_model_loader: - kv  17:              qwen3.attention.head_count_kv u32              = 8
llama_model_loader: - kv  18:                       qwen3.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  19:     qwen3.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  20:                 qwen3.attention.key_length u32              = 128
llama_model_loader: - kv  21:               qwen3.attention.value_length u32              = 128
llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  27:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  28:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  29:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  30:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  31:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  32:               general.quantization_version u32              = 2
llama_model_loader: - kv  33:                          general.file_type u32              = 7
llama_model_loader: - type  f32:  145 tensors
llama_model_loader: - type q8_0:  253 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q8_0
print_info: file size   = 3.98 GiB (8.50 BPW) 
load: 0 unused tokens
load: control-looking token: 128247 '</s>' was not control-type; this is probably a bug in the model. its type will be overridden
load: printing all EOG tokens:
load:   - 128247 ('</s>')
load:   - 151643 ('<|endoftext|>')
load:   - 151645 ('<|im_end|>')
load:   - 151662 ('<|fim_pad|>')
load:   - 151663 ('<|repo_name|>')
load:   - 151664 ('<|file_sep|>')
load: special tokens cache size = 27
load: token to piece cache size = 0.9311 MB
print_info: arch                  = qwen3
print_info: vocab_only            = 0
print_info: no_alloc              = 0
print_info: n_ctx_train           = 40960
print_info: n_embd                = 2560
print_info: n_embd_inp            = 2560
print_info: n_layer               = 36
print_info: n_head                = 32
print_info: n_head_kv             = 8
print_info: n_rot                 = 128
print_info: n_swa                 = 0
print_info: is_swa_any            = 0
print_info: n_embd_head_k         = 128
print_info: n_embd_head_v         = 128
print_info: n_gqa                 = 4
print_info: n_embd_k_gqa          = 1024
print_info: n_embd_v_gqa          = 1024
print_info: f_norm_eps            = 0.0e+00
print_info: f_norm_rms_eps        = 1.0e-06
print_info: f_clamp_kqv           = 0.0e+00
print_info: f_max_alibi_bias      = 0.0e+00
print_info: f_logit_scale         = 0.0e+00
print_info: f_attn_scale          = 0.0e+00
print_info: n_ff                  = 9728
print_info: n_expert              = 0
print_info: n_expert_used         = 0
print_info: n_expert_groups       = 0
print_info: n_group_used          = 0
print_info: causal attn           = 1
print_info: pooling type          = -1
print_info: rope type             = 2
print_info: rope scaling          = linear
print_info: freq_base_train       = 1000000.0
print_info: freq_scale_train      = 1
print_info: n_ctx_orig_yarn       = 40960
print_info: rope_yarn_log_mul     = 0.0000
print_info: rope_finetuned        = unknown
print_info: model type            = 4B
print_info: model params          = 4.02 B
print_info: general.name          = Qwen3 4B
print_info: vocab type            = BPE
print_info: n_vocab               = 151936
print_info: n_merges              = 151387
print_info: BOS token             = 151643 '<|endoftext|>'
print_info: EOS token             = 151645 '<|im_end|>'
print_info: EOT token             = 151645 '<|im_end|>'
print_info: PAD token             = 151643 '<|endoftext|>'
print_info: LF token              = 198 'Ċ'
print_info: FIM PRE token         = 151659 '<|fim_prefix|>'
print_info: FIM SUF token         = 151661 '<|fim_suffix|>'
print_info: FIM MID token         = 151660 '<|fim_middle|>'
print_info: FIM PAD token         = 151662 '<|fim_pad|>'
print_info: FIM REP token         = 151663 '<|repo_name|>'
print_info: FIM SEP token         = 151664 '<|file_sep|>'
print_info: EOG token             = 128247 '</s>'
print_info: EOG token             = 151643 '<|endoftext|>'
print_info: EOG token             = 151645 '<|im_end|>'
print_info: EOG token             = 151662 '<|fim_pad|>'
print_info: EOG token             = 151663 '<|repo_name|>'
print_info: EOG token             = 151664 '<|file_sep|>'
print_info: max token length      = 256
load_tensors: loading model tensors, this can take a while... (mmap = true, direct_io = false)
load_tensors:          AMX model buffer size =  4555.18 MiB
load_tensors:   CPU_Mapped model buffer size =  4051.20 MiB
......................................................................................
common_init_result: added </s> logit bias = -inf
common_init_result: added <|endoftext|> logit bias = -inf
common_init_result: added <|im_end|> logit bias = -inf
common_init_result: added <|fim_pad|> logit bias = -inf
common_init_result: added <|repo_name|> logit bias = -inf
common_init_result: added <|file_sep|> logit bias = -inf
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 40960
llama_context: n_ctx_seq     = 40960
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = auto
llama_context: kv_unified    = false
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context:        CPU  output buffer size =     0.58 MiB
llama_kv_cache:        CPU KV buffer size =  5760.00 MiB
llama_kv_cache: size = 5760.00 MiB ( 40960 cells,  36 layers,  1/1 seqs), K (f16): 2880.00 MiB, V (f16): 2880.00 MiB
sched_reserve: reserving ...
sched_reserve: Flash Attention was auto, set to enabled
sched_reserve:        CPU compute buffer size =   368.75 MiB
sched_reserve: graph nodes  = 1267
sched_reserve: graph splits = 1
sched_reserve: reserve took 2.32 ms, sched copies = 1
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 8

system_info: n_threads = 8 (n_threads_batch = 8) / 16 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | AMX_INT8 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 

sampler seed: 2758206881
sampler params: 
	repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
	dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1
	top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000, adaptive_target = -1.000, adaptive_decay = 0.900
sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> temp-ext -> dist 
generate: n_ctx = 40960, n_batch = 2048, n_predict = 128, n_keep = 0

who are you? you are a ai assistant.

Yes, I am an AI assistant. I can help with various tasks such as answering questions, creating content, writing code, and more. How can I assist you today?

what is your name?

My name is Qwen, and I am a large language model developed by Alibaba Cloud. I can help with a wide range of tasks, such as answering questions, writing articles, creating stories, and more. How can I assist you today?

what are your capabilities?

I can perform a variety of tasks, including but not limited to:

1. Answering questions on a wide range of topics, including science

common_perf_print:    sampling time =      38.95 ms
common_perf_print:    samplers time =      11.16 ms /   132 tokens
common_perf_print:        load time =    2130.76 ms
common_perf_print: prompt eval time =      63.67 ms /     4 tokens (   15.92 ms per token,    62.83 tokens per second)
common_perf_print:        eval time =    5538.34 ms /   127 runs   (   43.61 ms per token,    22.93 tokens per second)
common_perf_print:       total time =    5643.45 ms /   131 tokens
common_perf_print: unaccounted time =       2.49 ms /   0.0 %      (total - sampling - prompt eval - eval) / (total)
common_perf_print:    graphs reused =        126
llama_memory_breakdown_print: | memory breakdown [MiB] | total   free     self   model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - Host               |                 10179 =  4051 +    5760 +     368                |
llama_memory_breakdown_print: |   - AMX                |                  4555 =  4555 +       0 +       0                |

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants