add `--dry-run` option to `llama-quantize` by ddh0 · Pull Request #19526 · ggml-org/llama.cpp

ddh0 · 2026-02-11T20:20:52Z

This PR adds a new --dry-run option to llama-quantize. This option calculates the size of each tensor in the target type without actually performing quantization, and prints the final quantization size in the same way that llama-quantize does currently.

Example command:

llama-quantize --dry-run gemma-3-4b-it-q8_0.gguf Q4_K

Example output:

main: build = 8015 (07f882bbb)
main: built with AppleClang 17.0.0.17000603 for Darwin arm64
main: calculating quantization size for '/Users/dylan/Documents/AI/gguf/gemma-3-4b-it-q8_0.gguf' as Q4_K
llama_model_loader: loaded meta data with 41 key-value pairs and 444 tensors from /Users/dylan/Documents/AI/gguf/gemma-3-4b-it-q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma3
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = gemma-3-4b-it
llama_model_loader: - kv   3:                           general.finetune str              = it
llama_model_loader: - kv   4:                           general.basename str              = gemma-3
llama_model_loader: - kv   5:                         general.size_label str              = 4B
llama_model_loader: - kv   6:                            general.license str              = gemma
llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
llama_model_loader: - kv   8:                  general.base_model.0.name str              = Gemma 3 4b Pt
llama_model_loader: - kv   9:          general.base_model.0.organization str              = Google
llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/google/gemma-3...
llama_model_loader: - kv  11:                               general.tags arr[str,1]       = ["image-text-to-text"]
llama_model_loader: - kv  12:                      gemma3.context_length u32              = 131072
llama_model_loader: - kv  13:                    gemma3.embedding_length u32              = 2560
llama_model_loader: - kv  14:                         gemma3.block_count u32              = 34
llama_model_loader: - kv  15:                 gemma3.feed_forward_length u32              = 10240
llama_model_loader: - kv  16:                gemma3.attention.head_count u32              = 8
llama_model_loader: - kv  17:    gemma3.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  18:                gemma3.attention.key_length u32              = 256
llama_model_loader: - kv  19:              gemma3.attention.value_length u32              = 256
llama_model_loader: - kv  20:                          general.file_type u32              = 7
llama_model_loader: - kv  21:                      gemma3.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  22:            gemma3.attention.sliding_window u32              = 1024
llama_model_loader: - kv  23:             gemma3.attention.head_count_kv u32              = 4
llama_model_loader: - kv  24:                   gemma3.rope.scaling.type str              = linear
llama_model_loader: - kv  25:                 gemma3.rope.scaling.factor f32              = 8.000000
llama_model_loader: - kv  26:               general.quantization_version u32              = 2
llama_model_loader: - kv  27:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  28:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  29:                      tokenizer.ggml.tokens arr[str,262208]  = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv  30:                      tokenizer.ggml.scores arr[f32,262208]  = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  31:                  tokenizer.ggml.token_type arr[i32,262208]  = [3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  32:                tokenizer.ggml.bos_token_id u32              = 2
llama_model_loader: - kv  33:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  34:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  35:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  36:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  37:               tokenizer.ggml.add_sep_token bool             = false
llama_model_loader: - kv  38:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  39:                    tokenizer.chat_template str              = {{ bos_token }}\n{%- if messages[0]['r...
llama_model_loader: - kv  40:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - type  f32:  205 tensors
llama_model_loader: - type q8_0:  239 tensors
[   1/ 444]                   output_norm.weight - [  2560,      1,      1,      1], type =    f32, size =    0.010 MiB
[   2/ 444]                    token_embd.weight - [  2560, 262208,      1,      1], type =   q8_0, size =   680.17 MiB ->   525.13 MiB (q6_K)
[   3/ 444]                  blk.0.attn_k.weight - [  2560,   1024,      1,      1], type =   q8_0, size =     2.66 MiB ->     1.41 MiB (q4_K)
[   4/ 444]             blk.0.attn_k_norm.weight - [   256,      1,      1,      1], type =    f32, size =    0.001 MiB
[   5/ 444]               blk.0.attn_norm.weight - [  2560,      1,      1,      1], type =    f32, size =    0.010 MiB
[   6/ 444]             blk.0.attn_output.weight - [  2048,   2560,      1,      1], type =   q8_0, size =     5.31 MiB ->     2.81 MiB (q4_K)
[   7/ 444]                  blk.0.attn_q.weight - [  2560,   2048,      1,      1], type =   q8_0, size =     5.31 MiB ->     2.81 MiB (q4_K)
[   8/ 444]             blk.0.attn_q_norm.weight - [   256,      1,      1,      1], type =    f32, size =    0.001 MiB
[   9/ 444]                  blk.0.attn_v.weight - [  2560,   1024,      1,      1], type =   q8_0, size =     2.66 MiB ->     2.05 MiB (q6_K)
[  10/ 444]                blk.0.ffn_down.weight - [ 10240,   2560,      1,      1], type =   q8_0, size =    26.56 MiB ->    20.51 MiB (q6_K)
[  11/ 444]                blk.0.ffn_gate.weight - [  2560,  10240,      1,      1], type =   q8_0, size =    26.56 MiB ->    14.06 MiB (q4_K)
[  12/ 444]                blk.0.ffn_norm.weight - [  2560,      1,      1,      1], type =    f32, size =    0.010 MiB
[  13/ 444]                  blk.0.ffn_up.weight - [  2560,  10240,      1,      1], type =   q8_0, size =    26.56 MiB ->    14.06 MiB (q4_K)
[  14/ 444]     blk.0.post_attention_norm.weight - [  2560,      1,      1,      1], type =    f32, size =    0.010 MiB
[  15/ 444]           blk.0.post_ffw_norm.weight - [  2560,      1,      1,      1], type =    f32, size =    0.010 MiB
|
|  ... truncated for brevity ...
|
[ 432/ 444]                 blk.33.attn_k.weight - [  2560,   1024,      1,      1], type =   q8_0, size =     2.66 MiB ->     1.41 MiB (q4_K)
[ 433/ 444]            blk.33.attn_k_norm.weight - [   256,      1,      1,      1], type =    f32, size =    0.001 MiB
[ 434/ 444]              blk.33.attn_norm.weight - [  2560,      1,      1,      1], type =    f32, size =    0.010 MiB
[ 435/ 444]            blk.33.attn_output.weight - [  2048,   2560,      1,      1], type =   q8_0, size =     5.31 MiB ->     2.81 MiB (q4_K)
[ 436/ 444]                 blk.33.attn_q.weight - [  2560,   2048,      1,      1], type =   q8_0, size =     5.31 MiB ->     2.81 MiB (q4_K)
[ 437/ 444]            blk.33.attn_q_norm.weight - [   256,      1,      1,      1], type =    f32, size =    0.001 MiB
[ 438/ 444]                 blk.33.attn_v.weight - [  2560,   1024,      1,      1], type =   q8_0, size =     2.66 MiB ->     2.05 MiB (q6_K)
[ 439/ 444]               blk.33.ffn_down.weight - [ 10240,   2560,      1,      1], type =   q8_0, size =    26.56 MiB ->    20.51 MiB (q6_K)
[ 440/ 444]               blk.33.ffn_gate.weight - [  2560,  10240,      1,      1], type =   q8_0, size =    26.56 MiB ->    14.06 MiB (q4_K)
[ 441/ 444]               blk.33.ffn_norm.weight - [  2560,      1,      1,      1], type =    f32, size =    0.010 MiB
[ 442/ 444]                 blk.33.ffn_up.weight - [  2560,  10240,      1,      1], type =   q8_0, size =    26.56 MiB ->    14.06 MiB (q4_K)
[ 443/ 444]    blk.33.post_attention_norm.weight - [  2560,      1,      1,      1], type =    f32, size =    0.010 MiB
[ 444/ 444]          blk.33.post_ffw_norm.weight - [  2560,      1,      1,      1], type =    f32, size =    0.010 MiB
llama_model_quantize_impl: model size  =  3932.82 MiB (8.50 BPW)
llama_model_quantize_impl: quant size  =  2368.31 MiB (5.12 BPW)

main: quantize time =   139.49 ms
main:    total time =   139.49 ms

Credit to @AesSedai for this idea - he has a preliminary version that can be seen here. His version supports calculating the size for all possible quantization types and creating a measurement file that can be re-used for any quantization. For now, this is just a simple calculation that runs on every tensor, with no fancy options.

ddh0 · 2026-02-11T20:23:26Z

AI disclosure: I used Claude to help with understanding which changes need to be made, but I did all the changes by hand.

ddh0 · 2026-02-11T21:43:57Z

I think this is ready for review. I've tested it with gemma-3-4b (dense), granite-4.0-micro (dense hybrid), and granite-4.0-tiny-preview (hybrid MoE) and the calculated sizes are exactly right for all three (comparing --dry-run to an actual quantization). Not sure if there is some edge case I'm missing.

ddh0 · 2026-02-11T21:48:00Z

There are two small unrelated changes that are included here:

in the output of llama_model_quantize_impl, at the end where it prints model size = ... and quant size = ..., I added BPW to each of these lines for convenience (no longer have to load the model to see the final BPW)
changed the number of characters used for tensor dimensions in llama_format_tensor_shape from 5 to 6 (noticed it was not enough when testing gemma)

ddh0 · 2026-02-12T00:07:43Z

Also tested with --tensor-type overrides and it works with those, too.

…atrix

ddh0 · 2026-02-12T03:30:59Z

The latest commits are a minor refactor of llama_model_quantize_impl. I moved this old check:

            if ((new_type == GGML_TYPE_IQ2_XXS ||
                 new_type == GGML_TYPE_IQ2_XS  ||
                 new_type == GGML_TYPE_IQ2_S   ||
                 new_type == GGML_TYPE_IQ1_S   ||
                (new_type == GGML_TYPE_IQ1_M && strcmp(tensor->name, "token_embd.weight") && strcmp(tensor->name, "output.weight"))  ||
                (new_type == GGML_TYPE_Q2_K && params->ftype == LLAMA_FTYPE_MOSTLY_Q2_K_S && strcmp(tensor->name, "token_embd.weight") != 0)) && !imatrix) { ... }

into a new function:

static bool tensor_type_requires_imatrix(const llama_model_quantize_params * params, const ggml_tensor * t, const ggml_type dst_type) {
    return (
        dst_type == GGML_TYPE_IQ2_XXS || dst_type == GGML_TYPE_IQ2_XS ||
        dst_type == GGML_TYPE_IQ3_XXS || dst_type == GGML_TYPE_IQ1_S  ||
        dst_type == GGML_TYPE_IQ2_S   || dst_type == GGML_TYPE_IQ1_M  ||
        (   // Q2_K is the worst k-quant type - only allow it without imatrix for token embeddings
            dst_type == GGML_TYPE_Q2_K && strcmp(t->name, "token_embd.weight") != 0
        )
    );
}

so I can re-use it for dry-run, and added new conditional warning, which gives a heads-up if performing this quantization will require an imatrix. This will prevent many headaches for me personally, and hopefully others :)

llama_model_quantize_impl: WARNING: dry run completed successfully, but actually completing this quantization will require an imatrix!

ddh0 · 2026-02-18T23:04:53Z

Gentle ping @ggerganov - any chance this can be merged?

CISC · 2026-02-19T21:29:26Z

Something fails with quantize CI test: https://github.com/ggml-org/llama.cpp/actions/runs/22198756896/job/64205914038?pr=19526

ddh0 · 2026-02-19T21:39:39Z

Hmm, it looks like the test that fails is using this command:

./bin/llama-quantize ../models-mnt/qwen3/0.6B/ggml-model-bf16.gguf ../models-mnt/qwen3/0.6B/ggml-model-q2_k.gguf q2_k 8

and fails with:

2026-02-19T20:40:31.1084399Z ============================================================
2026-02-19T20:40:31.1085296Z Missing importance matrix for tensor blk.0.attn_k.weight in a very low-bit quantization
2026-02-19T20:40:31.1086200Z The result will be garbage, so bailing out
2026-02-19T20:40:31.1086832Z ============================================================
2026-02-19T20:40:31.1087256Z 
2026-02-19T20:40:31.5579790Z llama_model_quantize: failed to quantize: Missing importance matrix for tensor blk.0.attn_k.weight in a very low-bit quantization
2026-02-19T20:40:31.5581416Z main: failed to quantize model from '../models-mnt/qwen3/0.6B/ggml-model-bf16.gguf'

This seems fine to me; it didn't provide an imatrix, so it's correct for this to fail.

However right after that:

2026-02-19T20:40:31.5657945Z cat: /home/ggml/results/llama.cpp/qwen3_0_6b-ppl.log: No such file or directory
2026-02-19T20:40:31.5683643Z cat: /home/ggml/results/llama.cpp/qwen3_0_6b-imatrix-sum.log: No such file or directory
2026-02-19T20:40:31.5707597Z cat: /home/ggml/results/llama.cpp/qwen3_0_6b-tg-f16.log: No such file or directory
2026-02-19T20:40:31.5731845Z cat: /home/ggml/results/llama.cpp/qwen3_0_6b-tg-bf16.log: No such file or directory
2026-02-19T20:40:31.5757933Z cat: /home/ggml/results/llama.cpp/qwen3_0_6b-tg-q8_0.log: No such file or directory
2026-02-19T20:40:31.5783377Z cat: /home/ggml/results/llama.cpp/qwen3_0_6b-tg-q4_0.log: No such file or directory
2026-02-19T20:40:31.5807355Z cat: /home/ggml/results/llama.cpp/qwen3_0_6b-tg-q4_1.log: No such file or directory
2026-02-19T20:40:31.5830700Z cat: /home/ggml/results/llama.cpp/qwen3_0_6b-tg-q5_0.log: No such file or directory
2026-02-19T20:40:31.5855480Z cat: /home/ggml/results/llama.cpp/qwen3_0_6b-tg-q5_1.log: No such file or directory
2026-02-19T20:40:31.5880203Z cat: /home/ggml/results/llama.cpp/qwen3_0_6b-tg-q2_k.log: No such file or directory
2026-02-19T20:40:31.5902849Z cat: /home/ggml/results/llama.cpp/qwen3_0_6b-tg-q3_k.log: No such file or directory
2026-02-19T20:40:31.5926651Z cat: /home/ggml/results/llama.cpp/qwen3_0_6b-tg-q4_k.log: No such file or directory
2026-02-19T20:40:31.5950618Z cat: /home/ggml/results/llama.cpp/qwen3_0_6b-tg-q5_k.log: No such file or directory
2026-02-19T20:40:31.5973404Z cat: /home/ggml/results/llama.cpp/qwen3_0_6b-tg-q6_k.log: No such file or directory
2026-02-19T20:40:31.5997086Z cat: /home/ggml/results/llama.cpp/qwen3_0_6b-save-load-state.log: No such file or directory

No idea what this is, or why all the results at the end are blank. It seems unrelated to me, but I'm not sure. Let me know how to proceed.

ddh0 · 2026-02-19T21:42:12Z

If it would be easier, I can take out the tensor_type_requires_imatrix and related changes in this PR, and move all the refactoring to #19616, leaving this as only --dry-run. But still, I'm confused as to why the tests are failing.

(Edit: Though, that would be kind of a mess now that I think about it. Hmm...)

CISC · 2026-02-19T21:52:03Z

Hmm, it looks like the test that fails is using this command:

./bin/llama-quantize ../models-mnt/qwen3/0.6B/ggml-model-bf16.gguf ../models-mnt/qwen3/0.6B/ggml-model-q2_k.gguf q2_k 8

This seems fine to me; it didn't provide an imatrix, so it's correct for this to fail.

Ah, that's the problem, you basically changed the check from new_type == GGML_TYPE_Q2_K && params->ftype == LLAMA_FTYPE_MOSTLY_Q2_K_S to just new_type == GGML_TYPE_Q2_K, and now it fails.

ddh0 · 2026-02-19T22:00:35Z

OK, I'll change the check back to use the same conditions as before, since this change is not related to --dry-run anyway. Will leave those changes to the next PR.

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

ddh0 · 2026-02-19T22:15:37Z

For the record, about that Q2_K condition:

static bool tensor_type_requires_imatrix(const ggml_tensor * t, const ggml_type dst_type, const llama_ftype ftype) {
    return (
        dst_type == GGML_TYPE_IQ2_XXS || dst_type == GGML_TYPE_IQ2_XS ||
        dst_type == GGML_TYPE_IQ3_XXS || dst_type == GGML_TYPE_IQ1_S  ||
        dst_type == GGML_TYPE_IQ2_S   || dst_type == GGML_TYPE_IQ1_M  ||
        (   // Q2_K_S is the worst k-quant type - only allow it without imatrix for token embeddings
            dst_type == GGML_TYPE_Q2_K && ftype == LLAMA_FTYPE_MOSTLY_Q2_K_S && strcmp(t->name, "token_embd.weight") != 0
        )
    );
}

Making the per-tensor imatrix requirement conditional on the tensor itself (t->type and t->name) makes sense, but I think that making it conditional on the overall ftype (Q2_K_S) does not make sense, and I will attempt remove it cleanly in the refactor.

CISC · 2026-02-19T22:31:39Z

Making the per-tensor imatrix requirement conditional on the tensor itself (t->type and t->name) makes sense, but I think that making it conditional on the overall ftype (Q2_K_S) does not make sense, and I will attempt remove it cleanly in the refactor.

The reason it's there is because of this:

llama.cpp/src/llama-quant.cpp

Lines 237 to 244 in b1123f9

    
           } else if (name == "token_embd.weight" || name == "per_layer_token_embd.weight") { 
        
               if (qs.params->token_embedding_type < GGML_TYPE_COUNT) { 
        
                   new_type = qs.params->token_embedding_type; 
        
               } else { 
        
                   if (ftype == LLAMA_FTYPE_MOSTLY_IQ2_XXS || ftype == LLAMA_FTYPE_MOSTLY_IQ2_XS || 
        
                       ftype == LLAMA_FTYPE_MOSTLY_IQ1_S   || ftype == LLAMA_FTYPE_MOSTLY_IQ1_M) { 
        
                       new_type = GGML_TYPE_Q2_K; 
        
                   }

and these:

llama.cpp/src/llama-quant.cpp

Lines 255 to 270 in b1123f9

    
           } else if (ftype == LLAMA_FTYPE_MOSTLY_IQ2_XXS || ftype == LLAMA_FTYPE_MOSTLY_IQ2_XS || ftype == LLAMA_FTYPE_MOSTLY_IQ1_S || 
        
                      ftype == LLAMA_FTYPE_MOSTLY_IQ2_S || ftype == LLAMA_FTYPE_MOSTLY_IQ2_M    || ftype == LLAMA_FTYPE_MOSTLY_IQ1_M) { 
        
               if (name.find("attn_v.weight") != std::string::npos) { 
        
                   if (qs.model.hparams.n_gqa() >= 4 || qs.model.hparams.n_expert >= 4) new_type = GGML_TYPE_Q4_K; 
        
                   else new_type = ftype == LLAMA_FTYPE_MOSTLY_IQ2_S || ftype == LLAMA_FTYPE_MOSTLY_IQ2_M ? GGML_TYPE_IQ3_S : GGML_TYPE_Q2_K; 
        
                   ++qs.i_attention_wv; 
        
               } 
        
               else if (qs.model.hparams.n_expert == 8 && name.find("attn_k.weight") != std::string::npos) { 
        
                   new_type = GGML_TYPE_Q4_K; 
        
               } 
        
               else if (name.find("ffn_down") != std::string::npos) { 
        
                   if (qs.i_ffn_down < qs.n_ffn_down/8) { 
        
                       new_type = ftype == LLAMA_FTYPE_MOSTLY_IQ2_S || ftype == LLAMA_FTYPE_MOSTLY_IQ2_M ? GGML_TYPE_IQ3_S : GGML_TYPE_Q2_K; 
        
                   } 
        
                   ++qs.i_ffn_down; 
        
               }

ddh0 · 2026-02-19T22:44:13Z

I see, thank you. I'll probably just leave that logic how it is, then. Merge when green? 🤞

CISC · 2026-02-19T22:49:25Z

I see, thank you. I'll probably just leave that logic how it is, then. Merge when green? 🤞

Yep. :)

BTW, that per_layer_token_embd.weight was probably added some time later. Means that check should be amended, but another PR...

@compilade

* clean slate for branch * use 6 characters for tensor dims * add --dry-run to llama-quantize * use 6 characters for tensor dims (cont.) * no need to re-calculate ggml_nbytes for tensor * fix indent * show model and quant BPW when quant completes * add example to --help * new function `tensor_requires_imatrix`, add courtesy warning about imatrix * missing __func__, move imatrix flag set * logic error * fixup tensor_requires_imatrix * add missing `GGML_TYPE`s * simplify and rename `tensor_type_requires_imatrix` * simplify for style * add back Q2_K edge case for imatrix * guard ftype imatrix warning * comment ref ggml-org#12557 * remove per @compilade * remove unused `params` parameter * move `bool dry_run` per GG * move `bool dry_run` per GG * Update src/llama-quant.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-quant.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-quant.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

@compilade

* clean slate for branch * use 6 characters for tensor dims * add --dry-run to llama-quantize * use 6 characters for tensor dims (cont.) * no need to re-calculate ggml_nbytes for tensor * fix indent * show model and quant BPW when quant completes * add example to --help * new function `tensor_requires_imatrix`, add courtesy warning about imatrix * missing __func__, move imatrix flag set * logic error * fixup tensor_requires_imatrix * add missing `GGML_TYPE`s * simplify and rename `tensor_type_requires_imatrix` * simplify for style * add back Q2_K edge case for imatrix * guard ftype imatrix warning * comment ref ggml-org#12557 * remove per @compilade * remove unused `params` parameter * move `bool dry_run` per GG * move `bool dry_run` per GG * Update src/llama-quant.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-quant.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-quant.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

@compilade

* clean slate for branch * use 6 characters for tensor dims * add --dry-run to llama-quantize * use 6 characters for tensor dims (cont.) * no need to re-calculate ggml_nbytes for tensor * fix indent * show model and quant BPW when quant completes * add example to --help * new function `tensor_requires_imatrix`, add courtesy warning about imatrix * missing __func__, move imatrix flag set * logic error * fixup tensor_requires_imatrix * add missing `GGML_TYPE`s * simplify and rename `tensor_type_requires_imatrix` * simplify for style * add back Q2_K edge case for imatrix * guard ftype imatrix warning * comment ref ggml-org#12557 * remove per @compilade * remove unused `params` parameter * move `bool dry_run` per GG * move `bool dry_run` per GG * Update src/llama-quant.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-quant.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-quant.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

@compilade

* clean slate for branch * use 6 characters for tensor dims * add --dry-run to llama-quantize * use 6 characters for tensor dims (cont.) * no need to re-calculate ggml_nbytes for tensor * fix indent * show model and quant BPW when quant completes * add example to --help * new function `tensor_requires_imatrix`, add courtesy warning about imatrix * missing __func__, move imatrix flag set * logic error * fixup tensor_requires_imatrix * add missing `GGML_TYPE`s * simplify and rename `tensor_type_requires_imatrix` * simplify for style * add back Q2_K edge case for imatrix * guard ftype imatrix warning * comment ref ggml-org#12557 * remove per @compilade * remove unused `params` parameter * move `bool dry_run` per GG * move `bool dry_run` per GG * Update src/llama-quant.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-quant.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-quant.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

@compilade

* clean slate for branch * use 6 characters for tensor dims * add --dry-run to llama-quantize * use 6 characters for tensor dims (cont.) * no need to re-calculate ggml_nbytes for tensor * fix indent * show model and quant BPW when quant completes * add example to --help * new function `tensor_requires_imatrix`, add courtesy warning about imatrix * missing __func__, move imatrix flag set * logic error * fixup tensor_requires_imatrix * add missing `GGML_TYPE`s * simplify and rename `tensor_type_requires_imatrix` * simplify for style * add back Q2_K edge case for imatrix * guard ftype imatrix warning * comment ref ggml-org#12557 * remove per @compilade * remove unused `params` parameter * move `bool dry_run` per GG * move `bool dry_run` per GG * Update src/llama-quant.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-quant.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-quant.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

ddh0 added 4 commits February 11, 2026 12:47

clean slate for branch

844ad3e

Merge branch 'ggml-org:master' into llama-quantize-dry-run

e6b790f

use 6 characters for tensor dims

0d22288

add --dry-run to llama-quantize

56c27b1

github-actions Bot added the examples label Feb 11, 2026

ddh0 added 5 commits February 11, 2026 14:29

use 6 characters for tensor dims (cont.)

c3f42de

no need to re-calculate ggml_nbytes for tensor

b9b32f0

fix indent

150e1db

show model and quant BPW when quant completes

966b21a

add example to --help

07f882b

ddh0 marked this pull request as ready for review February 11, 2026 21:41

ddh0 requested a review from ggerganov as a code owner February 11, 2026 21:41

ddh0 added 6 commits February 11, 2026 20:49

new function tensor_requires_imatrix, add courtesy warning about im…

2769f35

…atrix

missing __func__, move imatrix flag set

ea8da05

logic error

3211a84

fixup tensor_requires_imatrix

55dbee2

add missing GGML_TYPEs

22db764

simplify and rename tensor_type_requires_imatrix

ae786b8

ddh0 and others added 7 commits February 11, 2026 21:41

simplify for style

1ccd7a4

add back Q2_K edge case for imatrix

1658228

guard ftype imatrix warning

b15bb34

comment ref ggml-org#12557

4052824

remove per @compilade

44f9fee

remove unused params parameter

f58de63

Merge branch 'ggml-org:master' into llama-quantize-dry-run

75ab2b3

Merge branch 'ggml-org:master' into llama-quantize-dry-run

855010e

ggerganov approved these changes Feb 19, 2026

View reviewed changes

Comment thread include/llama.h Outdated

ddh0 added 3 commits February 19, 2026 14:19

move bool dry_run per GG

e56e9db

move bool dry_run per GG

5218991

Merge branch 'ggml-org:master' into llama-quantize-dry-run

42d59c2

CISC reviewed Feb 19, 2026

View reviewed changes

Comment thread src/llama-quant.cpp Outdated

Comment thread src/llama-quant.cpp Outdated

Comment thread src/llama-quant.cpp Outdated

ddh0 and others added 3 commits February 19, 2026 16:03

Update src/llama-quant.cpp

38ef33f

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

Update src/llama-quant.cpp

4d38a0b

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

Update src/llama-quant.cpp

b1123f9

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

CISC approved these changes Feb 19, 2026

View reviewed changes

CISC merged commit 492bc31 into ggml-org:master Feb 20, 2026
78 checks passed

ddh0 deleted the llama-quantize-dry-run branch February 20, 2026 18:15

ddh0 mentioned this pull request Feb 20, 2026

llama-quant : fail early on missing imatrix, refactor type selection, code cleanup #19770

Merged

ubergarm mentioned this pull request Feb 23, 2026

llama-quantize: --dry-run option ikawrakow/ik_llama.cpp#1309

Merged

deadprogram mentioned this pull request Feb 24, 2026

feature: add DryRun param to ModelQuantizeParams to match latest llama.cpp hybridgroup/yzma#222

Merged

Conversation

ddh0 commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ddh0 commented Feb 11, 2026

Uh oh!

ddh0 commented Feb 11, 2026

Uh oh!

ddh0 commented Feb 11, 2026

Uh oh!

ddh0 commented Feb 12, 2026

Uh oh!

ddh0 commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ddh0 commented Feb 18, 2026

Uh oh!

Uh oh!

CISC commented Feb 19, 2026

Uh oh!

ddh0 commented Feb 19, 2026

Uh oh!

ddh0 commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CISC commented Feb 19, 2026

Uh oh!

ddh0 commented Feb 19, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ddh0 commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CISC commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ddh0 commented Feb 19, 2026

Uh oh!

CISC commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ddh0 commented Feb 11, 2026 •

edited

Loading

ddh0 commented Feb 12, 2026 •

edited

Loading

ddh0 commented Feb 19, 2026 •

edited

Loading

ddh0 commented Feb 19, 2026 •

edited

Loading

CISC commented Feb 19, 2026 •

edited

Loading

CISC commented Feb 19, 2026 •

edited

Loading