Skip to content

llama: introduce support for model-embedded sampling parameters#17120

Merged
taronaeo merged 17 commits intoggml-org:masterfrom
taronaeo:feat/auto-sampling-params
Nov 25, 2025
Merged

llama: introduce support for model-embedded sampling parameters#17120
taronaeo merged 17 commits intoggml-org:masterfrom
taronaeo:feat/auto-sampling-params

Conversation

@taronaeo
Copy link
Copy Markdown
Member

@taronaeo taronaeo commented Nov 9, 2025

ref: #17088

This PR introduces the feature to allow sampler parameters to be set from GGUF KV metadata allowing model creators to embed recommended sampler settings unless explicitly overridden using the CLI flags.

Handy for users who do not want to tinker with the settings but want the recommended settings applied.

Priority of Sampling Parameters

  1. User flags (i.e., setting --temp 0.6)
  2. Model-Embedded recommendation (i.e., general.sampling.temp = 0.6)
  3. Default hardcoded values in common_params_sampling

Introduced Metadata

  • general.sampling.sequence
  • general.sampling.top_k
  • general.sampling.top_p
  • general.sampling.min_p
  • general.sampling.xtc_probability
  • general.sampling.xtc_threshold
  • general.sampling.temp
  • general.sampling.penalty_last_n
  • general.sampling.penalty_repeat
  • general.sampling.mirostat
  • general.sampling.mirostat_tau
  • general.sampling.mirostat_eta

Please let me know if we should introduce more sampling parameters.

Embedding From Safetensors into GGUF

By default, it will attempt to find the generation_config.json within the model directory and automatically add recommended sampler parameters into the GGUF metadata. If a sampling parameter is not available within the file, users can also specify --metadata metadata.json.

Note that --metadata metadata.json takes precedence over generation_config.json and will overwrite metadata if duplicate keys are found.

$ cat > metadata.json << EOF 
{
    "general.sampling.temp": 0.6
}
EOF

$ python3 convert_hf_to_gguf.py --outfile deepseek-r1-distill-qwen-1.5b.gguf --metadata metadata.json deepseek-r1-distill-qwen-1.5b/

$ ./build/bin/llama-cli -m deepseek-r1-distill-qwen-1.5b.gguf -p "Write me a dog walking business idea 1. " -no-cnv -n 1 -t 10 2>&1 | grep "temp"    
llama_model_loader: - kv   2:                       general.sampling.temp f32             = 0.600000
llama_model_loader: - kv  27:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
        top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.600
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
@github-actions github-actions Bot added the python python script changes label Nov 9, 2025
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
@CISC
Copy link
Copy Markdown
Member

CISC commented Nov 9, 2025

$ cat > metadata.json << EOF 
{
    "general.sampler.temp": 0.6
}
EOF

So, you're suggesting that parameters should be added manually before conversion? How likely is that to happen?

AFAIK most models come with recommended (though, some are likely to just be copy-pasted from somewhere) settings in generation_config.json, so perhaps a better idea to get them from there?

Edit: or is that automatically added to metadata?

@taronaeo
Copy link
Copy Markdown
Member Author

taronaeo commented Nov 9, 2025

$ cat > metadata.json << EOF 
{
    "general.sampler.temp": 0.6
}
EOF

So, you're suggesting that parameters should be added manually before conversion? How likely is that to happen?

AFAIK most models come with recommended (though, some are likely to just be copy-pasted from somewhere) settings in generation_config.json, so perhaps a better idea to get them from there?

You're right, I didn't spot that. Well I guess I have to rework the code such that it pulls generation_config.json from the model directory, maps to general.sampler.* and we can skip the --metadata flag.

@Green-Sky
Copy link
Copy Markdown
Collaborator

I think sampling sequence is important too. Also I personally only really tend to use min-p and xtc(not in your proposal).

@taronaeo
Copy link
Copy Markdown
Member Author

taronaeo commented Nov 9, 2025

@Green-Sky Will include general.sampler.xtc_probability and general.sampler.xtc_thresold first then --samplers SEQUENCE.

@CISC RE generation_config.json vs. the custom --metadata file, I've realised that generation_config.json does not actually document (non-standard) support for parameters such as mirostat. In this case, we'll still need support for --metadata metadata.json to cover these parameters, unless there is a better way of handling this.

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
@CISC
Copy link
Copy Markdown
Member

CISC commented Nov 9, 2025

@CISC RE generation_config.json vs. the custom --metadata file, I've realised that generation_config.json does not actually document (non-standard) support for parameters such as mirostat. In this case, we'll still need support for --metadata metadata.json to cover these parameters, unless there is a better way of handling this.

Does transformers even have this parameter?

@taronaeo
Copy link
Copy Markdown
Member Author

taronaeo commented Nov 9, 2025

@CISC RE generation_config.json vs. the custom --metadata file, I've realised that generation_config.json does not actually document (non-standard) support for parameters such as mirostat. In this case, we'll still need support for --metadata metadata.json to cover these parameters, unless there is a better way of handling this.

Does transformers even have this parameter?

Doesn't look like it. Followed some of Ollama's supported parameters: https://ollama.readthedocs.io/en/modelfile/#parameter

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
@taronaeo
Copy link
Copy Markdown
Member Author

@CISC any update on this PR?

Comment thread gguf-py/gguf/constants.py
Comment thread common/common.h Outdated
Comment thread common/common.cpp
@CISC
Copy link
Copy Markdown
Member

CISC commented Nov 14, 2025

@CISC any update on this PR?

Thanks for the reminder. :)

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
@taronaeo taronaeo requested a review from danbev as a code owner November 16, 2025 10:20
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
@taronaeo
Copy link
Copy Markdown
Member Author

@ggerganov soft ping. in-case it was missed, latest changes are ready for review :)

Comment thread include/llama.h
Comment on lines +536 to +537
// Get sampling metadata key name. Returns nullptr if the key is invalid
LLAMA_API const char * llama_model_meta_key_str(enum llama_model_meta_key key);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although the comment is true right now, I guess the intended future use is broader than sampling?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although the comment is true right now, I guess the intended future use is broader than sampling?

Yep, we just need to update the Ilama_model_meta_key to include other metadata keys :)

@taronaeo taronaeo merged commit 877566d into ggml-org:master Nov 25, 2025
74 checks passed
Copy link
Copy Markdown
Collaborator

@ORippler ORippler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this nice addition to llama.cpp! Some remarks from my side:

By default, transformers disables all samplers that are not specified in generation_config.json (with the exception of top-k which they are aware and intend to change). llama.cpp however has non-disabled defaults for top-k, top-p, min-p and temperature when values are not specified by the user/generation_config.json.

I feel the "correct" path of execution would be to follow transformers (and also vllms) logic of disabling all samplers that are not explicitly enabled via active parametrization, though this would be a somewhat breaking/disruptive change. Also, it would be nice to follow the hard-coded sampling-chain-logic as applied in transformers if we follow the guidance given by model builders via generation_config.json. Maybe one could resolve to llama.cpp's current defaults if:

  1. no override is given by the user for any parameter
  2. the model does not guide towards a sampling chain (though all-absence should be treated as "no pre-filtering" being done under transformers/vllm logic, argh).

Thoughts?

Edit: Reading through the PR I realized not all llm inferencing solutions support the same set of samplers, making any move towards "unification" even more difficult if not out-right futile, hmhm

Comment thread gguf-py/gguf/metadata.py
Comment on lines +77 to +78
if gen_config:
metadata.sampling_sequence = gen_config.get("sequence", metadata.sampling_sequence)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK transformers.GenerationConfig has no way to mutate the sampling sequence. It's hard-coded here. So the only way to set the sampling sequence is by supplying it via metadata_ovveride?

@dagbdagb
Copy link
Copy Markdown

I looked at updating docs/gguf.md with information from this merge, but chickened out... @taronaeo, if gguf.md is the de facto documentation for the GGUF- format (big if), should not this feature be described there?

Also, out of curiosity, aren't "the best" sampling parameters for a model somewhat dependent on the specific task at hand? If so, would an extension of this feature for sets of sampling parameters make sense?

@AsbjornOlling
Copy link
Copy Markdown
Contributor

I looked at updating docs/gguf.md with information from this merge, but chickened out... @taronaeo, if gguf.md is the de facto documentation for the GGUF- format (big if), should not this feature be described there?

Also, out of curiosity, aren't "the best" sampling parameters for a model somewhat dependent on the specific task at hand? If so, would an extension of this feature for sets of sampling parameters make sense?

+1 to both points. Adding it to gguf.md would be a great addition for discoverability.

Many models (most?) have a single recommended sampling configuration, but some (e.g. qwen3.5 have multiple sampler configs: with/without thinking, coding vs general tasks.

It could be worthwhile to have the gguf metadata support multiple sampler configs, just like it supports multiple chat templates. E.g. general.sampling.default.top_p and general.sampling.coding.top_p

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants