Top-H is a training-free decoding method that balances creativity and coherence in open-ended text generation by constraining entropy at each step. It solves an entropy-constrained mass maximization problem with an efficient greedy procedure, yielding robust, high-temperature generations that remain coherent.
📄 This repository accompanies our paper:
Top-H Decoding: Adapting the Creativity and Coherence with Bounded Entropy in Text Generation
- Seyedarmin Azizi — seyedarm@usc.edu
- Erfan Baghaei Potraghloo — baghaeip@usc.edu
- Souvik Kundu — mail2ksouvik@gmail.com
- Massoud Pedram — pedram@usc.edu
- Overview
- Key Features
- Results Summary
- Installation
- Reproducing Paper Results
- Method Details
- Tuning & Tips
- Citation
- Contact
Classic truncated sampling (temperature, top-k, top-p, min-p) trades off diversity vs. coherence but often ignores the shape of the next-token distribution. Top-H makes this trade-off explicit by upper-bounding the entropy of the truncated distribution relative to the original model distribution — exploring more when the model is unsure, and tightening when it is confident.
At a glance:
- Formulates Entropy-Constrained Minimum Divergence (ECMD) and proves equivalence to Entropy-Constrained Mass Maximization (ECMM) (NP-hard).
- Introduces a greedy approximation (Top-H) with a simple termination guarantee controlled by an entropy scale α.
- Delivers strong empirical gains over min-p and top-p, especially at higher temperatures.
- 🛠 Training-free & model-agnostic — drop-in decoding; no fine-tuning.
- 🎛 Entropy-aware truncation — caps randomness via H(q) ≤ α·H(p), recalculated every step.
- 🧮 Theory-backed — ECMD ⇔ ECMM (NP-hard); practical greedy rule with early-stop criterion.
- 🔥 Robust at high temperature — maintains coherence where min-p/top-p degrade.
- 🧪 Wide evaluation — creative writing (e.g., Alpaca-Eval, MT-Bench) and QA (GPQA, GSM8K).
- On creative writing benchmarks, Top-H outperforms SoTA alternatives by up to 25.63%, while preserving consistency.
- On reasoning datasets (GSM8K, GPQA), Top-H remains robust at elevated temperatures.
| Benchmark / Model / T | min-p | top-p | Top-H |
|---|---|---|---|
| GSM8K — LLaMA-3.1-8B — T=2 | 13.72 | 2.65 | 39.35 |
| GPQA — Phi-3-Mini — T=2 | 23.44 | 18.53 | 30.80 |
See the paper for full tables, settings, and ablations.
git clone https://github.com/your-org/top-h-decoding.git
cd Top-H-Decoding
pip install -r requirements.txtbash alpaca_evaluate.sh- Clone the lm-evaluation-harness repository:
git clone --depth 1 https://github.com/EleutherAI/lm-evaluation-harness cd lm-evaluation-harness pip install -e .
- Replace the huggingface.py file from this repo into the lm-evaluation-harness repo:
cp -f ./Top-H-Decoding/huggingface.py ./lm-evaluation-harness/lm_eval/models
- Run the evaluation script:
bash lm_evaluate.sh
This section demonstrates how to perform inference using a Top-H decoding strategy with Hugging Face models.
You can easily create an instance of TopH_LogitsProcessor and pass it to model.generate().
from logit_processor import *
from transformers import AutoModelForCausalLM, AutoTokenizer
# Choose your model
model_name = "meta-llama/Llama-3.1-8B-Instruct"
# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map="auto",
torch_dtype="auto"
)
model.eval()
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Create Top-H logits processor
lp = TopH_LogitsProcessor(temperature=0.3)
# Define prompt
prompt = """Write a creative story about time moving backwards."""
input_ids = tokenizer(prompt, return_tensors="pt")["input_ids"].to(model.device)
# Generate output
output = model.generate(
input_ids=input_ids,
temperature=0.3,
logits_processor=[lp], # inject custom processor
top_p=None, # disable nucleus sampling
max_new_tokens=512
)
print("output:\n")
print(tokenizer.decode(output[0], skip_special_tokens=True))-
ECMD (conceptual): minimize divergence (e.g., JSD) between the model’s original distribution p and a truncated distribution q, under an entropy cap H(q) ≤ α·H(p).
-
Equivalence: ECMD ⇔ ECMM (maximize retained probability mass subject to the same entropy bound).
-
Complexity: ECMM is NP-hard; Top-H uses a greedy accumulation over sorted tokens, with a stop rule when adding the next token would exceed the entropy budget.
-
Adaptivity: The threshold α·H(p) is recomputed at each time step, so Top-H loosens up when the model is uncertain and tightens when confident.
- α (entropy scale): primary knob.
- Lower α → tighter, more coherent, shorter candidate sets.
- Higher α → looser, more diverse, longer candidate sets.
- We commonly use α ≈ 0.3–0.5.
- Temperature: Top-H plays well with higher T; entropy constraint maintains coherence.
- Fallback: In rare numerical edge cases selecting no tokens, fall back to argmax or include top-1.
If you use Top-H in your research, please cite:
@misc{potraghloo2025toph,
title={Top-H Decoding: Adapting the Creativity and Coherence with Bounded Entropy in Text Generation},
author={Erfan Baghaei Potraghloo and Seyedarmin Azizi and Souvik Kundu and Massoud Pedram},
year={2025},
eprint={2509.02510},
archivePrefix={arXiv},
primaryClass={cs.CL}
}For questions or collaboration, reach out:
- Seyedarmin Azizi — seyedarm@usc.edu
- Erfan Baghaei Potraghloo — baghaeip@usc.edu