Skip to content

llama-bench: Fix to reduce very high ± variability#21282

Draft
michaelw9999 wants to merge 1 commit intoggml-org:masterfrom
michaelw9999:bench-fix
Draft

llama-bench: Fix to reduce very high ± variability#21282
michaelw9999 wants to merge 1 commit intoggml-org:masterfrom
michaelw9999:bench-fix

Conversation

@michaelw9999
Copy link
Copy Markdown
Contributor

Overview

After the implementation of PR #19754 , llama-bench started to show very high variability for most bench runs. That can still be avoided by adding repeats, eg, -r 5 or -r 10. This change fixes llama-bench's high variability/noise by adding 4 warmup runs (easy to adjust), which seems to be the sweet spot of not adding too much delay from the extra run, but still substantially reducing variability.
There still may be some variance from one run to another based on system load or other factors, but this new default setting helps to prevent that and is more consistent with the output.

Additional information

Example llama-bench before/after change in output on some models I've tested, without flags:

Model Test Before After
Qwen3.5 0.8B Q4_K_M pp512 25231.96 ± 17262.95 42934.98 ± 240.28
Qwen3.5 0.8B Q4_K_M tg128 496.66 ± 2.32 496.70 ± 4.56
Qwen3.5 0.8B NVFP4 pp512 27147.01 ± 1477 46399.96 ± 144.82
Qwen3.5 0.8B NVFP4 tg128 404.41 ± 1.87 394.96 ± 1.38
Cascade 31B MXFP4 pp512 8609.84 ± 4706.08 10613.82 ± 28.14
Cascade 31B MXFP4 tg128 201.45 ± 9.3 205.84 ± 4.34
Cascade 31B NVFP4 pp512 9052.70 ± 4221.99 10986.75 ± 22.5
Cascade 31B NVFP4 tg128 195.93 ± 3.98 199.45 ± 0.58

Requirements

  • I have read and agree with the contributing guidelines
    Yes
  • AI usage disclosure:
    Yes - helped locate the best position to put in the fix.

@michaelw9999
Copy link
Copy Markdown
Contributor Author

@JohannesGaessler

Copy link
Copy Markdown
Contributor

@am17an am17an left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add --n-warmup-runs instead of this? on CPUs the variability can be even higher

@JohannesGaessler
Copy link
Copy Markdown
Contributor

To clarify: what we are observing is not an issue with variance, it's an issue with bias. The first benchmark run rather than the warmup run is the one where a CUDA graph is actually captured so the performance of the first benchmark run is consistently underestimated vs. real-life usage. The correct solution as far as I'm concerned is to just do 2 warmup runs if and only if the number of tokens and the physical batch size are equal. That should fix the issue and only minimally increase the runtime.

@michaelw9999
Copy link
Copy Markdown
Contributor Author

I'm going to switch this to draft and study this further and see if I can come up with a better solution that also works for cpu and for tg.
I still see big too much variation on warmup=2 for me, and even on -r 50 sometimes there's a hickup and it's way off. Perhaps add up a few and just discard outliers.

Most of the time even without n=4 tg variability remains low (eg, 0.9% for tg128 on my first data point). Fix brought pp512's to 0.55%.
I've tried -r 50 on some of the tiny models and while it's better with n=4 maybe there is still something else going on.

@michaelw9999 michaelw9999 marked this pull request as draft April 2, 2026 19:21
@am17an
Copy link
Copy Markdown
Contributor

am17an commented Apr 3, 2026

BTW that is kind of expected for an extremely small model like you're testing (Qwen 0.8B), you should try larger models which mirror real world use-cases

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants