Skip to content

common : do not pass prompt tokens to reasoning budget sampler#22488

Merged
aldehir merged 4 commits intoggml-org:masterfrom
aldehir:fix-reasoning-budget
Apr 29, 2026
Merged

common : do not pass prompt tokens to reasoning budget sampler#22488
aldehir merged 4 commits intoggml-org:masterfrom
aldehir:fix-reasoning-budget

Conversation

@aldehir
Copy link
Copy Markdown
Contributor

@aldehir aldehir commented Apr 28, 2026

Overview

cont: #22323

Do not pass prompt tokens through the reasoning budget sampler, mirroring grammar behavior. Renamed accept_grammar to is_generated to better convey the purpose of this flag.

Also adjusted the prefill logic to pass the generation prompt through the reasoning budget sampler as well. I removed the prefill_tokens parameter, as it required the prefill to match the starting token sequence exactly. Instead, we simply feed each token individually so it gets processed by the state machine.

Additional information

Requirements

@aldehir aldehir requested a review from a team as a code owner April 28, 2026 22:15
@aldehir aldehir changed the title Fix reasoning budget common : do not pass prompt tokens to reasoning budget sampler Apr 28, 2026
@aldehir aldehir marked this pull request as draft April 28, 2026 22:17
@aldehir aldehir marked this pull request as ready for review April 28, 2026 22:22
@aldehir
Copy link
Copy Markdown
Contributor Author

aldehir commented Apr 28, 2026

@BruceJillis if you have an opportunity, can you see if this addresses the core issue.

Copy link
Copy Markdown
Member

@pwilkin pwilkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice :)

@BruceJillis
Copy link
Copy Markdown
Contributor

@aldehir I like the change! Now the same state machine drives prefill and generation. I tested with Qwen3.6-27B and the test cases I made for #22323: I see activated firing during common_sampler_init / prefill replay and deactivated on natural close. So yes it addresses the issue and the refactor looks clean to me.

As an aside: a user flagged that the reasoning budget logs are very noisy on #22323. Do you have a rough timeline for this PR? If it's a while out I'd like to open a small follow up that logs the unimportant transitions at DEBUG while leaving budget exhausted / forcing immediately at INFO.

Comment thread common/sampling.h Outdated
@aldehir aldehir requested a review from ggerganov April 29, 2026 11:38
@aldehir
Copy link
Copy Markdown
Contributor Author

aldehir commented Apr 29, 2026

I did another pass and realized precomputing if the grammar should accept needs to stay, otherwise it checks against the updated reasoning budget state, which is incorrect.

@aldehir
Copy link
Copy Markdown
Contributor Author

aldehir commented Apr 29, 2026

@BruceJillis this should land soon, so that will resolve the logging and only log for generated think sequences.

@pwilkin
Copy link
Copy Markdown
Member

pwilkin commented Apr 29, 2026

Waiting for CI and will merge.

@aldehir aldehir merged commit d775992 into ggml-org:master Apr 29, 2026
46 checks passed
@aldehir aldehir deleted the fix-reasoning-budget branch April 29, 2026 19:11
tekintian added a commit to tekintian/llama.cpp that referenced this pull request May 1, 2026
* 'master' of github.com:tekintian/llama.cpp: (659 commits)
  ggml-webgpu: Improve performance of mat-vec and mat-mat for MUL_MAT_ID (ggml-org#22464)
  Update llama-mmap to use ftello/fseeko (ggml-org#22497)
  common : check for null getpwuid in hf-cache (ggml-org#22550)
  vulkan: add get/set tensor 2d functions (ggml-org#22514)
  spec: fix argument typo (ggml-org#22552)
  ci : bump ty to 0.0.33 (ggml-org#22535)
  vendor : update cpp-httplib to 0.43.2 (ggml-org#22548)
  CUDA: fix tile FA kernel on Pascal (ggml-org#22541)
  scripts : add wc2wt.sh - create worktree from current HEAD (ggml-org#22513)
  add fast matmul iquants (ggml-org#22504)
  spec : fix draft model checkpoints (ggml-org#22521)
  spec : fix vocab compat checks in spec example (ggml-org#22426)
  common : do not pass prompt tokens to reasoning budget sampler (ggml-org#22488)
  hexagon: make vmem and buffer-size configurable (ggml-org#22487)
  CUDA: fuse SSM_CONV + ADD(bias) + SILU (ggml-org#22478)
  spec : disacard last drafted token with low prob (ggml-org#22506)
  sync : ggml
  ggml : bump version to 0.10.1 (ggml/1469)
  webui: fix slow mic stop and WAV encode (ggml-org#22480)
  ggml-cpu : disable tiled matmul on AIX to fix page boundary segfault (ggml-org#22293)
  ...

# Conflicts:
#	.gitignore
rsenthilkumar6 pushed a commit to rsenthilkumar6/llama.cpp that referenced this pull request May 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants