common : do not pass prompt tokens to reasoning budget sampler#22488
common : do not pass prompt tokens to reasoning budget sampler#22488aldehir merged 4 commits intoggml-org:masterfrom
Conversation
|
@BruceJillis if you have an opportunity, can you see if this addresses the core issue. |
|
@aldehir I like the change! Now the same state machine drives prefill and generation. I tested with Qwen3.6-27B and the test cases I made for #22323: I see As an aside: a user flagged that the reasoning budget logs are very noisy on #22323. Do you have a rough timeline for this PR? If it's a while out I'd like to open a small follow up that logs the unimportant transitions at DEBUG while leaving budget exhausted / forcing immediately at INFO. |
|
I did another pass and realized precomputing if the grammar should accept needs to stay, otherwise it checks against the updated reasoning budget state, which is incorrect. |
|
@BruceJillis this should land soon, so that will resolve the logging and only log for generated think sequences. |
|
Waiting for CI and will merge. |
* 'master' of github.com:tekintian/llama.cpp: (659 commits) ggml-webgpu: Improve performance of mat-vec and mat-mat for MUL_MAT_ID (ggml-org#22464) Update llama-mmap to use ftello/fseeko (ggml-org#22497) common : check for null getpwuid in hf-cache (ggml-org#22550) vulkan: add get/set tensor 2d functions (ggml-org#22514) spec: fix argument typo (ggml-org#22552) ci : bump ty to 0.0.33 (ggml-org#22535) vendor : update cpp-httplib to 0.43.2 (ggml-org#22548) CUDA: fix tile FA kernel on Pascal (ggml-org#22541) scripts : add wc2wt.sh - create worktree from current HEAD (ggml-org#22513) add fast matmul iquants (ggml-org#22504) spec : fix draft model checkpoints (ggml-org#22521) spec : fix vocab compat checks in spec example (ggml-org#22426) common : do not pass prompt tokens to reasoning budget sampler (ggml-org#22488) hexagon: make vmem and buffer-size configurable (ggml-org#22487) CUDA: fuse SSM_CONV + ADD(bias) + SILU (ggml-org#22478) spec : disacard last drafted token with low prob (ggml-org#22506) sync : ggml ggml : bump version to 0.10.1 (ggml/1469) webui: fix slow mic stop and WAV encode (ggml-org#22480) ggml-cpu : disable tiled matmul on AIX to fix page boundary segfault (ggml-org#22293) ... # Conflicts: # .gitignore
Overview
cont: #22323
Do not pass prompt tokens through the reasoning budget sampler, mirroring grammar behavior. Renamed
accept_grammartois_generatedto better convey the purpose of this flag.Also adjusted the prefill logic to pass the generation prompt through the reasoning budget sampler as well. I removed the
prefill_tokensparameter, as it required the prefill to match the starting token sequence exactly. Instead, we simply feed each token individually so it gets processed by the state machine.Additional information
Requirements