perf: add is_reasoning_end_streaming() override to GptOssReasoningParser by fergusfinn · Pull Request #4 · doublewordai/vllm

fergusfinn · 2026-03-02T11:10:51Z

Summary

Override is_reasoning_end_streaming() in GptOssReasoningParser to window the backward scan to the last ~23 tokens instead of scanning the entire sequence
Reduces per-step cost from O(n) to O(1), eliminating the O(n²) total cost over a generation

Approach

The base class is_reasoning_end_streaming(input_ids, delta_ids) defaults to calling is_reasoning_end(input_ids), which scans backward through the full token sequence. Other parsers (Step3, BaseThinking) override this to only check delta_ids (O(1)), but GptOss can't do a simple delta check because its end pattern (<|channel|>final ... <|message|>) spans multiple tokens with a variable gap.

The override windows the search to the last prefix_len + max_gap + suffix_len tokens (~23 for gpt-oss). The reasoning end pattern is always at the tail of the sequence (we just generated those tokens), so looking further back is unnecessary. The eom_token_id early-exit in is_reasoning_end() still works within the window for multi-turn safety.

Test plan

All existing TEST_CASES run through is_reasoning_end_streaming() to confirm parity with is_reasoning_end()
Same cases with 10k dummy tokens prepended to verify windowing correctness
Signature smoke test with empty inputs
Run: pytest tests/reasoning/test_gptoss_reasoning_parser.py -v

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 99ff69837d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Override is_reasoning_end_streaming() in GptOssReasoningParser to window the backward scan to the last ~23 + len(delta_ids) tokens instead of scanning the entire sequence. This reduces per-step cost from O(n) to O(1), eliminating the O(n²) total cost over a generation. Including len(delta_ids) in the window ensures correctness under speculative decoding where a single step can accept many tokens. Signed-off-by: Fergus <fergus.barratt00@gmail.com>

The base class broadened delta_ids from Sequence to Iterable in vllm-project#33593, and the call site now passes itertools.islice. Materialize to tuple before calling len(). Signed-off-by: fergus barratt <fergus.barratt00@gmail.com>

…duling (vllm-project#38726) Signed-off-by: Jing Wang <jingwang96@qq.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…nd named function (vllm-project#39870) Signed-off-by: JaredforReal <w13431838023@gmail.com> Signed-off-by: sfeng33 <4florafeng@gmail.com> Co-authored-by: sfeng33 <4florafeng@gmail.com>

…ct#39291) Signed-off-by: allgather <all2allops@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

Signed-off-by: Roger Wang <hey@rogerw.io> Co-authored-by: Chauncey <chaunceyjiang@gmail.com>

Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>

…mdhip64 (vllm-project#39978) Signed-off-by: Andreas Karatzas <akaratza@amd.com>

Signed-off-by: Ryan Rock <ryan.rock@amd.com>

…t64 overflow (vllm-project#39953) Signed-off-by: aditi <aditi.rana@amd.com>

vllm-project#40171) Signed-off-by: Yanan Cao <gmagogsfm@gmail.com> Co-authored-by: Claude Sonnet 4 <noreply@anthropic.com>

) Signed-off-by: mgoin <mgoin64@gmail.com>

…quence_group (vllm-project#40175) Signed-off-by: mgoin <mgoin64@gmail.com>

Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>

Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: Michael Goin <mgoin64@gmail.com>

…m-project#39844) Signed-off-by: Chaojun Zhang <chaojun.zhang@intel.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

Signed-off-by: milesial <milesial@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

…-project#39845) Signed-off-by: Ziying Tao <tzzying@outlook.com>

…llm-project#38405) Signed-off-by: Nithin Chalapathi <nithin.ch10@gmail.com> Signed-off-by: Nithin Chalapathi <nithinc@berkeley.edu> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>

…llm-project#40189) Signed-off-by: z1ying <tzzying@outlook.com>

…ject#40167) Signed-off-by: Yanan Cao <gmagogsfm@gmail.com> Co-authored-by: Theresa Shan <Theresa.Shan@amd.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… on-disk model_type differs (vllm-project#39554) Signed-off-by: Misa <misaAle@users.noreply.github.com> Signed-off-by: Misael Casarez <misacasa@amazon.com> Co-authored-by: Misael Casarez <misacasa@amazon.com>

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

…ence (vllm-project#40411) Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com> Co-authored-by: Luciano Martins <lucianommartins@users.noreply.github.com>

…project#40355) Signed-off-by: shen-shanshan <467638484@qq.com>

…orkserver prewarm (vllm-project#40331) Signed-off-by: simon-mo <simon@inferact.ai>

…9100) Signed-off-by: yewentao256 <zhyanwentao@126.com>

Signed-off-by: lesj0610 <lesj0610@gmail.com>

…ples (vllm-project#40335) Signed-off-by: shen-shanshan <467638484@qq.com>

…ject#39016)

Signed-off-by: Hang Yang <hangy@amd.com>

…ompt` (vllm-project#40339) Signed-off-by: Alchuang22-dev <2584829494@qq.com>

…etch + forkserver prewarm" (vllm-project#40438) Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>

…llm-project#39391) Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>

…njection (vllm-project#39502) Signed-off-by: Krish Hung <krishung5@gmail.com> Signed-off-by: krishung5 <krish@nvidia.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…37861) Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> Signed-off-by: wang.yuqi <noooop@126.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

Signed-off-by: Artem Spector <artems@il.ibm.com> Signed-off-by: artemspector <artems@il.ibm.com> Co-authored-by: artemspector <artems@il.ibm.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

…nch serve (vllm-project#40288) Signed-off-by: talora <talora@nvidia.com> Signed-off-by: Talor Abramovich <talor19@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Signed-off-by: Yusuf <yusufmohammad@live.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

vllm-project#37114) Signed-off-by: Hollow Man <hollowman@opensuse.org>

…oject#39887) Signed-off-by: zengxian <xiangdong.zeng@intel.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>

…UDA graph video inference (vllm-project#40445) Signed-off-by: shen-shanshan <467638484@qq.com>

…lative decoding is enabled (vllm-project#40454) Signed-off-by: Roi Koren <roik@nvidia.com>

) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com> Signed-off-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com>

Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Co-authored-by: Nick Hill <nickhill123@gmail.com>

…oject#40467) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

…llm-project#40276) Co-authored-by: Roger Wang <hey@rogerw.io>

…_sampled_tokens and draft_tokens (vllm-project#39833) Signed-off-by: Zijing Liu <liuzijing2014@gmail.com>

chatgpt-codex-connector Bot reviewed Mar 2, 2026

View reviewed changes

Comment thread vllm/reasoning/gptoss_reasoning_parser.py Outdated

Comment thread vllm/reasoning/gptoss_reasoning_parser.py Outdated

fergusfinn force-pushed the main branch from 1892993 to de7dd63 Compare March 2, 2026 11:20

fergusfinn force-pushed the perf/gptoss-streaming-reasoning-end branch from 76ac447 to f374ce9 Compare March 2, 2026 13:08

fergusfinn force-pushed the perf/gptoss-streaming-reasoning-end branch from f374ce9 to 01b79fe Compare April 10, 2026 06:35

fergusfinn force-pushed the perf/gptoss-streaming-reasoning-end branch 2 times, most recently from 06cacc8 to 3ae6795 Compare April 14, 2026 06:57

fergusfinn force-pushed the perf/gptoss-streaming-reasoning-end branch from 3ae6795 to 293e353 Compare April 15, 2026 07:00

fergusfinn and others added 21 commits April 15, 2026 08:01

Merge branch 'main' into perf/gptoss-streaming-reasoning-end

14f1fe8

Merge branch 'main' into perf/gptoss-streaming-reasoning-end

9b59dea

Merge branch 'main' into perf/gptoss-streaming-reasoning-end

fc76e9c

[Bugfix][Core] Fix stuck chunked pipeline parallelism with async sche…

747256b

…duling (vllm-project#38726) Signed-off-by: Jing Wang <jingwang96@qq.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

[BugFix] Support custom tool parsers when tool_choice is required a…

ceade19

…nd named function (vllm-project#39870) Signed-off-by: JaredforReal <w13431838023@gmail.com> Signed-off-by: sfeng33 <4florafeng@gmail.com> Co-authored-by: sfeng33 <4florafeng@gmail.com>

feat: Add LoRA support for Gemma4ForConditionalGeneration (vllm-proje…

640cc9d

…ct#39291) Signed-off-by: allgather <all2allops@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

[Misc][UX] Map mimo reasoning and tooling parsers (vllm-project#40089)

512765d

Signed-off-by: Roger Wang <hey@rogerw.io> Co-authored-by: Chauncey <chaunceyjiang@gmail.com>

skip fp8e4b15 on xpu (vllm-project#39957)

251c18d

Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>

[ROCm][CI] Build fastsafetensors from source so it links against liba…

1ae11e2

…mdhip64 (vllm-project#39978) Signed-off-by: Andreas Karatzas <akaratza@amd.com>

[AMD][CI] Update DeepEP branch (vllm-project#38396)

58da4ee

Signed-off-by: Ryan Rock <ryan.rock@amd.com>

[ROCm] Fix TurboQuant on ROCm: backend routing, flash-attn compat, in…

6ef1efd

…t64 overflow (vllm-project#39953) Signed-off-by: aditi <aditi.rana@amd.com>

[Kernel] [Helion] Force disable HOP path due to performance regression (

5cddddd

vllm-project#40171) Signed-off-by: Yanan Cao <gmagogsfm@gmail.com> Co-authored-by: Claude Sonnet 4 <noreply@anthropic.com>

[Kernel] Add MXFP4 W4A4 CUTLASS MoE kernel for SM100 (vllm-project#37463

a8bffaa

) Signed-off-by: mgoin <mgoin64@gmail.com>

Remove outdated tests test_mixtral_moe and test_duplicated_ignored_se…

1f45e83

…quence_group (vllm-project#40175) Signed-off-by: mgoin <mgoin64@gmail.com>

[XPU]fake impl for xpu fp8_gemm (vllm-project#39984)

55842a8

Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>

[CI] Speed up test_fused_marlin_moe (vllm-project#40178)

48a65cc

Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: Michael Goin <mgoin64@gmail.com>

[XPU] fix all_reduce all-zero accuracy issue under torch.compile (vll…

993859c

…m-project#39844) Signed-off-by: Chaojun Zhang <chaojun.zhang@intel.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

[Core] Reduce mm scheduler, get_num_embed overhead (vllm-project#40143)

b075552

Signed-off-by: milesial <milesial@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

[Doc] Add Realtime Transcription section to supported_models.md (vllm…

d0697cc

…-project#39845) Signed-off-by: Ziying Tao <tzzying@outlook.com>

[Frontend] Add multimodal support to /inference/v1/generate endpoint (v…

80b1823

…llm-project#38405) Signed-off-by: Nithin Chalapathi <nithin.ch10@gmail.com> Signed-off-by: Nithin Chalapathi <nithinc@berkeley.edu> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>

[Doc] Fix outdated source reference comment in anthropic/serving.py (v…

cda19ec

…llm-project#40189) Signed-off-by: z1ying <tzzying@outlook.com>

gmagogsfm and others added 30 commits April 21, 2026 00:23

[vLLM IR] Add IR op testing and benchmarking infrastructure (vllm-pro…

fe5c115

…ject#40167) Signed-off-by: Yanan Cao <gmagogsfm@gmail.com> Co-authored-by: Theresa Shan <Theresa.Shan@amd.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

[Bugfix] Fix _CONFIG_REGISTRY types getting wrong config class when…

0e884fe

… on-disk model_type differs (vllm-project#39554) Signed-off-by: Misa <misaAle@users.noreply.github.com> Signed-off-by: Misael Casarez <misacasa@amazon.com> Co-authored-by: Misael Casarez <misacasa@amazon.com>

[Misc] Reduce attention logging levels (vllm-project#40086)

18563f2

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

[Bugfix] Gemma4: fix multimodal embedder norm order to match HF refer…

20d3743

…ence (vllm-project#40411) Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com> Co-authored-by: Luciano Martins <lucianommartins@users.noreply.github.com>

[Doc] Update ViT CUDA graph doc for mixed (image+video) inputs (vllm-…

8097591

…project#40355) Signed-off-by: shen-shanshan <467638484@qq.com>

[Startup] Parallelize torch/transformers import + weight prefetch + f…

8256833

…orkserver prewarm (vllm-project#40331) Signed-off-by: simon-mo <simon@inferact.ai>

[Deprecation] Deprecate cprofile and cprofile_context (vllm-project#3…

301024a

…9100) Signed-off-by: yewentao256 <zhyanwentao@126.com>

[Fix] Add missing space in IP fallback warning (vllm-project#40359)

989cc12

Signed-off-by: lesj0610 <lesj0610@gmail.com>

[MM][Misc] Support image+video mixed inputs (per prompt) for VLM exam…

b478400

…ples (vllm-project#40335) Signed-off-by: shen-shanshan <467638484@qq.com>

[MoE] Triton MoE Perf regression - restore low latency path (vllm-pro…

257015d

…ject#39016)

[Feat] dflash support for ROCm (vllm-project#39703)

f95c11a

Signed-off-by: Hang Yang <hangy@amd.com>

[Bugfix] Normalize malformed dict prompts that carry token IDs in `pr…

5a94a19

…ompt` (vllm-project#40339) Signed-off-by: Alchuang22-dev <2584829494@qq.com>

Revert "[Startup] Parallelize torch/transformers import + weight pref…

3975eb6

…etch + forkserver prewarm" (vllm-project#40438) Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>

fix: clamp NaN/Inf in topk_softmax to prevent duplicate expert IDs (v…

28c2221

…llm-project#39391) Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>

feat(multimodal): support externally processed mm_kwargs with cache i…

766cb65

…njection (vllm-project#39502) Signed-off-by: Krish Hung <krishung5@gmail.com> Signed-off-by: krishung5 <krish@nvidia.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add Granite 4.1 Vision as built-in multimodal model (vllm-project#40282)

d249a9e

Signed-off-by: Artem Spector <artems@il.ibm.com> Signed-off-by: artemspector <artems@il.ibm.com> Co-authored-by: artemspector <artems@il.ibm.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

[Doc] Add Qwen3 AWQ models to documentation (vllm-project#40034)

ec5ef0a

Signed-off-by: Yusuf <yusufmohammad@live.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

[Bugfix] LoRA: extend expert base_layer loading to Qwen3.5 and Step3.x (

908a713

vllm-project#37114) Signed-off-by: Hollow Man <hollowman@opensuse.org>

[XPU][CI] Add misc, engine and lora cases on Intel GPU in CI (vllm-pr…

b2a5518

…oject#39887) Signed-off-by: zengxian <xiangdong.zeng@intel.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>

[MM][CG] Optimize default max_frames_per_batch auto-infer for ViT C…

936e0b7

…UDA graph video inference (vllm-project#40445) Signed-off-by: shen-shanshan <467638484@qq.com>

Default to 'align' mamba cache mode for Mamba-based models when specu…

f819265

…lative decoding is enabled (vllm-project#40454) Signed-off-by: Roi Koren <roik@nvidia.com>

[UX] Bump version in CG memory profiling log message (vllm-project#40465

ab5666e

) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

Revert vllm-project#38730 and vllm-project#38791 (vllm-project#40032)

6d85b36

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com> Signed-off-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com>

[Model Runner V2] Multiple prompt logprobs support (vllm-project#39937)

66cc3fa

Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Co-authored-by: Nick Hill <nickhill123@gmail.com>

Add new tp plan styles to the Transformers modelling backend (vllm-pr…

6ee081d

…oject#40467) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

Revert "[Misc] Move pyav and soundfile to common requirements" (v…

67eb608

…llm-project#40276) Co-authored-by: Roger Wang <hey@rogerw.io>

[MRv2]fix: model accuracy regression caused by reusing the stale last…

9a6a66f

…_sampled_tokens and draft_tokens (vllm-project#39833) Signed-off-by: Zijing Liu <liuzijing2014@gmail.com>

Merge branch 'main' into perf/gptoss-streaming-reasoning-end

bd26a18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: add is_reasoning_end_streaming() override to GptOssReasoningParser#4

perf: add is_reasoning_end_streaming() override to GptOssReasoningParser#4
fergusfinn wants to merge 111 commits into
mainfrom
perf/gptoss-streaming-reasoning-end

fergusfinn commented Mar 2, 2026 •

edited by github-actions Bot

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

fergusfinn commented Mar 2, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Approach

Test plan

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

fergusfinn commented Mar 2, 2026 •

edited by github-actions Bot

Loading