Skip to content

fix: handle trailing replacement character in Whisper word timestamp decoding#45226

Closed
akhilc08 wants to merge 1 commit intohuggingface:mainfrom
akhilc08:fix/whisper-word-timestamp-trailing-replacement-char
Closed

fix: handle trailing replacement character in Whisper word timestamp decoding#45226
akhilc08 wants to merge 1 commit intohuggingface:mainfrom
akhilc08:fix/whisper-word-timestamp-trailing-replacement-char

Conversation

@akhilc08
Copy link
Copy Markdown

@akhilc08 akhilc08 commented Apr 3, 2026

Summary

  • Fixes an IndexError: string index out of range crash in _split_tokens_on_unicode() when the decoded token stream ends with a dangling Unicode replacement character (U+FFFD)
  • Adds a bounds check so that when unicode_offset + decoded.index(replacement_char) >= len(decoded_full), the out-of-bounds access is avoided
  • The trailing replacement character token is still collected and flushed correctly

Closes #44869

Test plan

  • Verify that Whisper word-level timestamp decoding no longer crashes when the final token(s) decode to U+FFFD
  • Verify that normal (non-trailing-replacement-char) inputs produce identical results

🤖 Generated with Claude Code

…decoding

Add bounds check in `_split_tokens_on_unicode()` to prevent IndexError
when the decoded token stream ends with a dangling Unicode replacement
character (U+FFFD), causing the computed index to equal `len(decoded_full)`.

Closes #44869

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 3, 2026

[For maintainers] Suggested jobs to run (before merge)

run-slow: whisper

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 3, 2026

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=45226&sha=67de02

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Whisper word timestamp decode crashes on trailing replacement character at end of decoded token stream

1 participant