Skip to content

Improved Tokenize & Concatenate#1273

Merged
jlarson4 merged 2 commits intodevfrom
bug/tokenize-and-concatenate-eos-boundary
Apr 29, 2026
Merged

Improved Tokenize & Concatenate#1273
jlarson4 merged 2 commits intodevfrom
bug/tokenize-and-concatenate-eos-boundary

Conversation

@jlarson4
Copy link
Copy Markdown
Collaborator

Description

PR #1201 updated our chunking system to be whitespace bound, which narrowed the set of cases that #1133 could appear in, but did not completely eliminate the issue. This additional change should further close this gap in coverage.

Fixes #1133

Type of change

  • Bug fix (non-breaking change which fixes an issue)

Checklist:

  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I have not rewritten tests relating to key interfaces which would affect backward compatibility

@jlarson4
Copy link
Copy Markdown
Collaborator Author

After doing more review of @BorisTheBrave's comments on #1133, and reviewing the historical design choices of tokenize_and_concatenate, I have made the decision to implement per-doc tokenizing.

This is optimal for fast tokenizers and slightly slower (but still correct) for slow ones. If a slow-tokenizer user reports a real perf regression, we'll revisit. When tokenize_and_concatenate was originally written in 2022, fast tokenizers were in the minority, but that paradigm has shifted in the intervening years.

This change also has a small effect on SentencePiece tokenizers (T5, Mistral). Previously, a BOS token was emitted at every chunk boundary for LLaMA models, and an extra EOS token was emitted at every chunk boundary for T5 models. Any variance perceived by users of those models is due to the new per-doc approach eliminating those unnecessary tokens. In benchmark testing, this actually resulted in a 20-30% performance improvement for SentencePiece tokenizers, due to the removal of the spurious tokens.

@jlarson4 jlarson4 merged commit ad8e123 into dev Apr 29, 2026
43 of 44 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant