Improved Tokenize & Concatenate by jlarson4 · Pull Request #1273 · TransformerLensOrg/TransformerLens

jlarson4 · 2026-04-29T03:48:57Z

Description

PR #1201 updated our chunking system to be whitespace bound, which narrowed the set of cases that #1133 could appear in, but did not completely eliminate the issue. This additional change should further close this gap in coverage.

Fixes #1133

Type of change

Bug fix (non-breaking change which fixes an issue)

Checklist:

I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
I have not rewritten tests relating to key interfaces which would affect backward compatibility

jlarson4 · 2026-04-29T04:32:02Z

After doing more review of @BorisTheBrave's comments on #1133, and reviewing the historical design choices of tokenize_and_concatenate, I have made the decision to implement per-doc tokenizing.

This is optimal for fast tokenizers and slightly slower (but still correct) for slow ones. If a slow-tokenizer user reports a real perf regression, we'll revisit. When tokenize_and_concatenate was originally written in 2022, fast tokenizers were in the minority, but that paradigm has shifted in the intervening years.

This change also has a small effect on SentencePiece tokenizers (T5, Mistral). Previously, a BOS token was emitted at every chunk boundary for LLaMA models, and an extra EOS token was emitted at every chunk boundary for T5 models. Any variance perceived by users of those models is due to the new per-doc approach eliminating those unnecessary tokens. In benchmark testing, this actually resulted in a 20-30% performance improvement for SentencePiece tokenizers, due to the removal of the spurious tokens.

Improved issue with tokenize and concatenate

3d97166

jlarson4 mentioned this pull request Apr 29, 2026

[Bug Report] tokenize_and_concatenate doesn't tokenize correctly. #1133

Open

1 task

dialed in the approach to be per-doc

0b26625

jlarson4 merged commit ad8e123 into dev Apr 29, 2026
43 of 44 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved Tokenize & Concatenate#1273

Improved Tokenize & Concatenate#1273
jlarson4 merged 2 commits intodevfrom
bug/tokenize-and-concatenate-eos-boundary

jlarson4 commented Apr 29, 2026

Uh oh!

jlarson4 commented Apr 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jlarson4 commented Apr 29, 2026

Description

Type of change

Checklist:

Uh oh!

jlarson4 commented Apr 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant