Skip to content

🔴🔴🔴 fix: skip clean_up_tokenization for BPE tokenizers in PreTrainedTokenizerFast#44915

Merged
itazap merged 10 commits intohuggingface:mainfrom
maxsloef-goodfire:fix/skip-cleanup-for-bpe
Apr 27, 2026
Merged

🔴🔴🔴 fix: skip clean_up_tokenization for BPE tokenizers in PreTrainedTokenizerFast#44915
itazap merged 10 commits intohuggingface:mainfrom
maxsloef-goodfire:fix/skip-cleanup-for-bpe

Conversation

@maxsloef-goodfire
Copy link
Copy Markdown
Contributor

@maxsloef-goodfire maxsloef-goodfire commented Mar 21, 2026

What does this PR do?

clean_up_tokenization applies English-specific string replacements ( .., ??, ,,, etc.) to decoded text. This was designed for BERT-era WordPiece tokenizers where decoding produced artifacts like "Hello , world .".

For BPE tokenizers (Llama 3, GPT-2, etc.), spaces are encoded as part of tokens and decoding does not produce these artifacts. The cleanup is actively destructive — it strips legitimate spaces from correctly decoded text. For example, "x != y" becomes "x!= y".

This PR adds a guard in PreTrainedTokenizerFast._decode() that unconditionally skips the cleanup when the backend model is BPE, and emits a logger.warning_once when clean_up_tokenization_spaces=True is set in the tokenizer config. Users who need the string replacements for other purposes can call tokenizer.clean_up_tokenization() directly.

Why this matters

All 24 Llama 3.x models on the Hub have clean_up_tokenization_spaces=true baked into their tokenizer_config.json (inherited from a library default when Llama 3 switched tokenizer classes — see #35175, #31187, #32575). Fixing the config on every model repo (and every downstream fine-tune) is a game of whack-a-mole. This library-level guard ensures the cleanup is never applied to tokenizers where it's incorrect, even if the config says otherwise.

Minimal reproduction (before fix)

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
text = "x != y and a.b == c"
ids = tokenizer.encode(text, add_special_tokens=False)
print(repr(tokenizer.decode(ids)))
# 'x!= y and a.b == c'  ← space before != silently dropped

Changes

  • src/transformers/tokenization_utils_tokenizers.py — in PreTrainedTokenizerFast._decode(), check type(self.backend_tokenizer.model).__name__ and skip clean_up_tokenization() for BPE models, emitting a warning via logger.warning_once.
  • tests/tokenization/test_tokenization_fast.py — added test_bpe_tokenizer_skips_clean_up_tokenization_spaces verifying BPE roundtrip preserves text.
  • tests/utils/test_tokenization_utils.py — updated test_clean_up_tokenization_spaces to use clean roundtrip text (GPT-2 is BPE, so cleanup is now correctly skipped).

Fixes #35175
Fixes #31187

Before submitting

Who can review?

@ArthurZucker @itazap — tokenizer maintainers. Arthur previously acknowledged this should be False in #35175.

clean_up_tokenization applies BERT-era string replacements (` .` → `.`,
` !` → `!`, etc.) that are destructive for BPE tokenizers where spaces
are encoded as part of tokens. This adds a guard that skips the cleanup
when the backend model is BPE and emits a warning_once suggesting the
user set clean_up_tokenization_spaces=False.

Fixes huggingface#35175
Verifies that BPE tokenizers preserve spaces before punctuation
even when clean_up_tokenization_spaces=True.
clean_up_tokenization is always skipped for BPE tokenizers, even when
explicitly requested, because the cleanup is fundamentally wrong for
BPE (it strips legitimate spaces that are part of the token encoding).
Users who need those string replacements can call
clean_up_tokenization() directly.

Updated test_tokenization_utils.py to expect preserved spacing for
GPT-2 (BPE). Added test in test_tokenization_fast.py verifying the
guard works with an explicit True parameter.
Move test_bpe_tokenizer_skips_clean_up_tokenization_spaces to
PreTrainedTokenizationFastTest (which has bytelevel_bpe_model_name).
Update test_clean_up_tokenization_spaces to use normal text without
artificial WordPiece artifacts — BPE roundtrip preserves originals.
ByteLevel BPE tokenizers prepend a space during encoding. Use
" Hello world." so the roundtrip matches exactly.
Copy link
Copy Markdown
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey! Ty for the PR!
This was actually quite a rollercoaster. (#42898)
We decided to deprecate the flag in #31938. Then we had to introduce it back in #43426.

At this point, and again just as we could not break the uploaded models on the hub we have 2 choices.

  1. 🔴 this PR as a breaking change to enforce decoding is 1-1. This would literally affect ALL BPE models in the current state and does not allow to opt-out. I think this would break gpt2 which is also a BPE and had this since a long long time ago.
  2. We just document this better, pin this issue idk but I don't think there's much to do here.

This has been around for a while, the main issue for me is that for Llama the original repo indeed does not clean it up.

I don't mind trying to fix for llama3 specifically! but in this state its absolutely breaking

Comment on lines +1021 to +1028
if type(self.backend_tokenizer.model).__name__ == "BPE":
logger.warning_once(
"Ignoring clean_up_tokenization_spaces=True for BPE tokenizer"
f" {self.__class__.__name__}. The clean_up_tokenization post-processing"
" step is designed for WordPiece tokenizers and is destructive for BPE"
" (it strips spaces before punctuation). Set"
" clean_up_tokenization_spaces=False to suppress this warning."
)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is something we can do... its too breaking for anyone that relies on this behavior.
Its a very big breaking change.

Add clean_up_tokenization_spaces_even_though_its_wrong_for_bpe flag
so users who rely on the old behavior can opt back in. The warning
message now mentions this flag. Added test for the override path.
@maxsloef-goodfire maxsloef-goodfire force-pushed the fix/skip-cleanup-for-bpe branch from d11ef36 to ddca57e Compare March 23, 2026 17:12
@maxsloef-goodfire
Copy link
Copy Markdown
Contributor Author

Hey @ArthurZucker, thanks for the review and the context on the history here. Two arguments for why this should go in:

Correctness: cleanup is definitionally wrong for BPE

BPE tokenizers encode whitespace as part of their tokens. That's the whole point — the Ġ prefix in GPT-2, the byte-level encoding in Llama. Decode produces a perfect roundtrip without any post-processing. There are no WordPiece-style artifacts to clean up, so clean_up_tokenization can only destroy information, never add correctness. This isn't a Llama-specific quirk, it's a property of how BPE works.

Compare with BERT, where WordPiece splits "it's"["it", "'", "s"] and decodes to "it ' s" — cleanup is genuinely needed there. BPE never produces those artifacts.

This is also why a Llama-3-specific fix isn't quite right — the problem isn't that Llama 3 has a bad config, it's that cleanup is fundamentally incompatible with BPE as a tokenizer class. A model-specific fix would need to be repeated for every new BPE model that ships with clean_up_tokenization_spaces: true, and there's nothing stopping that from happening again.

Impact: the asymmetry is stark

Llama 3 is one of the most popular model families on the Hub, and every Llama 3.x model — plus the massive ecosystem of fine-tunes and derivatives — ships with clean_up_tokenization_spaces: true in its config. Every user who doesn't know to manually override this gets silently corrupted output: "x != y""x!= y", "! ! !""!!!".

On the other side, the concern is breaking someone who feeds pre-tokenized text with artificial spacing through a BPE encode→decode→cleanup pipeline as a convenience post-processor. That's a rare and unusual pattern — using a feature designed for WordPiece artifacts as a general-purpose space collapser on a tokenizer that doesn't produce those artifacts.

Escape hatch added

We've added clean_up_tokenization_spaces_for_bpe_even_though_it_will_corrupt_output as an override flag for anyone who truly needs the old behavior.

@maxsloef-goodfire
Copy link
Copy Markdown
Contributor Author

oops, will fix tests

@maxsloef-goodfire
Copy link
Copy Markdown
Contributor Author

ah turns out test failures were unrelated to my changes - @ArthurZucker ready for re-review!

Copy link
Copy Markdown
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, I am fine with this, put a BIG red dot on the PR name please and want to get @itazap's approval as well!

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@itazap
Copy link
Copy Markdown
Collaborator

itazap commented Apr 22, 2026

run-slow: llama, auto

@github-actions
Copy link
Copy Markdown
Contributor

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/auto", "models/llama"]
quantizations: []

@github-actions
Copy link
Copy Markdown
Contributor

CI Results

Workflow Run ⚙️

Commit Info

Context Commit Description
RUN 3805a348 workflow commit (merge commit)
PR 91c9946f branch commit (from PR)
main b00b7c08 base commit (on main)

✅ No failing test specific to this PR 🎉 👏 !

@maxsloef-goodfire maxsloef-goodfire changed the title fix: skip clean_up_tokenization for BPE tokenizers in PreTrainedTokenizerFast 🔴🔴🔴 fix: skip clean_up_tokenization for BPE tokenizers in PreTrainedTokenizerFast Apr 22, 2026
@maxsloef-goodfire
Copy link
Copy Markdown
Contributor Author

okay, I am fine with this, put a BIG red dot on the PR name please and want to get @itazap's approval as well!

great :) big red dots! 🔴🔴🔴

@itazap
Copy link
Copy Markdown
Collaborator

itazap commented Apr 23, 2026

Hey! Can we please add a llama test since this broke llama tokenization? to avoid your change being reverted in the future, thank you!

…s skip

Loads the real Meta-Llama-3-8B tokenizer so the test exercises the actual
shipped config (`clean_up_tokenization_spaces=True` + BPE backend) — locks
in the fix against future reverts. Marked @slow so it only runs under
RUN_SLOW=1, consistent with other gated-repo tokenizer tests in this class.
@maxsloef-goodfire
Copy link
Copy Markdown
Contributor Author

Hey! Can we please add a llama test since this broke llama tokenization? to avoid your change being reverted in the future, thank you!

sure - added!

@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: llama

@itazap itazap added this pull request to the merge queue Apr 27, 2026
@itazap
Copy link
Copy Markdown
Collaborator

itazap commented Apr 27, 2026

Thanks for adding this! 🤗 agree It makes sense to protect BPE from old config params

Merged via the queue into huggingface:main with commit bbb51c8 Apr 27, 2026
28 checks passed
ArthurZucker pushed a commit that referenced this pull request Apr 28, 2026
…edTokenizerFast` (#44915)

* fix: skip clean_up_tokenization for BPE tokenizers

clean_up_tokenization applies BERT-era string replacements (` .` → `.`,
` !` → `!`, etc.) that are destructive for BPE tokenizers where spaces
are encoded as part of tokens. This adds a guard that skips the cleanup
when the backend model is BPE and emits a warning_once suggesting the
user set clean_up_tokenization_spaces=False.

Fixes #35175

* test: add test for BPE tokenizer skipping clean_up_tokenization

Verifies that BPE tokenizers preserve spaces before punctuation
even when clean_up_tokenization_spaces=True.

* fix: update tests to expect BPE cleanup skip

clean_up_tokenization is always skipped for BPE tokenizers, even when
explicitly requested, because the cleanup is fundamentally wrong for
BPE (it strips legitimate spaces that are part of the token encoding).
Users who need those string replacements can call
clean_up_tokenization() directly.

Updated test_tokenization_utils.py to expect preserved spacing for
GPT-2 (BPE). Added test in test_tokenization_fast.py verifying the
guard works with an explicit True parameter.

* fix: move BPE test to correct class, use clean roundtrip text

Move test_bpe_tokenizer_skips_clean_up_tokenization_spaces to
PreTrainedTokenizationFastTest (which has bytelevel_bpe_model_name).
Update test_clean_up_tokenization_spaces to use normal text without
artificial WordPiece artifacts — BPE roundtrip preserves originals.

* fix: add leading space to test string for ByteLevel BPE prefix

ByteLevel BPE tokenizers prepend a space during encoding. Use
" Hello world." so the roundtrip matches exactly.

* feat: add escape hatch for BPE cleanup override

Add clean_up_tokenization_spaces_even_though_its_wrong_for_bpe flag
so users who rely on the old behavior can opt back in. The warning
message now mentions this flag. Added test for the override path.

* test: add llama 3 regression test for BPE clean_up_tokenization_spaces skip

Loads the real Meta-Llama-3-8B tokenizer so the test exercises the actual
shipped config (`clean_up_tokenization_spaces=True` + BPE backend) — locks
in the fix against future reverts. Marked @slow so it only runs under
RUN_SLOW=1, consistent with other gated-repo tokenizer tests in this class.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Detokenization discrepancy with Llama3.1 Original Llama-3 tokenizer behaves differently from transformers version

4 participants