🔴🔴🔴 fix: skip `clean_up_tokenization` for BPE tokenizers in `PreTrainedTokenizerFast` by maxsloef-goodfire · Pull Request #44915 · huggingface/transformers

maxsloef-goodfire · 2026-03-21T20:45:03Z

What does this PR do?

clean_up_tokenization applies English-specific string replacements ( . → ., ? → ?, , → ,, etc.) to decoded text. This was designed for BERT-era WordPiece tokenizers where decoding produced artifacts like "Hello , world .".

For BPE tokenizers (Llama 3, GPT-2, etc.), spaces are encoded as part of tokens and decoding does not produce these artifacts. The cleanup is actively destructive — it strips legitimate spaces from correctly decoded text. For example, "x != y" becomes "x!= y".

This PR adds a guard in PreTrainedTokenizerFast._decode() that unconditionally skips the cleanup when the backend model is BPE, and emits a logger.warning_once when clean_up_tokenization_spaces=True is set in the tokenizer config. Users who need the string replacements for other purposes can call tokenizer.clean_up_tokenization() directly.

Why this matters

All 24 Llama 3.x models on the Hub have clean_up_tokenization_spaces=true baked into their tokenizer_config.json (inherited from a library default when Llama 3 switched tokenizer classes — see #35175, #31187, #32575). Fixing the config on every model repo (and every downstream fine-tune) is a game of whack-a-mole. This library-level guard ensures the cleanup is never applied to tokenizers where it's incorrect, even if the config says otherwise.

Minimal reproduction (before fix)

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
text = "x != y and a.b == c"
ids = tokenizer.encode(text, add_special_tokens=False)
print(repr(tokenizer.decode(ids)))
# 'x!= y and a.b == c'  ← space before != silently dropped

Changes

src/transformers/tokenization_utils_tokenizers.py — in PreTrainedTokenizerFast._decode(), check type(self.backend_tokenizer.model).__name__ and skip clean_up_tokenization() for BPE models, emitting a warning via logger.warning_once.
tests/tokenization/test_tokenization_fast.py — added test_bpe_tokenizer_skips_clean_up_tokenization_spaces verifying BPE roundtrip preserves text.
tests/utils/test_tokenization_utils.py — updated test_clean_up_tokenization_spaces to use clean roundtrip text (GPT-2 is BPE, so cleanup is now correctly skipped).

Fixes #35175
Fixes #31187

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline, Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case. — Detokenization discrepancy with Llama3.1 #35175, Original Llama-3 tokenizer behaves differently from transformers version #31187, Llama3 Tokenizer Decode Removing Space Character #32575
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

Who can review?

@ArthurZucker @itazap — tokenizer maintainers. Arthur previously acknowledged this should be False in #35175.

clean_up_tokenization applies BERT-era string replacements (` .` → `.`, ` !` → `!`, etc.) that are destructive for BPE tokenizers where spaces are encoded as part of tokens. This adds a guard that skips the cleanup when the backend model is BPE and emits a warning_once suggesting the user set clean_up_tokenization_spaces=False. Fixes huggingface#35175

Verifies that BPE tokenizers preserve spaces before punctuation even when clean_up_tokenization_spaces=True.

clean_up_tokenization is always skipped for BPE tokenizers, even when explicitly requested, because the cleanup is fundamentally wrong for BPE (it strips legitimate spaces that are part of the token encoding). Users who need those string replacements can call clean_up_tokenization() directly. Updated test_tokenization_utils.py to expect preserved spacing for GPT-2 (BPE). Added test in test_tokenization_fast.py verifying the guard works with an explicit True parameter.

Move test_bpe_tokenizer_skips_clean_up_tokenization_spaces to PreTrainedTokenizationFastTest (which has bytelevel_bpe_model_name). Update test_clean_up_tokenization_spaces to use normal text without artificial WordPiece artifacts — BPE roundtrip preserves originals.

ByteLevel BPE tokenizers prepend a space during encoding. Use " Hello world." so the roundtrip matches exactly.

ArthurZucker

Hey! Ty for the PR!
This was actually quite a rollercoaster. (#42898)
We decided to deprecate the flag in #31938. Then we had to introduce it back in #43426.

At this point, and again just as we could not break the uploaded models on the hub we have 2 choices.

🔴 this PR as a breaking change to enforce decoding is 1-1. This would literally affect ALL BPE models in the current state and does not allow to opt-out. I think this would break gpt2 which is also a BPE and had this since a long long time ago.
We just document this better, pin this issue idk but I don't think there's much to do here.

This has been around for a while, the main issue for me is that for Llama the original repo indeed does not clean it up.

I don't mind trying to fix for llama3 specifically! but in this state its absolutely breaking

ArthurZucker · 2026-03-23T08:16:36Z

+            if type(self.backend_tokenizer.model).__name__ == "BPE":
+                logger.warning_once(
+                    "Ignoring clean_up_tokenization_spaces=True for BPE tokenizer"
+                    f" {self.__class__.__name__}. The clean_up_tokenization post-processing"
+                    " step is designed for WordPiece tokenizers and is destructive for BPE"
+                    " (it strips spaces before punctuation). Set"
+                    " clean_up_tokenization_spaces=False to suppress this warning."
+                )


I don't think this is something we can do... its too breaking for anyone that relies on this behavior.
Its a very big breaking change.

Add clean_up_tokenization_spaces_even_though_its_wrong_for_bpe flag so users who rely on the old behavior can opt back in. The warning message now mentions this flag. Added test for the override path.

maxsloef-goodfire · 2026-03-23T18:45:51Z

Hey @ArthurZucker, thanks for the review and the context on the history here. Two arguments for why this should go in:

Correctness: cleanup is definitionally wrong for BPE

BPE tokenizers encode whitespace as part of their tokens. That's the whole point — the Ġ prefix in GPT-2, the byte-level encoding in Llama. Decode produces a perfect roundtrip without any post-processing. There are no WordPiece-style artifacts to clean up, so clean_up_tokenization can only destroy information, never add correctness. This isn't a Llama-specific quirk, it's a property of how BPE works.

Compare with BERT, where WordPiece splits "it's" → ["it", "'", "s"] and decodes to "it ' s" — cleanup is genuinely needed there. BPE never produces those artifacts.

This is also why a Llama-3-specific fix isn't quite right — the problem isn't that Llama 3 has a bad config, it's that cleanup is fundamentally incompatible with BPE as a tokenizer class. A model-specific fix would need to be repeated for every new BPE model that ships with clean_up_tokenization_spaces: true, and there's nothing stopping that from happening again.

Impact: the asymmetry is stark

Llama 3 is one of the most popular model families on the Hub, and every Llama 3.x model — plus the massive ecosystem of fine-tunes and derivatives — ships with clean_up_tokenization_spaces: true in its config. Every user who doesn't know to manually override this gets silently corrupted output: "x != y" → "x!= y", "! ! !" → "!!!".

On the other side, the concern is breaking someone who feeds pre-tokenized text with artificial spacing through a BPE encode→decode→cleanup pipeline as a convenience post-processor. That's a rare and unusual pattern — using a feature designed for WordPiece artifacts as a general-purpose space collapser on a tokenizer that doesn't produce those artifacts.

Escape hatch added

We've added clean_up_tokenization_spaces_for_bpe_even_though_it_will_corrupt_output as an override flag for anyone who truly needs the old behavior.

maxsloef-goodfire · 2026-04-09T13:39:02Z

oops, will fix tests

maxsloef-goodfire · 2026-04-20T15:13:47Z

ah turns out test failures were unrelated to my changes - @ArthurZucker ready for re-review!

ArthurZucker

okay, I am fine with this, put a BIG red dot on the PR name please and want to get @itazap's approval as well!

HuggingFaceDocBuilderDev · 2026-04-22T05:16:24Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

itazap · 2026-04-22T17:18:13Z

run-slow: llama, auto

github-actions · 2026-04-22T17:19:35Z

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/auto", "models/llama"]
quantizations: []

github-actions · 2026-04-22T17:48:34Z

CI Results

Workflow Run ⚙️

Commit Info

Context	Commit	Description
RUN	3805a348	workflow commit (merge commit)
PR	91c9946f	branch commit (from PR)
main	b00b7c08	base commit (on `main`)

✅ No failing test specific to this PR 🎉 👏 !

maxsloef-goodfire · 2026-04-22T22:03:19Z

okay, I am fine with this, put a BIG red dot on the PR name please and want to get @itazap's approval as well!

great :) big red dots! 🔴🔴🔴

itazap · 2026-04-23T07:48:43Z

Hey! Can we please add a llama test since this broke llama tokenization? to avoid your change being reverted in the future, thank you!

…s skip Loads the real Meta-Llama-3-8B tokenizer so the test exercises the actual shipped config (`clean_up_tokenization_spaces=True` + BPE backend) — locks in the fix against future reverts. Marked @slow so it only runs under RUN_SLOW=1, consistent with other gated-repo tokenizer tests in this class.

maxsloef-goodfire · 2026-04-23T15:48:20Z

Hey! Can we please add a llama test since this broke llama tokenization? to avoid your change being reverted in the future, thank you!

sure - added!

github-actions · 2026-04-23T15:48:22Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: llama

itazap · 2026-04-27T08:50:31Z

Thanks for adding this! 🤗 agree It makes sense to protect BPE from old config params

…edTokenizerFast` (#44915) * fix: skip clean_up_tokenization for BPE tokenizers clean_up_tokenization applies BERT-era string replacements (` .` → `.`, ` !` → `!`, etc.) that are destructive for BPE tokenizers where spaces are encoded as part of tokens. This adds a guard that skips the cleanup when the backend model is BPE and emits a warning_once suggesting the user set clean_up_tokenization_spaces=False. Fixes #35175 * test: add test for BPE tokenizer skipping clean_up_tokenization Verifies that BPE tokenizers preserve spaces before punctuation even when clean_up_tokenization_spaces=True. * fix: update tests to expect BPE cleanup skip clean_up_tokenization is always skipped for BPE tokenizers, even when explicitly requested, because the cleanup is fundamentally wrong for BPE (it strips legitimate spaces that are part of the token encoding). Users who need those string replacements can call clean_up_tokenization() directly. Updated test_tokenization_utils.py to expect preserved spacing for GPT-2 (BPE). Added test in test_tokenization_fast.py verifying the guard works with an explicit True parameter. * fix: move BPE test to correct class, use clean roundtrip text Move test_bpe_tokenizer_skips_clean_up_tokenization_spaces to PreTrainedTokenizationFastTest (which has bytelevel_bpe_model_name). Update test_clean_up_tokenization_spaces to use normal text without artificial WordPiece artifacts — BPE roundtrip preserves originals. * fix: add leading space to test string for ByteLevel BPE prefix ByteLevel BPE tokenizers prepend a space during encoding. Use " Hello world." so the roundtrip matches exactly. * feat: add escape hatch for BPE cleanup override Add clean_up_tokenization_spaces_even_though_its_wrong_for_bpe flag so users who rely on the old behavior can opt back in. The warning message now mentions this flag. Added test for the override path. * test: add llama 3 regression test for BPE clean_up_tokenization_spaces skip Loads the real Meta-Llama-3-8B tokenizer so the test exercises the actual shipped config (`clean_up_tokenization_spaces=True` + BPE backend) — locks in the fix against future reverts. Marked @slow so it only runs under RUN_SLOW=1, consistent with other gated-repo tokenizer tests in this class.

maxsloef-goodfire added 5 commits March 21, 2026 20:42

test: add test for BPE tokenizer skipping clean_up_tokenization

2993578

Verifies that BPE tokenizers preserve spaces before punctuation even when clean_up_tokenization_spaces=True.

fix: add leading space to test string for ByteLevel BPE prefix

247ee6f

ByteLevel BPE tokenizers prepend a space during encoding. Use " Hello world." so the roundtrip matches exactly.

ArthurZucker reviewed Mar 23, 2026

View reviewed changes

feat: add escape hatch for BPE cleanup override

ddca57e

Add clean_up_tokenization_spaces_even_though_its_wrong_for_bpe flag so users who rely on the old behavior can opt back in. The warning message now mentions this flag. Added test for the override path.

maxsloef-goodfire force-pushed the fix/skip-cleanup-for-bpe branch from d11ef36 to ddca57e Compare March 23, 2026 17:12

maxsloef-goodfire requested a review from ArthurZucker April 9, 2026 13:20

maxsloef-goodfire added 2 commits April 9, 2026 06:20

Merge branch 'main' into fix/skip-cleanup-for-bpe

3f0769d

Merge branch 'main' into fix/skip-cleanup-for-bpe

ddaa381

Merge branch 'main' into fix/skip-cleanup-for-bpe

91c9946

ArthurZucker approved these changes Apr 22, 2026

View reviewed changes

maxsloef-goodfire changed the title ~~fix: skip clean_up_tokenization for BPE tokenizers in PreTrainedTokenizerFast~~ 🔴🔴🔴 fix: skip clean_up_tokenization for BPE tokenizers in PreTrainedTokenizerFast Apr 22, 2026

itazap added this pull request to the merge queue Apr 27, 2026

Merged via the queue into huggingface:main with commit bbb51c8 Apr 27, 2026
28 checks passed

kndtran mentioned this pull request May 1, 2026

AutoTokenizer resolves to GPT2Tokenizer in transformers v5, producing different token IDs for granite-4.0-micro generative-computing/mellea#947

Open

Conversation

maxsloef-goodfire commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Why this matters

Minimal reproduction (before fix)

Changes

Before submitting

Who can review?

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

maxsloef-goodfire commented Mar 23, 2026

Uh oh!

maxsloef-goodfire commented Apr 9, 2026

Uh oh!

maxsloef-goodfire commented Apr 20, 2026

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Apr 22, 2026

Uh oh!

itazap commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

CI Results

Commit Info

Uh oh!

maxsloef-goodfire commented Apr 22, 2026

Uh oh!

itazap commented Apr 23, 2026

Uh oh!

maxsloef-goodfire commented Apr 23, 2026

Uh oh!

github-actions Bot commented Apr 23, 2026

Uh oh!

itazap commented Apr 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

maxsloef-goodfire commented Mar 21, 2026 •

edited

Loading