Fix: Apply clean_up_tokenization_spaces in TokenizersBackend._decode by Aznix07 · Pull Request #42916 · huggingface/transformers

Aznix07 · 2025-12-17T07:46:49Z

What does this PR do?

This PR fixes a regression in v5.0.0rc1 where the _decode method in TokenizerBackend was not respecting the clean_up_tokenization_spaces paramter, causing unwanted spaces to appear before punctuation in decoded output.

Reproduction:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('mlx-community/Llama-3.2-1B-Instruct-4bit')
text = tokenizer.decode([128000, 64, 1174, 65])
print(text)

Behavior:

v4.57.3: <|begin_of_text|>a,b ✅
v5.0.0rc1 (before fix): <|begin_of_text|>a ,b ❌ (extra space before comma)
v5.0.0rc1 (after fix): <|begin_of_text|>a,b ✅

Solution

Added the missing clean_up_tokenization_spaces logic to TokenizersBackend._decode() method. When enabled (default behavior), it removes extra spaces before punctuation using regex pattern matching.

Fixes #42913

Who can review?

@ArthurZucker @itazap

github-actions · 2025-12-17T07:56:02Z

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=42916&sha=99c932

ArthurZucker

Hey! is there a motivation to have that? We removed it because its unintuitive, and can be done by the user itself outside. do you have a specific usecase in mind?

Aznix07 · 2025-12-17T13:53:17Z

Hi @ArthurZucker!

Thanks for the response! I realize I should have clarified this before submitting the PR - apologize for that.

What I observed:

The clean_up_tokenization_spaces param exists in decode() and defaults to True
But in TokenizersBackend._decode(), it's not being applied
This causes: decode([128000, 64, 1174, 65]) → 'a ,b' instead of 'a,b'

My question:
Was removing this cleanup intentional for v5.0.0?

If yes, I can close this PR. If it was unintended, Im happy to adjust the implementation based on your guidance.

Thanks!

apaniukov · 2025-12-18T11:36:39Z

Hey! is there a motivation to have that? We removed it because its unintuitive, and can be done by the user itself outside. do you have a specific usecase in mind?

@ArthurZucker the issue is that in v4 clean_up_tokenization_spaces flag worked for all tokenizers, and in v5 it works only when the particular tokenizer implementation defined clean_up_tokenization method.

It is inconsistent with PythonBackend tokenizer, where there is default implementation in _decode method.
clean_up_tokenization_spaces flag in decode method now works for some tokenizers, and stop working for other.
(Some) Tokenizers that has self.clean_up_tokenization_spaces=True change behavior in v5, like Llama form the issue.

Removing the behavior and keeping the clean_up_tokenization_spaces flag in the decode is unintuitive, because we need to check does it really work for each tokenizer we use.

I created the issue about it earlier: #42898

itazap · 2026-01-08T16:32:04Z

yes it is intentional that we move away from using clean_up_tokenization_spaces in v5, that is why TokenizersBackend _decode does not use it (and doesn't need to by default). If it is necessary it is part of the _decode function like for Luke:

https://github.com/huggingface/transformers/blob/c8bc4dea4c24037cafa38a10ffd0d583b35441a2/src/transformers/models/luke/tokenization_luke.py#L420

so the extra space in the original example is expected !

Fix: Apply clean_up_tokenization_spaces in TokenizersBackend._decode

99c9321

ArthurZucker reviewed Dec 17, 2025

View reviewed changes

Sai-Suraj-27 mentioned this pull request Jan 13, 2026

Fix failing PegasusX, Mvp & LED model integration tests #43245

Merged

5 tasks

itazap mentioned this pull request Jan 22, 2026

bring back clean_up_tokenization_spaces to tokenizers backend #43426

Merged

This was referenced Apr 29, 2026

Cumulative feature and defect updates from recent Transformers PRs evalstate/transformers#42

Open

Cumulative defect fixes from recent Transformers PRs evalstate/transformers#43

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: Apply clean_up_tokenization_spaces in TokenizersBackend._decode#42916

Fix: Apply clean_up_tokenization_spaces in TokenizersBackend._decode#42916
Aznix07 wants to merge 1 commit intohuggingface:mainfrom
Aznix07:fix-tokenizer-decode-spaces

Aznix07 commented Dec 17, 2025

Uh oh!

github-actions Bot commented Dec 17, 2025

Uh oh!

ArthurZucker left a comment

Uh oh!

Aznix07 commented Dec 17, 2025

Uh oh!

apaniukov commented Dec 18, 2025

Uh oh!

itazap commented Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Aznix07 commented Dec 17, 2025

What does this PR do?

Solution

Who can review?

Uh oh!

github-actions Bot commented Dec 17, 2025

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Aznix07 commented Dec 17, 2025

Uh oh!

apaniukov commented Dec 18, 2025

Uh oh!

itazap commented Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants