Fix: Set clean_up_tokenization_spaces by Aznix07 · Pull Request #42900 · huggingface/transformers

Aznix07 · 2025-12-16T12:27:43Z

What does this PR do?

This PR fixes a regression where clean_up_tokenization_spaces default was changed from True to False in v5.0.0rc1, breaking backward compatibility with v4.x.

Problem:

In transformers v5.0.0rc1, tokenizers no longer clean up spaces before punctuation by default:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-9b-it")
text_input = "beginning , and"    
ids = tokenizer(text_input).input_ids
decoded = tokenizer.decode(ids, skip_special_tokens=True)

# v4.57.3:   "beginning, and"  (spaces cleaned)
# v5.0.0rc1: "beginning , and" (spaces NOT cleaned)

Solution:

Changed the default value in tokenization_utils_base.py (line 1420):

- self.clean_up_tokenization_spaces = kwargs.pop("clean_up_tokenization_spaces", False)
+ self.clean_up_tokenization_spaces = kwargs.pop("clean_up_tokenization_spaces", True)

Safety:
This change only affects post-processing of decoded text, NOT:

Token IDs sent to the model (verified unchanged)
Model predictions
Training behavior

Fixes #42898

Before submitting

Did you read the [contributor guideline]
Was this discussed/approved via a Github issue or the forum?

Who can review?

@ArthurZucker @itazap

github-actions · 2025-12-16T13:27:49Z

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=42900&sha=aeba93

Aznix07 · 2025-12-16T13:41:22Z

I tried fixing this on main by changing the default clean_up_tokenization_spaces to True, but tokenization tests (e.g. Wav2Vec2, CLVP) expect the default to remain False in v5 and fail accordingly.

From the issue description, the real problem seems to be in the new v5 tokenization backends (TokenizersBackend / PythonBackend), where decode(..., clean_up_tokenization_spaces=True) doesn’t call a default clean_up_tokenization implementation.

Could you confirm:

Should the fix target the v5 branch instead of main?
Is the expected behavior in v5: default clean_up_tokenization_spaces=False globally, but decode(..., clean_up_tokenization_spaces=True) must always apply cleanup logic?

I’d be happy to move my work to the v5 branch and implement this properly once I know which files/classes to target.

itazap · 2026-01-08T16:42:59Z

hey yes this is intended for v5! if models require clean_up_tokenization_spaces it needs to be implemented within their _decode function (see Luke)

Fix: Set clean_up_tokenization_spaces

aeba93b

Aznix07 force-pushed the fix-clean-up-tokenization-spaces branch from e93e894 to aeba93b Compare December 16, 2025 13:18

itazap mentioned this pull request Jan 22, 2026

bring back clean_up_tokenization_spaces to tokenizers backend #43426

Merged

This was referenced Apr 29, 2026

Cumulative feature and defect updates from recent Transformers PRs evalstate/transformers#42

Open

Cumulative defect fixes from recent Transformers PRs evalstate/transformers#43

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: Set clean_up_tokenization_spaces#42900

Fix: Set clean_up_tokenization_spaces#42900
Aznix07 wants to merge 1 commit intohuggingface:mainfrom
Aznix07:fix-clean-up-tokenization-spaces

Aznix07 commented Dec 16, 2025

Uh oh!

github-actions Bot commented Dec 16, 2025

Uh oh!

Aznix07 commented Dec 16, 2025

Uh oh!

itazap commented Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Aznix07 commented Dec 16, 2025

What does this PR do?

Before submitting

Who can review?

Uh oh!

github-actions Bot commented Dec 16, 2025

Uh oh!

Aznix07 commented Dec 16, 2025

Uh oh!

itazap commented Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants