Fix: Set clean_up_tokenization_spaces#42900
Conversation
e93e894 to
aeba93b
Compare
|
View the CircleCI Test Summary for this PR: https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=42900&sha=aeba93 |
|
I tried fixing this on main by changing the default clean_up_tokenization_spaces to True, but tokenization tests (e.g. Wav2Vec2, CLVP) expect the default to remain False in v5 and fail accordingly. From the issue description, the real problem seems to be in the new v5 tokenization backends (TokenizersBackend / PythonBackend), where decode(..., clean_up_tokenization_spaces=True) doesn’t call a default clean_up_tokenization implementation. Could you confirm:
I’d be happy to move my work to the v5 branch and implement this properly once I know which files/classes to target. |
|
hey yes this is intended for v5! if models require |
What does this PR do?
This PR fixes a regression where
clean_up_tokenization_spacesdefault was changed fromTruetoFalsein v5.0.0rc1, breaking backward compatibility with v4.x.Problem:
In transformers v5.0.0rc1, tokenizers no longer clean up spaces before punctuation by default:
Solution:
Changed the default value in
tokenization_utils_base.py(line 1420):Safety:
This change only affects post-processing of decoded text, NOT:
Fixes #42898
Before submitting
Who can review?
@ArthurZucker @itazap