Fix: Apply clean_up_tokenization_spaces in TokenizersBackend._decode#42916
Fix: Apply clean_up_tokenization_spaces in TokenizersBackend._decode#42916Aznix07 wants to merge 1 commit intohuggingface:mainfrom
Conversation
|
View the CircleCI Test Summary for this PR: https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=42916&sha=99c932 |
ArthurZucker
left a comment
There was a problem hiding this comment.
Hey! is there a motivation to have that? We removed it because its unintuitive, and can be done by the user itself outside. do you have a specific usecase in mind?
|
Hi @ArthurZucker! Thanks for the response! I realize I should have clarified this before submitting the PR - apologize for that. What I observed:
My question: If yes, I can close this PR. If it was unintended, Im happy to adjust the implementation based on your guidance. Thanks! |
@ArthurZucker the issue is that in v4
Removing the behavior and keeping the I created the issue about it earlier: #42898 |
|
yes it is intentional that we move away from using
so the extra space in the original example is expected ! |
What does this PR do?
This PR fixes a regression in v5.0.0rc1 where the
_decodemethod inTokenizerBackendwas not respecting theclean_up_tokenization_spacesparamter, causing unwanted spaces to appear before punctuation in decoded output.Reproduction:
Behavior:
Solution
Added the missing
clean_up_tokenization_spaceslogic toTokenizersBackend._decode()method. When enabled (default behavior), it removes extra spaces before punctuation using regex pattern matching.Fixes #42913
Who can review?
@ArthurZucker @itazap