Skip to content

clean_up_tokenization_spaces=False if unset#31938

Merged
ArthurZucker merged 10 commits intomainfrom
clean_up_tokenization_spaces_false_default
Sep 26, 2024
Merged

clean_up_tokenization_spaces=False if unset#31938
ArthurZucker merged 10 commits intomainfrom
clean_up_tokenization_spaces_false_default

Conversation

@itazap
Copy link
Copy Markdown
Collaborator

@itazap itazap commented Jul 12, 2024

FUTURE DEPRECATION

fixes #31884

start of deprecating clean_up_tokenization_spaces. Right now it defaults to True, update to default to False.

  • BERT based models need clean_up_tokenization_spaces=True, so set this in class init
  • some models like wav2vec needed test updated since they don't really expect clean_up_tokenization_spaces=True.

@ArthurZucker

@itazap itazap requested a review from ArthurZucker July 12, 2024 16:53
@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Comment thread src/transformers/tokenization_utils_base.py
@itazap itazap force-pushed the clean_up_tokenization_spaces_false_default branch 4 times, most recently from e689ec3 to ec6f78a Compare July 25, 2024 08:23
Comment thread tests/test_tokenization_common.py Outdated

# TODO This is ran for all models but only tests bert...
def test_clean_up_tokenization_spaces(self):
tokenizer = BertTokenizer.from_pretrained("google-bert/bert-base-uncased")
Copy link
Copy Markdown
Collaborator Author

@itazap itazap Jul 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this only ever tested Bert, so I don't think this is valuable to keep or to update to be tested for each model, because it behaves differently with special tokens and might have to be customized for some models. Since we will deprecate, I don't think it's useful to start maintaining this test now for all models!

@itazap itazap marked this pull request as ready for review July 26, 2024 10:29
@itazap itazap requested a review from ArthurZucker July 26, 2024 10:30
def get_input_output_texts(self, tokenizer):
input_text = "lower newer"
output_text = "lower newer"
output_text = "lower[SPACE]newer"
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

testing sets a small vocab here, so this should be the expected behaviour. see unmodified test ClvpTokenizationTest --> test_full_tokenizer for an example where [SPACE] was expected

self.assertEqual(batch_tokens, ["HELLO<unk>!?!?<new_tokens>$$$", "BYE BYE<unk><new_tokens>$$$"])
self.assertEqual(batch_tokens_2, ["HELO!?!?<new_tokens>", "BYE BYE<new_tokens>"])
self.assertEqual(batch_tokens, ["HELLO<unk>!? !?<new_tokens>$$$", "BYE BYE<unk><new_tokens>$$$"])
self.assertEqual(batch_tokens_2, ["HELO!? !?<new_tokens>", "BYE BYE<new_tokens>"])
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the tokenizer.word_delimiter_token is replaced with . See :

    def convert_tokens_to_string(self, tokens: List[str]) -> str:
        """
        Converts a connectionist-temporal-classification (CTC) output tokens into a single string.
        """
      ...
        # replace delimiter token
        string = "".join([" " if token == self.word_delimiter_token else token for token in filtered_tokens]).strip()

        if self.do_lower_case:
            string = string.lower()

        return string
        

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should not have to do this! maybe cleanup tokenization should be true for wav2vec2 no?

Copy link
Copy Markdown
Collaborator Author

@itazap itazap Sep 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought so too but the sample_ids have tokenizer.word_delimiter_token_id which is a space " " so I think it would be expected in the output? wdyt @ArthurZucker

Copy link
Copy Markdown
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, let's make sure we keep the default to True for now!

Comment thread src/transformers/tokenization_utils_base.py
Comment thread src/transformers/tokenization_utils_base.py
@itazap itazap force-pushed the clean_up_tokenization_spaces_false_default branch from 45b6a0e to b7b2b09 Compare September 6, 2024 11:29
@rishi23root
Copy link
Copy Markdown

i believe the issue is

if "clean_up_tokenization_spaces" not in kwargs:
            warnings.warn(
                "`clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This "
                "behavior will be depracted in transformers v4.45, and will be then set to `False` by default. "
                "For more details check this issue: https://github.com/huggingface/transformers/issues/31884",
                FutureWarning,
            )

# By default, cleaning tokenization spaces for both fast and slow tokenizers
self.clean_up_tokenization_spaces = kwargs.pop("clean_up_tokenization_spaces", True)

by default it setting but i don't understand where to update it to remove the warning completely

@itazap
Copy link
Copy Markdown
Collaborator Author

itazap commented Sep 14, 2024

@rishi23root I'll update the message!

@itazap itazap requested a review from ArthurZucker September 20, 2024 09:21
self.assertEqual(batch_tokens, ["HELLO<unk>!?!?<new_tokens>$$$", "BYE BYE<unk><new_tokens>$$$"])
self.assertEqual(batch_tokens_2, ["HELO!?!?<new_tokens>", "BYE BYE<new_tokens>"])
self.assertEqual(batch_tokens, ["HELLO<unk>!? !?<new_tokens>$$$", "BYE BYE<unk><new_tokens>$$$"])
self.assertEqual(batch_tokens_2, ["HELO!? !?<new_tokens>", "BYE BYE<new_tokens>"])
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should not have to do this! maybe cleanup tokenization should be true for wav2vec2 no?

Comment on lines +1611 to +1614
warnings.warn(
"The `clean_up_tokenization_spaces` argument will soon be deprecated. It currently defaults to False if not passed.",
FutureWarning,
)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's not warn, we won't remove it!

Suggested change
warnings.warn(
"The `clean_up_tokenization_spaces` argument will soon be deprecated. It currently defaults to False if not passed.",
FutureWarning,
)

@ArthurZucker ArthurZucker merged commit 6730485 into main Sep 26, 2024
@ArthurZucker ArthurZucker deleted the clean_up_tokenization_spaces_false_default branch September 26, 2024 17:38
ArthurZucker pushed a commit that referenced this pull request Sep 26, 2024
* clean_up_tokenization_spaces=False if unset

* deprecate warning

* updating param for old models

* update models

* make fix-copies

* fix-copies and update bert models

* warning msg

* update prophet and clvp

* updating test since space before is arbitrarily removed

* remove warning for 4.45
@itazap itazap mentioned this pull request Sep 27, 2024
BernardZach pushed a commit to BernardZach/transformers that referenced this pull request Dec 5, 2024
* clean_up_tokenization_spaces=False if unset

* deprecate warning

* updating param for old models

* update models

* make fix-copies

* fix-copies and update bert models

* warning msg

* update prophet and clvp

* updating test since space before is arbitrarily removed

* remove warning for 4.45
@itazap itazap mentioned this pull request Jun 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] GPT-2 tokenizer is NOT invertible

4 participants