clean_up_tokenization_spaces=False if unset by itazap · Pull Request #31938 · huggingface/transformers

itazap · 2024-07-12T16:53:23Z

FUTURE DEPRECATION

start of deprecating clean_up_tokenization_spaces. Right now it defaults to True, update to default to False.

BERT based models need clean_up_tokenization_spaces=True, so set this in class init
some models like wav2vec needed test updated since they don't really expect clean_up_tokenization_spaces=True.

HuggingFaceDocBuilderDev · 2024-07-12T17:12:40Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

itazap · 2024-07-25T08:25:37Z


-    # TODO This is ran for all models but only tests bert...
-    def test_clean_up_tokenization_spaces(self):
-        tokenizer = BertTokenizer.from_pretrained("google-bert/bert-base-uncased")


this only ever tested Bert, so I don't think this is valuable to keep or to update to be tested for each model, because it behaves differently with special tokens and might have to be customized for some models. Since we will deprecate, I don't think it's useful to start maintaining this test now for all models!

itazap · 2024-07-26T10:33:03Z

    def get_input_output_texts(self, tokenizer):
        input_text = "lower newer"
-        output_text = "lower newer"
+        output_text = "lower[SPACE]newer"


testing sets a small vocab here, so this should be the expected behaviour. see unmodified test ClvpTokenizationTest --> test_full_tokenizer for an example where [SPACE] was expected

itazap · 2024-07-26T10:36:55Z

-        self.assertEqual(batch_tokens, ["HELLO<unk>!?!?<new_tokens>$$$", "BYE BYE<unk><new_tokens>$$$"])
-        self.assertEqual(batch_tokens_2, ["HELO!?!?<new_tokens>", "BYE BYE<new_tokens>"])
+        self.assertEqual(batch_tokens, ["HELLO<unk>!? !?<new_tokens>$$$", "BYE BYE<unk><new_tokens>$$$"])
+        self.assertEqual(batch_tokens_2, ["HELO!? !?<new_tokens>", "BYE BYE<new_tokens>"])


the tokenizer.word_delimiter_token is replaced with . See :

def convert_tokens_to_string(self, tokens: List[str]) -> str: """ Converts a connectionist-temporal-classification (CTC) output tokens into a single string. """ ... # replace delimiter token string = "".join([" " if token == self.word_delimiter_token else token for token in filtered_tokens]).strip() if self.do_lower_case: string = string.lower() return string

we should not have to do this! maybe cleanup tokenization should be true for wav2vec2 no?

I thought so too but the sample_ids have tokenizer.word_delimiter_token_id which is a space " " so I think it would be expected in the output? wdyt @ArthurZucker

ArthurZucker

Looks good, let's make sure we keep the default to True for now!

rishi23root · 2024-09-13T18:15:30Z

i believe the issue is

if "clean_up_tokenization_spaces" not in kwargs:
            warnings.warn(
                "`clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This "
                "behavior will be depracted in transformers v4.45, and will be then set to `False` by default. "
                "For more details check this issue: https://github.com/huggingface/transformers/issues/31884",
                FutureWarning,
            )

# By default, cleaning tokenization spaces for both fast and slow tokenizers
self.clean_up_tokenization_spaces = kwargs.pop("clean_up_tokenization_spaces", True)

by default it setting but i don't understand where to update it to remove the warning completely

itazap · 2024-09-14T07:58:05Z

@rishi23root I'll update the message!

ArthurZucker · 2024-09-25T12:38:17Z

-        self.assertEqual(batch_tokens, ["HELLO<unk>!?!?<new_tokens>$$$", "BYE BYE<unk><new_tokens>$$$"])
-        self.assertEqual(batch_tokens_2, ["HELO!?!?<new_tokens>", "BYE BYE<new_tokens>"])
+        self.assertEqual(batch_tokens, ["HELLO<unk>!? !?<new_tokens>$$$", "BYE BYE<unk><new_tokens>$$$"])
+        self.assertEqual(batch_tokens_2, ["HELO!? !?<new_tokens>", "BYE BYE<new_tokens>"])


we should not have to do this! maybe cleanup tokenization should be true for wav2vec2 no?

ArthurZucker · 2024-09-26T16:28:04Z

+        warnings.warn(
+            "The `clean_up_tokenization_spaces` argument will soon be deprecated. It currently defaults to False if not passed.",
+            FutureWarning,
+        )


let's not warn, we won't remove it!

Suggested change

warnings.warn(

"The `clean_up_tokenization_spaces` argument will soon be deprecated. It currently defaults to False if not passed.",

FutureWarning,

)

* clean_up_tokenization_spaces=False if unset * deprecate warning * updating param for old models * update models * make fix-copies * fix-copies and update bert models * warning msg * update prophet and clvp * updating test since space before is arbitrarily removed * remove warning for 4.45

itazap requested a review from ArthurZucker July 12, 2024 16:53

ArthurZucker reviewed Jul 22, 2024

View reviewed changes

Comment thread src/transformers/tokenization_utils_base.py

itazap force-pushed the clean_up_tokenization_spaces_false_default branch 4 times, most recently from e689ec3 to ec6f78a Compare July 25, 2024 08:23

itazap commented Jul 25, 2024

View reviewed changes

itazap marked this pull request as ready for review July 26, 2024 10:29

itazap requested a review from ArthurZucker July 26, 2024 10:30

itazap commented Jul 26, 2024

View reviewed changes

ArthurZucker reviewed Jul 31, 2024

View reviewed changes

Comment thread src/transformers/tokenization_utils_base.py

Comment thread src/transformers/tokenization_utils_base.py

itazap mentioned this pull request Aug 1, 2024

update clean_up_tokenization_spaces warning #32371

Merged

pesuchin mentioned this pull request Sep 5, 2024

Please future prove clean_up_tokenization_spaces huggingface/sentence-transformers#2922

Open

itazap added 9 commits September 6, 2024 13:27

clean_up_tokenization_spaces=False if unset

676b55e

deprecate warning

50029b2

updating param for old models

cc7d05f

update models

f0b4591

make fix-copies

227d54e

fix-copies and update bert models

8d047bc

warning msg

1995376

update prophet and clvp

1511100

updating test since space before is arbitrarily removed

b7b2b09

itazap force-pushed the clean_up_tokenization_spaces_false_default branch from 45b6a0e to b7b2b09 Compare September 6, 2024 11:29

itazap requested a review from ArthurZucker September 20, 2024 09:21

ArthurZucker approved these changes Sep 26, 2024

View reviewed changes

ArthurZucker reviewed Sep 26, 2024

View reviewed changes

remove warning for 4.45

9c34444

ArthurZucker merged commit 6730485 into main Sep 26, 2024

ArthurZucker deleted the clean_up_tokenization_spaces_false_default branch September 26, 2024 17:38

ArthurZucker mentioned this pull request Sep 26, 2024

[clean_up_tokenization_spaces] Pl bart was failing, updating #33735

Merged

itazap mentioned this pull request Sep 27, 2024

remove warning v2 #33761

Merged

baberabb mentioned this pull request Oct 3, 2024

fix tests EleutherAI/lm-evaluation-harness#2380

Merged

itazap mentioned this pull request Nov 22, 2024

depreciating all occurances of clean_up_tokenization_spaces #31232

Closed

ArthurZucker mentioned this pull request Dec 20, 2024

Detokenization discrepancy with Llama3.1 #35175

Closed

4 tasks

pcuenca mentioned this pull request Jan 17, 2025

Bert: Slow vs fast decoding inconsistency huggingface/tokenizers#1723

Closed

ArthurZucker mentioned this pull request Feb 12, 2025

Mangled tokenization with Llama 3.1 for string sequences containing<space>'m #35938

Closed

4 tasks

itazap mentioned this pull request Jun 20, 2025

Fix rag #38585

Merged

ArthurZucker mentioned this pull request Mar 23, 2026

🔴🔴🔴 fix: skip clean_up_tokenization for BPE tokenizers in PreTrainedTokenizerFast #44915

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clean_up_tokenization_spaces=False if unset#31938

clean_up_tokenization_spaces=False if unset#31938
ArthurZucker merged 10 commits intomainfrom
clean_up_tokenization_spaces_false_default

itazap commented Jul 12, 2024 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Jul 12, 2024

Uh oh!

Uh oh!

itazap Jul 25, 2024 •

edited

Loading

Uh oh!

itazap Jul 26, 2024

Uh oh!

itazap Jul 26, 2024

Uh oh!

ArthurZucker Sep 25, 2024

Uh oh!

itazap Sep 26, 2024 •

edited

Loading

Uh oh!

ArthurZucker left a comment

Uh oh!

Uh oh!

Uh oh!

rishi23root commented Sep 13, 2024

Uh oh!

itazap commented Sep 14, 2024

Uh oh!

ArthurZucker Sep 25, 2024

Uh oh!

ArthurZucker Sep 26, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

itazap commented Jul 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Jul 12, 2024

Uh oh!

Uh oh!

itazap Jul 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

itazap Jul 26, 2024

Choose a reason for hiding this comment

Uh oh!

itazap Jul 26, 2024

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Sep 25, 2024

Choose a reason for hiding this comment

Uh oh!

itazap Sep 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

rishi23root commented Sep 13, 2024

Uh oh!

itazap commented Sep 14, 2024

Uh oh!

ArthurZucker Sep 25, 2024

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Sep 26, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

itazap commented Jul 12, 2024 •

edited

Loading

itazap Jul 25, 2024 •

edited

Loading

itazap Sep 26, 2024 •

edited

Loading