clean_up_tokenization_spaces=False if unset#31938
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
e689ec3 to
ec6f78a
Compare
|
|
||
| # TODO This is ran for all models but only tests bert... | ||
| def test_clean_up_tokenization_spaces(self): | ||
| tokenizer = BertTokenizer.from_pretrained("google-bert/bert-base-uncased") |
There was a problem hiding this comment.
this only ever tested Bert, so I don't think this is valuable to keep or to update to be tested for each model, because it behaves differently with special tokens and might have to be customized for some models. Since we will deprecate, I don't think it's useful to start maintaining this test now for all models!
| def get_input_output_texts(self, tokenizer): | ||
| input_text = "lower newer" | ||
| output_text = "lower newer" | ||
| output_text = "lower[SPACE]newer" |
There was a problem hiding this comment.
testing sets a small vocab here, so this should be the expected behaviour. see unmodified test ClvpTokenizationTest --> test_full_tokenizer for an example where [SPACE] was expected
| self.assertEqual(batch_tokens, ["HELLO<unk>!?!?<new_tokens>$$$", "BYE BYE<unk><new_tokens>$$$"]) | ||
| self.assertEqual(batch_tokens_2, ["HELO!?!?<new_tokens>", "BYE BYE<new_tokens>"]) | ||
| self.assertEqual(batch_tokens, ["HELLO<unk>!? !?<new_tokens>$$$", "BYE BYE<unk><new_tokens>$$$"]) | ||
| self.assertEqual(batch_tokens_2, ["HELO!? !?<new_tokens>", "BYE BYE<new_tokens>"]) |
There was a problem hiding this comment.
the tokenizer.word_delimiter_token is replaced with . See :
def convert_tokens_to_string(self, tokens: List[str]) -> str:
"""
Converts a connectionist-temporal-classification (CTC) output tokens into a single string.
"""
...
# replace delimiter token
string = "".join([" " if token == self.word_delimiter_token else token for token in filtered_tokens]).strip()
if self.do_lower_case:
string = string.lower()
return string
There was a problem hiding this comment.
we should not have to do this! maybe cleanup tokenization should be true for wav2vec2 no?
There was a problem hiding this comment.
I thought so too but the sample_ids have tokenizer.word_delimiter_token_id which is a space " " so I think it would be expected in the output? wdyt @ArthurZucker
ArthurZucker
left a comment
There was a problem hiding this comment.
Looks good, let's make sure we keep the default to True for now!
45b6a0e to
b7b2b09
Compare
|
i believe the issue is if "clean_up_tokenization_spaces" not in kwargs:
warnings.warn(
"`clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This "
"behavior will be depracted in transformers v4.45, and will be then set to `False` by default. "
"For more details check this issue: https://github.com/huggingface/transformers/issues/31884",
FutureWarning,
)
# By default, cleaning tokenization spaces for both fast and slow tokenizers
self.clean_up_tokenization_spaces = kwargs.pop("clean_up_tokenization_spaces", True)by default it setting but i don't understand where to update it to remove the warning completely |
|
@rishi23root I'll update the message! |
| self.assertEqual(batch_tokens, ["HELLO<unk>!?!?<new_tokens>$$$", "BYE BYE<unk><new_tokens>$$$"]) | ||
| self.assertEqual(batch_tokens_2, ["HELO!?!?<new_tokens>", "BYE BYE<new_tokens>"]) | ||
| self.assertEqual(batch_tokens, ["HELLO<unk>!? !?<new_tokens>$$$", "BYE BYE<unk><new_tokens>$$$"]) | ||
| self.assertEqual(batch_tokens_2, ["HELO!? !?<new_tokens>", "BYE BYE<new_tokens>"]) |
There was a problem hiding this comment.
we should not have to do this! maybe cleanup tokenization should be true for wav2vec2 no?
| warnings.warn( | ||
| "The `clean_up_tokenization_spaces` argument will soon be deprecated. It currently defaults to False if not passed.", | ||
| FutureWarning, | ||
| ) |
There was a problem hiding this comment.
let's not warn, we won't remove it!
| warnings.warn( | |
| "The `clean_up_tokenization_spaces` argument will soon be deprecated. It currently defaults to False if not passed.", | |
| FutureWarning, | |
| ) |
* clean_up_tokenization_spaces=False if unset * deprecate warning * updating param for old models * update models * make fix-copies * fix-copies and update bert models * warning msg * update prophet and clvp * updating test since space before is arbitrarily removed * remove warning for 4.45
* clean_up_tokenization_spaces=False if unset * deprecate warning * updating param for old models * update models * make fix-copies * fix-copies and update bert models * warning msg * update prophet and clvp * updating test since space before is arbitrarily removed * remove warning for 4.45
FUTURE DEPRECATION
fixes #31884
start of deprecating
clean_up_tokenization_spaces. Right now it defaults to True, update to default to False.clean_up_tokenization_spaces=True, so set this in class initclean_up_tokenization_spaces=True.@ArthurZucker