Tokenizers v3.0.0#3185
Conversation
6b87216 to
e65e5de
Compare
LysandreJik
left a comment
There was a problem hiding this comment.
Overall, this is quite a different test suite than what we have in test_modeling_common and test_tokenization_common, but in a good way, imo.
The use of subTest will greatly help debugging and splitting between test_xxx methods and assert_xxx methods makes the code cleaner and easier to read.
| TOKENIZERS_CLASSES = frozenset([ | ||
| Tokenizer("Bert", BertTokenizerFast, BertTokenizer, "vocab_file"), | ||
| Tokenizer("DistilBert", DistilBertTokenizerFast, DistilBertTokenizer, "vocab_file"), | ||
| Tokenizer("Roberta", RobertaTokenizerFast, RobertaTokenizer, "vocab_file"), | ||
| ]) |
There was a problem hiding this comment.
frozenset + named tuple: nice!
| for (name, rust_cls, python_cls, vocab_key) in self.TOKENIZERS_CLASSES: | ||
| for pretrained_name in python_cls.pretrained_vocab_files_map[vocab_key].keys(): | ||
| with self.subTest("{} ({})".format(name, pretrained_name)): |
There was a problem hiding this comment.
This seems to me like to optimal organisation. Testing on every checkpoint on every tokenizer organized in subTests is really thorough.
Does it take a while? Should we mark this as slow or is it fast enough?
There was a problem hiding this comment.
It takes around 1 min for doing everything
| padded_tokens_r = list(takewhile(lambda i: i == tokenizer_r.pad_token_id, reversed(input_r))) | ||
| padded_tokens_p = list(takewhile(lambda i: i == tokenizer_p.pad_token_id, reversed(input_p))) |
There was a problem hiding this comment.
We usually try to use lambdas as little as possible, as they're usually a bit hard to read. cc @thomwolf
There was a problem hiding this comment.
hmm here I think it's fine, no?
There was a problem hiding this comment.
In tests I'm more fine with them
9941cc5 to
9c92af3
Compare
thomwolf
left a comment
There was a problem hiding this comment.
This is looking great, just a few comments on things to check
| return_token_type_ids: bool = True, | ||
| return_attention_mask: bool = True, |
There was a problem hiding this comment.
You should rebase on master if you can because I think this is now Optional[bool] = None since @LysandreJik worked to adapt the output to the models.
There was a problem hiding this comment.
I'm not sure to fully undestand why bool should be Optional ? A default value would be more understable imho and remove the need for None checking, wdyt ?
There was a problem hiding this comment.
The spirit behind having those values as None by default is the following: if this value is None, then all those returns are set to the default tokenizer-specific values. This is different from tokenizer to tokenizer, e.g. DistilBERT should by default return input_ids and attention_mask, but not token_type_ids, as the model cannot handle it. This in turn allows the user to do the following:
inputs = tokenizer.encode_plus(values, return_tensors="pt")
model(**inputs)And this now works with every model.
These values can still be explicitly set to True or False by the user. See #3116 for more information/implementation details.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
…Fast Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
…ded to the output #3091 Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
…ncode_plus methods parameter.
…enizerFast. Avoid stripping on None values.
This new structure exposes all the mappings retrieved from Rust. It also keeps the current behavior with model forward.
Backward compatibility.
… in majority of cases.
…structor parameter on Rust Tokenizers.
n1t0
left a comment
There was a problem hiding this comment.
A few details here and there, but otherwise looks good to me!
| ) -> List[Encoding]: | ||
| if sequences is None: | ||
| raise ValueError( | ||
| "Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers." |
There was a problem hiding this comment.
I think these messages might be more specific for each method. This one should probably just say list/tuple of strings.
| def encode(self, sequence: str, pair: Optional[str] = None, add_special_tokens: bool = False) -> Encoding: | ||
| if sequence is None: | ||
| raise ValueError( | ||
| "Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers." |
There was a problem hiding this comment.
This one should probably just say string
| encoding_dict["special_tokens_mask"].append(e.special_tokens_mask) | ||
| if return_offsets_mapping: | ||
| encoding_dict["offset_mapping"].append([e.original_str.offsets(o) for o in e.offsets]) | ||
| encoding_dict["offset_mapping"] = [o for o in e.offsets] |
There was a problem hiding this comment.
Should it be like above:
| encoding_dict["offset_mapping"] = [o for o in e.offsets] | |
| encoding_dict["offset_mapping"].append(e.offsets) |
|
|
||
| if batch_text_or_text_pairs is None: | ||
| raise ValueError( | ||
| "Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers." |
There was a problem hiding this comment.
Same here, I think we can remove the string option
| self.assert_embeded_special_tokens(tokenizer_r, tokenizer_p) | ||
| self.assert_padding(tokenizer_r, tokenizer_p) | ||
| # TODO: enable for v3.0.0 | ||
| # self.assert_empty_output_no_special_tokens(tokenizer_r, tokenizer_p) |
|
|
||
| # Check for dynamic encoding sequence handling in batch_encode_plus | ||
| self.assert_batch_encode_dynamic_overflowing(tokenizer_r) | ||
| # Rust correctly handles the space before the mask while python doesnt |
There was a problem hiding this comment.
I think Python is right here. There shouldn't be any space before the <mask> token. This means that Roberta on the fast path should probably have an AddedToken('<mask>', lstrip=True)
| # Testing tokenization | ||
| tokens = tokenizer.tokenize(sequence, add_prefix_space=True) | ||
| rust_tokens = rust_tokenizer.tokenize(sequence) | ||
| rust_tokens = rust_tokenizer.tokenize(sequence, add_prefix_space=True) |
There was a problem hiding this comment.
add_prefix_space=True isn't required here I think
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
…every iteration. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
… for Roberta. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
…utes. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
Codecov Report
@@ Coverage Diff @@
## master #3185 +/- ##
==========================================
- Coverage 77.79% 77.55% -0.25%
==========================================
Files 100 100
Lines 17025 17105 +80
==========================================
+ Hits 13245 13265 +20
- Misses 3780 3840 +60
Continue to review full report at Codecov.
|
LysandreJik
left a comment
There was a problem hiding this comment.
This isn't easy to review as the diff mixes substractions from PreTrainedTokenizer and additions from BatchEncoding; other than that, cool! Thanks @mfuntowicz :)
| # Define type aliases | ||
| TextInput = str | ||
| TextPairInput = Tuple[str, str] | ||
| PreTokenizedInput = List[str] | ||
| PreTokenizedInputPair = Tuple[List[str], List[str]] |
| Find the Offsets of the token containing the character at the specified position | ||
| :param sentence: Index of the sentence relative to the batch provided to the tokenizer. | ||
| :param char: Char index to get the relative token offsets | ||
| :return: (token start, token end) |
There was a problem hiding this comment.
(Applicable to most other docstrings) We use Google style doc in the library - could we try and use them here as well?
|
Really like the new typings! |
|
|
||
| # Filter out features not available on specific models | ||
| inputs = self.inputs_for_model(inputs) | ||
| # inputs = self.inputs_for_model(inputs) |
There was a problem hiding this comment.
Let's remember to remove this for good soon @mfuntowicz
No description provided.