Skipping outputs#3116
Conversation
|
Nice. One question: do we want to have a
|
|
I'm ok with both solutions (by the way, in general terms, a lot of software can accept a combination of whitelist and/or blacklist. When both are present, it's usually "include the whitelist, and remove the blacklist") If we do |
|
I agree with both of you. Furthermore, this approach (deleting from the dict I'm implementing both your proposed changes, looking into fixing the above and into fast tokenizers. I'll then move on to the tests.
|
| return_attention_mask=True, | ||
| return_overflowing_tokens=False, | ||
| return_special_tokens_mask=False, | ||
| return_token_type_ids=None, |
There was a problem hiding this comment.
maybe explicitly define them as: Optional[bool] = None?
There was a problem hiding this comment.
Yes, I'll update the typings, doc & tests if @mfuntowicz and @thomwolf agree that this is the best way to do it.
There was a problem hiding this comment.
Why do we move from True/False to None/XXX ?
There was a problem hiding this comment.
Putting None means that we can safely identify if the user has passed a value different from the default. If a value is None, then we can rely on the return_outputs attribute to return this value or not.
If it is not None, then its value is absolute (as it's an argument that was passed by the user).
There was a problem hiding this comment.
Ok I see, then I guess a clear typing like @julien-c suggested would be good
aefcb84 to
0cc3eb5
Compare
|
I like the solution, 👍 . One question: It requires the user to know / look at the names of the parameters handled by model = SomeModel(...)
tokenizer = AutoTokenizer.from_pretrained(..., return_outputs=model.input_names) |
|
Indeed, such an attribute would be helpful! I'll add it and move on to the tests. |
thomwolf
left a comment
There was a problem hiding this comment.
ok did a review but I'm not sure it was finished enough to do one haha
anyway, here are some comments
| return_attention_mask=True, | ||
| return_overflowing_tokens=False, | ||
| return_special_tokens_mask=False, | ||
| return_token_type_ids=None, |
There was a problem hiding this comment.
Why do we move from True/False to None/XXX ?
| pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP | ||
| max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES | ||
| pretrained_init_configuration = PRETRAINED_INIT_CONFIGURATION | ||
| return_outputs = ["attention_mask"] |
There was a problem hiding this comment.
rename this model_inputs or model_input_names?
There was a problem hiding this comment.
I'll let @julien-c answer that as he offered that naming
If we do
keep_output, maybe we name the attributereturn_outputs: List[str]for consistency withencode_xxx()params?
| if return_attention_mask is None: | ||
| return_attention_mask = "attention_mask" in self.return_outputs | ||
| if return_overflowing_tokens is None: | ||
| return_overflowing_tokens = "overflowing_tokens" in self.return_outputs |
There was a problem hiding this comment.
"overflowing_tokens" will never be in self.return_outputs, won't it?
There was a problem hiding this comment.
Not by default, but a user could pass it as argument so that it is always returned. Not as important as the other three though, I agree. Will remove it.
|
Regarding the suggestion of @mfuntowicz, in the end, this should be in a common configuration for model and tokenizers I guess, so maybe we could actually have this attribute as input to |
|
That value is currently managed by the It still needs to be a class attribute in my opinion, as it should be overridden by children of |
0cc3eb5 to
eaf0f97
Compare
| return [ | ||
| {value: batch_encode_plus_sequences[value][i] for value in batch_encode_plus_sequences.keys()} | ||
| for i in range(len(batch_encode_plus_sequences)) | ||
| for i in range(len(batch_encode_plus_sequences["input_ids"])) |
63a664d to
84b4176
Compare
|
Should be good for review. I reverted the |
Codecov Report
@@ Coverage Diff @@
## master #3116 +/- ##
==========================================
+ Coverage 77.98% 77.99% +<.01%
==========================================
Files 98 98
Lines 16645 16660 +15
==========================================
+ Hits 12981 12994 +13
- Misses 3664 3666 +2
Continue to review full report at Codecov.
|
dd336e1 to
96b2fa1
Compare
* Minimal example * Proposal 2 * Proposal 2 for fast tokenizers * Typings * Docs * Revert "Docs" for easier review This reverts commit eaf0f97. * Remove unnecessary assignments * Tests * Fix faulty type * Remove prints * return_outputs -> model_input_names * Revert "Revert "Docs" for easier review" This reverts commit 6fdc694. * code quality
Currently,
encode_plusandbatch_encode_plusreturn the same outputs for different models.This is sub-optimal as we can't do the following for each model:
This will crash for DistilBERT as the tokenizer would return
token_type_idswhich can't be handled by the model.In order to fix this, each tokenizer has to return model-specific arguments. Usually there are the same default arguments, and some models handle less (e.g. DistilBERT, RoBERTa).
This is a mock PR offering a solution using a
skip_outputsreturn_outputsargument to tokenizers.Returns a dictionary without the token type ids:
{'input_ids': [101, 4403, 117, 1293, 1132, 1128, 136, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}Specifying a custom
skip_outputsreturn_outputsat initialisation works as expected:{'input_ids': [101, 4403, 117, 1293, 1132, 1128, 136, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}or with a custom
skippedoutput:{'input_ids': [101, 4403, 117, 1293, 1132, 1128, 136, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0]}This also works with saving/reloading:
Returns the following:
{'input_ids': [101, 4403, 117, 1293, 1132, 1128, 136, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0]} {'input_ids': [101, 4403, 117, 1293, 1132, 1128, 136, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0]}