MobileBERT tokenizer tests#16896
Conversation
|
The documentation is not available anymore as the PR was closed or merged. |
SaulLu
left a comment
There was a problem hiding this comment.
Thank you very much for working on adding tests for the MobileBert tokenizer! 🚀:
I've left 2 comments that suggest doing these tests a little differently. Don't hesitate to tell me what you think!
| # disable duplicate incorporation of tests from parent class in this module | ||
| BertTokenizationTest.__test__ = False | ||
|
|
||
|
|
||
| @require_tokenizers | ||
| class MobileBertTokenizationTest(BertTokenizationTest, unittest.TestCase): |
There was a problem hiding this comment.
To facilitate future maintenance of this test, I think it would be easier to just copy and paste the Bert's test into this file so that MobileBertTokenizationTest does not depend on BertTokenizationTest. In transformers - in general - we prefer to copy/paste code rather than create inheritances between models 🙂
There was a problem hiding this comment.
Ah, OK. In this case the solution is trivial of course. Might I ask what the reasoning is behind the preference to copy code instead of minimise duplication, just so I know in future?
There was a problem hiding this comment.
I think this part of our documentation should interest you!
I think for the particular case of MobileBert it is not obvious because I think that today we would not have created a MobileBertTokenizer class but only referenced the BertTokenizer in the config of the checkpoints if we really don't want to have any changes between the 2 tokenizers.
Now that the MobileBertTokenizer class exists I think the spirit is rather that MobileBertTokenizer's behaviour should not be constrained by the BERT model.
I would like to ping @LysandreJik - one of the main maintainers of transformers - on this subject if he has a different opinion than me on this issue, especially because it's a design discussion 🙂 .
There was a problem hiding this comment.
@patrickvonplaten has a recent blog post for this 😄
There was a problem hiding this comment.
Perfect, thanks! I especially like points 3+4.
Re: tests - I wonder what we do if a bug is found for the Bert tokenizer, in order to get the resulting test also implemented in eg. MobileBERT's tests. Where is that provenance annotated? Just in the ancestors? That said, I suppose new BERT tokenizer bugs are unlikely at this point :)
There was a problem hiding this comment.
There are things like (only for models/tokenizers, etc. for now, not for tests)
# Copied from <predecessor_model>.<function>
For example,
There was a problem hiding this comment.
For tests, I think we haven't really focus on this point (yet)
SaulLu
left a comment
There was a problem hiding this comment.
Thank you very much for the changes. I've left in a comment another change that I think is necessary to test MobileBert's classes.
Also, I think you would have to run the formatting (make style && make quality) and commit the changes for all the tests to pass 😄
Co-authored-by: SaulLu <55560583+SaulLu@users.noreply.github.com>
|
Obviously - thanks! |
Co-authored-by: SaulLu <55560583+SaulLu@users.noreply.github.com>
…sformers into mobilebert-tok-tests
Thanks for this, it makes sense. By the way, e.g. |
|
Could you check this comment and see if it works well? That's my first thought :-) |
OK this was fixed in huggingface/doc-builder#207 :)
Yes! All done :) |
SaulLu
left a comment
There was a problem hiding this comment.
LGTM! Thank you for your contribution 🤗
* unhardcode pretrained model path, make it a class var * add tests for mobilebert tokenizer * allow tempfiles for vocab & merge similarity test to autodelete * add explanatory comments * remove unused imports, let make style do its.. thing * remove inheritance and use BERT tok tests for MobileBERT * Update tests/mobilebert/test_tokenization_mobilebert.py Co-authored-by: SaulLu <55560583+SaulLu@users.noreply.github.com> * amend class names, remove unused import, add fix for mobilebert's hub pathname * unhardcode pretrained model path, make it a class var * add tests for mobilebert tokenizer * allow tempfiles for vocab & merge similarity test to autodelete * add explanatory comments * remove unused imports, let make style do its.. thing * remove inheritance and use BERT tok tests for MobileBERT * Update tests/mobilebert/test_tokenization_mobilebert.py Co-authored-by: SaulLu <55560583+SaulLu@users.noreply.github.com> * amend class names, remove unused import, add fix for mobilebert's hub pathname * amend paths for model tests being in models/ subdir of /tests * explicitly rm test from prev path Co-authored-by: SaulLu <55560583+SaulLu@users.noreply.github.com>
* unhardcode pretrained model path, make it a class var * add tests for mobilebert tokenizer * allow tempfiles for vocab & merge similarity test to autodelete * add explanatory comments * remove unused imports, let make style do its.. thing * remove inheritance and use BERT tok tests for MobileBERT * Update tests/mobilebert/test_tokenization_mobilebert.py Co-authored-by: SaulLu <55560583+SaulLu@users.noreply.github.com> * amend class names, remove unused import, add fix for mobilebert's hub pathname * unhardcode pretrained model path, make it a class var * add tests for mobilebert tokenizer * allow tempfiles for vocab & merge similarity test to autodelete * add explanatory comments * remove unused imports, let make style do its.. thing * remove inheritance and use BERT tok tests for MobileBERT * Update tests/mobilebert/test_tokenization_mobilebert.py Co-authored-by: SaulLu <55560583+SaulLu@users.noreply.github.com> * amend class names, remove unused import, add fix for mobilebert's hub pathname * amend paths for model tests being in models/ subdir of /tests * explicitly rm test from prev path Co-authored-by: SaulLu <55560583+SaulLu@users.noreply.github.com>
What does this PR do?
This PR implements tests for MobileBERT. As MobileBERT uses a copy of the BERT Tokenizer, the test inherits from BertTokenizationTest, and also checks the merge & vocab files for these two models are identical.
Contributes fixes to issue #16627
Before submitting
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
cc. @LysandreJik @SaulLu