Add BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese#13788
Add BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese#13788sgugger merged 31 commits intohuggingface:masterfrom datquocnguyen:master
Conversation
patil-suraj
left a comment
There was a problem hiding this comment.
Thanks a lot for adding this, great to have BARTpho integrated into Transformers!
I have left a few comments below, let me know if something is not clear, happy to help :)
Also AFAIK it's not possible to add a fast version of this type of tokenizer where sentencepiece is used only for tokenization and a different vocab file is used to id to token and token to id conversion. But @SaulLu , @n1t0 would know more :)
(cc @LysandreJik @sgugger)
sgugger
left a comment
There was a problem hiding this comment.
I wonder how this tokenizer exactly differs from a classic sentencepiece tokenizer (like BART)? Could we also add the fast version?
Co-authored-by: Suraj Patil <surajp815@gmail.com>
Co-authored-by: Suraj Patil <surajp815@gmail.com>
Co-authored-by: Suraj Patil <surajp815@gmail.com>
Co-authored-by: Suraj Patil <surajp815@gmail.com>
Co-authored-by: Suraj Patil <surajp815@gmail.com>
Co-authored-by: Suraj Patil <surajp815@gmail.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
|
Thanks, @patil-suraj and @sgugger The same comment I get from both of you is regarding the vocab_file. Here is a summary: I did not train a sentencepiece for Vietnamese. Usecase of BartphoTokenizer: Other languages can thus simply reuse BartphoTokenizer with their own |
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Suraj Patil <surajp815@gmail.com>
|
|
||
| if token_ids_1 is None: | ||
| return len(cls + token_ids_0 + sep) * [0] | ||
| return len(cls + token_ids_0 + sep + sep + token_ids_1 + sep) * [0] |
There was a problem hiding this comment.
this looks like BERT or RoBERTa - I don't think BART makes use of the CLS token no?
Think it would be better to write the function more like it is in T5:
There was a problem hiding this comment.
With the following statement:
There was a problem hiding this comment.
Bart is a bad example as it inherits from RoBERTa
There was a problem hiding this comment.
The BartphoTokenizer also looks like BarthezTokenizer, as they are both inherited from BartTokenizer (RobertaTokenizer). And BARThez also got approved.
transformers/src/transformers/models/barthez/tokenization_barthez.py
Lines 227 to 229 in 85d69a7
-
When BartphoTokenizer written following RobertaTokenizer, the
transformersBARTpho produces the same feature outputs as itsfairseqBARTpho counterpart. This is exactly what I expected since I originally trained BARTpho usingfairseqand then converted it intotransformers. -
Then when I wrote BartphoTokenizer following T5Tokenizer as you suggested,
transformersandfairseqBARTpho variants produced different feature outputs given the same input text.
I am not sure why it's a bad example. Any feedback on this @patrickvonplaten ? Thanks.
SaulLu
left a comment
There was a problem hiding this comment.
Thank you so much for this addition and your work @datquocnguyen !
Concerning tokenizer fast, indeed I don't think that our tokenizer fast were designed to support this kind of behavior. It might be possible to find a trick to make it work but unfortunately I don't see it now.
|
@sgugger , @LysandreJik, @patil-suraj , @SaulLu and @patrickvonplaten |
LysandreJik
left a comment
There was a problem hiding this comment.
This looks good to me @datquocnguyen, thank you for adding another tokenizer alongside BERTweet and PhoBERT!
Too bad that there can be no fast tokenizer, but this LGTM either way!
|
Thanks @LysandreJik My pull request suddenly failed the check
They are out of my control, not relating to BartphoTokenizer. |
|
Yes this is a problem unrelated to this PR so you can ignore those failures. They should be fixed tomorrow :-) |
sgugger
left a comment
There was a problem hiding this comment.
One last small comment and you should then run make fix-copies. Then we will be good to merge :-)
|
@sgugger I made a revision following your last comment. Thanks. |
|
Thanks a gain for your contribution! |
What does this PR do?
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
Please help have a look: @patrickvonplaten, @patil-suraj Thanks.