Skip to content

Add BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese#13788

Merged
sgugger merged 31 commits intohuggingface:masterfrom
datquocnguyen:master
Oct 18, 2021
Merged

Add BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese#13788
sgugger merged 31 commits intohuggingface:masterfrom
datquocnguyen:master

Conversation

@datquocnguyen
Copy link
Copy Markdown
Contributor

What does this PR do?

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Please help have a look: @patrickvonplaten, @patil-suraj Thanks.

Copy link
Copy Markdown
Contributor

@patil-suraj patil-suraj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for adding this, great to have BARTpho integrated into Transformers!

I have left a few comments below, let me know if something is not clear, happy to help :)

Also AFAIK it's not possible to add a fast version of this type of tokenizer where sentencepiece is used only for tokenization and a different vocab file is used to id to token and token to id conversion. But @SaulLu , @n1t0 would know more :)
(cc @LysandreJik @sgugger)

Copy link
Copy Markdown
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how this tokenizer exactly differs from a classic sentencepiece tokenizer (like BART)? Could we also add the fast version?

datquocnguyen and others added 7 commits October 1, 2021 11:14
Co-authored-by: Suraj Patil <surajp815@gmail.com>
Co-authored-by: Suraj Patil <surajp815@gmail.com>
Co-authored-by: Suraj Patil <surajp815@gmail.com>
Co-authored-by: Suraj Patil <surajp815@gmail.com>
Co-authored-by: Suraj Patil <surajp815@gmail.com>
Co-authored-by: Suraj Patil <surajp815@gmail.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
@datquocnguyen
Copy link
Copy Markdown
Contributor Author

datquocnguyen commented Oct 1, 2021

Thanks, @patil-suraj and @sgugger
I fixed the pull request based on you guys' comments.

The same comment I get from both of you is regarding the vocab_file. Here is a summary:

I did not train a sentencepiece for Vietnamese.
bartpho-syllable employs the existing pre-trained "sentencepiece" model from XLMRoBERTaTokenizer, and this pre-trained "sentencepiece" model is referred to as a vocab_file of 250K types.
reduced_vocab_file is a vocab containing 40K Vietnamese-specificized types extracted from the XLMRoBERTaTokenizer vocab of 250K types.

Usecase of BartphoTokenizer: Other languages can thus simply reuse BartphoTokenizer with their own reduced_vocab_file. The goal here is to reduce model sizes of existing pre-trained XLM-RoBERTa/mBART models when applying to a smaller set of languages instead of the whole 50/100 languages.

datquocnguyen and others added 4 commits October 1, 2021 16:02
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Suraj Patil <surajp815@gmail.com>

if token_ids_1 is None:
return len(cls + token_ids_0 + sep) * [0]
return len(cls + token_ids_0 + sep + sep + token_ids_1 + sep) * [0]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks like BERT or RoBERTa - I don't think BART makes use of the CLS token no?
Think it would be better to write the function more like it is in T5:

return len(token_ids_0 + eos + token_ids_1 + eos) * [0]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the following statement:

Create a mask from the two sequences passed to be used in a sequence-pair classification task. T5 does not make

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bart is a bad example as it inherits from RoBERTa

Copy link
Copy Markdown
Contributor Author

@datquocnguyen datquocnguyen Oct 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The BartphoTokenizer also looks like BarthezTokenizer, as they are both inherited from BartTokenizer (RobertaTokenizer). And BARThez also got approved.

if token_ids_1 is None:
return len(cls + token_ids_0 + sep) * [0]
return len(cls + token_ids_0 + sep + sep + token_ids_1 + sep) * [0]

  • When BartphoTokenizer written following RobertaTokenizer, the transformers BARTpho produces the same feature outputs as its fairseq BARTpho counterpart. This is exactly what I expected since I originally trained BARTpho using fairseq and then converted it into transformers.

  • Then when I wrote BartphoTokenizer following T5Tokenizer as you suggested, transformers and fairseq BARTpho variants produced different feature outputs given the same input text.

I am not sure why it's a bad example. Any feedback on this @patrickvonplaten ? Thanks.

Copy link
Copy Markdown
Contributor

@SaulLu SaulLu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for this addition and your work @datquocnguyen !

Concerning tokenizer fast, indeed I don't think that our tokenizer fast were designed to support this kind of behavior. It might be possible to find a trick to make it work but unfortunately I don't see it now.

@LysandreJik LysandreJik self-requested a review October 12, 2021 12:20
@datquocnguyen
Copy link
Copy Markdown
Contributor Author

@sgugger , @LysandreJik, @patil-suraj , @SaulLu and @patrickvonplaten
Please could you have a look and provide your feedback for my recent changes? Thanks.

Copy link
Copy Markdown
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me @datquocnguyen, thank you for adding another tokenizer alongside BERTweet and PhoBERT!

Too bad that there can be no fast tokenizer, but this LGTM either way!

@datquocnguyen
Copy link
Copy Markdown
Contributor Author

Thanks @LysandreJik

My pull request suddenly failed the check run_tests_torch_and_flax:

FAILED tests/test_modeling_flax_clip.py::FlaxCLIPModelTest::test_equivalence_flax_to_pt
FAILED tests/test_modeling_flax_clip.py::FlaxCLIPModelTest::test_equivalence_pt_to_flax

They are out of my control, not relating to BartphoTokenizer.

@sgugger
Copy link
Copy Markdown
Collaborator

sgugger commented Oct 18, 2021

Yes this is a problem unrelated to this PR so you can ignore those failures. They should be fixed tomorrow :-)

Copy link
Copy Markdown
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last small comment and you should then run make fix-copies. Then we will be good to merge :-)

@datquocnguyen
Copy link
Copy Markdown
Contributor Author

@sgugger I made a revision following your last comment. Thanks.
FYI, two failed tests are not related to BartphoTokenizer.

@sgugger sgugger merged commit 3d587c5 into huggingface:master Oct 18, 2021
@sgugger
Copy link
Copy Markdown
Collaborator

sgugger commented Oct 18, 2021

Thanks a gain for your contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants