Add BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese by datquocnguyen · Pull Request #13788 · huggingface/transformers

datquocnguyen · 2021-09-29T10:56:01Z

What does this PR do?

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Please help have a look: @patrickvonplaten, @patil-suraj Thanks.

patil-suraj

Thanks a lot for adding this, great to have BARTpho integrated into Transformers!

I have left a few comments below, let me know if something is not clear, happy to help :)

Also AFAIK it's not possible to add a fast version of this type of tokenizer where sentencepiece is used only for tokenization and a different vocab file is used to id to token and token to id conversion. But @SaulLu , @n1t0 would know more :)
(cc @LysandreJik @sgugger)

docs/source/model_doc/bartpho.rst

docs/source/model_doc/bertweet.rst

docs/source/model_doc/phobert.rst

src/transformers/models/bartpho/__init__.py

tests/test_tokenization_bartpho.py

docs/source/model_doc/bartpho.rst

sgugger

I wonder how this tokenizer exactly differs from a classic sentencepiece tokenizer (like BART)? Could we also add the fast version?

docs/source/index.rst

docs/source/model_doc/bartpho.rst

src/transformers/models/bartpho/tokenization_bartpho.py

Co-authored-by: Suraj Patil <surajp815@gmail.com>

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

datquocnguyen · 2021-10-01T05:22:46Z

Thanks, @patil-suraj and @sgugger
I fixed the pull request based on you guys' comments.

The same comment I get from both of you is regarding the vocab_file. Here is a summary:

I did not train a sentencepiece for Vietnamese.
bartpho-syllable employs the existing pre-trained "sentencepiece" model from XLMRoBERTaTokenizer, and this pre-trained "sentencepiece" model is referred to as a vocab_file of 250K types.
reduced_vocab_file is a vocab containing 40K Vietnamese-specificized types extracted from the XLMRoBERTaTokenizer vocab of 250K types.

Usecase of BartphoTokenizer: Other languages can thus simply reuse BartphoTokenizer with their own reduced_vocab_file. The goal here is to reduce model sizes of existing pre-trained XLM-RoBERTa/mBART models when applying to a smaller set of languages instead of the whole 50/100 languages.

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Co-authored-by: Suraj Patil <surajp815@gmail.com>

patrickvonplaten · 2021-10-04T15:43:21Z

src/transformers/models/bartpho/tokenization_bartpho.py

+
+        if token_ids_1 is None:
+            return len(cls + token_ids_0 + sep) * [0]
+        return len(cls + token_ids_0 + sep + sep + token_ids_1 + sep) * [0]


this looks like BERT or RoBERTa - I don't think BART makes use of the CLS token no?
Think it would be better to write the function more like it is in T5:

transformers/src/transformers/models/t5/tokenization_t5.py

Line 217 in 3a8de58

return len(token_ids_0 + eos + token_ids_1 + eos) * [0]

With the following statement:

transformers/src/transformers/models/t5/tokenization_t5.py

Line 201 in 3a8de58

Create a mask from the two sequences passed to be used in a sequence-pair classification task. T5 does not make

Bart is a bad example as it inherits from RoBERTa

The BartphoTokenizer also looks like BarthezTokenizer, as they are both inherited from BartTokenizer (RobertaTokenizer). And BARThez also got approved.

transformers/src/transformers/models/barthez/tokenization_barthez.py

Lines 227 to 229 in 85d69a7

if token_ids_1 is None:

return len(cls + token_ids_0 + sep) * [0]

return len(cls + token_ids_0 + sep + sep + token_ids_1 + sep) * [0]

When BartphoTokenizer written following RobertaTokenizer, the transformers BARTpho produces the same feature outputs as its fairseq BARTpho counterpart. This is exactly what I expected since I originally trained BARTpho using fairseq and then converted it into transformers.

Then when I wrote BartphoTokenizer following T5Tokenizer as you suggested, transformers and fairseq BARTpho variants produced different feature outputs given the same input text.

I am not sure why it's a bad example. Any feedback on this @patrickvonplaten ? Thanks.

src/transformers/models/bartpho/tokenization_bartpho.py

SaulLu

Thank you so much for this addition and your work @datquocnguyen !

Concerning tokenizer fast, indeed I don't think that our tokenizer fast were designed to support this kind of behavior. It might be possible to find a trick to make it work but unfortunately I don't see it now.

docs/source/model_doc/bartpho.rst

src/transformers/models/bartpho/tokenization_bartpho.py

datquocnguyen · 2021-10-15T15:02:14Z

@sgugger , @LysandreJik, @patil-suraj , @SaulLu and @patrickvonplaten
Please could you have a look and provide your feedback for my recent changes? Thanks.

LysandreJik

This looks good to me @datquocnguyen, thank you for adding another tokenizer alongside BERTweet and PhoBERT!

Too bad that there can be no fast tokenizer, but this LGTM either way!

md5sum.saved

docs/source/model_doc/bartpho.rst

datquocnguyen · 2021-10-16T08:34:21Z

Thanks @LysandreJik

My pull request suddenly failed the check run_tests_torch_and_flax:

FAILED tests/test_modeling_flax_clip.py::FlaxCLIPModelTest::test_equivalence_flax_to_pt
FAILED tests/test_modeling_flax_clip.py::FlaxCLIPModelTest::test_equivalence_pt_to_flax

They are out of my control, not relating to BartphoTokenizer.

sgugger · 2021-10-18T00:45:09Z

Yes this is a problem unrelated to this PR so you can ignore those failures. They should be fixed tomorrow :-)

sgugger

One last small comment and you should then run make fix-copies. Then we will be good to merge :-)

README.md

datquocnguyen · 2021-10-18T14:12:13Z

@sgugger I made a revision following your last comment. Thanks.
FYI, two failed tests are not related to BartphoTokenizer.

sgugger · 2021-10-18T14:16:52Z

Thanks a gain for your contribution!

datquocnguyen added 8 commits September 25, 2021 16:30

Add the pre-trained BARTpho model

78481d6

Add the pre-trained BARTpho model

c0431fb

Add the pre-trained BARTpho model

25b978e

Fix incorrectly sorted and/or formatted imports

74025b1

Fix incorrectly sorted and/or formatted style

d5e3a09

Fix check_dummies

b2f85a1

Fix check_dummies

4e87364

Fix check_dummies

3c27edf

patil-suraj reviewed Sep 29, 2021

View reviewed changes

patil-suraj requested review from SaulLu, patrickvonplaten and sgugger September 29, 2021 12:56

sgugger reviewed Sep 30, 2021

View reviewed changes

docs/source/index.rst Show resolved Hide resolved

docs/source/model_doc/bartpho.rst Outdated Show resolved Hide resolved

docs/source/model_doc/bartpho.rst Outdated Show resolved Hide resolved

src/transformers/models/bartpho/tokenization_bartpho.py Outdated Show resolved Hide resolved

datquocnguyen and others added 7 commits October 1, 2021 11:14

Update docs/source/model_doc/bartpho.rst

a917147

Co-authored-by: Suraj Patil <surajp815@gmail.com>

Update src/transformers/models/bartpho/__init__.py

8357ed2

Co-authored-by: Suraj Patil <surajp815@gmail.com>

Update src/transformers/models/bartpho/tokenization_bartpho.py

27d9cfb

Co-authored-by: Suraj Patil <surajp815@gmail.com>

Update tests/test_tokenization_bartpho.py

b7641f4

Co-authored-by: Suraj Patil <surajp815@gmail.com>

Update src/transformers/models/bartpho/tokenization_bartpho.py

887dee8

Co-authored-by: Suraj Patil <surajp815@gmail.com>

Update tests/test_tokenization_bartpho.py

0871ab0

Co-authored-by: Suraj Patil <surajp815@gmail.com>

Update docs/source/model_doc/bartpho.rst

4241144

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

datquocnguyen and others added 4 commits October 1, 2021 16:02

Update docs/source/model_doc/bartpho.rst

adb68d4

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Update src/transformers/models/bartpho/__init__.py

26c9f89

Co-authored-by: Suraj Patil <surajp815@gmail.com>

Add the pre-trained BARTpho model

6209625

Merge branch 'master' into master

4b5da8c

patrickvonplaten reviewed Oct 4, 2021

View reviewed changes

src/transformers/models/bartpho/tokenization_bartpho.py Show resolved Hide resolved

SaulLu reviewed Oct 5, 2021

View reviewed changes

docs/source/model_doc/bartpho.rst Show resolved Hide resolved

src/transformers/models/bartpho/tokenization_bartpho.py Outdated Show resolved Hide resolved

datquocnguyen added 2 commits October 7, 2021 11:29

Merge branch 'huggingface:master' into master

21f3741

Merge branch 'huggingface:master' into master

e5d03a1

Add Tips section in doc and details of monolingual_vocab_file

6d39278

LysandreJik self-requested a review October 12, 2021 12:20

datquocnguyen and others added 2 commits October 15, 2021 18:35

Fix conflicts

774396b

Merge branch 'master' into master

2ebcb61

LysandreJik approved these changes Oct 16, 2021

View reviewed changes

md5sum.saved Outdated Show resolved Hide resolved

docs/source/model_doc/bartpho.rst Outdated Show resolved Hide resolved

datquocnguyen and others added 5 commits October 16, 2021 14:42

Add another tip related to monolingual_vocab_file

435c97c

Readd dependency_versions_table.py

9d2aeef

Merge branch 'master' into master

e5da9db

Handle failing checks

c67ac77

Remove test_list.txt

0a2799b

Remove md5sum.saved

85a17da

sgugger approved these changes Oct 18, 2021

View reviewed changes

README.md Outdated Show resolved Hide resolved

Revise Readme.md

19a3bb0

sgugger merged commit 3d587c5 into huggingface:master Oct 18, 2021

datquocnguyen mentioned this pull request Mar 21, 2022

word_ids() is not available when using Python-based tokenizers VinAIResearch/PhoBERT#40

Closed

datquocnguyen mentioned this pull request May 14, 2022

Add fast tokenizer for BARTpho #17254

Closed

5 tasks

	if token_ids_1 is None:
	return len(cls + token_ids_0 + sep) * [0]
	return len(cls + token_ids_0 + sep + sep + token_ids_1 + sep) * [0]

Conversation

datquocnguyen commented Sep 29, 2021

What does this PR do?

Before submitting

Who can review?

Uh oh!

patil-suraj left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sgugger left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

datquocnguyen commented Oct 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

patrickvonplaten Oct 4, 2021

Choose a reason for hiding this comment

Uh oh!

patrickvonplaten Oct 4, 2021

Choose a reason for hiding this comment

Uh oh!

patrickvonplaten Oct 4, 2021

Choose a reason for hiding this comment

Uh oh!

datquocnguyen Oct 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

SaulLu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

datquocnguyen commented Oct 15, 2021

Uh oh!

LysandreJik left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

datquocnguyen commented Oct 16, 2021

Uh oh!

sgugger commented Oct 18, 2021

Uh oh!

sgugger left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

datquocnguyen commented Oct 18, 2021

Uh oh!

sgugger commented Oct 18, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

datquocnguyen commented Oct 1, 2021 •

edited

Loading

datquocnguyen Oct 12, 2021 •

edited

Loading