[Tokenization] fix edge case for bert tokenization by patrickvonplaten · Pull Request #3517 · huggingface/transformers

patrickvonplaten · 2020-03-29T16:53:12Z

This PR fixes #3502 .

The reason why the tests fail in #3502 is because of an edge case.

If the input to tokenizer.batch_encode_plus() consists of a tokenized string that results in a list of exactly two strings ([[16], [.]] in issue #3502) then it is treated as a pair of input sequences (=> [CLS] input_sequence_1 [SEP] input_sequence_2 [SEP]) but this behavior should only happen if the input list consists of two untokenized strings.

patrickvonplaten · 2020-03-29T17:06:04Z

@mfuntowicz @n1t0 @LysandreJik - could you check? :-)

n1t0 · 2020-03-31T14:05:53Z

Does this mean that batch_encode_plus is supposed to handle "pre-tokenized" inputs? I thought this was something introduced by #3185 with a specific flag is_pretokenized (cc @mfuntowicz)

codecov-io · 2020-04-07T16:35:42Z

Codecov Report

Merging #3517 into master will increase coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #3517      +/-   ##
==========================================
+ Coverage   78.03%   78.05%   +0.01%     
==========================================
  Files         104      104              
  Lines       17708    17709       +1     
==========================================
+ Hits        13819    13822       +3     
+ Misses       3889     3887       -2

Impacted Files	Coverage Δ
src/transformers/tokenization_utils.py	`85.78% <100.00%> (+0.01%)`	⬆️
src/transformers/modeling_utils.py	`92.23% <0.00%> (+0.12%)`	⬆️
src/transformers/modeling_tf_utils.py	`93.28% <0.00%> (+0.16%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5aa8a27...3bde162. Read the comment docs.

patrickvonplaten · 2020-04-07T16:44:19Z

Does this mean that batch_encode_plus is supposed to handle "pre-tokenized" inputs? I thought this was something introduced by #3185 with a specific flag is_pretokenized (cc @mfuntowicz)

@mfuntowicz showed me the is_pretokenized flag for tokenizers v3.0.0 so this makes everything much easier

LysandreJik

Woa that flag is useful!

patrickvonplaten requested review from LysandreJik, mfuntowicz and n1t0 March 29, 2020 17:06

patrickvonplaten changed the title ~~fix egde gase for bert tokenization~~ [Tokenization] fix egde gase for bert tokenization Mar 30, 2020

mfuntowicz approved these changes Mar 31, 2020

View reviewed changes

LysandreJik reviewed Mar 31, 2020

View reviewed changes

Comment thread src/transformers/tokenization_utils.py Outdated

sshleifer changed the title ~~[Tokenization] fix egde gase for bert tokenization~~ [Tokenization] fix edge gase for bert tokenization Mar 31, 2020

sshleifer changed the title ~~[Tokenization] fix edge gase for bert tokenization~~ [Tokenization] fix edge case for bert tokenization Mar 31, 2020

patrickvonplaten added 2 commits April 7, 2020 18:18

fix egde gase for bert tokenization

b06ed1b

add Lysandres comments for improvement

85567cf

patrickvonplaten force-pushed the fix_edge_case_for_bert_tokenization branch from 5953279 to 85567cf Compare April 7, 2020 16:23

use new is_pretokenized_flag

3bde162

LysandreJik approved these changes Apr 7, 2020

View reviewed changes

LysandreJik merged commit b0ad069 into huggingface:master Apr 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Tokenization] fix edge case for bert tokenization#3517

[Tokenization] fix edge case for bert tokenization#3517
LysandreJik merged 3 commits intohuggingface:masterfrom
patrickvonplaten:fix_edge_case_for_bert_tokenization

patrickvonplaten commented Mar 29, 2020

Uh oh!

patrickvonplaten commented Mar 29, 2020

Uh oh!

Uh oh!

n1t0 commented Mar 31, 2020

Uh oh!

codecov-io commented Apr 7, 2020 •

edited

Loading

Uh oh!

patrickvonplaten commented Apr 7, 2020

Uh oh!

LysandreJik left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

patrickvonplaten commented Mar 29, 2020

Uh oh!

patrickvonplaten commented Mar 29, 2020

Uh oh!

Uh oh!

n1t0 commented Mar 31, 2020

Uh oh!

codecov-io commented Apr 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

patrickvonplaten commented Apr 7, 2020

Uh oh!

LysandreJik left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

codecov-io commented Apr 7, 2020 •

edited

Loading