Skip to content

[Tokenization] fix edge case for bert tokenization#3517

Merged
LysandreJik merged 3 commits intohuggingface:masterfrom
patrickvonplaten:fix_edge_case_for_bert_tokenization
Apr 7, 2020
Merged

[Tokenization] fix edge case for bert tokenization#3517
LysandreJik merged 3 commits intohuggingface:masterfrom
patrickvonplaten:fix_edge_case_for_bert_tokenization

Conversation

@patrickvonplaten
Copy link
Copy Markdown
Contributor

This PR fixes #3502 .

The reason why the tests fail in #3502 is because of an edge case.

If the input to tokenizer.batch_encode_plus() consists of a tokenized string that results in a list of exactly two strings ([[16], [.]] in issue #3502) then it is treated as a pair of input sequences (=> [CLS] input_sequence_1 [SEP] input_sequence_2 [SEP]) but this behavior should only happen if the input list consists of two untokenized strings.

@patrickvonplaten
Copy link
Copy Markdown
Contributor Author

@mfuntowicz @n1t0 @LysandreJik - could you check? :-)

@patrickvonplaten patrickvonplaten changed the title fix egde gase for bert tokenization [Tokenization] fix egde gase for bert tokenization Mar 30, 2020
Comment thread src/transformers/tokenization_utils.py Outdated
@n1t0
Copy link
Copy Markdown
Contributor

n1t0 commented Mar 31, 2020

Does this mean that batch_encode_plus is supposed to handle "pre-tokenized" inputs? I thought this was something introduced by #3185 with a specific flag is_pretokenized (cc @mfuntowicz)

@sshleifer sshleifer changed the title [Tokenization] fix egde gase for bert tokenization [Tokenization] fix edge gase for bert tokenization Mar 31, 2020
@sshleifer sshleifer changed the title [Tokenization] fix edge gase for bert tokenization [Tokenization] fix edge case for bert tokenization Mar 31, 2020
@patrickvonplaten patrickvonplaten force-pushed the fix_edge_case_for_bert_tokenization branch from 5953279 to 85567cf Compare April 7, 2020 16:23
@codecov-io
Copy link
Copy Markdown

codecov-io commented Apr 7, 2020

Codecov Report

Merging #3517 into master will increase coverage by 0.01%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3517      +/-   ##
==========================================
+ Coverage   78.03%   78.05%   +0.01%     
==========================================
  Files         104      104              
  Lines       17708    17709       +1     
==========================================
+ Hits        13819    13822       +3     
+ Misses       3889     3887       -2     
Impacted Files Coverage Δ
src/transformers/tokenization_utils.py 85.78% <100.00%> (+0.01%) ⬆️
src/transformers/modeling_utils.py 92.23% <0.00%> (+0.12%) ⬆️
src/transformers/modeling_tf_utils.py 93.28% <0.00%> (+0.16%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5aa8a27...3bde162. Read the comment docs.

@patrickvonplaten
Copy link
Copy Markdown
Contributor Author

Does this mean that batch_encode_plus is supposed to handle "pre-tokenized" inputs? I thought this was something introduced by #3185 with a specific flag is_pretokenized (cc @mfuntowicz)

@mfuntowicz showed me the is_pretokenized flag for tokenizers v3.0.0 so this makes everything much easier

Copy link
Copy Markdown
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Woa that flag is useful!

@LysandreJik LysandreJik merged commit b0ad069 into huggingface:master Apr 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bert Batch Encode Plus adding an extra [SEP]

5 participants