Fix + Test by LysandreJik · Pull Request #8049 · huggingface/transformers

LysandreJik · 2020-10-26T13:21:17Z

Fix an edge case of the blenderbot-90 tokenizer.

Context

If the blenderbot-90 tokenizer is used to tokenize the following sequence:

sequence = "Ok ."

It will split it in two tokens at first:

transformers/src/transformers/tokenization_blenderbot.py

Line 221 in 8bbe824

split_tokens.extend([t for t in self.bpe(token).split(" ")])

Those two tokens will be ['Ok', '.']

The issue is that, when passed the second token, the bpe method will convert it from '.' to ' .' here:

transformers/src/transformers/tokenization_blenderbot.py

Line 160 in 8bbe824

token = re.sub("([.,!?()])", r" \1", token)

This then gets split on spaces here:

transformers/src/transformers/tokenization_blenderbot.py

Line 166 in 8bbe824

tokens = token.split(" ")

This is where the issue lies, as it creates two strings: ["", "."], the first one being empty.

It then crashes a bit further as we try to index the empty string:

transformers/src/transformers/tokenization_blenderbot.py

Line 171 in 8bbe824

word = tuple(list(word[:-1]) + [word[-1] + "</w>"])

Proposal

Ensure that the token has a length > 0 before trying to manage it, otherwise ignore that token.

Added a test.

sshleifer

Great catch!

sshleifer · 2020-10-26T15:28:09Z

        tokens = token.split(" ")
        words = []
        for token in tokens:
+            if not len(token):


if not token also works

You're right!

Fix + Test

dfebed4

LysandreJik requested a review from sshleifer October 26, 2020 13:21

sshleifer approved these changes Oct 26, 2020

View reviewed changes

LysandreJik merged commit cbad90d into master Oct 26, 2020

LysandreJik deleted the fix-blenderbot-90-tokenizer branch October 26, 2020 16:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix + Test#8049

Fix + Test#8049
LysandreJik merged 1 commit intomasterfrom
fix-blenderbot-90-tokenizer

LysandreJik commented Oct 26, 2020

Uh oh!

sshleifer left a comment

Uh oh!

sshleifer Oct 26, 2020

Uh oh!

LysandreJik Oct 26, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

LysandreJik commented Oct 26, 2020

Context

Proposal

Uh oh!

sshleifer left a comment

Choose a reason for hiding this comment

Uh oh!

sshleifer Oct 26, 2020

Choose a reason for hiding this comment

Uh oh!

LysandreJik Oct 26, 2020

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants