-
Notifications
You must be signed in to change notification settings - Fork 561
[Bug Report] tokenize_and_concatenate doesn't tokenize correctly. #1133
Copy link
Copy link
Open
Labels
bugSomething isn't workingSomething isn't workingcomplexity-highVery complicated changes for people to address who are quite familiar with the codeVery complicated changes for people to address who are quite familiar with the codehigh-priorityMaintainers are interested in these issues being solved before othersMaintainers are interested in these issues being solved before othersminorRelease a minor versionRelease a minor versionrefactorChanging something with the code that will either affect external user, or contributorsChanging something with the code that will either affect external user, or contributorsseen_by_maintainersConfirms that a maintainer is aware of this card.Confirms that a maintainer is aware of this card.
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingcomplexity-highVery complicated changes for people to address who are quite familiar with the codeVery complicated changes for people to address who are quite familiar with the codehigh-priorityMaintainers are interested in these issues being solved before othersMaintainers are interested in these issues being solved before othersminorRelease a minor versionRelease a minor versionrefactorChanging something with the code that will either affect external user, or contributorsChanging something with the code that will either affect external user, or contributorsseen_by_maintainersConfirms that a maintainer is aware of this card.Confirms that a maintainer is aware of this card.
Describe the bug
tokenize_and_concatenate slices strings into 20 chunks by character before tokenizing. This can cut a token in two, leading to token pairs that would never normally occur.
Code example
I don't have an example, I found this while debugging a larger project.
In a debugger in that method after tokenizing and dropping padding tokens I observed the following:
I think it's obvious from code inspection what the problem is.
Checklist