Skip to content

[Bug Report] tokenize_and_concatenate doesn't tokenize correctly. #1133

@BorisTheBrave

Description

@BorisTheBrave

Describe the bug

tokenize_and_concatenate slices strings into 20 chunks by character before tokenizing. This can cut a token in two, leading to token pairs that would never normally occur.

Code example
I don't have an example, I found this while debugging a larger project.

In a debugger in that method after tokenizing and dropping padding tokens I observed the following:

> chunks[2][-10:], chunks[3][:10]
('t on the M', 'ilitary Ne')
> tokenizer.decode([4460])
' Mil'
> tokenizer.decode([337])
' M'
> tokenizer.decode([346])
'il'
> np.where((tokens[:-1] == 337) & (tokens[1:] == 346))[0]
array([79848])
> tokens[79848:79848+2]
array([337, 346]) #  SHOULD NEVER OCCUR

I think it's obvious from code inspection what the problem is.

Checklist

  • I have checked that there is no similar issue in the repo (required)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingcomplexity-highVery complicated changes for people to address who are quite familiar with the codehigh-priorityMaintainers are interested in these issues being solved before othersminorRelease a minor versionrefactorChanging something with the code that will either affect external user, or contributorsseen_by_maintainersConfirms that a maintainer is aware of this card.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions