[Bug Report] tokenize_and_concatenate doesn't tokenize correctly.

**Describe the bug**

tokenize_and_concatenate slices strings into 20 chunks by character before tokenizing. This can cut a token in two, leading to token pairs that would never normally occur.

**Code example**
I don't have an example, I found this while debugging a larger project.

In a debugger in that method after tokenizing and dropping padding tokens I observed the following:

```
> chunks[2][-10:], chunks[3][:10]
('t on the M', 'ilitary Ne')
> tokenizer.decode([4460])
' Mil'
> tokenizer.decode([337])
' M'
> tokenizer.decode([346])
'il'
> np.where((tokens[:-1] == 337) & (tokens[1:] == 346))[0]
array([79848])
> tokens[79848:79848+2]
array([337, 346]) #  SHOULD NEVER OCCUR

```

I think it's obvious from code inspection what the problem is.

### Checklist

- [x] I have checked that there is no similar [issue](https://github.com/TransformerLensOrg/TransformerLens/issues) in the repo (**required**)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug Report] tokenize_and_concatenate doesn't tokenize correctly. #1133

Checklist

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug Report] tokenize_and_concatenate doesn't tokenize correctly. #1133

Description

Checklist

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions