Skip to content

don't break legacy behavior when enforced!#44626

Open
ArthurZucker wants to merge 1 commit intomainfrom
fix-tokenizer-legacy
Open

don't break legacy behavior when enforced!#44626
ArthurZucker wants to merge 1 commit intomainfrom
fix-tokenizer-legacy

Conversation

@ArthurZucker
Copy link
Copy Markdown
Collaborator

@ArthurZucker ArthurZucker commented Mar 12, 2026

What does this PR do?

Adds a missing branch.
I don't really know if this is worth it, can't find a model online that enforces the flag to True

@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: llama

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

self._tokenizer.pre_tokenizer = None
self._tokenizer.normalizer = normalizers.Sequence(
[normalizers.Prepend(prepend="▁"), normalizers.Replace(pattern=" ", content="▁")]
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Legacy normalizer ignores add_prefix_space setting

High Severity

The legacy branch unconditionally includes normalizers.Prepend(prepend="▁"), but the equivalent logic in LlamaConverter.normalizer() in convert_slow_tokenizer.py only adds Prepend when add_prefix_space is true. When legacy=True and add_prefix_space=False, this causes an extra "▁" to be prepended to every input, producing incorrect tokenization and a mismatch with the converter's behavior.

Fix in Cursor Fix in Web

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants