don't break legacy behavior when enforced!#44626
Conversation
|
[For maintainers] Suggested jobs to run (before merge) run-slow: llama |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
| self._tokenizer.pre_tokenizer = None | ||
| self._tokenizer.normalizer = normalizers.Sequence( | ||
| [normalizers.Prepend(prepend="▁"), normalizers.Replace(pattern=" ", content="▁")] | ||
| ) |
There was a problem hiding this comment.
Legacy normalizer ignores add_prefix_space setting
High Severity
The legacy branch unconditionally includes normalizers.Prepend(prepend="▁"), but the equivalent logic in LlamaConverter.normalizer() in convert_slow_tokenizer.py only adds Prepend when add_prefix_space is true. When legacy=True and add_prefix_space=False, this causes an extra "▁" to be prepended to every input, producing incorrect tokenization and a mismatch with the converter's behavior.
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |


What does this PR do?
Adds a missing branch.
I don't really know if this is worth it, can't find a model online that enforces the flag to
True