Skip to content

Add regression test for ByteLevel added-token Unicode decode corruption#45062

Closed
ErenAta16 wants to merge 9 commits intohuggingface:mainfrom
ErenAta16:bugfix/45051-bytelevel-added-token-unicode
Closed

Add regression test for ByteLevel added-token Unicode decode corruption#45062
ErenAta16 wants to merge 9 commits intohuggingface:mainfrom
ErenAta16:bugfix/45051-bytelevel-added-token-unicode

Conversation

@ErenAta16
Copy link
Copy Markdown

This PR adds a regression test for Unicode corruption when decoding added_tokens with ByteLevel tokenizers (e.g. GPT-2 family).

In affected cases, characters such as č, ć, đ can decode into control characters (\r, \x07, \x11) after being added via add_tokens.

Fixes huggingface/tokenizers#1996

What this PR does

  • Adds a GPT-2 regression test in tests/models/gpt2/test_tokenization_gpt2.py:
    • test_added_tokens_unicode_roundtrip_with_bytelevel
  • The test validates roundtrip behavior for:
    • Začnimo
    • kuća
    • međa
  • Marks the test as xfail because the underlying issue appears to be in huggingface/tokenizers ByteLevel decode behavior, not in model-specific Transformers logic.

Why xfail?

Current behavior reproduces consistently, but the root cause is in the tokenizers backend decode path.
This test documents and locks the bug at the Transformers level until the upstream tokenizers fix is available.

Test command

  • python -m pytest tests/models/gpt2/test_tokenization_gpt2.py -k "added_tokens_unicode_roundtrip_with_bytelevel" -q

Add a GPT-2 regression test that captures added token Unicode decode corruption with ByteLevel tokenizers and mark it xfail while the underlying tokenizers-layer fix is pending.

Made-with: Cursor
Copy link
Copy Markdown
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this does not fix anything 😅

Comment on lines +94 to +95
tokenizer_fast = AutoTokenizer.from_pretrained("gpt2", use_fast=True)
tokenizer_slow = AutoTokenizer.from_pretrained("gpt2", use_fast=False)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fast and slow don't exist anymore

ErenAta16 and others added 5 commits March 27, 2026 18:32
Encode newly added tokens through the ByteLevel unicode alphabet when the backend uses a ByteLevel pre-tokenizer and decoder without a normalizer, preventing control-character corruption on decode. Add a GPT-2 regression test to validate unicode roundtrip for added tokens.

Made-with: Cursor
Apply ByteLevel encoding to newly added tokens whenever tokenizer decoding uses ByteLevel but normalization does not, covering setups like Qwen (NFC normalizer + ByteLevel pre-tokenizer/decoder) and preventing unicode-to-control-character corruption on decode.

Made-with: Cursor
Remove a stale type ignore in generation utils and clean formatting/import ordering so check_code_quality passes on the PR branch.

Made-with: Cursor
@github-actions
Copy link
Copy Markdown
Contributor

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=45062&sha=a83162

Restore dist.all_reduce line to match upstream main so check_code_quality stays aligned with the type checker configuration.

Made-with: Cursor
@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: gpt2

CI check_code_quality runs `ruff format --check`; collapse the multi-line if to match formatter output.

Made-with: Cursor
…ssion test

- Return early for special_tokens before optional ByteLevel vocabulary encoding.
- Load GPT-2 via from_pretrained_id and document huggingface#45051 in the test docstring.

Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Fast Tokenizer (ByteLevel BPE) incorrectly decodes added_tokens containing specific Unicode characters (e.g., 'č' becomes '\r')

2 participants