Add regression test for ByteLevel added-token Unicode decode corruption by ErenAta16 · Pull Request #45062 · huggingface/transformers

ErenAta16 · 2026-03-27T15:23:38Z

This PR adds a regression test for Unicode corruption when decoding added_tokens with ByteLevel tokenizers (e.g. GPT-2 family).

In affected cases, characters such as č, ć, đ can decode into control characters (\r, \x07, \x11) after being added via add_tokens.

Fixes huggingface/tokenizers#1996

What this PR does

Adds a GPT-2 regression test in tests/models/gpt2/test_tokenization_gpt2.py:
- test_added_tokens_unicode_roundtrip_with_bytelevel
The test validates roundtrip behavior for:
- Začnimo
- kuća
- međa
Marks the test as xfail because the underlying issue appears to be in huggingface/tokenizers ByteLevel decode behavior, not in model-specific Transformers logic.

Why xfail?

Current behavior reproduces consistently, but the root cause is in the tokenizers backend decode path.
This test documents and locks the bug at the Transformers level until the upstream tokenizers fix is available.

Test command

python -m pytest tests/models/gpt2/test_tokenization_gpt2.py -k "added_tokens_unicode_roundtrip_with_bytelevel" -q

Add a GPT-2 regression test that captures added token Unicode decode corruption with ByteLevel tokenizers and mark it xfail while the underlying tokenizers-layer fix is pending. Made-with: Cursor

ArthurZucker

this does not fix anything 😅

ArthurZucker · 2026-03-27T15:25:15Z

+        tokenizer_fast = AutoTokenizer.from_pretrained("gpt2", use_fast=True)
+        tokenizer_slow = AutoTokenizer.from_pretrained("gpt2", use_fast=False)


fast and slow don't exist anymore

…decode" This reverts commit fdaa06f.

Encode newly added tokens through the ByteLevel unicode alphabet when the backend uses a ByteLevel pre-tokenizer and decoder without a normalizer, preventing control-character corruption on decode. Add a GPT-2 regression test to validate unicode roundtrip for added tokens. Made-with: Cursor

Apply ByteLevel encoding to newly added tokens whenever tokenizer decoding uses ByteLevel but normalization does not, covering setups like Qwen (NFC normalizer + ByteLevel pre-tokenizer/decoder) and preventing unicode-to-control-character corruption on decode. Made-with: Cursor

Remove a stale type ignore in generation utils and clean formatting/import ordering so check_code_quality passes on the PR branch. Made-with: Cursor

github-actions · 2026-03-27T17:42:23Z

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=45062&sha=a83162

Restore dist.all_reduce line to match upstream main so check_code_quality stays aligned with the type checker configuration. Made-with: Cursor

github-actions · 2026-03-27T21:21:42Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: gpt2

CI check_code_quality runs `ruff format --check`; collapse the multi-line if to match formatter output. Made-with: Cursor

…ssion test - Return early for special_tokens before optional ByteLevel vocabulary encoding. - Load GPT-2 via from_pretrained_id and document huggingface#45051 in the test docstring. Made-with: Cursor

test: add xfail regression for ByteLevel added-token unicode decode

fdaa06f

Add a GPT-2 regression test that captures added token Unicode decode corruption with ByteLevel tokenizers and mark it xfail while the underlying tokenizers-layer fix is pending. Made-with: Cursor

ArthurZucker reviewed Mar 27, 2026

View reviewed changes

ArthurZucker added the Code agent slop label Mar 27, 2026

ErenAta16 and others added 5 commits March 27, 2026 18:32

Revert "test: add xfail regression for ByteLevel added-token unicode …

fcd5b20

…decode" This reverts commit fdaa06f.

chore: fix CI lint and typing follow-ups for ByteLevel tokenizer patch

cdb41be

Remove a stale type ignore in generation utils and clean formatting/import ordering so check_code_quality passes on the PR branch. Made-with: Cursor

Merge branch 'main' into bugfix/45051-bytelevel-added-token-unicode

a83162f

chore: drop unrelated generation utils diff from ByteLevel tokenizer PR

c4d8d54

Restore dist.all_reduce line to match upstream main so check_code_quality stays aligned with the type checker configuration. Made-with: Cursor

ErenAta16 added 2 commits March 28, 2026 00:33

style: ruff-format ByteLevel added-token branch condition

b22d5de

CI check_code_quality runs `ruff format --check`; collapse the multi-line if to match formatter output. Made-with: Cursor

refactor: clarify ByteLevel add_tokens path and stabilize GPT-2 regre…

0348490

…ssion test - Return early for special_tokens before optional ByteLevel vocabulary encoding. - Load GPT-2 via from_pretrained_id and document huggingface#45051 in the test docstring. Made-with: Cursor

ArthurZucker closed this Mar 27, 2026

evalstate mentioned this pull request Apr 28, 2026

Cumulative defect fixes from recent Transformers PRs evalstate/transformers#41

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add regression test for ByteLevel added-token Unicode decode corruption#45062

Add regression test for ByteLevel added-token Unicode decode corruption#45062
ErenAta16 wants to merge 9 commits intohuggingface:mainfrom
ErenAta16:bugfix/45051-bytelevel-added-token-unicode

ErenAta16 commented Mar 27, 2026

Uh oh!

ArthurZucker left a comment

Uh oh!

ArthurZucker Mar 27, 2026

Uh oh!

github-actions Bot commented Mar 27, 2026

Uh oh!

github-actions Bot commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		tokenizer_fast = AutoTokenizer.from_pretrained("gpt2", use_fast=True)
		tokenizer_slow = AutoTokenizer.from_pretrained("gpt2", use_fast=False)

Conversation

ErenAta16 commented Mar 27, 2026

What this PR does

Why xfail?

Test command

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Mar 27, 2026

Uh oh!

github-actions Bot commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants