Add regression test for ByteLevel added-token Unicode decode corruption#45062
Closed
ErenAta16 wants to merge 9 commits intohuggingface:mainfrom
Closed
Add regression test for ByteLevel added-token Unicode decode corruption#45062ErenAta16 wants to merge 9 commits intohuggingface:mainfrom
ErenAta16 wants to merge 9 commits intohuggingface:mainfrom
Conversation
Add a GPT-2 regression test that captures added token Unicode decode corruption with ByteLevel tokenizers and mark it xfail while the underlying tokenizers-layer fix is pending. Made-with: Cursor
Collaborator
ArthurZucker
left a comment
There was a problem hiding this comment.
this does not fix anything 😅
Comment on lines
+94
to
+95
| tokenizer_fast = AutoTokenizer.from_pretrained("gpt2", use_fast=True) | ||
| tokenizer_slow = AutoTokenizer.from_pretrained("gpt2", use_fast=False) |
Collaborator
There was a problem hiding this comment.
fast and slow don't exist anymore
…decode" This reverts commit fdaa06f.
Encode newly added tokens through the ByteLevel unicode alphabet when the backend uses a ByteLevel pre-tokenizer and decoder without a normalizer, preventing control-character corruption on decode. Add a GPT-2 regression test to validate unicode roundtrip for added tokens. Made-with: Cursor
Apply ByteLevel encoding to newly added tokens whenever tokenizer decoding uses ByteLevel but normalization does not, covering setups like Qwen (NFC normalizer + ByteLevel pre-tokenizer/decoder) and preventing unicode-to-control-character corruption on decode. Made-with: Cursor
Remove a stale type ignore in generation utils and clean formatting/import ordering so check_code_quality passes on the PR branch. Made-with: Cursor
Contributor
|
View the CircleCI Test Summary for this PR: https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=45062&sha=a83162 |
Restore dist.all_reduce line to match upstream main so check_code_quality stays aligned with the type checker configuration. Made-with: Cursor
Contributor
|
[For maintainers] Suggested jobs to run (before merge) run-slow: gpt2 |
CI check_code_quality runs `ruff format --check`; collapse the multi-line if to match formatter output. Made-with: Cursor
…ssion test - Return early for special_tokens before optional ByteLevel vocabulary encoding. - Load GPT-2 via from_pretrained_id and document huggingface#45051 in the test docstring. Made-with: Cursor
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR adds a regression test for Unicode corruption when decoding
added_tokenswith ByteLevel tokenizers (e.g. GPT-2 family).In affected cases, characters such as
č,ć,đcan decode into control characters (\r,\x07,\x11) after being added viaadd_tokens.Fixes huggingface/tokenizers#1996
What this PR does
tests/models/gpt2/test_tokenization_gpt2.py:test_added_tokens_unicode_roundtrip_with_bytelevelZačnimokućameđaxfailbecause the underlying issue appears to be inhuggingface/tokenizersByteLevel decode behavior, not in model-specific Transformers logic.Why xfail?
Current behavior reproduces consistently, but the root cause is in the tokenizers backend decode path.
This test documents and locks the bug at the Transformers level until the upstream tokenizers fix is available.
Test command
python -m pytest tests/models/gpt2/test_tokenization_gpt2.py -k "added_tokens_unicode_roundtrip_with_bytelevel" -q