Skip to content

Fix "AttributeError: NewTokenizer has no attribute special_attribute_present" (Remove REGISTERED_FAST_ALIASES)#45293

Open
yonigozlan wants to merge 1 commit intohuggingface:mainfrom
yonigozlan:fix-tokenization-auto
Open

Fix "AttributeError: NewTokenizer has no attribute special_attribute_present" (Remove REGISTERED_FAST_ALIASES)#45293
yonigozlan wants to merge 1 commit intohuggingface:mainfrom
yonigozlan:fix-tokenization-auto

Conversation

@yonigozlan
Copy link
Copy Markdown
Member

Fix global state leak in AutoTokenizer.register causing test failures

Problem

test_from_pretrained_dynamic_processor was failing when run as part of the full test class with:

AttributeError: NewTokenizer has no attribute special_attribute_present

Root cause

AutoTokenizer.register was populating a global dict REGISTERED_TOKENIZER_CLASSES: dict[str, type]
that mapped class __name__ strings to registered class objects. Tests that called AutoTokenizer.register
(e.g. test_dynamic_processor_with_specific_dynamic_subcomponents) registered a local NewTokenizer class
there without cleaning it up. When test_from_pretrained_dynamic_processor later ran,
tokenizer_class_from_name("NewTokenizer") returned the stale local class (which lacked
special_attribute_present) instead of loading the correct one from the Hub.

Fix

Removed REGISTERED_TOKENIZER_CLASSES entirely. It seems to be redundant (🚨I might be wrong please double check :) ) : AutoTokenizer.register already stores
classes in TOKENIZER_MAPPING._extra_content, and tokenizer_class_from_name can walk
_extra_content.values() to find a class by __name__. The leaked global was the only path through which
the stale class was being returned.

Also moved the _extra_content lookup in tokenizer_class_from_name to run before TOKENIZER_MAPPING_NAMES,
so user-registered classes continue to shadow built-ins, consistent with how AutoConfig.register currently works.

If this looks breaking to you, I'm happy to just pop the registered "NewTokenizer" from REGISTERED_TOKENIZER_CLASSES to fix the tests, and leave tokenization_auto as is.

Cc @itazap @ArthurZucker

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 7, 2026

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@itazap
Copy link
Copy Markdown
Collaborator

itazap commented Apr 9, 2026

It looks fine to me if tests are passing! Agreed that fixing the leaked global should be done anyway

if getattr(tokenizer, "__name__", None) == class_name:
return tokenizer

# We did not find the class, but maybe it's because a dep is missing. In that case, the class will be in the main
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# We did not find the class, but maybe it's because a dep is missing. In that case, the class will be in the main

unrelated to this PR but might as well rm this duplicate comment 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants