Skip to content

Fix KeyError in convert_to_native_format for dict vocab#44452

Merged
itazap merged 1 commit intohuggingface:mainfrom
weiguangli-io:codex/transformers-44451-scandibert-keyerror
Mar 19, 2026
Merged

Fix KeyError in convert_to_native_format for dict vocab#44452
itazap merged 1 commit intohuggingface:mainfrom
weiguangli-io:codex/transformers-44451-scandibert-keyerror

Conversation

@weiguangli-io
Copy link
Copy Markdown
Contributor

Fix KeyError in convert_to_native_format for dict vocab

Fixes #44451

Problem

AutoTokenizer.from_pretrained("vesteinn/ScandiBERT") raises KeyError: 0 in convert_to_native_format.

ScandiBERT's tokenizer_config.json specifies tokenizer_class: "XLMRobertaTokenizer", which has model = Unigram. When convert_to_native_format reads the tokenizer.json, it enters the Unigram branch and executes:

if vocab and isinstance(vocab[0], (list, tuple)):

However, the vocab from tokenizer.json is a dict ({"token": score, ...}), not a list. Indexing a dict with [0] raises KeyError because there is no key 0.

Fix

Add an isinstance(vocab, list) check before attempting to index vocab[0]:

if isinstance(vocab, list) and vocab and isinstance(vocab[0], (list, tuple)):

This is consistent with the guards used in the other branches (e.g., BPE/WordPiece branch on line 163). When vocab is already a dict, no conversion is needed and it is passed through as-is.

Regression

Introduced in v5 by #42894 (use TokenizersBackend).

@Rocketknight1
Copy link
Copy Markdown
Member

cc @itazap @ArthurZucker since it seems like a tokenizer regression (see also the original issue at #44451)

Copy link
Copy Markdown
Collaborator

@itazap itazap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, thank you!

@AngledLuffa
Copy link
Copy Markdown

Thank you, looking forward to it! This was the best model I'd found so far for annotation tasks on DA and IS. (Will keep exploring other options anyway)

@ArthurZucker ArthurZucker enabled auto-merge March 10, 2026 13:45
@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@AngledLuffa
Copy link
Copy Markdown

Can this be merged? It would be great to use this transformer again with newer versions of the package

When loading tokenizers like vesteinn/ScandiBERT whose tokenizer_config
specifies XLMRobertaTokenizer (model=Unigram) but whose tokenizer.json
contains a dict-type vocab, the expression vocab[0] raises KeyError
because dict keys are strings, not integers. Add an isinstance(vocab,
list) guard so the list-to-tuple conversion is only attempted on list
vocabs.
@Rocketknight1 Rocketknight1 force-pushed the codex/transformers-44451-scandibert-keyerror branch from 1317523 to cd5cfd5 Compare March 18, 2026 16:07
@Rocketknight1
Copy link
Copy Markdown
Member

Looks like it was ready to go and just had automerge issues, will try to merge it.

@ArthurZucker ArthurZucker added this pull request to the merge queue Mar 18, 2026
@AngledLuffa
Copy link
Copy Markdown

That's great, thank you!

@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to no response for status checks Mar 18, 2026
@AngledLuffa
Copy link
Copy Markdown

Is this fixable?

@itazap itazap added this pull request to the merge queue Mar 19, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Mar 19, 2026
@Rocketknight1 Rocketknight1 added this pull request to the merge queue Mar 19, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Mar 19, 2026
@itazap itazap added this pull request to the merge queue Mar 19, 2026
Merged via the queue into huggingface:main with commit 25a9105 Mar 19, 2026
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Latest version cannot load "vesteinn/ScandiBERT"

6 participants