Fix KeyError in convert_to_native_format for dict vocab by weiguangli-io · Pull Request #44452 · huggingface/transformers

weiguangli-io · 2026-03-05T03:34:02Z

Fix KeyError in `convert_to_native_format` for dict vocab

Problem

AutoTokenizer.from_pretrained("vesteinn/ScandiBERT") raises KeyError: 0 in convert_to_native_format.

ScandiBERT's tokenizer_config.json specifies tokenizer_class: "XLMRobertaTokenizer", which has model = Unigram. When convert_to_native_format reads the tokenizer.json, it enters the Unigram branch and executes:

if vocab and isinstance(vocab[0], (list, tuple)):

However, the vocab from tokenizer.json is a dict ({"token": score, ...}), not a list. Indexing a dict with [0] raises KeyError because there is no key 0.

Fix

Add an isinstance(vocab, list) check before attempting to index vocab[0]:

if isinstance(vocab, list) and vocab and isinstance(vocab[0], (list, tuple)):

This is consistent with the guards used in the other branches (e.g., BPE/WordPiece branch on line 163). When vocab is already a dict, no conversion is needed and it is passed through as-is.

Regression

Introduced in v5 by #42894 (use TokenizersBackend).

Rocketknight1 · 2026-03-05T13:50:12Z

cc @itazap @ArthurZucker since it seems like a tokenizer regression (see also the original issue at #44451)

itazap

good catch, thank you!

AngledLuffa · 2026-03-05T20:32:10Z

Thank you, looking forward to it! This was the best model I'd found so far for annotation tasks on DA and IS. (Will keep exploring other options anyway)

HuggingFaceDocBuilderDev · 2026-03-10T13:54:05Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

AngledLuffa · 2026-03-16T06:39:40Z

Can this be merged? It would be great to use this transformer again with newer versions of the package

When loading tokenizers like vesteinn/ScandiBERT whose tokenizer_config specifies XLMRobertaTokenizer (model=Unigram) but whose tokenizer.json contains a dict-type vocab, the expression vocab[0] raises KeyError because dict keys are strings, not integers. Add an isinstance(vocab, list) guard so the list-to-tuple conversion is only attempted on list vocabs.

Rocketknight1 · 2026-03-18T16:07:13Z

Looks like it was ready to go and just had automerge issues, will try to merge it.

AngledLuffa · 2026-03-18T16:27:02Z

That's great, thank you!

AngledLuffa · 2026-03-18T21:51:32Z

Is this fixable?

itazap approved these changes Mar 5, 2026

View reviewed changes

ArthurZucker approved these changes Mar 10, 2026

View reviewed changes

ArthurZucker enabled auto-merge March 10, 2026 13:45

Rocketknight1 force-pushed the codex/transformers-44451-scandibert-keyerror branch from 1317523 to cd5cfd5 Compare March 18, 2026 16:07

ArthurZucker added this pull request to the merge queue Mar 18, 2026

github-merge-queue Bot removed this pull request from the merge queue due to no response for status checks Mar 18, 2026

itazap added this pull request to the merge queue Mar 19, 2026

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Mar 19, 2026

Rocketknight1 added this pull request to the merge queue Mar 19, 2026

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Mar 19, 2026

itazap added this pull request to the merge queue Mar 19, 2026

Merged via the queue into huggingface:main with commit 25a9105 Mar 19, 2026
28 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix KeyError in convert_to_native_format for dict vocab#44452

Fix KeyError in convert_to_native_format for dict vocab#44452
itazap merged 1 commit intohuggingface:mainfrom
weiguangli-io:codex/transformers-44451-scandibert-keyerror

weiguangli-io commented Mar 5, 2026

Uh oh!

Rocketknight1 commented Mar 5, 2026

Uh oh!

itazap left a comment

Uh oh!

AngledLuffa commented Mar 5, 2026

Uh oh!

HuggingFaceDocBuilderDev commented Mar 10, 2026

Uh oh!

AngledLuffa commented Mar 16, 2026

Uh oh!

Rocketknight1 commented Mar 18, 2026

Uh oh!

AngledLuffa commented Mar 18, 2026

Uh oh!

Uh oh!

AngledLuffa commented Mar 18, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

weiguangli-io commented Mar 5, 2026