Skip to content

fix(tokenizer): Avert special token property overwrites in batch add_tokens calls#43654

Open
harshaljanjani wants to merge 11 commits intohuggingface:mainfrom
harshaljanjani:fix/bigbird-mask-token-registration
Open

fix(tokenizer): Avert special token property overwrites in batch add_tokens calls#43654
harshaljanjani wants to merge 11 commits intohuggingface:mainfrom
harshaljanjani:fix/bigbird-mask-token-registration

Conversation

@harshaljanjani
Copy link
Copy Markdown
Contributor

@harshaljanjani harshaljanjani commented Jan 31, 2026

What does this PR do?

→ Fixes test_modeling_big_bird.py::BigBirdModelIntegrationTest::test_fill_mask.

For more details on reproducing the bug, please visit the linked issue!

Fixes #43653.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you fix any necessary existing tests?

CI Failure:

Image

Current Output:

Image

Output After the Fix:

Image

@harshaljanjani harshaljanjani marked this pull request as ready for review January 31, 2026 18:36
@github-actions github-actions Bot requested a review from ArthurZucker January 31, 2026 18:37
Comment on lines +139 to +143
if isinstance(mask_token_obj, AddedToken):
mask_id = self._tokenizer.token_to_id(str(mask_token_obj))
if mask_id is not None:
self._tokenizer.add_special_tokens([mask_token_obj])

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey this is wrong 😓 the call to super() should already be adding all of the special tokens.
The reason they are skipped when decoding is probably because skip_special_tokens=True by default

Copy link
Copy Markdown
Contributor Author

@harshaljanjani harshaljanjani Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review @ArthurZucker; as it turns out, I underestimated the bug, but I think this is a much more informed reasoning chain since it required a few layers deeper analysis.

BigBirdTokenizer defines mask_token with lstrip=True, but this batch add_tokens call processes multiple [MASK] copies with conflicting lstrip values, one with lstrip=True from _special_tokens_map, one with lstrip=False from saved configs (written code to demonstrate the same; output attached with [TRACE] notes showing the conflicting values).
→ In Rust's add_tokens method. AddedToken has a Hash implementation using only content "[MASK]", but Eq/PartialEq checks all fields. Two tokens with same content but different lstrip are equal for hashing but different for equality.
→ Within a batch, existing.contains() uses full equality, so [MASK](lstrip=False) doesn't match [MASK](lstrip=True) and overwrites it. Whichever appears later is seen as the final state, in our buggy case, lstrip=False.
So, the fix would be to make add_tokens calls for each special token, so the correct lstrip=True version is seen as the final state.

The current output:

before_bigbird

The fixed output would look something like the following:

after_bigbird

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Feb 2, 2026

[For maintainers] Suggested jobs to run (before merge)

run-slow: big_bird

@harshaljanjani harshaljanjani changed the title fix(tokenizer): Register BigBird [MASK] token to fix CI failure fix(tokenizer): Avert special token property overwrites in batch add_tokens calls Feb 3, 2026
Copy link
Copy Markdown
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to check your answer a bit more

if not special_token_value.special:
special_token_value.special = True
self._tokenizer.add_tokens([special_token_value])

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

above you have:

        # if some of the special tokens are not already in the tokenizer, add them
        # V5: Check both named special tokens and extra special tokens
        # Iterate over _special_tokens_map to preserve AddedToken properties (lstrip, rstrip, etc.)
        for special_token_value in self._special_tokens_map.values():
            if special_token_value is None:
                continue
            if str(special_token_value) not in encoder and special_token_value not in tokens_to_add:
                tokens_to_add.append(special_token_value)

so we are doing it twice which I don't think makes sense.

Now I am all in for a bug fix, so the bug you are fixing -> add an equivalent small test for bigbird maybe?

Your deep dive is interesting, I did not fully check it, but whatever was serialized (in the tokenizer_config.json or tokenizer.json) takes precedence

Copy link
Copy Markdown
Contributor Author

@harshaljanjani harshaljanjani Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for being so patient with the review! Hopefully this addresses the concerns :)

Config takes precedence.

The bug tmk, is that [MASK] appears with conflicting properties within the preset itself, in both _special_tokens_map (with lstrip=True from google/bigbird-roberta-base/tokenizer_config.json) and _extra_special_tokens (with default properties lstrip=False, normalized=False when loaded from google/bigbird-roberta-base/spiece.model), and thus the fix is needed since it conflicts within the preset itself.

image

Your deep dive is interesting, I did not fully check it, but whatever was serialized (in the tokenizer_config.json or tokenizer.json) takes precedence

In this case it doesn't which is the MO of the fix. However, I did find a more efficient way to do it thanks to your review: I could de-duplicate, since the whole reason it's happening is because of the duplication of the mask. We could prevent the duplicate [MASK] from entering the batch in the first place instead of checking after. Turns out that works a charm as well. I initially tried to solve this by making individual add_tokens calls for each _special_tokens_map entry after the batch to ensure the correct properties reach the backend, but I think this is a better solution. I've also added a test case per your suggestion.

Hope this better aligns with what the norm is by design here, thanks for taking the time!!

@ArthurZucker
Copy link
Copy Markdown
Collaborator

→ Within a batch, existing.contains() uses full equality, so MASK doesn't match MASK and overwrites it. Whichever appears later is seen as the final state, in our buggy case, lstrip=False.

This is the expected behavior yes.

@ArthurZucker
Copy link
Copy Markdown
Collaborator

one with lstrip=False from saved configs

Config always takes precedence

@harshaljanjani
Copy link
Copy Markdown
Contributor Author

harshaljanjani commented Feb 3, 2026

Thanks for taking the time to review! I've left a reply here :)
Also, ci/circleci: tests_exotic_models seems to be flaky, the failing test appears unrelated to this change and is passing locally with the latest version of the repo (commit: affcf45).

image

# Also check extra special tokens
for token in self._extra_special_tokens:
if str(token) not in encoder and token not in tokens_to_add:
if str(token) not in encoder and str(token) not in {str(t) for t in tokens_to_add}:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you fix the complexity? :)
Create {str(t) for t in tokens_to_add} once to then check existence in O(1) instead of O(len(tokens_to_add) :)

@harshaljanjani
Copy link
Copy Markdown
Contributor Author

Resolved :)

Copy link
Copy Markdown
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you work a bit more on the test please? let's explicit was is not passing on main today

# on main:
In [3]: tokenizer._special_tokens_map.get("mask_token")
Out[3]: AddedToken("[MASK]", rstrip=False, lstrip=True, single_word=False, normalized=True, special=True)

but added tokens decoder has:

AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),

Comment on lines +370 to +371
# Check that the backend also has lstrip=True
backend_mask = tokenizer._tokenizer.get_added_tokens_decoder()[mask_id]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not the test we should enforce it was already working

# Check that mask_token in _special_tokens_map has lstrip=True
mask_in_special = tokenizer._special_tokens_map.get("mask_token")
self.assertIsNotNone(mask_in_special)
self.assertTrue(mask_in_special.lstrip, "mask_token in _special_tokens_map should have lstrip=True")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this already passes for me locally

Copy link
Copy Markdown
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you work a bit more on the test please? let's explicit was is not passing on main today

# on main:
In [3]: tokenizer._special_tokens_map.get("mask_token")
Out[3]: AddedToken("[MASK]", rstrip=False, lstrip=True, single_word=False, normalized=True, special=True)

but added tokens decoder has:

AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),

@harshaljanjani
Copy link
Copy Markdown
Contributor Author

harshaljanjani commented Feb 10, 2026

@ArthurZucker Thanks for your time. So as I mentioned in this comment, [MASK] was registered a second time, so without the fix, tokenizing text with [MASK] produces '_' artifacts, I think this should be a good testing benchmark since it's not peeking at internals and is just focusing on user-level behavior. Please do let me know if this suffices as a better test, I've added both examples as the test for robustness, thanks!

Before Fix:

image image image

After Fix:

image image

@harshaljanjani
Copy link
Copy Markdown
Contributor Author

Happy to make changes until it's production-ready, please do let me know; thanks a ton for your time!

@harshaljanjani
Copy link
Copy Markdown
Contributor Author

Following up on this PR, happy to make changes or add context if helpful!

@harshaljanjani
Copy link
Copy Markdown
Contributor Author

harshaljanjani commented Mar 3, 2026

@ArthurZucker, just a gentle ping; got a notif that the issue was unfortunately marked as stale, but the PR is ready for review :)
Left this explanation for how I tried to resolve the comments :)

@harshaljanjani
Copy link
Copy Markdown
Contributor Author

@ArthurZucker Just a gentle ping on this :)

@harshaljanjani
Copy link
Copy Markdown
Contributor Author

harshaljanjani commented Mar 26, 2026

Good day @ArthurZucker, just checking if this fix is still being considered?
cc: @Rocketknight1 @vasqu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG][CI] BigBirdTokenizer mask token not registered as special token, gives empty decode output

2 participants