fix(tokenizer): Avert special token property overwrites in batch add_tokens calls by harshaljanjani · Pull Request #43654 · huggingface/transformers

harshaljanjani · 2026-01-31T18:31:06Z

What does this PR do?

→ Fixes test_modeling_big_bird.py::BigBirdModelIntegrationTest::test_fill_mask.

For more details on reproducing the bug, please visit the linked issue!

Fixes #43653.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you fix any necessary existing tests?

CI Failure:

Current Output:

Output After the Fix:

ArthurZucker · 2026-02-02T09:53:31Z

+        if isinstance(mask_token_obj, AddedToken):
+            mask_id = self._tokenizer.token_to_id(str(mask_token_obj))
+            if mask_id is not None:
+                self._tokenizer.add_special_tokens([mask_token_obj])
+


hey this is wrong 😓 the call to super() should already be adding all of the special tokens.
The reason they are skipped when decoding is probably because skip_special_tokens=True by default

Thanks for the review @ArthurZucker; as it turns out, I underestimated the bug, but I think this is a much more informed reasoning chain since it required a few layers deeper analysis.

→ BigBirdTokenizer defines mask_token with lstrip=True, but this batch add_tokens call processes multiple [MASK] copies with conflicting lstrip values, one with lstrip=True from _special_tokens_map, one with lstrip=False from saved configs (written code to demonstrate the same; output attached with [TRACE] notes showing the conflicting values).
→ In Rust's add_tokens method. AddedToken has a Hash implementation using only content "[MASK]", but Eq/PartialEq checks all fields. Two tokens with same content but different lstrip are equal for hashing but different for equality.
→ Within a batch, existing.contains() uses full equality, so [MASK](lstrip=False) doesn't match [MASK](lstrip=True) and overwrites it. Whichever appears later is seen as the final state, in our buggy case, lstrip=False.
So, the fix would be to make add_tokens calls for each special token, so the correct lstrip=True version is seen as the final state.

The current output:

The fixed output would look something like the following:

github-actions · 2026-02-02T16:25:23Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: big_bird

ArthurZucker

I need to check your answer a bit more

ArthurZucker · 2026-02-03T09:29:19Z

+                if not special_token_value.special:
+                    special_token_value.special = True
+                self._tokenizer.add_tokens([special_token_value])
+


above you have:

# if some of the special tokens are not already in the tokenizer, add them # V5: Check both named special tokens and extra special tokens # Iterate over _special_tokens_map to preserve AddedToken properties (lstrip, rstrip, etc.) for special_token_value in self._special_tokens_map.values(): if special_token_value is None: continue if str(special_token_value) not in encoder and special_token_value not in tokens_to_add: tokens_to_add.append(special_token_value)

so we are doing it twice which I don't think makes sense.

Now I am all in for a bug fix, so the bug you are fixing -> add an equivalent small test for bigbird maybe?

Your deep dive is interesting, I did not fully check it, but whatever was serialized (in the tokenizer_config.json or tokenizer.json) takes precedence

Thanks for being so patient with the review! Hopefully this addresses the concerns :)

Config takes precedence.

The bug tmk, is that [MASK] appears with conflicting properties within the preset itself, in both _special_tokens_map (with lstrip=True from google/bigbird-roberta-base/tokenizer_config.json) and _extra_special_tokens (with default properties lstrip=False, normalized=False when loaded from google/bigbird-roberta-base/spiece.model), and thus the fix is needed since it conflicts within the preset itself.

Your deep dive is interesting, I did not fully check it, but whatever was serialized (in the tokenizer_config.json or tokenizer.json) takes precedence

In this case it doesn't which is the MO of the fix. However, I did find a more efficient way to do it thanks to your review: I could de-duplicate, since the whole reason it's happening is because of the duplication of the mask. We could prevent the duplicate [MASK] from entering the batch in the first place instead of checking after. Turns out that works a charm as well. I initially tried to solve this by making individual add_tokens calls for each _special_tokens_map entry after the batch to ensure the correct properties reach the backend, but I think this is a better solution. I've also added a test case per your suggestion.

Hope this better aligns with what the norm is by design here, thanks for taking the time!!

ArthurZucker · 2026-02-03T09:31:56Z

→ Within a batch, existing.contains() uses full equality, so MASK doesn't match MASK and overwrites it. Whichever appears later is seen as the final state, in our buggy case, lstrip=False.

This is the expected behavior yes.

ArthurZucker · 2026-02-03T09:32:34Z

one with lstrip=False from saved configs

Config always takes precedence

harshaljanjani · 2026-02-03T13:44:43Z

Thanks for taking the time to review! I've left a reply here :)
Also, ci/circleci: tests_exotic_models seems to be flaky, the failing test appears unrelated to this change and is passing locally with the latest version of the repo (commit: affcf45).

ArthurZucker · 2026-02-06T08:11:26Z

        # Also check extra special tokens
        for token in self._extra_special_tokens:
-            if str(token) not in encoder and token not in tokens_to_add:
+            if str(token) not in encoder and str(token) not in {str(t) for t in tokens_to_add}:


Can you fix the complexity? :)
Create {str(t) for t in tokens_to_add} once to then check existence in O(1) instead of O(len(tokens_to_add) :)

harshaljanjani · 2026-02-06T08:31:26Z

Resolved :)

ArthurZucker

Can you work a bit more on the test please? let's explicit was is not passing on main today

# on main:
In [3]: tokenizer._special_tokens_map.get("mask_token")
Out[3]: AddedToken("[MASK]", rstrip=False, lstrip=True, single_word=False, normalized=True, special=True)

but added tokens decoder has:

AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),

ArthurZucker · 2026-02-09T16:26:19Z

+        # Check that the backend also has lstrip=True
+        backend_mask = tokenizer._tokenizer.get_added_tokens_decoder()[mask_id]


this is not the test we should enforce it was already working

ArthurZucker · 2026-02-09T16:26:30Z

+        # Check that mask_token in _special_tokens_map has lstrip=True
+        mask_in_special = tokenizer._special_tokens_map.get("mask_token")
+        self.assertIsNotNone(mask_in_special)
+        self.assertTrue(mask_in_special.lstrip, "mask_token in _special_tokens_map should have lstrip=True")


this already passes for me locally

ArthurZucker

Can you work a bit more on the test please? let's explicit was is not passing on main today

# on main:
In [3]: tokenizer._special_tokens_map.get("mask_token")
Out[3]: AddedToken("[MASK]", rstrip=False, lstrip=True, single_word=False, normalized=True, special=True)

but added tokens decoder has:

AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),

harshaljanjani · 2026-02-10T05:36:01Z

@ArthurZucker Thanks for your time. So as I mentioned in this comment, [MASK] was registered a second time, so without the fix, tokenizing text with [MASK] produces '_' artifacts, I think this should be a good testing benchmark since it's not peeking at internals and is just focusing on user-level behavior. Please do let me know if this suffices as a better test, I've added both examples as the test for robustness, thanks!

Before Fix:

After Fix:

harshaljanjani · 2026-02-10T05:43:41Z

Happy to make changes until it's production-ready, please do let me know; thanks a ton for your time!

harshaljanjani · 2026-02-16T06:26:29Z

Following up on this PR, happy to make changes or add context if helpful!

harshaljanjani · 2026-03-03T08:19:22Z

@ArthurZucker, just a gentle ping; got a notif that the issue was unfortunately marked as stale, but the PR is ready for review :)
Left this explanation for how I tried to resolve the comments :)

harshaljanjani · 2026-03-10T06:58:08Z

@ArthurZucker Just a gentle ping on this :)

harshaljanjani · 2026-03-26T06:33:38Z

Good day @ArthurZucker, just checking if this fix is still being considered?
cc: @Rocketknight1 @vasqu

fix(tokenizer): Register [MASK] token in BigBirdTokenizer

f67c97b

harshaljanjani marked this pull request as ready for review January 31, 2026 18:36

github-actions Bot requested a review from ArthurZucker January 31, 2026 18:37

ArthurZucker reviewed Feb 2, 2026

View reviewed changes

fix: BigBird mask token lstrip property not propagated to Rust backend

1c35dbb

nit: Fix ci/circleci: check_code_quality

ccd1fea

harshaljanjani changed the title ~~fix(tokenizer): Register BigBird [MASK] token to fix CI failure~~ fix(tokenizer): Avert special token property overwrites in batch add_tokens calls Feb 3, 2026

ArthurZucker reviewed Feb 3, 2026

View reviewed changes

fix: Avert dupl special tokens with conflicting properties

a16bf99

Merge branch 'main' into fix/bigbird-mask-token-registration

2c19493

harshaljanjani requested a review from ArthurZucker February 4, 2026 10:06

ArthurZucker reviewed Feb 6, 2026

View reviewed changes

fix: Reduce complexity

74e37a1

harshaljanjani requested a review from ArthurZucker February 6, 2026 08:31

ArthurZucker reviewed Feb 9, 2026

View reviewed changes

fix: Focus test on tokenization behavior

bb19f35

harshaljanjani requested a review from ArthurZucker February 10, 2026 10:01

Merge branch 'main' into fix/bigbird-mask-token-registration

893141b

harshaljanjani mentioned this pull request Feb 25, 2026

fix(utils): Make torch_compilable_check compatible with torch.export strict mode #44266

Merged

6 tasks

Merge branch 'main' into fix/bigbird-mask-token-registration

c46973a

Merge branch 'main' into fix/bigbird-mask-token-registration

4555b9e

Merge branch 'main' into fix/bigbird-mask-token-registration

e95d1d5

This was referenced Apr 29, 2026

Cumulative feature and defect updates from recent Transformers PRs evalstate/transformers#42

Open

Cumulative defect fixes from recent Transformers PRs evalstate/transformers#43

Open

		# Check that the backend also has lstrip=True
		backend_mask = tokenizer._tokenizer.get_added_tokens_decoder()[mask_id]

Conversation

harshaljanjani commented Jan 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Uh oh!

ArthurZucker Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

harshaljanjani Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Feb 2, 2026

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

harshaljanjani Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ArthurZucker commented Feb 3, 2026

Uh oh!

ArthurZucker commented Feb 3, 2026

Uh oh!

harshaljanjani commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ArthurZucker Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

harshaljanjani commented Feb 6, 2026

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

harshaljanjani commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

harshaljanjani commented Feb 10, 2026

Uh oh!

harshaljanjani commented Feb 16, 2026

Uh oh!

harshaljanjani commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

harshaljanjani commented Mar 10, 2026

Uh oh!

harshaljanjani commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

harshaljanjani commented Jan 31, 2026 •

edited

Loading

harshaljanjani Feb 2, 2026 •

edited

Loading

harshaljanjani Feb 3, 2026 •

edited

Loading

harshaljanjani commented Feb 3, 2026 •

edited

Loading

harshaljanjani commented Feb 10, 2026 •

edited

Loading

harshaljanjani commented Mar 3, 2026 •

edited

Loading

harshaljanjani commented Mar 26, 2026 •

edited

Loading