fix(tokenizer): Avert special token property overwrites in batch add_tokens calls#43654
fix(tokenizer): Avert special token property overwrites in batch add_tokens calls#43654harshaljanjani wants to merge 11 commits intohuggingface:mainfrom
Conversation
| if isinstance(mask_token_obj, AddedToken): | ||
| mask_id = self._tokenizer.token_to_id(str(mask_token_obj)) | ||
| if mask_id is not None: | ||
| self._tokenizer.add_special_tokens([mask_token_obj]) | ||
|
|
There was a problem hiding this comment.
hey this is wrong 😓 the call to super() should already be adding all of the special tokens.
The reason they are skipped when decoding is probably because skip_special_tokens=True by default
There was a problem hiding this comment.
Thanks for the review @ArthurZucker; as it turns out, I underestimated the bug, but I think this is a much more informed reasoning chain since it required a few layers deeper analysis.
→ BigBirdTokenizer defines mask_token with lstrip=True, but this batch add_tokens call processes multiple [MASK] copies with conflicting lstrip values, one with lstrip=True from _special_tokens_map, one with lstrip=False from saved configs (written code to demonstrate the same; output attached with [TRACE] notes showing the conflicting values).
→ In Rust's add_tokens method. AddedToken has a Hash implementation using only content "[MASK]", but Eq/PartialEq checks all fields. Two tokens with same content but different lstrip are equal for hashing but different for equality.
→ Within a batch, existing.contains() uses full equality, so [MASK](lstrip=False) doesn't match [MASK](lstrip=True) and overwrites it. Whichever appears later is seen as the final state, in our buggy case, lstrip=False.
So, the fix would be to make add_tokens calls for each special token, so the correct lstrip=True version is seen as the final state.
The current output:
The fixed output would look something like the following:
|
[For maintainers] Suggested jobs to run (before merge) run-slow: big_bird |
ArthurZucker
left a comment
There was a problem hiding this comment.
I need to check your answer a bit more
| if not special_token_value.special: | ||
| special_token_value.special = True | ||
| self._tokenizer.add_tokens([special_token_value]) | ||
|
|
There was a problem hiding this comment.
above you have:
# if some of the special tokens are not already in the tokenizer, add them
# V5: Check both named special tokens and extra special tokens
# Iterate over _special_tokens_map to preserve AddedToken properties (lstrip, rstrip, etc.)
for special_token_value in self._special_tokens_map.values():
if special_token_value is None:
continue
if str(special_token_value) not in encoder and special_token_value not in tokens_to_add:
tokens_to_add.append(special_token_value)so we are doing it twice which I don't think makes sense.
Now I am all in for a bug fix, so the bug you are fixing -> add an equivalent small test for bigbird maybe?
Your deep dive is interesting, I did not fully check it, but whatever was serialized (in the tokenizer_config.json or tokenizer.json) takes precedence
There was a problem hiding this comment.
Thanks for being so patient with the review! Hopefully this addresses the concerns :)
Config takes precedence.
The bug tmk, is that [MASK] appears with conflicting properties within the preset itself, in both _special_tokens_map (with lstrip=True from google/bigbird-roberta-base/tokenizer_config.json) and _extra_special_tokens (with default properties lstrip=False, normalized=False when loaded from google/bigbird-roberta-base/spiece.model), and thus the fix is needed since it conflicts within the preset itself.
Your deep dive is interesting, I did not fully check it, but whatever was serialized (in the tokenizer_config.json or tokenizer.json) takes precedence
In this case it doesn't which is the MO of the fix. However, I did find a more efficient way to do it thanks to your review: I could de-duplicate, since the whole reason it's happening is because of the duplication of the mask. We could prevent the duplicate [MASK] from entering the batch in the first place instead of checking after. Turns out that works a charm as well. I initially tried to solve this by making individual add_tokens calls for each _special_tokens_map entry after the batch to ensure the correct properties reach the backend, but I think this is a better solution. I've also added a test case per your suggestion.
Hope this better aligns with what the norm is by design here, thanks for taking the time!!
This is the expected behavior yes. |
Config always takes precedence |
|
Thanks for taking the time to review! I've left a reply here :)
|
| # Also check extra special tokens | ||
| for token in self._extra_special_tokens: | ||
| if str(token) not in encoder and token not in tokens_to_add: | ||
| if str(token) not in encoder and str(token) not in {str(t) for t in tokens_to_add}: |
There was a problem hiding this comment.
Can you fix the complexity? :)
Create {str(t) for t in tokens_to_add} once to then check existence in O(1) instead of O(len(tokens_to_add) :)
|
Resolved :) |
ArthurZucker
left a comment
There was a problem hiding this comment.
Can you work a bit more on the test please? let's explicit was is not passing on main today
# on main:
In [3]: tokenizer._special_tokens_map.get("mask_token")
Out[3]: AddedToken("[MASK]", rstrip=False, lstrip=True, single_word=False, normalized=True, special=True)but added tokens decoder has:
AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
| # Check that the backend also has lstrip=True | ||
| backend_mask = tokenizer._tokenizer.get_added_tokens_decoder()[mask_id] |
There was a problem hiding this comment.
this is not the test we should enforce it was already working
| # Check that mask_token in _special_tokens_map has lstrip=True | ||
| mask_in_special = tokenizer._special_tokens_map.get("mask_token") | ||
| self.assertIsNotNone(mask_in_special) | ||
| self.assertTrue(mask_in_special.lstrip, "mask_token in _special_tokens_map should have lstrip=True") |
There was a problem hiding this comment.
this already passes for me locally
ArthurZucker
left a comment
There was a problem hiding this comment.
Can you work a bit more on the test please? let's explicit was is not passing on main today
# on main:
In [3]: tokenizer._special_tokens_map.get("mask_token")
Out[3]: AddedToken("[MASK]", rstrip=False, lstrip=True, single_word=False, normalized=True, special=True)but added tokens decoder has:
AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
|
@ArthurZucker Thanks for your time. So as I mentioned in this comment, [MASK] was registered a second time, so without the fix, tokenizing text with [MASK] produces '_' artifacts, I think this should be a good testing benchmark since it's not peeking at internals and is just focusing on user-level behavior. Please do let me know if this suffices as a better test, I've added both examples as the test for robustness, thanks! Before Fix:
After Fix:
|
|
Happy to make changes until it's production-ready, please do let me know; thanks a ton for your time! |
|
Following up on this PR, happy to make changes or add context if helpful! |
|
@ArthurZucker, just a gentle ping; got a notif that the issue was unfortunately marked as stale, but the PR is ready for review :) |
|
@ArthurZucker Just a gentle ping on this :) |
|
Good day @ArthurZucker, just checking if this fix is still being considered? |







What does this PR do?
→ Fixes
test_modeling_big_bird.py::BigBirdModelIntegrationTest::test_fill_mask.For more details on reproducing the bug, please visit the linked issue!
Fixes #43653.
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
CI Failure:
Current Output:
Output After the Fix: