Update default values of bos/eos token ids in CLIPTextConfig#24773
Update default values of bos/eos token ids in CLIPTextConfig#24773
CLIPTextConfig#24773Conversation
| # (TODO): remove this comment | ||
| # we can't just reset `eos_token_id` to `vocab_size - 1`! | ||
| # - we need to respect the value in the config file. | ||
| # - (even if we want to reset, `eos_token_id` is not just `vocab_size - 1` when user adding more tokens) | ||
| # Before all the config files on Hub repo. are updated, we can't use `config.eos_token_id` in the modeling. |
There was a problem hiding this comment.
This PR just updates the default values. For existing Hub repo. config files, their values are still used.
There was a problem hiding this comment.
I will remove this TODO block once PR being reviewed.
| @@ -106,10 +106,16 @@ def __init__( | |||
| initializer_range=0.02, | |||
| initializer_factor=1.0, | |||
| pad_token_id=1, | |||
There was a problem hiding this comment.
When I use the tokenizer from the openai/clip repository, the padding is done with 0. But in HF CLIP tokenizer, it is eos_token_id, and the config here has default value 1.
Although the padding is not used for pooling in clip text model, and should not affect the text-image similarity loss computation, I am afraid to change it here.
|
The documentation is not available anymore as the PR was closed or merged. |
|
Regarding the padding token: (copy past from (partial) internal discussion given @patil-suraj)
|
amyeroberts
left a comment
There was a problem hiding this comment.
Thanks for updating, and the detailed comments & explanations! 🤗
What does this PR do?
Currently the default values are not the ones from the corresponding tokenizers.
See discussion in #24650
However, we can't use the
config.eos_token_idin the modeling file (which is the ultimate goal in #24650) with only the change in this PR. We will have to update all the Hub repo. config files first 😢 . (Probably there is something easier to do)