Update default values of bos/eos token ids in `CLIPTextConfig` by ydshieh · Pull Request #24773 · huggingface/transformers

ydshieh · 2023-07-12T10:00:31Z

What does this PR do?

Currently the default values are not the ones from the corresponding tokenizers.

See discussion in #24650

However, we can't use the config.eos_token_id in the modeling file (which is the ultimate goal in #24650) with only the change in this PR. We will have to update all the Hub repo. config files first 😢 . (Probably there is something easier to do)

ydshieh · 2023-07-12T10:10:01Z

+        # (TODO): remove this comment
+        #  we can't just reset `eos_token_id` to `vocab_size - 1`!
+        #    - we need to respect the value in the config file.
+        #    - (even if we want to reset, `eos_token_id` is not just `vocab_size - 1` when user adding more tokens)
+        #  Before all the config files on Hub repo. are updated, we can't use `config.eos_token_id` in the modeling.


This PR just updates the default values. For existing Hub repo. config files, their values are still used.

I will remove this TODO block once PR being reviewed.

ydshieh · 2023-07-12T10:15:21Z

@@ -106,10 +106,16 @@ def __init__(
        initializer_range=0.02,
        initializer_factor=1.0,
        pad_token_id=1,


When I use the tokenizer from the openai/clip repository, the padding is done with 0. But in HF CLIP tokenizer, it is eos_token_id, and the config here has default value 1.

Although the padding is not used for pooling in clip text model, and should not affect the text-image similarity loss computation, I am afraid to change it here.

HuggingFaceDocBuilderDev · 2023-07-12T10:16:57Z

The documentation is not available anymore as the PR was closed or merged.

ydshieh · 2023-07-12T10:56:22Z

Regarding the padding token:

(copy past from (partial) internal discussion given @patil-suraj)

When we added CLIP I tested for the text_projection , logits_per_image and logits_per_text. For the text_projection the model pulls the embeddings of the last token i.e the eos token. The rest of the tokens i.e the padding tokens are ignored. We can see in this colab that text_projection , logits_per_image and logits_per_text match with the OAI model because we only take the pooled embeddings. And when CLIP was released it was intended for these features which are needed for contrastive tasks. Hence I didn't test against all token embeddings.

IMO the wrong padding token will only affect inference when using all token emebeddings i.e Stable Diffusion. For training even if the padding token is wrong it shouldn't affect because

Because CLIP did not use attention_mask during training.

CLIPTextEncoder uses casual mask, so the tokens to the right don't influence the hidden states of tokens to the left.

CLIP is trained with contrastive loss which is computed using the projections, and as I said above the text_projection is computed by pooling the eos token embeddings, which will be always similar no matter what the padding token is, because CLIPTextEncoder is causal, so the eos embeddings won't be affected by tokens on the right.

Hence, for downstream training (like SD) as long as a consistent token is used for padding it shouldn't severely affect the training. But for inference we will need to use the same token as Patrick explained.
This could also be the reason that we didn't have any issue related to this.

As far as I can understand, it'll only affect the inference if a different token (compared to the padding token used for training) is used for padding. (edited)

amyeroberts

Thanks for updating, and the detailed comments & explanations! 🤗

fix

edcbb76

ydshieh commented Jul 12, 2023

View reviewed changes

ydshieh requested review from amyeroberts and patil-suraj July 12, 2023 10:15

fix

a0f43c2

amyeroberts approved these changes Jul 12, 2023

View reviewed changes

patrickvonplaten approved these changes Jul 12, 2023

View reviewed changes

fix

473fd1b

ydshieh merged commit 4f85aaa into main Jul 12, 2023

ydshieh deleted the break_everything_hope_not branch July 12, 2023 11:50

ydshieh mentioned this pull request Jul 12, 2023

Make CLIP model could use new added tokens with meaningful pooling #24777

Merged

NielsRogge mentioned this pull request Jul 31, 2025

Support MetaCLIP 2 #39821

Closed

coderabbitai Bot mentioned this pull request Apr 28, 2026

feat: PyTorch-style lazy shape inference for ~50 layers + backbone refactor (#1209) ooples/AiDotNet#1218

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update default values of bos/eos token ids in `CLIPTextConfig`#24773

Update default values of bos/eos token ids in `CLIPTextConfig`#24773
ydshieh merged 3 commits intomainfrom
break_everything_hope_not

ydshieh commented Jul 12, 2023

Uh oh!

ydshieh Jul 12, 2023

Uh oh!

ydshieh Jul 12, 2023

Uh oh!

ydshieh Jul 12, 2023

Uh oh!

HuggingFaceDocBuilderDev commented Jul 12, 2023 •

edited

Loading

Uh oh!

ydshieh commented Jul 12, 2023

Uh oh!

amyeroberts left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ydshieh commented Jul 12, 2023

What does this PR do?

Uh oh!

ydshieh Jul 12, 2023

Choose a reason for hiding this comment

Uh oh!

ydshieh Jul 12, 2023

Choose a reason for hiding this comment

Uh oh!

ydshieh Jul 12, 2023

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Jul 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ydshieh commented Jul 12, 2023

Uh oh!

amyeroberts left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

HuggingFaceDocBuilderDev commented Jul 12, 2023 •

edited

Loading