Skip to content

Update default values of bos/eos token ids in CLIPTextConfig#24773

Merged
ydshieh merged 3 commits intomainfrom
break_everything_hope_not
Jul 12, 2023
Merged

Update default values of bos/eos token ids in CLIPTextConfig#24773
ydshieh merged 3 commits intomainfrom
break_everything_hope_not

Conversation

@ydshieh
Copy link
Copy Markdown
Collaborator

@ydshieh ydshieh commented Jul 12, 2023

What does this PR do?

Currently the default values are not the ones from the corresponding tokenizers.

See discussion in #24650

However, we can't use the config.eos_token_id in the modeling file (which is the ultimate goal in #24650) with only the change in this PR. We will have to update all the Hub repo. config files first 😢 . (Probably there is something easier to do)

Comment on lines +113 to +117
# (TODO): remove this comment
# we can't just reset `eos_token_id` to `vocab_size - 1`!
# - we need to respect the value in the config file.
# - (even if we want to reset, `eos_token_id` is not just `vocab_size - 1` when user adding more tokens)
# Before all the config files on Hub repo. are updated, we can't use `config.eos_token_id` in the modeling.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR just updates the default values. For existing Hub repo. config files, their values are still used.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will remove this TODO block once PR being reviewed.

@@ -106,10 +106,16 @@ def __init__(
initializer_range=0.02,
initializer_factor=1.0,
pad_token_id=1,
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I use the tokenizer from the openai/clip repository, the padding is done with 0. But in HF CLIP tokenizer, it is eos_token_id, and the config here has default value 1.

Although the padding is not used for pooling in clip text model, and should not affect the text-image similarity loss computation, I am afraid to change it here.

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

HuggingFaceDocBuilderDev commented Jul 12, 2023

The documentation is not available anymore as the PR was closed or merged.

@ydshieh
Copy link
Copy Markdown
Collaborator Author

ydshieh commented Jul 12, 2023

Regarding the padding token:

(copy past from (partial) internal discussion given @patil-suraj)

When we added CLIP I tested for the text_projection , logits_per_image and logits_per_text. For the text_projection the model pulls the embeddings of the last token i.e the eos token. The rest of the tokens i.e the padding tokens are ignored. We can see in this colab that text_projection , logits_per_image and logits_per_text match with the OAI model because we only take the pooled embeddings. And when CLIP was released it was intended for these features which are needed for contrastive tasks. Hence I didn't test against all token embeddings.

IMO the wrong padding token will only affect inference when using all token emebeddings i.e Stable Diffusion. For training even if the padding token is wrong it shouldn't affect because

  • Because CLIP did not use attention_mask during training.
  • CLIPTextEncoder uses casual mask, so the tokens to the right don't influence the hidden states of tokens to the left.
  • CLIP is trained with contrastive loss which is computed using the projections, and as I said above the text_projection is computed by pooling the eos token embeddings, which will be always similar no matter what the padding token is, because CLIPTextEncoder is causal, so the eos embeddings won't be affected by tokens on the right.
  • Hence, for downstream training (like SD) as long as a consistent token is used for padding it shouldn't severely affect the training. But for inference we will need to use the same token as Patrick explained.
    This could also be the reason that we didn't have any issue related to this.

As far as I can understand, it'll only affect the inference if a different token (compared to the padding token used for training) is used for padding. (edited)

Copy link
Copy Markdown
Contributor

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating, and the detailed comments & explanations! 🤗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants