From 949e034b5ab6f8b4d1a6ffed2fbf3f63e2724fe0 Mon Sep 17 00:00:00 2001 From: Purusharth Malik Date: Thu, 27 Mar 2025 15:50:08 +0530 Subject: [PATCH 1/8] Update clip.md --- docs/source/en/model_doc/clip.md | 270 ++++++++++++------------------- 1 file changed, 99 insertions(+), 171 deletions(-) diff --git a/docs/source/en/model_doc/clip.md b/docs/source/en/model_doc/clip.md index 2e1c5168ce71..c924be57a88b 100644 --- a/docs/source/en/model_doc/clip.md +++ b/docs/source/en/model_doc/clip.md @@ -14,221 +14,149 @@ rendered properly in your Markdown viewer. --> -# CLIP - -
-PyTorch -TensorFlow -Flax -FlashAttention -SDPA +
+
+ PyTorch + TensorFlow + Flax + FlashAttention + SDPA +
-## Overview - -The CLIP model was proposed in [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) by Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, -Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever. CLIP -(Contrastive Language-Image Pre-Training) is a neural network trained on a variety of (image, text) pairs. It can be -instructed in natural language to predict the most relevant text snippet, given an image, without directly optimizing -for the task, similarly to the zero-shot capabilities of GPT-2 and 3. - -The abstract from the paper is the following: - -*State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This -restricted form of supervision limits their generality and usability since additional labeled data is needed to specify -any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a -much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes -with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 -million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference -learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study -the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks -such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The -model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need -for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot -without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained -model weights at this https URL.* - -This model was contributed by [valhalla](https://huggingface.co/valhalla). The original code can be found [here](https://github.com/openai/CLIP). - -## Usage tips and example - -CLIP is a multi-modal vision and language model. It can be used for image-text similarity and for zero-shot image -classification. CLIP uses a ViT like transformer to get visual features and a causal language model to get the text -features. Both the text and visual features are then projected to a latent space with identical dimension. The dot -product between the projected image and text features is then used as a similar score. - -To feed images to the Transformer encoder, each image is split into a sequence of fixed-size non-overlapping patches, -which are then linearly embedded. A [CLS] token is added to serve as representation of an entire image. The authors -also add absolute position embeddings, and feed the resulting sequence of vectors to a standard Transformer encoder. -The [`CLIPImageProcessor`] can be used to resize (or rescale) and normalize images for the model. - -The [`CLIPTokenizer`] is used to encode the text. The [`CLIPProcessor`] wraps -[`CLIPImageProcessor`] and [`CLIPTokenizer`] into a single instance to both -encode the text and prepare the images. The following example shows how to get the image-text similarity scores using -[`CLIPProcessor`] and [`CLIPModel`]. - - -```python ->>> from PIL import Image ->>> import requests +# CLIP ->>> from transformers import CLIPProcessor, CLIPModel +[CLIP](https://huggingface.co/papers/2103.00020) is a is a multi-modal vision and language model trained on a variety of (image, text) pairs. It can be instructed in natural language to predict the most relevant text snippet, given an image, without directly optimizing for the task, similarly to the zero-shot capabilities of GPT-2 and 3. It can also be used for zero-shot image classification. The crux of the ideology behind the model is to use a ViT like transformer to get the visual features and a causal language model to get the text features. Both these features are then projected to a latent space with the same number of dimensions and their dot product gives us a similarity score. ->>> model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32") ->>> processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32") +You can find all the original CLIP checkpoints under the `Models` section of [OpenAI's Company Page](https://huggingface.co/openai). ->>> url = "http://images.cocodataset.org/val2017/000000039769.jpg" ->>> image = Image.open(requests.get(url, stream=True).raw) +> [!TIP] +> Click on the CLIP models in the right sidebar for more examples of how to apply CLIP to different image and language tasks. ->>> inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True) +The example below demonstrates how to calculate similarity scores between multiple textual options and a given image with [`Pipeline`], [`AutoModel`], and [`CLIPModel`]. ->>> outputs = model(**inputs) ->>> logits_per_image = outputs.logits_per_image # this is the image-text similarity score ->>> probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities -``` + + +```py +import requests +from PIL import Image +from transformers import pipeline -### Combining CLIP and Flash Attention 2 +clip = pipeline("zero-shot-image-classification", model="openai/clip-vit-base-patch32") -First, make sure to install the latest version of Flash Attention 2. +url = "http://images.cocodataset.org/val2017/000000039769.jpg" +image = Image.open(requests.get(url, stream=True).raw) +labels = ["a photo of a cat", "a photo of a dog", "a photo of a car"] -```bash -pip install -U flash-attn --no-build-isolation +similarity = clip(image, candidate_labels=labels) +print(similarity) ``` -Make also sure that you have a hardware that is compatible with Flash-Attention 2. Read more about it in the official documentation of flash-attn repository. Make also sure to load your model in half-precision (e.g. `torch.float16`) - - - -For small batch sizes, you might notice a slowdown in your model when using flash attention. Refer to the section [Expected speedups with Flash Attention and SDPA](#Expected-speedups-with-Flash-Attention-and-SDPA) below and select an appropriate attention implementation. - - - -To load and run a model using Flash Attention 2, refer to the snippet below: - -```python ->>> import torch ->>> import requests ->>> from PIL import Image + + ->>> from transformers import CLIPProcessor, CLIPModel +```py +import requests +from PIL import Image +from transformers import AutoProcessor, AutoModel +from PIL import Image +import torch ->>> device = "cuda" ->>> torch_dtype = torch.float16 +model = AutoModel.from_pretrained("openai/clip-vit-base-patch32") +processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32") ->>> model = CLIPModel.from_pretrained( -... "openai/clip-vit-base-patch32", -... attn_implementation="flash_attention_2", -... device_map=device, -... torch_dtype=torch_dtype, -... ) ->>> processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32") +url = "http://images.cocodataset.org/val2017/000000039769.jpg" +image = Image.open(requests.get(url, stream=True).raw) +text = ["a photo of a cat", "a photo of a dog", "a photo of a car"] ->>> url = "http://images.cocodataset.org/val2017/000000039769.jpg" ->>> image = Image.open(requests.get(url, stream=True).raw) +inputs = processor(text=text, images=image, return_tensors="pt", padding=True) +outputs = model(**inputs) ->>> inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True) ->>> inputs.to(device) +image_features = outputs.image_embeds +text_features = outputs.text_embeds ->>> with torch.no_grad(): -... with torch.autocast(device): -... outputs = model(**inputs) - ->>> logits_per_image = outputs.logits_per_image # this is the image-text similarity score ->>> probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities ->>> print(probs) -tensor([[0.9946, 0.0052]], device='cuda:0', dtype=torch.float16) +similarity = torch.cosine_similarity(image_features, text_features) +print(similarity) ``` + + -### Using Scaled Dot Product Attention (SDPA) - -PyTorch includes a native scaled dot-product attention (SDPA) operator as part of `torch.nn.functional`. This function -encompasses several implementations that can be applied depending on the inputs and the hardware in use. See the -[official documentation](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html) -or the [GPU Inference](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#pytorch-scaled-dot-product-attention) -page for more information. +```py +from PIL import Image +import requests -SDPA is used by default for `torch>=2.1.1` when an implementation is available, but you may also set -`attn_implementation="sdpa"` in `from_pretrained()` to explicitly request SDPA to be used. +from transformers import CLIPProcessor, CLIPModel -```python -from transformers import CLIPModel +model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32") +processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32") -model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32", torch_dtype=torch.float16, attn_implementation="sdpa") -``` - -For the best speedups, we recommend loading the model in half-precision (e.g. `torch.float16` or `torch.bfloat16`). +url = "http://images.cocodataset.org/val2017/000000039769.jpg" +image = Image.open(requests.get(url, stream=True).raw) -### Expected speedups with Flash Attention and SDPA +inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True) -On a local benchmark (NVIDIA A10G, PyTorch 2.3.1+cu121) with `float16`, we saw the following speedups during inference for `"openai/clip-vit-large-patch14"` checkpoint ([code](https://gist.github.com/qubvel/ac691a54e54f9fae8144275f866a7ff8)): - -#### CLIPTextModel +outputs = model(**inputs) +logits_per_image = outputs.logits_per_image # this is the image-text similarity score +probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities +``` -| Num text labels | Eager (s/iter) | FA2 (s/iter) | FA2 speedup | SDPA (s/iter) | SDPA speedup | -|------------------:|-----------------:|---------------:|--------------:|----------------:|---------------:| -| 4 | 0.009 | 0.012 | 0.737 | 0.007 | 1.269 | -| 16 | 0.009 | 0.014 | 0.659 | 0.008 | 1.187 | -| 32 | 0.018 | 0.021 | 0.862 | 0.016 | 1.142 | -| 64 | 0.034 | 0.034 | 1.001 | 0.03 | 1.163 | -| 128 | 0.063 | 0.058 | 1.09 | 0.054 | 1.174 | + + -![clip_text_model_viz_3](https://github.com/user-attachments/assets/e9826b43-4e66-4f4c-952b-af4d90bd38eb) +## Using CLIP with Quantization and Flash Attention 2 -#### CLIPVisionModel + -| Image batch size | Eager (s/iter) | FA2 (s/iter) | FA2 speedup | SDPA (s/iter) | SDPA speedup | -|-------------------:|-----------------:|---------------:|--------------:|----------------:|---------------:| -| 1 | 0.016 | 0.013 | 1.247 | 0.012 | 1.318 | -| 4 | 0.025 | 0.021 | 1.198 | 0.021 | 1.202 | -| 16 | 0.093 | 0.075 | 1.234 | 0.075 | 1.24 | -| 32 | 0.181 | 0.147 | 1.237 | 0.146 | 1.241 | +For small batch sizes, you might notice a slowdown in your model when using flash attention. Refer to the section [Expected speedups with Flash Attention and SDPA](#Expected-speedups-with-Flash-Attention-and-SDPA) below and select an appropriate attention implementation. -![clip_image_model_viz_3](https://github.com/user-attachments/assets/50a36206-e3b9-4adc-ac8e-926b8b071d63) + -#### CLIPModel +To load and run a model using Flash Attention 2 in a quantized manner, refer to the snippet below: -| Image batch size | Num text labels | Eager (s/iter) | FA2 (s/iter) | FA2 speedup | SDPA (s/iter) | SDPA speedup | -|-------------------:|------------------:|-----------------:|---------------:|--------------:|----------------:|---------------:| -| 1 | 4 | 0.025 | 0.026 | 0.954 | 0.02 | 1.217 | -| 1 | 16 | 0.026 | 0.028 | 0.918 | 0.02 | 1.287 | -| 1 | 64 | 0.042 | 0.046 | 0.906 | 0.036 | 1.167 | -| 4 | 4 | 0.028 | 0.033 | 0.849 | 0.024 | 1.189 | -| 4 | 16 | 0.034 | 0.035 | 0.955 | 0.029 | 1.169 | -| 4 | 64 | 0.059 | 0.055 | 1.072 | 0.05 | 1.179 | -| 16 | 4 | 0.096 | 0.088 | 1.091 | 0.078 | 1.234 | -| 16 | 16 | 0.102 | 0.09 | 1.129 | 0.083 | 1.224 | -| 16 | 64 | 0.127 | 0.11 | 1.157 | 0.105 | 1.218 | -| 32 | 4 | 0.185 | 0.159 | 1.157 | 0.149 | 1.238 | -| 32 | 16 | 0.19 | 0.162 | 1.177 | 0.154 | 1.233 | -| 32 | 64 | 0.216 | 0.181 | 1.19 | 0.176 | 1.228 | +```py +import requests +import torch +from PIL import Image +from transformers import AutoModel, AutoProcessor, BitsAndBytesConfig -## Resources +# Set up quantization config +quant_config = BitsAndBytesConfig( + load_in_4bit=True, + bnb_4bit_compute_dtype=torch.float16, + bnb_4bit_use_double_quant=True +) -A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with CLIP. +model = AutoModel.from_pretrained( + "openai/clip-vit-base-patch32", + quantization_config=quant_config, + device_map="auto", + attn_implementation="flash_attention_2" # FlashAttention only supports Ampere GPUs or newer +) -- [Fine tuning CLIP with Remote Sensing (Satellite) images and captions](https://huggingface.co/blog/fine-tune-clip-rsicd), a blog post about how to fine-tune CLIP with [RSICD dataset](https://github.com/201528014227051/RSICD_optimal) and comparison of performance changes due to data augmentation. -- This [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/contrastive-image-text) shows how to train a CLIP-like vision-text dual encoder model using a pre-trained vision and text encoder using [COCO dataset](https://cocodataset.org/#home). +processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32") - +url = "http://images.cocodataset.org/val2017/000000039769.jpg" +image = Image.open(requests.get(url, stream=True).raw) +text = ["a photo of a cat", "a photo of a dog"] -- A [notebook](https://colab.research.google.com/drive/1tuoAC5F4sC7qid56Z0ap-stR3rwdk0ZV?usp=sharing) on how to use a pretrained CLIP for inference with beam search for image captioning. 🌎 +inputs = processor(text=text, images=image, return_tensors="pt", padding=True) +inputs = {k: v.to("cuda") for k, v in inputs.items()} -**Image retrieval** +with torch.no_grad(): + outputs = model(**inputs) -- A [notebook](https://colab.research.google.com/drive/1bLVwVKpAndpEDHqjzxVPr_9nGrSbuOQd?usp=sharing) on image retrieval using pretrained CLIP and computing MRR(Mean Reciprocal Rank) score. 🌎 -- A [notebook](https://colab.research.google.com/github/deep-diver/image_search_with_natural_language/blob/main/notebooks/Image_Search_CLIP.ipynb) on image retrieval and showing the similarity score. 🌎 -- A [notebook](https://colab.research.google.com/drive/1xO-wC_m_GNzgjIBQ4a4znvQkvDoZJvH4?usp=sharing) on how to map images and texts to the same vector space using Multilingual CLIP. 🌎 -- A [notebook](https://colab.research.google.com/github/vivien000/clip-demo/blob/master/clip.ipynb#scrollTo=uzdFhRGqiWkR) on how to run CLIP on semantic image search using [Unsplash](https://unsplash.com) and [TMDB](https://www.themoviedb.org/) datasets. 🌎 +image_features = outputs.image_embeds +text_features = outputs.text_embeds -**Explainability** +similarity = torch.cosine_similarity(image_features, text_features) +``` -- A [notebook](https://colab.research.google.com/github/hila-chefer/Transformer-MM-Explainability/blob/main/CLIP_explainability.ipynb) on how to visualize similarity between input token and image segment. 🌎 +## Notes -If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we will review it. -The resource should ideally demonstrate something new instead of duplicating an existing resource. +- For the best speedups, we recommend loading the model in half-precision (e.g. torch.float16 or torch.bfloat16). ## CLIPConfig From 0c54a852e8090eccecdedf8806889d9082757141 Mon Sep 17 00:00:00 2001 From: Purusharth Malik <56820986+purusharthmalik@users.noreply.github.com> Date: Sat, 29 Mar 2025 16:50:43 +0530 Subject: [PATCH 2/8] Update docs/source/en/model_doc/clip.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --- docs/source/en/model_doc/clip.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/en/model_doc/clip.md b/docs/source/en/model_doc/clip.md index c924be57a88b..36c54ab3f84d 100644 --- a/docs/source/en/model_doc/clip.md +++ b/docs/source/en/model_doc/clip.md @@ -27,7 +27,7 @@ rendered properly in your Markdown viewer. # CLIP -[CLIP](https://huggingface.co/papers/2103.00020) is a is a multi-modal vision and language model trained on a variety of (image, text) pairs. It can be instructed in natural language to predict the most relevant text snippet, given an image, without directly optimizing for the task, similarly to the zero-shot capabilities of GPT-2 and 3. It can also be used for zero-shot image classification. The crux of the ideology behind the model is to use a ViT like transformer to get the visual features and a causal language model to get the text features. Both these features are then projected to a latent space with the same number of dimensions and their dot product gives us a similarity score. +[CLIP](https://huggingface.co/papers/2103.00020) is a is a multimodal vision and language model motivated by overcoming the fixed number of object categories when training a computer vision model. CLIP learns about images directly from raw text by jointly training on 400M (image, text) pairs with an image encoder and text encoder. Both features are projected to a latent space with the same number of dimensions and their dot product gives a similarity score. You can find all the original CLIP checkpoints under the `Models` section of [OpenAI's Company Page](https://huggingface.co/openai). From dc56c9c7d3a94b08f79fabd984f5dd44def9a806 Mon Sep 17 00:00:00 2001 From: Purusharth Malik <56820986+purusharthmalik@users.noreply.github.com> Date: Sat, 29 Mar 2025 16:51:07 +0530 Subject: [PATCH 3/8] Update docs/source/en/model_doc/clip.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --- docs/source/en/model_doc/clip.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/en/model_doc/clip.md b/docs/source/en/model_doc/clip.md index 36c54ab3f84d..4a6c04f41f2a 100644 --- a/docs/source/en/model_doc/clip.md +++ b/docs/source/en/model_doc/clip.md @@ -29,7 +29,7 @@ rendered properly in your Markdown viewer. [CLIP](https://huggingface.co/papers/2103.00020) is a is a multimodal vision and language model motivated by overcoming the fixed number of object categories when training a computer vision model. CLIP learns about images directly from raw text by jointly training on 400M (image, text) pairs with an image encoder and text encoder. Both features are projected to a latent space with the same number of dimensions and their dot product gives a similarity score. -You can find all the original CLIP checkpoints under the `Models` section of [OpenAI's Company Page](https://huggingface.co/openai). +You can find all the original CLIP checkpoints under the [OpenAI](https://huggingface.co/openai) organization. > [!TIP] > Click on the CLIP models in the right sidebar for more examples of how to apply CLIP to different image and language tasks. From 1641265bbc697e882eaeca0ec52a65cedebc8b12 Mon Sep 17 00:00:00 2001 From: Purusharth Malik <56820986+purusharthmalik@users.noreply.github.com> Date: Sat, 29 Mar 2025 16:51:27 +0530 Subject: [PATCH 4/8] Update docs/source/en/model_doc/clip.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --- docs/source/en/model_doc/clip.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/en/model_doc/clip.md b/docs/source/en/model_doc/clip.md index 4a6c04f41f2a..9d6a6e33ef38 100644 --- a/docs/source/en/model_doc/clip.md +++ b/docs/source/en/model_doc/clip.md @@ -34,7 +34,7 @@ You can find all the original CLIP checkpoints under the [OpenAI](https://huggin > [!TIP] > Click on the CLIP models in the right sidebar for more examples of how to apply CLIP to different image and language tasks. -The example below demonstrates how to calculate similarity scores between multiple textual options and a given image with [`Pipeline`], [`AutoModel`], and [`CLIPModel`]. +The example below demonstrates how to calculate similarity scores between multiple text descriptions and an image with [`Pipeline`] or the [`AutoModel`] class. From c4a2ab6f98aac4890d9b1fd3df9f94c25cefd022 Mon Sep 17 00:00:00 2001 From: Purusharth Malik Date: Sat, 29 Mar 2025 16:58:59 +0530 Subject: [PATCH 5/8] Incorporated suggested changes --- docs/source/en/model_doc/clip.md | 109 ++++++------------------------- 1 file changed, 19 insertions(+), 90 deletions(-) diff --git a/docs/source/en/model_doc/clip.md b/docs/source/en/model_doc/clip.md index 9d6a6e33ef38..3c7a3599ade3 100644 --- a/docs/source/en/model_doc/clip.md +++ b/docs/source/en/model_doc/clip.md @@ -40,18 +40,17 @@ The example below demonstrates how to calculate similarity scores between multip ```py -import requests -from PIL import Image +import torch from transformers import pipeline -clip = pipeline("zero-shot-image-classification", model="openai/clip-vit-base-patch32") - -url = "http://images.cocodataset.org/val2017/000000039769.jpg" -image = Image.open(requests.get(url, stream=True).raw) +clip = pipeline( + task="zero-shot-image-classification", + model="openai/clip-vit-base-patch32", + torch_dtype=torch.bfloat16, + device=0 +) labels = ["a photo of a cat", "a photo of a dog", "a photo of a car"] - -similarity = clip(image, candidate_labels=labels) -print(similarity) +clip("http://images.cocodataset.org/val2017/000000039769.jpg", candidate_labels=labels) ``` @@ -59,104 +58,34 @@ print(similarity) ```py import requests +import torch from PIL import Image from transformers import AutoProcessor, AutoModel -from PIL import Image -import torch -model = AutoModel.from_pretrained("openai/clip-vit-base-patch32") +model = AutoModel.from_pretrained("openai/clip-vit-base-patch32", torch_dtype=torch.bfloat16, attn_implementation="sdpa") processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32") url = "http://images.cocodataset.org/val2017/000000039769.jpg" image = Image.open(requests.get(url, stream=True).raw) -text = ["a photo of a cat", "a photo of a dog", "a photo of a car"] - -inputs = processor(text=text, images=image, return_tensors="pt", padding=True) -outputs = model(**inputs) - -image_features = outputs.image_embeds -text_features = outputs.text_embeds - -similarity = torch.cosine_similarity(image_features, text_features) -print(similarity) -``` - - - - -```py -from PIL import Image -import requests - -from transformers import CLIPProcessor, CLIPModel - -model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32") -processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32") - -url = "http://images.cocodataset.org/val2017/000000039769.jpg" -image = Image.open(requests.get(url, stream=True).raw) +labels = ["a photo of a cat", "a photo of a dog", "a photo of a car"] -inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True) +inputs = processor(text=labels, images=image, return_tensors="pt", padding=True) outputs = model(**inputs) -logits_per_image = outputs.logits_per_image # this is the image-text similarity score -probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities +logits_per_image = outputs.logits_per_image +probs = logits_per_image.softmax(dim=1) +most_likely_idx = probs.argmax(dim=1).item() +most_likely_label = labels[most_likely_idx] +print(f"Most likely label: {most_likely_label} with probability: {probs[0][most_likely_idx].item():.3f}") ``` -## Using CLIP with Quantization and Flash Attention 2 - - - -For small batch sizes, you might notice a slowdown in your model when using flash attention. Refer to the section [Expected speedups with Flash Attention and SDPA](#Expected-speedups-with-Flash-Attention-and-SDPA) below and select an appropriate attention implementation. - - - -To load and run a model using Flash Attention 2 in a quantized manner, refer to the snippet below: - -```py -import requests -import torch -from PIL import Image -from transformers import AutoModel, AutoProcessor, BitsAndBytesConfig - -# Set up quantization config -quant_config = BitsAndBytesConfig( - load_in_4bit=True, - bnb_4bit_compute_dtype=torch.float16, - bnb_4bit_use_double_quant=True -) - -model = AutoModel.from_pretrained( - "openai/clip-vit-base-patch32", - quantization_config=quant_config, - device_map="auto", - attn_implementation="flash_attention_2" # FlashAttention only supports Ampere GPUs or newer -) - -processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32") - -url = "http://images.cocodataset.org/val2017/000000039769.jpg" -image = Image.open(requests.get(url, stream=True).raw) -text = ["a photo of a cat", "a photo of a dog"] - -inputs = processor(text=text, images=image, return_tensors="pt", padding=True) -inputs = {k: v.to("cuda") for k, v in inputs.items()} - -with torch.no_grad(): - outputs = model(**inputs) - -image_features = outputs.image_embeds -text_features = outputs.text_embeds - -similarity = torch.cosine_similarity(image_features, text_features) -``` - ## Notes -- For the best speedups, we recommend loading the model in half-precision (e.g. torch.float16 or torch.bfloat16). +- If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we will review it. +- The resource should ideally demonstrate something new instead of duplicating an existing resource. ## CLIPConfig From aff85d0b139f15b39592c96ef27513e6d3e9d0b8 Mon Sep 17 00:00:00 2001 From: Purusharth Malik <56820986+purusharthmalik@users.noreply.github.com> Date: Thu, 3 Apr 2025 00:56:55 +0530 Subject: [PATCH 6/8] Update docs/source/en/model_doc/clip.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --- docs/source/en/model_doc/clip.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/en/model_doc/clip.md b/docs/source/en/model_doc/clip.md index 3c7a3599ade3..592b0e080db6 100644 --- a/docs/source/en/model_doc/clip.md +++ b/docs/source/en/model_doc/clip.md @@ -27,7 +27,7 @@ rendered properly in your Markdown viewer. # CLIP -[CLIP](https://huggingface.co/papers/2103.00020) is a is a multimodal vision and language model motivated by overcoming the fixed number of object categories when training a computer vision model. CLIP learns about images directly from raw text by jointly training on 400M (image, text) pairs with an image encoder and text encoder. Both features are projected to a latent space with the same number of dimensions and their dot product gives a similarity score. +[CLIP](https://huggingface.co/papers/2103.00020) is a is a multimodal vision and language model motivated by overcoming the fixed number of object categories when training a computer vision model. CLIP learns about images directly from raw text by jointly training on 400M (image, text) pairs. Pretraining on this scale enables zero-shot transfer to downstream tasks. CLIP uses an image encoder and text encoder to get visual features and text features. Both features are projected to a latent space with the same number of dimensions and their dot product gives a similarity score. You can find all the original CLIP checkpoints under the [OpenAI](https://huggingface.co/openai) organization. From e56b1757c0a9c5171d3206d5e5277f76c689a749 Mon Sep 17 00:00:00 2001 From: Purusharth Malik <56820986+purusharthmalik@users.noreply.github.com> Date: Thu, 3 Apr 2025 00:57:16 +0530 Subject: [PATCH 7/8] Update docs/source/en/model_doc/clip.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --- docs/source/en/model_doc/clip.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/docs/source/en/model_doc/clip.md b/docs/source/en/model_doc/clip.md index 592b0e080db6..0aa7b885c553 100644 --- a/docs/source/en/model_doc/clip.md +++ b/docs/source/en/model_doc/clip.md @@ -84,8 +84,7 @@ print(f"Most likely label: {most_likely_label} with probability: {probs[0][most_ ## Notes -- If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we will review it. -- The resource should ideally demonstrate something new instead of duplicating an existing resource. +- Use [`CLIPImageProcessor`] to resize (or rescale) and normalizes images for the model. ## CLIPConfig From 86daaefe9a61f07c26cf63b470b1f00dc8a37548 Mon Sep 17 00:00:00 2001 From: Purusharth Malik <56820986+purusharthmalik@users.noreply.github.com> Date: Thu, 3 Apr 2025 00:57:31 +0530 Subject: [PATCH 8/8] Update docs/source/en/model_doc/clip.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --- docs/source/en/model_doc/clip.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/en/model_doc/clip.md b/docs/source/en/model_doc/clip.md index 0aa7b885c553..4ab9fe3f21ac 100644 --- a/docs/source/en/model_doc/clip.md +++ b/docs/source/en/model_doc/clip.md @@ -29,7 +29,7 @@ rendered properly in your Markdown viewer. [CLIP](https://huggingface.co/papers/2103.00020) is a is a multimodal vision and language model motivated by overcoming the fixed number of object categories when training a computer vision model. CLIP learns about images directly from raw text by jointly training on 400M (image, text) pairs. Pretraining on this scale enables zero-shot transfer to downstream tasks. CLIP uses an image encoder and text encoder to get visual features and text features. Both features are projected to a latent space with the same number of dimensions and their dot product gives a similarity score. -You can find all the original CLIP checkpoints under the [OpenAI](https://huggingface.co/openai) organization. +You can find all the original CLIP checkpoints under the [OpenAI](https://huggingface.co/openai?search_models=clip) organization. > [!TIP] > Click on the CLIP models in the right sidebar for more examples of how to apply CLIP to different image and language tasks.