From 972d2e40bc6da896dbeb41b3f83719ef2527da3b Mon Sep 17 00:00:00 2001
From: devkade <mouseku@moana-master>
Date: Sun, 1 Jun 2025 17:34:18 +0900
Subject: [PATCH 1/5] Update docs/source/en/model_doc/blip.md

---
 docs/source/en/model_doc/blip.md | 77 +++++++++++++++++++++++++-------
 1 file changed, 62 insertions(+), 15 deletions(-)
diff --git a/docs/source/en/model_doc/blip.md b/docs/source/en/model_doc/blip.md
index efb6b27082af..83e9cd23e064 100644
--- a/docs/source/en/model_doc/blip.md
+++ b/docs/source/en/model_doc/blip.md
@@ -14,31 +14,78 @@ rendered properly in your Markdown viewer.
 
 -->
 
+<div style="float: right;">
+    <div class="flex flex-wrap space-x-1">
+        <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
+        <img alt="TensorFlow" src="https://img.shields.io/badge/TensorFlow-FF6F00?style=flat&logo=tensorflow&logoColor=white">
+    </div>
+</div>
+
 # BLIP
 
-<div class="flex flex-wrap space-x-1">
-<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-<img alt="TensorFlow" src="https://img.shields.io/badge/TensorFlow-FF6F00?style=flat&logo=tensorflow&logoColor=white">
-</div>
+[BLIP](https://huggingface.co/papers/2201.12086) BLIP (Bootstrapped Language-Image Pretraining) is a multimodal model that can understand and generate images and text together, and it is distinctive in that one model can handle various vision-language tasks such as image captioning, image-text matching, and VQA all together. Unlike existing VLP models that were specialized in only one of understanding or generation, BLIP is designed to be flexibly transferred to both domains. In particular, without using noisy image-text pairs collected from the web as they are, it introduces a ‘bootstrapping caption’ technique that generates sentences with its own captioner and goes through a filtering process to increase learning quality, thereby securing cleaner and more meaningful data. As a result, BLIP achieves high performance and excellent generalization despite little artificial manual labor, and can be effectively utilized for various multimodal tasks.
+
+
+You can find all the original BLIP checkpoints under the [BLIP](https://huggingface.co/collections/Salesforce/blip-models-65242f40f1491fbf6a9e9472) collection.
+
+> [!TIP]
+> Click on the BLIP models in the right sidebar for more examples of how to apply BLIP to different tasks.
+
+The example below demonstrates how to generate text based on an image with [`Pipeline`] or the [`AutoModel`] class.
 
-## Overview
+<hfoptions id="usage">
+<hfoption id="Pipeline">
 
-The BLIP model was proposed in [BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation](https://arxiv.org/abs/2201.12086) by Junnan Li, Dongxu Li, Caiming Xiong, Steven Hoi.
+```python
+import torch
+from transformers import pipeline
 
-BLIP is a model that is able to perform various multi-modal tasks including:
-- Visual Question Answering 
-- Image-Text retrieval (Image-text matching)
-- Image Captioning
+pipeline = pipeline(
+    task="visual-question-answering",
+    model="Salesforce/blip-vqa-base",
+    torch_dtype=torch.float16,
+    device=0
+)
+url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+print(pipeline(question="What is cat doing?", image=url))
+```
 
-The abstract from the paper is the following:
+</hfoption>
+<hfoption id="AutoModel">
 
-*Vision-Language Pre-training (VLP) has advanced the performance for many vision-language tasks. 
-However, most existing pre-trained models only excel in either understanding-based tasks or generation-based tasks. Furthermore, performance improvement has been largely achieved by scaling up the dataset with noisy image-text pairs collected from the web, which is a suboptimal source of supervision. In this paper, we propose BLIP, a new VLP framework which transfers flexibly to both vision-language understanding and generation tasks. BLIP effectively utilizes the noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones. We achieve state-of-the-art results on a wide range of vision-language tasks, such as image-text retrieval (+2.7% in average recall@1), image captioning (+2.8% in CIDEr), and VQA (+1.6% in VQA score). BLIP also demonstrates strong generalization ability when directly transferred to videolanguage tasks in a zero-shot manner. Code, models, and datasets are released.*
+```python
+import requests
+import torch
+from PIL import Image
+from transformers import AutoProcessor, AutoModelForVisualQuestionAnswering
+
+processor = AutoProcessor.from_pretrained("Salesforce/blip-vqa-base")
+model = AutoModelForVisualQuestionAnswering.from_pretrained(
+    "Salesforce/blip-vqa-base", 
+    torch_dtype=torch.float16,
+    device_map="auto"
+)
+
+url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+image = Image.open(requests.get(url, stream=True).raw)
+
+question = "What is cat doing?"
+inputs = processor(images=image, text=question, return_tensors="pt").to("cuda", torch.float16)
+
+output = model.generate(**inputs)
+print(processor.batch_decode(output, skip_special_tokens=True)[0])
+```
+
+</hfoption>
+
+</hfoptions>
 
 ![BLIP.gif](https://cdn-uploads.huggingface.co/production/uploads/1670928184033-62441d1d9fdefb55a0b7d12c.gif)
 
-This model was contributed by [ybelkada](https://huggingface.co/ybelkada).
-The original code can be found [here](https://github.com/salesforce/BLIP).
+## Notes
+
+- This model was contributed by [ybelkada](https://huggingface.co/ybelkada).
+- The original code can be found [here](https://github.com/salesforce/BLIP).
 
 ## Resources
 

From 7fe590ccb7164c19bf28c5b3cf7cdb03acad3e1f Mon Sep 17 00:00:00 2001
From: devkade <devkade12@gmail.com>
Date: Sun, 1 Jun 2025 17:52:01 +0900
Subject: [PATCH 2/5] fix(docs/source/en/model_doc/blip.md): fix redundent typo
 error

---
 docs/source/en/model_doc/blip.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/en/model_doc/blip.md b/docs/source/en/model_doc/blip.md
index 83e9cd23e064..e6391950396d 100644
--- a/docs/source/en/model_doc/blip.md
+++ b/docs/source/en/model_doc/blip.md
@@ -23,7 +23,7 @@ rendered properly in your Markdown viewer.
 
 # BLIP
 
-[BLIP](https://huggingface.co/papers/2201.12086) BLIP (Bootstrapped Language-Image Pretraining) is a multimodal model that can understand and generate images and text together, and it is distinctive in that one model can handle various vision-language tasks such as image captioning, image-text matching, and VQA all together. Unlike existing VLP models that were specialized in only one of understanding or generation, BLIP is designed to be flexibly transferred to both domains. In particular, without using noisy image-text pairs collected from the web as they are, it introduces a ‘bootstrapping caption’ technique that generates sentences with its own captioner and goes through a filtering process to increase learning quality, thereby securing cleaner and more meaningful data. As a result, BLIP achieves high performance and excellent generalization despite little artificial manual labor, and can be effectively utilized for various multimodal tasks.
+[BLIP](https://huggingface.co/papers/2201.12086) (Bootstrapped Language-Image Pretraining) is a multimodal model that can understand and generate images and text together, and it is distinctive in that one model can handle various vision-language tasks such as image captioning, image-text matching, and VQA all together. Unlike existing VLP models that were specialized in only one of understanding or generation, BLIP is designed to be flexibly transferred to both domains. In particular, without using noisy image-text pairs collected from the web as they are, it introduces a ‘bootstrapping caption’ technique that generates sentences with its own captioner and goes through a filtering process to increase learning quality, thereby securing cleaner and more meaningful data. As a result, BLIP achieves high performance and excellent generalization despite little artificial manual labor, and can be effectively utilized for various multimodal tasks.
 
 
 You can find all the original BLIP checkpoints under the [BLIP](https://huggingface.co/collections/Salesforce/blip-models-65242f40f1491fbf6a9e9472) collection.

From de1d8f471bb3ba9e1a137cd23b8c7e36a52bc124 Mon Sep 17 00:00:00 2001
From: devkade <devkade12@gmail.com>
Date: Fri, 20 Jun 2025 17:51:33 +0900
Subject: [PATCH 3/5] fix (docs/source/en/model_doc/blip.md): modify of review
 contents

---
 docs/source/en/model_doc/blip.md | 13 ++++---------
 1 file changed, 4 insertions(+), 9 deletions(-)

diff --git a/docs/source/en/model_doc/blip.md b/docs/source/en/model_doc/blip.md
index e6391950396d..05e2ccada064 100644
--- a/docs/source/en/model_doc/blip.md
+++ b/docs/source/en/model_doc/blip.md
@@ -23,15 +23,15 @@ rendered properly in your Markdown viewer.
 
 # BLIP
 
-[BLIP](https://huggingface.co/papers/2201.12086) (Bootstrapped Language-Image Pretraining) is a multimodal model that can understand and generate images and text together, and it is distinctive in that one model can handle various vision-language tasks such as image captioning, image-text matching, and VQA all together. Unlike existing VLP models that were specialized in only one of understanding or generation, BLIP is designed to be flexibly transferred to both domains. In particular, without using noisy image-text pairs collected from the web as they are, it introduces a ‘bootstrapping caption’ technique that generates sentences with its own captioner and goes through a filtering process to increase learning quality, thereby securing cleaner and more meaningful data. As a result, BLIP achieves high performance and excellent generalization despite little artificial manual labor, and can be effectively utilized for various multimodal tasks.
+[BLIP](https://huggingface.co/papers/2201.12086) (Bootstrapped Language-Image Pretraining) is a vision-language pretraining (VLP) framework designed for *both* understanding and generation tasks. Most existing pretrained models are only good at one or the other. It uses a captioner to generate captions and a filter to remove the noisy captions. This increases training data quality and more effectively uses the messy web data.
 
 
 You can find all the original BLIP checkpoints under the [BLIP](https://huggingface.co/collections/Salesforce/blip-models-65242f40f1491fbf6a9e9472) collection.
 
 > [!TIP]
-> Click on the BLIP models in the right sidebar for more examples of how to apply BLIP to different tasks.
+> Click on the BLIP models in the right sidebar for more examples of how to apply BLIP to different vision language tasks.
 
-The example below demonstrates how to generate text based on an image with [`Pipeline`] or the [`AutoModel`] class.
+The example below demonstrates how to visual question answering with [`Pipeline`] or the [`AutoModel`] class.
 
 <hfoptions id="usage">
 <hfoption id="Pipeline">
@@ -82,14 +82,9 @@ print(processor.batch_decode(output, skip_special_tokens=True)[0])
 
 ![BLIP.gif](https://cdn-uploads.huggingface.co/production/uploads/1670928184033-62441d1d9fdefb55a0b7d12c.gif)
 
-## Notes
-
-- This model was contributed by [ybelkada](https://huggingface.co/ybelkada).
-- The original code can be found [here](https://github.com/salesforce/BLIP).
-
 ## Resources
 
-- [Jupyter notebook](https://github.com/huggingface/notebooks/blob/main/examples/image_captioning_blip.ipynb) on how to fine-tune BLIP for image captioning on a custom dataset
+Refer to this [notebook](https://github.com/huggingface/notebooks/blob/main/examples/image_captioning_blip.ipynb) to learn how to fine-tune BLIP for image captioning on a custom dataset.
 
 ## BlipConfig
 

From 11431a0da000aa556942d15b32ce77d9340bb880 Mon Sep 17 00:00:00 2001
From: devkade <devkade12@gmail.com>
Date: Fri, 20 Jun 2025 17:56:49 +0900
Subject: [PATCH 4/5] fix(docs/source/en/model_doc/blip.md): modify code block

---
 docs/source/en/model_doc/blip.md | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/docs/source/en/model_doc/blip.md b/docs/source/en/model_doc/blip.md
index 05e2ccada064..212581054ccf 100644
--- a/docs/source/en/model_doc/blip.md
+++ b/docs/source/en/model_doc/blip.md
@@ -46,8 +46,8 @@ pipeline = pipeline(
     torch_dtype=torch.float16,
     device=0
 )
-url = "http://images.cocodataset.org/val2017/000000039769.jpg"
-print(pipeline(question="What is cat doing?", image=url))
+url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
+pipeline(question="What is the weather in this image?", image=url)
 ```
 
 </hfoption>
@@ -66,14 +66,14 @@ model = AutoModelForVisualQuestionAnswering.from_pretrained(
     device_map="auto"
 )
 
-url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
 image = Image.open(requests.get(url, stream=True).raw)
 
-question = "What is cat doing?"
+question = "What is the weather in this image?"
 inputs = processor(images=image, text=question, return_tensors="pt").to("cuda", torch.float16)
 
 output = model.generate(**inputs)
-print(processor.batch_decode(output, skip_special_tokens=True)[0])
+processor.batch_decode(output, skip_special_tokens=True)[0]
 ```
 
 </hfoption>

From 7766fe658ca55ff9250e9ddf4d5bf9258adba596 Mon Sep 17 00:00:00 2001
From: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Date: Fri, 20 Jun 2025 13:30:35 -0700
Subject: [PATCH 5/5] Update blip.md

---
 docs/source/en/model_doc/blip.md | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/docs/source/en/model_doc/blip.md b/docs/source/en/model_doc/blip.md
index 212581054ccf..a8d4c5a14bbd 100644
--- a/docs/source/en/model_doc/blip.md
+++ b/docs/source/en/model_doc/blip.md
@@ -29,6 +29,8 @@ rendered properly in your Markdown viewer.
 You can find all the original BLIP checkpoints under the [BLIP](https://huggingface.co/collections/Salesforce/blip-models-65242f40f1491fbf6a9e9472) collection.
 
 > [!TIP]
+> This model was contributed by [ybelkada](https://huggingface.co/ybelkada).
+> 
 > Click on the BLIP models in the right sidebar for more examples of how to apply BLIP to different vision language tasks.
 
 The example below demonstrates how to visual question answering with [`Pipeline`] or the [`AutoModel`] class.
@@ -77,11 +79,8 @@ processor.batch_decode(output, skip_special_tokens=True)[0]
 ```
 
 </hfoption>
-
 </hfoptions>
 
-![BLIP.gif](https://cdn-uploads.huggingface.co/production/uploads/1670928184033-62441d1d9fdefb55a0b7d12c.gif)
-
 ## Resources
 
 Refer to this [notebook](https://github.com/huggingface/notebooks/blob/main/examples/image_captioning_blip.ipynb) to learn how to fine-tune BLIP for image captioning on a custom dataset.