Skip to content

wangyuchi369/RICO

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

[EMNLP 2025] RICO: An Enhanced Image Recaptioning Method via Visual Reconstruction


This is the repo for the official implementation of the paper: RICO: Improving Accuracy and Completeness in Image Recaptioning via Visual Reconstruction

📆 TODO List

  • Code for the full RICO pipeline.
  • Pretrained checkpoint for RICO-Flash.
  • Training method for your own DPO-based model.

💡 Introduction

Existing recaptioning methods typically rely on powerful multimodal large language models (MLLMs) to enhance textual descriptions, but often suffer from inaccuracies due to hallucinations and incompleteness caused by missing fine-grained details.

To address these issues, we propose RICO, a novel framework that enhances captions through an iterative visual reconstruction-and-refinement pipeline. Our key idea is to:

  1. Reconstruct the caption into an image using a text-to-image model.

  2. Compare the original image with the reconstructed image using an MLLM.

  3. Refine the caption based on detected discrepancies.

This iterative process leads to more accurate and comprehensive captions. To further reduce the computational cost of multiple iterations, we introduce RICO-Flash, a lightweight variant trained with Direct Preference Optimization (DPO) to emulate RICO’s behavior in a single step.

⚙️ Installation & Setup

Environment Setup

Required packages and dependencies are listed in the requirements.txt file. You can install the environment using Conda with the following command:

conda env create -n rico python=3.11
conda activate rico
pip install -r requirements.txt
pip install flash-attn --no-build-isolation

Add Your GPT-4o Integration

To use GPT-4o in the caption refinement process, you need to implement your own API call method in models_util/gpt4o.py. The current implementation is a placeholder and should be replaced with your actual API logic:

def call_gpt4o(orig_img_path, new_img_path, prompt):
    pass
    #! Please implement the function to call GPT-4o with the provided prompt and return the response.
    #! This function should handle the API call to GPT-4o, passing the original and reconstructed image paths along with the prompt.
    #! The function should return the revised caption generated by GPT-4o.

Plug-and-Play Design

RICO is designed with a modular and extensible architecture, making it easy to plug in your own models for various components. Specifically, you can:

  • Replace the text-to-image model by modifying models_util/flux.py.

  • Swap out the initial captioning model in models_util/qwen_single.py.

  • Integrate your own caption refinement model by editing models_util/gpt4o.py.

Simply follow the existing interfaces and structures defined in these files to ensure compatibility.

🚀 Usage

To run the RICO pipeline, you can use the provided main_loop.py script.

We list some important arguments below, you can also check the full list of arguments in main_loop.py:

# define the number of iterations for the reconstruction-refinement process
ITER_STEPS = 2

# define the path to the image folder containing the images to be processed
parser.add_argument('--image_folder', type=str, default='datasets/capsbench')

# if you have pre-generated captions, you can provide the path to the caption JSON file
parser.add_argument('--caption_json', type=str, default=None)

# define the path to the output directory for saving iterative images
parser.add_argument('--output_video_dir', type=str, default='results/outputs')

# define the path to the output directory for saving iterative captions
parser.add_argument('--output_json_dir', type=str, default='results/records')

⚡ RICO-Flash Model

RICO-Flash is a lightweight variant of RICO that performs the reconstruction-and-refinement process in a single step, substantially reducing computational overhead. It fine-tunes the Qwen2-VL model using LoRA and DPO (Direct Preference Optimization), based on LLaMAFactory.


Setup

We recommend cloning the official LLaMAFactory repository and installing the required dependencies:

git clone https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -r requirements.txt

LoRA Checkpoint for RICO-Flash

You can download and extract the trained LoRA checkpoint for RICO-Flash from this link.

After training, you may merge the LoRA adapter weights into the base Qwen2-VL model. Please refer to the LLaMAFactory documentation for details. An example command is shown below:

llamafactory-cli export examples/merge_lora/llama3_lora_sft.yaml

You can modify the file path in examples/merge_lora/llama3_lora_sft.yaml to match your configuration, then run the command to perform the merge. The final merged model will be saved to the output directory you specify.

Inference with the Final Model

After merging the LoRA weights, you can load the final model just like a standard Qwen2-VL checkpoint:

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "path/to/merged/ckpt", trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
    "path/to/merged/ckpt", trust_remote_code=True
)

Training Your Own DPO-Based Model

To train your own model with LoRA and DPO, your dataset should follow the structure below. Each sample must include:

  • conversations: A dialogue that includes an <image> placeholder for multimodal input.
  • images: The local file path(s) to associated image(s).
  • chosen / rejected: Responses required for DPO training—the preferred and less preferred outputs.

Example:

[
  {
    "conversations": [
      {
        "from": "human",
        "value": "<image>Describe this image in detail. Your answer should be concise and informative."
      }
    ],
    "images": [
      "/path/to/image1.jpg"
    ],
    "chosen": {
      "from": "gpt",
      "value": "The preferred response here..."
    },
    "rejected": {
      "from": "gpt",
      "value": "The less preferred response here..."
    }
  },
  ...
]

We provide a reference configuration file, qwen2vl_lora_dpo.yaml, which is tailored for training Qwen2-VL with LoRA and DPO. You can modify the config file to adjust parameters such as dataset paths, model names, training epochs, batch size, and learning rates.

To start multi-GPU training with LoRA and DPO, run the following command:

FORCE_TORCHRUN=1 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
llamafactory-cli train qwen2vl_lora_dpo.yaml

For additional configuration options, refer to the official LLaMAFactory README.

☕ Citation

If you find our projects helpful to your research, please consider citing our paper:

@misc{wang2025ricoimprovingaccuracycompleteness,
      title={RICO: Improving Accuracy and Completeness in Image Recaptioning via Visual Reconstruction}, 
      author={Yuchi Wang and Yishuo Cai and Shuhuai Ren and Sihan Yang and Linli Yao and Yuanxin Liu and Yuanxing Zhang and Pengfei Wan and Xu Sun},
      year={2025},
      eprint={2505.22613},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2505.22613}, 
}

For any issues or further discussions, feel free to contact wangyuchi369@gmail.com

About

Official implementation of the paper: [EMNLP 2025] RICO: Improving Accuracy and Completeness in Image Recaptioning via Visual Reconstruction

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages