Checklist
Describe the bug
When using the InternVL3 preprocessor with multiple videos (e.g., 2 videos), there seems to be a potential issue in processing_internvl.py starting at line 131. The problem may cause the second video's frame count to be off by one, which leads to the following error downstream:
raise ValueError(
ValueError: Image features and image tokens do not match: tokens: 4096, features 3840
Reproduction
from transformers import AutoProcessor, AutoModelForImageTextToText
import torch
import os
os.environ["HTTPS_PROXY"] = "http://127.0.0.1:7890"
os.environ["HTTP_PROXY"] = "http://127.0.0.1:7890"
torch_device = "cuda"
model_checkpoint = "OpenGVLab/InternVL3-1B-hf"
processor = AutoProcessor.from_pretrained(model_checkpoint)
model = AutoModelForImageTextToText.from_pretrained(
model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)
system_prompt = """
You are a helpful multimodal assistant tasked with evaluating the quality of videos generated for a given input caption. Your goal is to determine which video better matches the caption. Choose either Video A or Video B as the better option.
Do NOT say both / neither are good.
Here are some rules for the evaluation:
(1) If the caption does not contain harmful content, prioritize evaluating whether the video faithfully and precisely reflects the caption, then consider its helpfulness, accuracy, level of detail, harmlessness, realism, motion consistency, and overall visual quality.
(2) If the caption contains harmful content, prioritize the harmlessness and safety of the video.
(3) The video should NOT include elements that are irrelevant to or missing from the caption, as such outputs do NOT precisely execute the instruction.
(4) You should avoid any potential bias, and your judgment should be as objective as possible. Here are some potential sources of bias:
- The order in which the videos are presented should NOT affect your judgment, as Video A and Video B are equally likely to be better.
- The rendering style (e.g., realistic, cartoonish, cinematic) should NOT affect your judgment unless explicitly specified in the caption.
- Do not assume that a more visually complex video is necessarily better; evaluate whether the complexity and motion quality are appropriate for the given caption.
Your reply should strictly follow this format:
Feedback:
Comparison:
Conclusion:
A or B
Here is the data.
"""
video_a_path = "https://huggingface.co/datasets/tianleliphoebe/genai-arena-video-mp4/blob/main/aadd660eca2b4d4788195c729920121f.mp4"
video_b_path = "https://huggingface.co/datasets/tianleliphoebe/genai-arena-video-mp4/blob/main/86aa85eec53c4b1293ef632e6183fe93.mp4"
user_input = "A polar bear is playing guitar\n"
messages = [
{
"role": "system",
"content": [
{"type": "text", "text": system_prompt}
]
},
{
"role": "user",
"content": [
{"type": "text", "text": "[User Input]\n"},
{"type": "text", "text": user_input},
{"type": "text", "text": "[The Start of Video A]\n"},
{"type": "video", "url": video_a_path.replace(
"/blob/", "/resolve/")},
{"type": "text", "text": "[The End of Video A]\n"},
{"type": "text", "text": "[The Start of Video B]\n"},
{"type": "video", "url": video_b_path.replace(
"/blob/", "/resolve/")},
{"type": "text", "text": "[The End of Video B]\n"},
],
},
]
inputs = processor.apply_chat_template(messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
num_frames=8
).to(model.device, dtype=torch.bfloat16)
generate_ids = model.generate(**inputs, max_new_tokens=1000)
decoded_output = processor.decode(
generate_ids[0, inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(decoded_output)
Environment
transformers == 4.56.1
PyTorch == 2.8.0+cu126
Error traceback
Checklist
Describe the bug
When using the InternVL3 preprocessor with multiple videos (e.g., 2 videos), there seems to be a potential issue in processing_internvl.py starting at line 131. The problem may cause the second video's frame count to be off by one, which leads to the following error downstream:
raise ValueError(
ValueError: Image features and image tokens do not match: tokens: 4096, features 3840
Reproduction
from transformers import AutoProcessor, AutoModelForImageTextToText
import torch
import os
os.environ["HTTPS_PROXY"] = "http://127.0.0.1:7890"
os.environ["HTTP_PROXY"] = "http://127.0.0.1:7890"
torch_device = "cuda"
model_checkpoint = "OpenGVLab/InternVL3-1B-hf"
processor = AutoProcessor.from_pretrained(model_checkpoint)
model = AutoModelForImageTextToText.from_pretrained(
model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)
system_prompt = """
You are a helpful multimodal assistant tasked with evaluating the quality of videos generated for a given input caption. Your goal is to determine which video better matches the caption. Choose either Video A or Video B as the better option.
Do NOT say both / neither are good.
Here are some rules for the evaluation:
(1) If the caption does not contain harmful content, prioritize evaluating whether the video faithfully and precisely reflects the caption, then consider its helpfulness, accuracy, level of detail, harmlessness, realism, motion consistency, and overall visual quality.
(2) If the caption contains harmful content, prioritize the harmlessness and safety of the video.
(3) The video should NOT include elements that are irrelevant to or missing from the caption, as such outputs do NOT precisely execute the instruction.
(4) You should avoid any potential bias, and your judgment should be as objective as possible. Here are some potential sources of bias:
Your reply should strictly follow this format:
Feedback:
Comparison:
Conclusion:
A or B
Here is the data.
"""
video_a_path = "https://huggingface.co/datasets/tianleliphoebe/genai-arena-video-mp4/blob/main/aadd660eca2b4d4788195c729920121f.mp4"
video_b_path = "https://huggingface.co/datasets/tianleliphoebe/genai-arena-video-mp4/blob/main/86aa85eec53c4b1293ef632e6183fe93.mp4"
user_input = "A polar bear is playing guitar\n"
messages = [
{
"role": "system",
"content": [
{"type": "text", "text": system_prompt}
]
},
{
"role": "user",
"content": [
{"type": "text", "text": "[User Input]\n"},
{"type": "text", "text": user_input},
{"type": "text", "text": "[The Start of Video A]\n"},
{"type": "video", "url": video_a_path.replace(
"/blob/", "/resolve/")},
{"type": "text", "text": "[The End of Video A]\n"},
{"type": "text", "text": "[The Start of Video B]\n"},
{"type": "video", "url": video_b_path.replace(
"/blob/", "/resolve/")},
{"type": "text", "text": "[The End of Video B]\n"},
],
},
]
inputs = processor.apply_chat_template(messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
num_frames=8
).to(model.device, dtype=torch.bfloat16)
generate_ids = model.generate(**inputs, max_new_tokens=1000)
decoded_output = processor.decode(
generate_ids[0, inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(decoded_output)
Environment
Error traceback