-
Notifications
You must be signed in to change notification settings - Fork 69
Description
Thanks for your contribution to the community!
We're attempting to reproduce the results reported in your paper, and we've met some problems during the process. Hope you could help us align the results.
(1) Possible bugs in the provided code that may cause overestimated results. We've used the evaluation code provided in this repo for the two MCQ benchmarks that feature high-res (V* and HRBench), and found some bugs. For more details, please refer to #66.
(2) What's the prompt for VQA bench (more specifically, the Math Reasoning benchmarks)? We used the same prompt as MCQ, i.e., append the customized user prompt to the question prompt, like this:
DEEP_EYES_USER_PROMPT = "\nThink first, call **image_zoom_in_tool** if needed, then answer. Format strictly as: <think>...</think> <tool_call>...</tool_call> (if tools needed) <answer>...</answer> "
prompt = {question_prompt} + "\n" + DEEP_EYES_USER_PROMPT
We use VLMEvalKit to evaluate these benchmarks and find that the results do not match the reported ones. Some results (MathVista, MathVerse, DynaMath, and LogicVista) are much lower, while others are much larger, as follows:
| Benchmark | MathVista_MINI | MathVerse_MINI | MathVision_MINI | WeMath | DynaMath | LogicVista |
|---|---|---|---|---|---|---|
| Paper report | 70.1 | 47.3 | 26.6 | 38.9 | 55.0 | 47.7 |
| Reproduction | 68.4 | 46.0 | 29.3 | 39.7 | 51.4 | 45.2 |
Is this normal? Could you provide the eval scripts or clarify more details about the evaluation process for these benchmarks?