Questions about reproduction of reported results

Thanks for your contribution to the community!
We're attempting to reproduce the results reported in your paper, and we've met some problems during the process. Hope you could help us align the results.

(1) Possible bugs in the provided code that may cause overestimated results. We've used the evaluation code provided in this repo for the two MCQ benchmarks that feature high-res (V* and HRBench), and found some bugs. For more details, please refer to https://github.com/Visual-Agent/DeepEyes/pull/66.


(2) What's the prompt for VQA bench (more specifically, the Math Reasoning benchmarks)? We used the same prompt as MCQ, i.e., append the customized user prompt to the question prompt, like this:
```
DEEP_EYES_USER_PROMPT = "\nThink first, call **image_zoom_in_tool** if needed, then answer. Format strictly as:  <think>...</think>  <tool_call>...</tool_call> (if tools needed)  <answer>...</answer> "

prompt = {question_prompt} + "\n" + DEEP_EYES_USER_PROMPT
```
We use [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) to evaluate these benchmarks and find that the results do not match the reported ones. Some results (MathVista, MathVerse, DynaMath, and LogicVista) are much lower, while others are much larger, as follows:
| Benchmark    | MathVista_MINI | MathVerse_MINI | MathVision_MINI | WeMath | DynaMath | LogicVista |
|--------------|:--------------:|:--------------:|:---------------:|:------:|:--------:|:----------:|
| Paper report |      70.1      |      47.3      |      26.6       |  38.9  |   55.0   |    47.7    |
| Reproduction |      68.4      |      46.0      |      29.3       |  39.7  |   51.4   |    45.2    |

Is this normal? Could you provide the eval scripts or clarify more details about the evaluation process for these benchmarks? 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about reproduction of reported results #67

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Benchmark	MathVista_MINI	MathVerse_MINI	MathVision_MINI	WeMath	DynaMath	LogicVista
Paper report	70.1	47.3	26.6	38.9	55.0	47.7
Reproduction	68.4	46.0	29.3	39.7	51.4	45.2

Questions about reproduction of reported results #67

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions