Skip to content

Questions about reproduction of reported results #67

@xjtupanda

Description

@xjtupanda

Thanks for your contribution to the community!
We're attempting to reproduce the results reported in your paper, and we've met some problems during the process. Hope you could help us align the results.

(1) Possible bugs in the provided code that may cause overestimated results. We've used the evaluation code provided in this repo for the two MCQ benchmarks that feature high-res (V* and HRBench), and found some bugs. For more details, please refer to #66.

(2) What's the prompt for VQA bench (more specifically, the Math Reasoning benchmarks)? We used the same prompt as MCQ, i.e., append the customized user prompt to the question prompt, like this:

DEEP_EYES_USER_PROMPT = "\nThink first, call **image_zoom_in_tool** if needed, then answer. Format strictly as:  <think>...</think>  <tool_call>...</tool_call> (if tools needed)  <answer>...</answer> "

prompt = {question_prompt} + "\n" + DEEP_EYES_USER_PROMPT

We use VLMEvalKit to evaluate these benchmarks and find that the results do not match the reported ones. Some results (MathVista, MathVerse, DynaMath, and LogicVista) are much lower, while others are much larger, as follows:

Benchmark MathVista_MINI MathVerse_MINI MathVision_MINI WeMath DynaMath LogicVista
Paper report 70.1 47.3 26.6 38.9 55.0 47.7
Reproduction 68.4 46.0 29.3 39.7 51.4 45.2

Is this normal? Could you provide the eval scripts or clarify more details about the evaluation process for these benchmarks?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions