Skip to content

Fix bug in evaluation code#66

Open
xjtupanda wants to merge 1 commit intoVisual-Agent:mainfrom
xjtupanda:refine
Open

Fix bug in evaluation code#66
xjtupanda wants to merge 1 commit intoVisual-Agent:mainfrom
xjtupanda:refine

Conversation

@xjtupanda
Copy link

Thank you for open-sourcing your work and making it easy to reproduce results.

This PR mainly fixes some bugs in the code and inappropriate evaluation settings that may cause overestimated results:

(1) In V Star Benchmark evaluation, the options should be shuffled (as in the annotation file provided by the official repo), instead of always putting the correct choice on 'A'. Here's the difference:

No shuffle:
Run 1:
"direct_attributes": 91.30434782608695,
"relative_position": 85.52631578947368,
"overall": 89.00523560209425

Run 2:
"direct_attributes": 92.17391304347827,
"relative_position": 90.78947368421053,
"overall": 91.62303664921467

Shuffle options:
Run 1:
"direct_attributes": 86.08695652173914,
"relative_position": 81.57894736842105,
"overall": 84.29319371727748

Run 2:
"direct_attributes": 86.08695652173914,
"relative_position": 85.52631578947368,
"overall": 85.86387434554975

The variance might be attributed to a small sample size (only 191 samples in total for this bench) and the model itself.

(2) In HRBench evaluation, the rule-based check should check if the option string is in the prediction result, instead of the option itself (in DeepEyes/eval/judge_result_hrbench.py):

  # elif answer in pred_ans:
  elif answer_str in pred_ans:
      acc_reward = 1.0

Here's an example of a False Positive:

hr_bench_8k
No.30:
"question": "Which side of the car is the person sitting on?", 
"answer": "B", 
"answer_str": "Front (hood)", 
"pred_ans": "A. Back (trunk)"

The model predicts a wrong result ("A. Back (trunk)") that accidentally contains the correct option ('B'), which is falsely taken as accurate.

Here's the difference made by the fix:

Before fixing:
"hr_bench_4k": {
    "single": 0.915,
    "cross": 0.585,
    "overall": 0.75
},
"hr_bench_8k": {
    "single": 0.8475,
    "cross": 0.565,
    "overall": 0.70625
}

After fixing:
"hr_bench_4k": {
    "single": 0.91,
    "cross": 0.565,
    "overall": 0.7375
},
"hr_bench_8k": {
    "single": 0.8475,
    "cross": 0.54,
    "overall": 0.6937500000000001
}

@JaaackHongggg
Copy link
Contributor

Hi, for the evaluation of Vstar, we follow the official setting in https://github.com/penghao-wu/vstar/blob/main/vstar_bench_eval.py. We also noticed that existing works also assess Vstar by putting the correct choice on 'A'.

Thanks for pointing out the bug in evaluating HRBench. I think the performance difference caused by this bug is not particularly obvious. We will fix it. Thanks a lot.

@xjtupanda
Copy link
Author

Hi, for the evaluation of Vstar, we follow the official setting in https://github.com/penghao-wu/vstar/blob/main/vstar_bench_eval.py. We also noticed that existing works also assess Vstar by putting the correct choice on 'A'.

Thanks for pointing out the bug in evaluating HRBench. I think the performance difference caused by this bug is not particularly obvious. We will fix it. Thanks a lot.

Hi, for the evaluation of Vstar, we follow the official setting in https://github.com/penghao-wu/vstar/blob/main/vstar_bench_eval.py. We also noticed that existing works also assess Vstar by putting the correct choice on 'A'.

Thanks for pointing out the bug in evaluating HRBench. I think the performance difference caused by this bug is not particularly obvious. We will fix it. Thanks a lot.

Thanks for your prompt reply. But I'm afraid it's not the case. The official setting (https://github.com/penghao-wu/vstar/blob/main/vstar_bench_eval.py) adopts a likelihood-based evaluation method, which evaluates the likelihood of each option (P(option | question)) and chooses the one with the largest likelihood. Thus, the answer is invariant to the position of the option in the prompt (i.e., irrelevant to "A", "B", "C", "D"). For example,

"question": "Is the flag red or white?",
"options": [
"The color of the flag is white.",
"The color of the flag is red."
]

If they evaluate the likelihood and find that:
P("The color of the flag is white." | "Is the flag red or white?") > P("The color of the flag is red." |"Is the flag red or white?")

Then they would consider that the model picks the option "The color of the flag is white.". This is different from putting the choice in the prompt and asking the model to pick one. Again, the final annotation file provided by V* bench curators casts the prompt into a common MCQ format and shuffles the options.

@dddraxxx
Copy link

dddraxxx commented Aug 20, 2025

Thank you for open-sourcing your work and making it easy to reproduce results.

This PR mainly fixes some bugs in the code and inappropriate evaluation settings that may cause overestimated results:

(1) In V Star Benchmark evaluation, the options should be shuffled (as in the annotation file provided by the official repo), instead of always putting the correct choice on 'A'. Here's the difference:

No shuffle:
Run 1:
"direct_attributes": 91.30434782608695,
"relative_position": 85.52631578947368,
"overall": 89.00523560209425

Run 2:
"direct_attributes": 92.17391304347827,
"relative_position": 90.78947368421053,
"overall": 91.62303664921467

Shuffle options:
Run 1:
"direct_attributes": 86.08695652173914,
"relative_position": 81.57894736842105,
"overall": 84.29319371727748

Run 2:
"direct_attributes": 86.08695652173914,
"relative_position": 85.52631578947368,
"overall": 85.86387434554975

The variance might be attributed to a small sample size (only 191 samples in total for this bench) and the model itself.

A very interesting thing for me is that the variance of the relative position in vstar is huge! I try myself and get the same results. The temp is 0 in eval and I don't know why this happens.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants