Conversation
|
Hi, for the evaluation of Vstar, we follow the official setting in https://github.com/penghao-wu/vstar/blob/main/vstar_bench_eval.py. We also noticed that existing works also assess Vstar by putting the correct choice on 'A'. Thanks for pointing out the bug in evaluating HRBench. I think the performance difference caused by this bug is not particularly obvious. We will fix it. Thanks a lot. |
Thanks for your prompt reply. But I'm afraid it's not the case. The official setting (https://github.com/penghao-wu/vstar/blob/main/vstar_bench_eval.py) adopts a likelihood-based evaluation method, which evaluates the likelihood of each option (P(option | question)) and chooses the one with the largest likelihood. Thus, the answer is invariant to the position of the option in the prompt (i.e., irrelevant to "A", "B", "C", "D"). For example, Then they would consider that the model picks the option |
A very interesting thing for me is that the variance of the relative position in vstar is huge! I try myself and get the same results. The temp is 0 in eval and I don't know why this happens. |
Thank you for open-sourcing your work and making it easy to reproduce results.
This PR mainly fixes some bugs in the code and inappropriate evaluation settings that may cause overestimated results:
(1) In V Star Benchmark evaluation, the options should be shuffled (as in the annotation file provided by the official repo), instead of always putting the correct choice on 'A'. Here's the difference:
The variance might be attributed to a small sample size (only 191 samples in total for this bench) and the model itself.
(2) In HRBench evaluation, the rule-based check should check if the option string is in the prediction result, instead of the option itself (in
DeepEyes/eval/judge_result_hrbench.py):Here's an example of a False Positive:
The model predicts a wrong result ("A. Back (trunk)") that accidentally contains the correct option ('B'), which is falsely taken as accurate.
Here's the difference made by the fix: