Hello.
I'm trying to reproduce the results in the leaderboard.
For each model, I run the following script, according to the README.md.
The script is run with python 3.10.15 environment created by anaconda and pip install -r requirements.(and install torch that is not specified in the requirements.txt)
python generate_cov_hf.py --model <model_name> --num_tests 20
python format.py --mode overall --path totalcov_<model_name>.jsonl
python eval_overall.py --path totalcov_<model_name>format.jsonl
Here's my results and comparisons with reported results.
| Branch Coverage |
Overall |
|
Cov@1 |
|
Cov@2 |
|
Cov@5 |
|
| |
Reported |
Reproduced |
Reported |
Reproduced |
Reported |
Reproduced |
Reported |
Reproduced |
| LLaMa3-8B-Instruct |
89.02 |
77.91 |
69.47 |
59.97 |
73.37 |
64.06 |
79.22 |
69.1 |
| CodeLLaMa-7B-Instruct |
81.56 |
75.78 |
72.28 |
66.77 |
73.96 |
68.43 |
75.9 |
70.61 |
| CodeLLaMa-13B-Instruct |
80.55 |
80.38 |
73.21 |
70.56 |
75.54 |
72.89 |
77.13 |
75.13 |
| CodeLLaMa-34B-Instruct |
83.74 |
83.94 |
71.37 |
71.99 |
74.5 |
74.22 |
77.8 |
78.38 |
| gemma-1.1-7b-it |
91.46 |
90.52 |
67.15 |
66.09 |
72.94 |
72.76 |
90.29 |
79.69 |
| CodeQwen1.5-7B-Chat |
86.9 |
83.68 |
77.66 |
77.27 |
78.94 |
78.29 |
80.95 |
79.38 |
| DeepSeek-Coder-1.3B-instruct |
75.99 |
75.18 |
69.06 |
70.07 |
69.9 |
70.34 |
70.7 |
70.94 |
| DeepSeek-Coder-6.7B-instruct |
91.6 |
93.29 |
75.29 |
75.19 |
78.73 |
79.12 |
83.46 |
84.26 |
| Line Coverage |
Overall |
|
Cov@1 |
|
Cov@2 |
|
Cov@5 |
|
| |
Reported |
Reproduced |
Reported |
Reproduced |
Reported |
Reproduced |
Reported |
Reproduced |
| LLaMa3-8B-Instruct |
90.98 |
80.34 |
77.4 |
68.66 |
80.08 |
71.35 |
84.42 |
74.78 |
| CodeLLaMa-7B-Instruct |
86.09 |
81.02 |
79.46 |
74.97 |
80.72 |
76.08 |
82.04 |
77.59 |
| CodeLLaMa-13B-Instruct |
85.66 |
86.22 |
80.49 |
79.84 |
82.26 |
81.5 |
83.44 |
82.91 |
| CodeLLaMa-34B-Instruct |
87.96 |
88.92 |
78.83 |
80.86 |
81.25 |
82.48 |
83.71 |
85.23 |
| gemma-1.1-7b-it |
93.16 |
93.05 |
76.23 |
76.64 |
80.54 |
81.43 |
85.9 |
85.99 |
| CodeQwen1.5-7B-Chat |
90.73 |
89.31 |
84.53 |
85.42 |
85.33 |
86.05 |
86.71 |
86.75 |
| DeepSeek-Coder-1.3B-instruct |
81.22 |
81.09 |
75.89 |
77.59 |
76.5 |
77.78 |
77.09 |
78.2 |
| DeepSeek-Coder-6.7B-instruct |
93.48 |
95.27 |
82.4 |
83.92 |
84.74 |
86.48 |
87.97 |
89.84 |
I checked that the generation result is deterministic. So I think it should reproduce exactly same score with the reported results in the leaderboard. But they are different.
For some models like LLaMa3-8B and CodeLLaMa-7B, the difference is significant.
For gemma-1.1-7b-it model, overall branch coverage is similar but cov@5 is significantly low than the reported score.
Could you provide exact command, arguments and library versions to reproduce the results in the leaderboard?
Thank you very much for your time and assistance.
Hello.
I'm trying to reproduce the results in the leaderboard.
For each model, I run the following script, according to the README.md.
The script is run with python 3.10.15 environment created by anaconda and
pip install -r requirements.(and install torch that is not specified in the requirements.txt)Here's my results and comparisons with reported results.
I checked that the generation result is deterministic. So I think it should reproduce exactly same score with the reported results in the leaderboard. But they are different.
For some models like LLaMa3-8B and CodeLLaMa-7B, the difference is significant.
For gemma-1.1-7b-it model, overall branch coverage is similar but cov@5 is significantly low than the reported score.
Could you provide exact command, arguments and library versions to reproduce the results in the leaderboard?
Thank you very much for your time and assistance.