Skip to content

Problems in reproducing the results in the leaderboard #11

@yeomjy

Description

@yeomjy

Hello.
I'm trying to reproduce the results in the leaderboard.

For each model, I run the following script, according to the README.md.
The script is run with python 3.10.15 environment created by anaconda and pip install -r requirements.(and install torch that is not specified in the requirements.txt)

python generate_cov_hf.py --model <model_name> --num_tests 20
python format.py --mode overall --path totalcov_<model_name>.jsonl
python eval_overall.py --path totalcov_<model_name>format.jsonl

Here's my results and comparisons with reported results.

Branch Coverage  Overall   Cov@1   Cov@2   Cov@5  
  Reported Reproduced Reported Reproduced Reported Reproduced Reported Reproduced
LLaMa3-8B-Instruct 89.02 77.91 69.47 59.97 73.37 64.06 79.22 69.1
CodeLLaMa-7B-Instruct 81.56 75.78 72.28 66.77 73.96 68.43 75.9 70.61
CodeLLaMa-13B-Instruct 80.55 80.38 73.21 70.56 75.54 72.89 77.13 75.13
CodeLLaMa-34B-Instruct 83.74 83.94 71.37 71.99 74.5 74.22 77.8 78.38
gemma-1.1-7b-it 91.46 90.52 67.15 66.09 72.94 72.76 90.29 79.69
CodeQwen1.5-7B-Chat 86.9 83.68 77.66 77.27 78.94 78.29 80.95 79.38
DeepSeek-Coder-1.3B-instruct 75.99 75.18 69.06 70.07 69.9 70.34 70.7 70.94
DeepSeek-Coder-6.7B-instruct 91.6 93.29 75.29 75.19 78.73 79.12 83.46 84.26
Line Coverage  Overall   Cov@1   Cov@2   Cov@5  
  Reported Reproduced Reported Reproduced Reported Reproduced Reported Reproduced
LLaMa3-8B-Instruct 90.98 80.34 77.4 68.66 80.08 71.35 84.42 74.78
CodeLLaMa-7B-Instruct 86.09 81.02 79.46 74.97 80.72 76.08 82.04 77.59
CodeLLaMa-13B-Instruct 85.66 86.22 80.49 79.84 82.26 81.5 83.44 82.91
CodeLLaMa-34B-Instruct 87.96 88.92 78.83 80.86 81.25 82.48 83.71 85.23
gemma-1.1-7b-it 93.16 93.05 76.23 76.64 80.54 81.43 85.9 85.99
CodeQwen1.5-7B-Chat 90.73 89.31 84.53 85.42 85.33 86.05 86.71 86.75
DeepSeek-Coder-1.3B-instruct 81.22 81.09 75.89 77.59 76.5 77.78 77.09 78.2
DeepSeek-Coder-6.7B-instruct 93.48 95.27 82.4 83.92 84.74 86.48 87.97 89.84

I checked that the generation result is deterministic. So I think it should reproduce exactly same score with the reported results in the leaderboard. But they are different.
For some models like LLaMa3-8B and CodeLLaMa-7B, the difference is significant.
For gemma-1.1-7b-it model, overall branch coverage is similar but cov@5 is significantly low than the reported score.

Could you provide exact command, arguments and library versions to reproduce the results in the leaderboard?

Thank you very much for your time and assistance.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions