Problems in reproducing the results in the leaderboard

Hello.
I'm trying to reproduce the results in the leaderboard.

For each model, I run the following script, according to the README.md.
The script is run with python 3.10.15 environment created by anaconda and `pip install -r requirements`.(and install torch that is not specified in the requirements.txt)

```shell
python generate_cov_hf.py --model <model_name> --num_tests 20
python format.py --mode overall --path totalcov_<model_name>.jsonl
python eval_overall.py --path totalcov_<model_name>format.jsonl
```
Here's my results and comparisons with reported results.

Branch Coverage  | Overall |   | Cov@1 |   | Cov@2 |   | Cov@5 |  
-- | -- | -- | -- | -- | -- | -- | -- | --
  | Reported | Reproduced | Reported | Reproduced | Reported | Reproduced | Reported | Reproduced
LLaMa3-8B-Instruct | 89.02 | 77.91 | 69.47 | 59.97 | 73.37 | 64.06 | 79.22 | 69.1
CodeLLaMa-7B-Instruct | 81.56 | 75.78 | 72.28 | 66.77 | 73.96 | 68.43 | 75.9 | 70.61
CodeLLaMa-13B-Instruct | 80.55 | 80.38 | 73.21 | 70.56 | 75.54 | 72.89 | 77.13 | 75.13
CodeLLaMa-34B-Instruct | 83.74 | 83.94 | 71.37 | 71.99 | 74.5 | 74.22 | 77.8 | 78.38
gemma-1.1-7b-it | 91.46 | 90.52 | 67.15 | 66.09 | 72.94 | 72.76 | 90.29 | 79.69
CodeQwen1.5-7B-Chat | 86.9 | 83.68 | 77.66 | 77.27 | 78.94 | 78.29 | 80.95 | 79.38
DeepSeek-Coder-1.3B-instruct | 75.99 | 75.18 | 69.06 | 70.07 | 69.9 | 70.34 | 70.7 | 70.94
DeepSeek-Coder-6.7B-instruct | 91.6 | 93.29 | 75.29 | 75.19 | 78.73 | 79.12 | 83.46 | 84.26


Line Coverage  | Overall |   | Cov@1 |   | Cov@2 |   | Cov@5 |  
-- | -- | -- | -- | -- | -- | -- | -- | --
  | Reported | Reproduced | Reported | Reproduced | Reported | Reproduced | Reported | Reproduced
LLaMa3-8B-Instruct | 90.98 | 80.34 | 77.4 | 68.66 | 80.08 | 71.35 | 84.42 | 74.78
CodeLLaMa-7B-Instruct | 86.09 | 81.02 | 79.46 | 74.97 | 80.72 | 76.08 | 82.04 | 77.59
CodeLLaMa-13B-Instruct | 85.66 | 86.22 | 80.49 | 79.84 | 82.26 | 81.5 | 83.44 | 82.91
CodeLLaMa-34B-Instruct | 87.96 | 88.92 | 78.83 | 80.86 | 81.25 | 82.48 | 83.71 | 85.23
gemma-1.1-7b-it | 93.16 | 93.05 | 76.23 | 76.64 | 80.54 | 81.43 | 85.9 | 85.99
CodeQwen1.5-7B-Chat | 90.73 | 89.31 | 84.53 | 85.42 | 85.33 | 86.05 | 86.71 | 86.75
DeepSeek-Coder-1.3B-instruct | 81.22 | 81.09 | 75.89 | 77.59 | 76.5 | 77.78 | 77.09 | 78.2
DeepSeek-Coder-6.7B-instruct | 93.48 | 95.27 | 82.4 | 83.92 | 84.74 | 86.48 | 87.97 | 89.84

I checked that the generation result is deterministic. So I think it should reproduce exactly same score with the reported results in the leaderboard. But they are different.
For some models like LLaMa3-8B and CodeLLaMa-7B, the difference is significant.
For gemma-1.1-7b-it model, overall branch coverage is similar but cov@5 is significantly low than the reported score.

Could you provide exact command, arguments and library versions to reproduce the results in the leaderboard?

Thank you very much for your time and assistance.






Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems in reproducing the results in the leaderboard #11

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Branch Coverage	Overall		Cov@1		Cov@2		Cov@5
	Reported	Reproduced	Reported	Reproduced	Reported	Reproduced	Reported	Reproduced
LLaMa3-8B-Instruct	89.02	77.91	69.47	59.97	73.37	64.06	79.22	69.1
CodeLLaMa-7B-Instruct	81.56	75.78	72.28	66.77	73.96	68.43	75.9	70.61
CodeLLaMa-13B-Instruct	80.55	80.38	73.21	70.56	75.54	72.89	77.13	75.13
CodeLLaMa-34B-Instruct	83.74	83.94	71.37	71.99	74.5	74.22	77.8	78.38
gemma-1.1-7b-it	91.46	90.52	67.15	66.09	72.94	72.76	90.29	79.69
CodeQwen1.5-7B-Chat	86.9	83.68	77.66	77.27	78.94	78.29	80.95	79.38
DeepSeek-Coder-1.3B-instruct	75.99	75.18	69.06	70.07	69.9	70.34	70.7	70.94
DeepSeek-Coder-6.7B-instruct	91.6	93.29	75.29	75.19	78.73	79.12	83.46	84.26

Problems in reproducing the results in the leaderboard #11

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions