Dear authors,
Thanks for your amazing work in this field.
I am trying to evaluate connectivity task using Llama2-7b. The result is about 40% to 50%, which is far less than Figure 4 in your paper. The version I am using is Llama2-7b-chat, with temperature = 0 and top_p = 0.7
I am wondering whether we are using the same parameter, or you may have also finetuned llama based on section 5.1?
Thank you!
DM