[EVAL] Long Horizon Execution#1072
[EVAL] Long Horizon Execution#1072akshathmangudi wants to merge 8 commits intohuggingface:mainfrom akshathmangudi:akshath/issue-1056
Conversation
|
Tagging #1069 for better readability. |
|
great !! Pulled and tested it, I modified a few things (mainly just to have multiple prompt lengths) and pushed a log dir to check the prompt and results. @akshathmangudi are you ok with me pushing them ? https://huggingface.co/spaces/SaylorTwift/long_horizon_execution cc: @shash42 does it look good to you ? :) |
|
yes ofc, is this something that i will have to keep note of in future PRs as well when i integrate benchmarks? |
|
and no, your PR was great ! It's just that it makes sense for this eval to have multiple prompt lengths |
|
hi! @akshathmangudi @NathanHB, was just taking a look at the implementation here, and while the implementation itself looks correct, I will put down some notes to maybe keep in mind:
edit (addition): the multi turn evaluation should be much simpler to implement, as the only differences arise in the loop-calling of the LLM. We implemented an online evaluation, i.e, we updated our metrics after each turn (because the evaluations took a while, and we did not want to wait for all turns to be done!) but it is easy to have an offline evaluation similar (acutally, identical) to the single-turn evaluation. Thank you for your efforts! I will be happy to discuss more if needed! |
|
ahhh i see. my initial thought was to have a hardcoded MAX_ITEMS value since it might be too long of a prompt to construct if we add all 50000 examples and truncate our input, output and value keys like and @viciousAegis thank you for the feedback! i will make the necessary fixes and you can let me know what you think! |
|
hi @viciousAegis, i would like you to review the code now. previously, we had our single-turn implementation with a single file but now it's been split to support single and multi-turn approaches. for single turn: for multi-turn: section 3.3's approach was also integrated into the implementation. @NathanHB would like your review on this as well. |
|
Tagging #1074 as that's the current PR. Sorry guys |

Approach described within #1056.
Tasks:
/tasks/tasks/long_horizon_execution.py<answer>tags./tasks/tasks/long_horizon_execution.pySTATUS: ready for review.
Current behavior:
When we run
lighteval tasks inspect long_horizon_execution, the output has been shown below: