[EVAL] MultiChallenge#1075
Conversation
|
cc: @NathanHB let me know your feedback on this! |
There was a problem hiding this comment.
Looking very good thanks for the PR !! Only some nit that i think would make the definition better.
Tagging @kdesh0399 and @ekwinox117 in case you want to chime in 🤗
|
thanks for the feedback! i have addressed your comments. let me know if this is okay! p.s. i tried to run the evaluation task, but i ran out of credits 😅 update: nvm, got it up and running using |
|
Nice work, looks good to me! |
NathanHB
left a comment
There was a problem hiding this comment.
Only few quick nits and we can merge !
Thank you for this it's very helpful 🤗
There was a problem hiding this comment.
Pull request overview
This PR integrates the MultiChallenge benchmark, a multi-turn conversational evaluation dataset designed to test LLMs' ability to handle complex conversations with human users. The implementation follows the lighteval framework's patterns for inspect-ai integration, using custom scorers and solvers to evaluate model responses against specific pass criteria using a judge LLM.
Key Changes
- Adds a new task configuration for the MultiChallenge benchmark with judge-based evaluation
- Implements custom scorer using GPT-4o as a judge model to evaluate responses against pass/fail criteria
- Creates a conversation solver to handle multi-turn dialogue context properly
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
hey @NathanHB, ive made the changes. let me know what you think! https://huggingface.co/spaces/akshathmangudi/multi_challenge-gpt |
|
hey @akshathmangudi thanks for the changes ! I will have some days off for the end of year so will be reviewing / testing your PRs beginning of the year, have a nice holiday 🤗 |
|
perfect, happy holidays!! |
hey @NathanHB, please let me know if there are any changes to be made. i would love to get this merged! |
|
hey @akshathmangudi ! thanks you so much for the great work on this, only had a few modifs to make to take better advantage of inspect API and reduce the number of line to define the task. is it ok if I push my modifs to a branch (while keeping your contributions to the repo) ? :) |
|
absolutely, thanks for the help! looking forward to contribute more to the repo. |
|
opened it here #1120 |
|
sounds good! |

Overview
Resolves #1019
Current status: READY FOR REVIEW
The current PR integrates MultiChallenge, a difficult benchmark that tests the ability of models to handle multi-turn conversation with human users. The PR consists of a single file
multi_challenge.py, which contains the general structure of including a prompt function, and loading it as a configuration withLightEvalTaskConfig.The implementation was also tested with the command
and results have been uploaded to my space here.