[EVAL] MultiChallenge by akshathmangudi · Pull Request #1075 · huggingface/lighteval

akshathmangudi · 2025-11-21T12:24:01Z

Overview

Resolves #1019

Current status: READY FOR REVIEW

The current PR integrates MultiChallenge, a difficult benchmark that tests the ability of models to handle multi-turn conversation with human users. The PR consists of a single file multi_challenge.py, which contains the general structure of including a prompt function, and loading it as a configuration with LightEvalTaskConfig.

The implementation was also tested with the command

lighteval eval "hf-inference-providers/openai/gpt-oss-20b" multi_challenge \
--bundle-dir multi-challenge --repo-id akshathmangudi/multi-challenge-gpt4o  \ 
--max-samples 10 --public

and results have been uploaded to my space here.

akshathmangudi · 2025-11-22T12:00:48Z

cc: @NathanHB let me know your feedback on this!

NathanHB

Looking very good thanks for the PR !! Only some nit that i think would make the definition better.

Tagging @kdesh0399 and @ekwinox117 in case you want to chime in 🤗

akshathmangudi · 2025-11-25T07:29:10Z

thanks for the feedback!

i have addressed your comments. let me know if this is okay!

p.s. i tried to run the evaluation task, but i ran out of credits 😅

update: nvm, got it up and running using openai/gpt-4o. linked the space below :)
https://huggingface.co/spaces/akshathmangudi/multi_challenge-gpt

kdesh0399 · 2025-11-25T14:54:18Z

Nice work, looks good to me!

NathanHB

Only few quick nits and we can merge !
Thank you for this it's very helpful 🤗

Copilot

Pull request overview

This PR integrates the MultiChallenge benchmark, a multi-turn conversational evaluation dataset designed to test LLMs' ability to handle complex conversations with human users. The implementation follows the lighteval framework's patterns for inspect-ai integration, using custom scorers and solvers to evaluate model responses against specific pass criteria using a judge LLM.

Key Changes

Adds a new task configuration for the MultiChallenge benchmark with judge-based evaluation
Implements custom scorer using GPT-4o as a judge model to evaluate responses against pass/fail criteria
Creates a conversation solver to handle multi-turn dialogue context properly

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

akshathmangudi · 2025-12-09T15:40:31Z

hey @NathanHB, ive made the changes. let me know what you think!

https://huggingface.co/spaces/akshathmangudi/multi_challenge-gpt

NathanHB · 2025-12-19T15:05:47Z

hey @akshathmangudi thanks for the changes ! I will have some days off for the end of year so will be reviewing / testing your PRs beginning of the year, have a nice holiday 🤗

akshathmangudi · 2025-12-19T18:35:19Z

perfect, happy holidays!!

akshathmangudi · 2026-01-11T06:38:55Z

hey @akshathmangudi thanks for the changes ! I will have some days off for the end of year so will be reviewing / testing your PRs beginning of the year, have a nice holiday 🤗

hey @NathanHB, please let me know if there are any changes to be made. i would love to get this merged!

NathanHB · 2026-01-12T14:24:19Z

hey @akshathmangudi ! thanks you so much for the great work on this, only had a few modifs to make to take better advantage of inspect API and reduce the number of line to define the task. is it ok if I push my modifs to a branch (while keeping your contributions to the repo) ? :)

akshathmangudi · 2026-01-12T14:54:35Z

absolutely, thanks for the help!

looking forward to contribute more to the repo.

NathanHB · 2026-01-12T15:06:53Z

opened it here #1120

akshathmangudi · 2026-01-12T15:41:58Z

sounds good!

akshathmangudi added 3 commits November 21, 2025 17:53

initial commit

3ffdf0c

multi challenge impl, ready for review

d4cda44

docstring fixes

f1d82ef

akshathmangudi marked this pull request as ready for review November 22, 2025 12:00

Merge branch 'main' into akshath/issue-1019-v2

c8ff7d3