Sync fork from main by ljvmiranda921 · Pull Request #31 · filbench/lighteval

ljvmiranda921 · 2025-07-07T17:43:52Z

No description provided.

* set default temperature to 0 in generation config * issue warning when temperature == 0 with multiple samples * fix test

* Catch ROCM/HIP oom in should_reduce_batch_size * fix formatting --------- Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com>

* Update german_rag_evals.py * Update saving-and-reading-results.mdx

Co-authored-by: Nathan Habib <30601243+NathanHB@users.noreply.github.com> Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com>

* add TUMLU-mini benchmark, solves #577 * add benchmark info for tumlu-mini * Update community_tasks/turkic_evals.py --------- Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com>

## What does this PR do? This PR gives the prompt building logic in lighteval a much-needed spring cleaning The main goal: ditch legacy bloat, make things less painful for users and contributors, and unlock support for more complex benchmarks 🔥 ### Highlights - **Prompt Manager Overhaul:** Each model now owns its own PromptManager instance, with custom params for every flavor of prompt (multimodal, API, multiturn, you name it). - **system-prompt**: now part of the model config - **use-chat-template**: now part of model config - **Metrics Slimdown:** Metrics now only care about `samplingMethod` (generative or loglikelihood). Say goodbye to `use_case` and all those old request types. - **Request Layer Gone:** Models get the raw `Doc` directly -—no more unnecessary `request` wrappers that were bloating the code. - **Unified ModelResponse:** All models return a single `ModelResponse` type, whether generative or loglikelihood. This means simpler logging and metric computation. - **Consistent Metric Signatures:** Every metric now uses the same function signature: `compute(doc: Doc, model_response: ModelResponse)`. - **Standardized Details:** Each sample’s details now always include three fields: doc, metric, and model_response. - **Generative Metrics Unified:** All generative metrics now work the same way. If users want greedy generation, they need to set temperature to 0. **Exception will be raised if the user tries to run a sampling metric with temp = 0** - **Removed Loglikelihood Single Token:** bloated and almost not used - **Tests:** All tests pass, and no changes were needed to expected values. ### Why? - Less code, fewer headaches. - Easier to add new benchmarks (including weird and wonderful ones). - More user-friendly inspection tools. - A single, unified way to handle prompts, responses, and metrics. --------- Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: clementine@huggingface.co <clementine@huggingface.co>

* too many false positives with the current gpqa metric extraction, making it more string * fixing whitespace and instruction in prompt * better to have a strict extraction for index extraction in general actually * added comment * fix tests, need to invert condition

Translations provided by Kairit Sirts

… space (#831) * Update extractive_match_utils.py for words where `:` is preceded by a space * fix style

NathanHB and others added 13 commits June 20, 2025 15:04

set default temperature to 0 in generation config (#814)

327071f

* set default temperature to 0 in generation config * issue warning when temperature == 0 with multiple samples * fix test

Flores-200 combinations->permutations (#820)

dd1af5a

Catch ROCM/HIP/AMD oom in should_reduce_batch_size (#812)

95ec669

* Catch ROCM/HIP oom in should_reduce_batch_size * fix formatting --------- Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com>

Fix Typos in Documentation and Task Description (#810)

a372550

* Update german_rag_evals.py * Update saving-and-reading-results.mdx

Add TranslationLiterals for Language.DANISH (#770)

b0c092a

Co-authored-by: Nathan Habib <30601243+NathanHB@users.noreply.github.com> Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com>

add TUMLU-mini benchmark, solves #577 (#811)

936612c

* add TUMLU-mini benchmark, solves #577 * add benchmark info for tumlu-mini * Update community_tasks/turkic_evals.py --------- Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com>

Fix system prompt concatenated with instruction in chat template (#830)

0ba8812

Icelandic addition (#775)

4dc9247

Complete TranslationLiterals for Language.ESTONIAN (#779)

a455539

Translations provided by Kairit Sirts

Update extractive_match_utils.py for words where : is preceded by a…

126f908

… space (#831) * Update extractive_match_utils.py for words where `:` is preceded by a space * fix style

fix: update python api user docs (#784)

88bd36a

ljvmiranda921 merged commit ad948ce into filbench:main Jul 7, 2025

ljvmiranda921 mentioned this pull request Jul 8, 2025

Revert "Sync fork from main" #33

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync fork from main#31

Sync fork from main#31
ljvmiranda921 merged 13 commits intofilbench:mainfrom
huggingface:main

ljvmiranda921 commented Jul 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

Conversation

ljvmiranda921 commented Jul 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants