Sync fork from main#31
Merged
ljvmiranda921 merged 13 commits intofilbench:mainfrom Jul 7, 2025
Merged
Conversation
* set default temperature to 0 in generation config * issue warning when temperature == 0 with multiple samples * fix test
* Catch ROCM/HIP oom in should_reduce_batch_size * fix formatting --------- Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com>
* Update german_rag_evals.py * Update saving-and-reading-results.mdx
Co-authored-by: Nathan Habib <30601243+NathanHB@users.noreply.github.com> Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com>
## What does this PR do? This PR gives the prompt building logic in lighteval a much-needed spring cleaning The main goal: ditch legacy bloat, make things less painful for users and contributors, and unlock support for more complex benchmarks 🔥 ### Highlights - **Prompt Manager Overhaul:** Each model now owns its own PromptManager instance, with custom params for every flavor of prompt (multimodal, API, multiturn, you name it). - **system-prompt**: now part of the model config - **use-chat-template**: now part of model config - **Metrics Slimdown:** Metrics now only care about `samplingMethod` (generative or loglikelihood). Say goodbye to `use_case` and all those old request types. - **Request Layer Gone:** Models get the raw `Doc` directly -—no more unnecessary `request` wrappers that were bloating the code. - **Unified ModelResponse:** All models return a single `ModelResponse` type, whether generative or loglikelihood. This means simpler logging and metric computation. - **Consistent Metric Signatures:** Every metric now uses the same function signature: `compute(doc: Doc, model_response: ModelResponse)`. - **Standardized Details:** Each sample’s details now always include three fields: doc, metric, and model_response. - **Generative Metrics Unified:** All generative metrics now work the same way. If users want greedy generation, they need to set temperature to 0. **Exception will be raised if the user tries to run a sampling metric with temp = 0** - **Removed Loglikelihood Single Token:** bloated and almost not used - **Tests:** All tests pass, and no changes were needed to expected values. ### Why? - Less code, fewer headaches. - Easier to add new benchmarks (including weird and wonderful ones). - More user-friendly inspection tools. - A single, unified way to handle prompts, responses, and metrics. --------- Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: clementine@huggingface.co <clementine@huggingface.co>
* too many false positives with the current gpqa metric extraction, making it more string * fixing whitespace and instruction in prompt * better to have a strict extraction for index extraction in general actually * added comment * fix tests, need to invert condition
Translations provided by Kairit Sirts
… space (#831) * Update extractive_match_utils.py for words where `:` is preceded by a space * fix style
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.