Skip to content

Sync fork from main#31

Merged
ljvmiranda921 merged 13 commits intofilbench:mainfrom
huggingface:main
Jul 7, 2025
Merged

Sync fork from main#31
ljvmiranda921 merged 13 commits intofilbench:mainfrom
huggingface:main

Conversation

@ljvmiranda921
Copy link
Copy Markdown

No description provided.

NathanHB and others added 13 commits June 20, 2025 15:04
* set default temperature to 0 in generation config

* issue warning when temperature == 0 with multiple samples

* fix test
* Catch ROCM/HIP oom in should_reduce_batch_size

* fix formatting

---------

Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com>
* Update german_rag_evals.py

* Update saving-and-reading-results.mdx
Co-authored-by: Nathan Habib <30601243+NathanHB@users.noreply.github.com>
Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com>
* add TUMLU-mini benchmark, solves #577

* add benchmark info for tumlu-mini

* Update community_tasks/turkic_evals.py

---------

Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com>
## What does this PR do?

This PR gives the prompt building logic in lighteval a much-needed spring cleaning

The main goal: ditch legacy bloat, make things less painful for users and contributors, and unlock support for more complex benchmarks 🔥 

### Highlights

- **Prompt Manager Overhaul:** Each model now owns its own PromptManager instance, with custom params for every flavor of prompt (multimodal, API, multiturn, you name it).
   - **system-prompt**: now part of the model config
   - **use-chat-template**: now part of model config
- **Metrics Slimdown:** Metrics now only care about `samplingMethod` (generative or loglikelihood). Say goodbye to `use_case` and all those old request types.
- **Request Layer Gone:** Models get the raw `Doc` directly -—no more unnecessary `request` wrappers that were bloating the code.
- **Unified ModelResponse:** All models return a single `ModelResponse` type, whether generative or loglikelihood. This means simpler logging and metric computation.
- **Consistent Metric Signatures:** Every metric now uses the same function signature: `compute(doc: Doc, model_response: ModelResponse)`.
- **Standardized Details:** Each sample’s details now always include three fields: doc, metric, and model_response.
- **Generative Metrics Unified:** All generative metrics now work the same way. If users want greedy generation, they need to set temperature to 0. **Exception will be raised if the user tries to run a sampling metric with temp = 0**
- **Removed Loglikelihood Single Token:** bloated and almost not used
- **Tests:** All tests pass, and no changes were needed to expected values.

### Why?

- Less code, fewer headaches.
- Easier to add new benchmarks (including weird and wonderful ones).
- More user-friendly inspection tools.
- A single, unified way to handle prompts, responses, and metrics.

---------

Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: clementine@huggingface.co <clementine@huggingface.co>
* too many false positives with the current gpqa metric extraction, making it more string

* fixing whitespace and instruction in prompt

* better to have a strict extraction for index extraction in general actually

* added comment

* fix tests, need to invert condition
Translations provided by Kairit Sirts
… space (#831)

* Update extractive_match_utils.py for words where `:` is preceded by a space

* fix style
@ljvmiranda921 ljvmiranda921 merged commit ad948ce into filbench:main Jul 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants