Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Pull Request Overview
This PR refactors the prompt-building and evaluation logic in lighteval by removing legacy request wrappers, unifying data structures (Doc and ModelResponse), and simplifying pipeline and registry handling.
- Introduces a single
Docdataclass for all task inputs and a unifiedModelResponse - Replaces multiple request types and response classes with
SamplingMethodandModelResponse - Updates
Pipeline,Registry, and prompt management to work with the new structures
Reviewed Changes
Copilot reviewed 84 out of 89 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| tests/utils.py | Update FakeModel to return ModelResponse and use Doc |
| src/lighteval/tasks/default_prompts.py | Changed default prompt construction, removed instructions |
| src/lighteval/tasks/requests.py | Replaced old request classes with a large Doc dataclass |
| src/lighteval/models/model_output.py | Consolidated response types into a single, expanded ModelResponse |
Comments suppressed due to low confidence (1)
src/lighteval/tasks/default_prompts.py:64
- The
instructionsvariable was removed from the default prompt, so any task-specific instructions will no longer appear. Consider restoringinstructions(e.g.f"{instructions}\n{question}\n{formatted_choices}") or explicitly handling wheninstructionsis empty.
prompt = f"\n{question}\n{formatted_choices}"
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
| return LogprobCorpusMetricInput(golds=gold_ixs, preds=np.argmax(choices_logprob)) | ||
|
|
||
|
|
||
| class TargetPerplexityPreparator: |
There was a problem hiding this comment.
Why introduce a new class instead of adding a is_target (False be default) parameter to the next one? (esp when so much of the code is the same)
| if num_samples > 1 and self.generation_config_dict["temperature"] == 0: | ||
| raise ValueError( | ||
| "You cannot generate multiple samples with temperature=0. Please set temperature > 0. Or use a non sampling metric." | ||
| ) | ||
|
|
There was a problem hiding this comment.
I wonder if we could not put this one in the abstract class
…ace/lighteval into nathan-refactor-prompt-building
…ace/lighteval into nathan-refactor-prompt-building
…ace/lighteval into nathan-refactor-prompt-building
| pad_amount = global_max_choices - cont_batch.shape[0] | ||
| padded = F.pad(cont_batch, (0, pad_amount), value=-1) |
There was a problem hiding this comment.
Shouldn't it be
pad_amount = global_max_choices - cont_batch.shape[1]
padded = F.pad(cont_batch, (0, pad_amount), value=-1)here?
There was a problem hiding this comment.
hum then I have other shape errors in torch.stack. something looks wrong here
## What does this PR do? This PR gives the prompt building logic in lighteval a much-needed spring cleaning The main goal: ditch legacy bloat, make things less painful for users and contributors, and unlock support for more complex benchmarks 🔥 ### Highlights - **Prompt Manager Overhaul:** Each model now owns its own PromptManager instance, with custom params for every flavor of prompt (multimodal, API, multiturn, you name it). - **system-prompt**: now part of the model config - **use-chat-template**: now part of model config - **Metrics Slimdown:** Metrics now only care about `samplingMethod` (generative or loglikelihood). Say goodbye to `use_case` and all those old request types. - **Request Layer Gone:** Models get the raw `Doc` directly -—no more unnecessary `request` wrappers that were bloating the code. - **Unified ModelResponse:** All models return a single `ModelResponse` type, whether generative or loglikelihood. This means simpler logging and metric computation. - **Consistent Metric Signatures:** Every metric now uses the same function signature: `compute(doc: Doc, model_response: ModelResponse)`. - **Standardized Details:** Each sample’s details now always include three fields: doc, metric, and model_response. - **Generative Metrics Unified:** All generative metrics now work the same way. If users want greedy generation, they need to set temperature to 0. **Exception will be raised if the user tries to run a sampling metric with temp = 0** - **Removed Loglikelihood Single Token:** bloated and almost not used - **Tests:** All tests pass, and no changes were needed to expected values. ### Why? - Less code, fewer headaches. - Easier to add new benchmarks (including weird and wonderful ones). - More user-friendly inspection tools. - A single, unified way to handle prompts, responses, and metrics. --------- Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: clementine@huggingface.co <clementine@huggingface.co>
What does this PR do?
This PR gives the prompt building logic in lighteval a much-needed spring cleaning
The main goal: ditch legacy bloat, make things less painful for users and contributors, and unlock support for more complex benchmarks 🔥
Highlights
samplingMethod(generative or loglikelihood). Say goodbye touse_caseand all those old request types.Docdirectly -—no more unnecessaryrequestwrappers that were bloating the code.ModelResponsetype, whether generative or loglikelihood. This means simpler logging and metric computation.compute(doc: Doc, model_response: ModelResponse).Why?
architecture of lighteval
Example details dataset