Fix SFTTrainer crash when train_dataset=None#5131
Fix SFTTrainer crash when train_dataset=None#5131albertvillanova wants to merge 1 commit intohuggingface:mainfrom
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
This is a somewhat unusual use case. However, if it’s straightforward to support (as it appears to be) I think it would be reasonable to add it. If we decide to move forward, we should:
One concern I have is that |
|
Thanks for your fast feedback, @qgallouedec. The main point here is to discuss if we want to officially support the evaluation-only use case. From my perspective, I’m not fully convinced that we should support it:
So philosophically, I lean toward not expanding the surface area unless we see strong evidence that this is a common and justified workflow. That said, in #2004 it does seem there was at least some expectation (or implicit support) for evaluation-only usage. So before deciding, I think we should clarify:
If we do decide to support it, I agree with your proposed steps above. So my suggestion would be:
Happy to align either way: but I think we should make this a deliberate design decision rather than just fixing the crash. |
Fix
SFTTrainercrash whentrain_dataset=None.This PR improves the robustness and flexibility of the
SFTTrainerinitialization when handling cases where no training dataset is provided. The changes ensure that loss configuration and dataset checks are only performed when a training dataset is present, preventing errors and improving compatibility with different training scenarios.Problem
SFTTrainer.init accepts train_dataset=None in its signature, but crashes with a TypeError immediately on construction:
Two further sites share the same root cause:
is_conversational(dataset_sample): references dataset_sample which is unbound when train_dataset is Noneself._prepare_dataset(train_dataset, ...): crashes inside the method on dataset.map()Solution
Guard all three sites with
if train_dataset is not None, with sensible defaults in theelsebranch:completion_only_loss = args.completion_only_loss or False: respects an explicit config value, otherwise falls back to standard LM loss (the safe choice when there is no data to inspect)_is_vision_dataset = False: cannot be detected without data; users relying on vision collation must supply data_collator explicitlyKey improvements:
completion_only_loss,_is_vision_dataset) is only performed whentrain_datasetis present; otherwise, sensible defaults are set._prepare_dataset) is only called iftrain_datasetis notNone.Discussion: should evaluation-only be a supported use case?
PR:
SFTTrainer.evaluate()andSFTTrainer.predict()with null train_dataset #2004PR #2004 explicitly added support for
SFTTrainer.evaluate()andSFTTrainer.predict()without a train_dataset. This fix preserves that intent.However, the same issue exists in several other trainers (DPOTrainer, RewardTrainer, and multiple experimental trainers), each with different degrees of fitness for evaluation-only workflows.
Before addressing those, it would be good to align on whether evaluation-only is an officially supported pattern for SFTTrainer and, if so, which other trainers should follow suit.
Once this is agreed and merged, I will fix the remaining trainers consistently with the agreed approach in follow-up PR.
CC: @qgallouedec, @lewtun (who approved #2004)