Skip to content

Fix SFTTrainer crash when train_dataset=None#5131

Open
albertvillanova wants to merge 1 commit intohuggingface:mainfrom
albertvillanova:fix-train-dataset-none-sft
Open

Fix SFTTrainer crash when train_dataset=None#5131
albertvillanova wants to merge 1 commit intohuggingface:mainfrom
albertvillanova:fix-train-dataset-none-sft

Conversation

@albertvillanova
Copy link
Member

@albertvillanova albertvillanova commented Feb 19, 2026

Fix SFTTrainer crash when train_dataset=None.

This PR improves the robustness and flexibility of the SFTTrainer initialization when handling cases where no training dataset is provided. The changes ensure that loss configuration and dataset checks are only performed when a training dataset is present, preventing errors and improving compatibility with different training scenarios.

Problem

SFTTrainer.init accepts train_dataset=None in its signature, but crashes with a TypeError immediately on construction:

dataset_sample = next(iter(train_dataset))  # TypeError: 'NoneType' is not iterable

Two further sites share the same root cause:

  • is_conversational(dataset_sample): references dataset_sample which is unbound when train_dataset is None
  • self._prepare_dataset(train_dataset, ...): crashes inside the method on dataset.map()

Solution

Guard all three sites with if train_dataset is not None, with sensible defaults in the else branch:

  • completion_only_loss = args.completion_only_loss or False: respects an explicit config value, otherwise falls back to standard LM loss (the safe choice when there is no data to inspect)
  • _is_vision_dataset = False: cannot be detected without data; users relying on vision collation must supply data_collator explicitly

Key improvements:

  • Added checks to ensure loss configuration (completion_only_loss, _is_vision_dataset) is only performed when train_dataset is present; otherwise, sensible defaults are set.
  • Guarded the assistant-only loss validation to only run if a training dataset is provided, preventing errors when no dataset is available.
  • Ensured dataset preparation (_prepare_dataset) is only called if train_dataset is not None.

Discussion: should evaluation-only be a supported use case?

PR:

PR #2004 explicitly added support for SFTTrainer.evaluate() and SFTTrainer.predict() without a train_dataset. This fix preserves that intent.

However, the same issue exists in several other trainers (DPOTrainer, RewardTrainer, and multiple experimental trainers), each with different degrees of fitness for evaluation-only workflows.

Before addressing those, it would be good to align on whether evaluation-only is an officially supported pattern for SFTTrainer and, if so, which other trainers should follow suit.

Once this is agreed and merged, I will fix the remaining trainers consistently with the agreed approach in follow-up PR.

CC: @qgallouedec, @lewtun (who approved #2004)

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@qgallouedec
Copy link
Member

This is a somewhat unusual use case. However, if it’s straightforward to support (as it appears to be) I think it would be reasonable to add it.

If we decide to move forward, we should:

  • Update the documentation (by adding evaluate and predict to the [[autodoc]] SFTTrainer)
  • Add a dedicated test case

One concern I have is that predict currently does not prepare the dataset. That goes against the design philosophy of SFTTrainer, which is intended to work with unprocessed data out of the box.

@albertvillanova
Copy link
Member Author

Thanks for your fast feedback, @qgallouedec.

The main point here is to discuss if we want to officially support the evaluation-only use case.

From my perspective, I’m not fully convinced that we should support it:

  • SFTTrainer is conceptually designed as a training-oriented abstraction
  • If users only want evaluation, they can already use Trainer directly, which avoids adding special cases to SFTTrainer

So philosophically, I lean toward not expanding the surface area unless we see strong evidence that this is a common and justified workflow.

That said, in #2004 it does seem there was at least some expectation (or implicit support) for evaluation-only usage. So before deciding, I think we should clarify:

  • Was evaluation-only indeed intentionally supported?
  • Do we know of real downstream users depending on this pattern?
  • Is this aligned with the long-term direction of SFTTrainer, or does it blur responsibilities?

If we do decide to support it, I agree with your proposed steps above.

So my suggestion would be:

  • First, align on whether evaluation-only is an officially supported pattern.
  • If yes, we should support it properly and consistently (docs + tests + dataset prep parity).
  • If not, we should explicitly document that train_dataset is required and recommend using Trainer for pure evaluation workflows.

Happy to align either way: but I think we should make this a deliberate design decision rather than just fixing the crash.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments