Skip to content

Fix: check TrainerState file exists before loading during resume#39599

Open
Petecheco wants to merge 2 commits intohuggingface:mainfrom
Petecheco:fix_trainer_state_load_when_file_missing
Open

Fix: check TrainerState file exists before loading during resume#39599
Petecheco wants to merge 2 commits intohuggingface:mainfrom
Petecheco:fix_trainer_state_load_when_file_missing

Conversation

@Petecheco
Copy link
Copy Markdown

What does this PR do?

When resuming training from a checkpoint, the Trainer attempts to load `trainer_state.json` to recover the train batch size. However, if the file does not exist, a `FileNotFoundError` is raised, causing resume to fail.

This patch adds a check using `os.path.isfile` before loading the state, consistent with other parts of the codebase. If the file is missing, a warning is logged and batch size recovery is skipped, allowing training to continue.

Fixes a potential crash and improves robustness when resuming from incomplete or custom checkpoint directories.

Related PR #27568

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

trainer: @zach-huggingface, @SunMarc and @qgallouedec
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Petecheco and others added 2 commits July 23, 2025 17:27
When resuming training from a checkpoint, the Trainer attempts to load
\`trainer_state.json\` to recover the train batch size. However, if the file
does not exist, a \`FileNotFoundError\` is raised, causing resume to fail.

This patch adds a check using \`os.path.isfile\` before loading the state,
consistent with other parts of the codebase. If the file is missing, a warning
is logged and batch size recovery is skipped, allowing training to continue.

Fixes a potential crash and improves robustness when resuming from incomplete
or custom checkpoint directories.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant