Skip to content

Handle loading non-existent checkpoints or corrupted checkpoints.#40790

Open
zhengchenyu wants to merge 4 commits intohuggingface:mainfrom
zhengchenyu:load.checkpoint.not.raise
Open

Handle loading non-existent checkpoints or corrupted checkpoints.#40790
zhengchenyu wants to merge 4 commits intohuggingface:mainfrom
zhengchenyu:load.checkpoint.not.raise

Conversation

@zhengchenyu
Copy link
Copy Markdown

@zhengchenyu zhengchenyu commented Sep 10, 2025

What does this PR do?

  • 1 Handle loading non-existent checkpoints

Setting resume_from_checkpoint to true at the start of training will result in an error because the latest checkpoint cannot be found. Therefore, we should only set to false or null at the beginning. If you then interrupt training and want to resume from the latest checkpoint, we need to set resume_from_checkpoint to true. This adjustment is unnecessary. If the latest checkpoint cannot be found at the start of training, simply print the message; there is no need to raise an exception.

  • 2 Handle loading corrupted checkpoints

I've noticed that an exception during the checkpoint process can interrupt the checkpoint, resulting in corrupted files. This can cause errors when loading the checkpoint. This PR add "latest" tag to the checkpoint to ensure the checkpoint is complete.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@Rocketknight1
Copy link
Copy Markdown
Member

cc @SunMarc

@zhengchenyu zhengchenyu changed the title Avoid exceptions when loading non-existent checkpoints Handle loading non-existent checkpoints or corrupted checkpoints. Sep 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants