Skip to content

Save scaler state dict when checkpointing#11663

Merged
LysandreJik merged 1 commit intomasterfrom
checkpoint_scaler
May 10, 2021
Merged

Save scaler state dict when checkpointing#11663
LysandreJik merged 1 commit intomasterfrom
checkpoint_scaler

Conversation

@sgugger
Copy link
Copy Markdown
Collaborator

@sgugger sgugger commented May 10, 2021

What does this PR do?

One last thing was missing for resuming with checkpoints and have exactly the same results as a complete training: the gradient scaler state when using mixed precision with AMP in PyTorch. This PR addresses that.

Fixes #11323

@sgugger sgugger requested a review from LysandreJik May 10, 2021 14:44
Copy link
Copy Markdown
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! LGTM, thanks @sgugger!

@LysandreJik LysandreJik merged commit 05a9306 into master May 10, 2021
@LysandreJik LysandreJik deleted the checkpoint_scaler branch May 10, 2021 14:58
Iwontbecreative pushed a commit to Iwontbecreative/transformers that referenced this pull request Jul 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug in trainer: substantially different results from restarting from a checkpoint and without

2 participants