Skip to content

feat: add --resume support to training script#315

Merged
CalebisGross merged 1 commit intomainfrom
feat/training-resume
Mar 21, 2026
Merged

feat: add --resume support to training script#315
CalebisGross merged 1 commit intomainfrom
feat/training-resume

Conversation

@CalebisGross
Copy link
Copy Markdown
Collaborator

Summary

  • Save full checkpoint state (model weights, optimizer state, global step, recent losses) at each save interval
  • Add --resume flag to resume training from any checkpoint
  • Auto-detect legacy (model-only) vs new-format checkpoints
  • Fast-forward data loader to resumed step position
  • Progress bar starts at resumed step

Needed because the 100M pretraining run (~59 hours) was accidentally interrupted at step 20K. The existing step_20000.pt checkpoint has model weights but no optimizer state — resume will work but Adam momentum restarts cold (brief loss spike expected, recovers within ~200 steps).

Test plan

  • --help shows --resume flag
  • make build and make check pass
  • Resume from legacy checkpoint (step_20000.pt) — verify training continues
  • Verify new checkpoints include optimizer state
  • Verify resume from new-format checkpoint preserves optimizer state

🤖 Generated with Claude Code

Save full checkpoint state (model + optimizer + step + losses) at each
save interval. Support resuming from both new-format and legacy
(model-only) checkpoints with automatic detection.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@CalebisGross CalebisGross merged commit 71c72b9 into main Mar 21, 2026
@CalebisGross CalebisGross deleted the feat/training-resume branch March 21, 2026 05:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant