Skip to content

Minimal Checkpointing#88

Merged
anagainaru merged 2 commits intomainfrom
checkpoints
Mar 4, 2026
Merged

Minimal Checkpointing#88
anagainaru merged 2 commits intomainfrom
checkpoints

Conversation

@rz4
Copy link
Copy Markdown
Collaborator

@rz4 rz4 commented Mar 4, 2026

Summary

Adds model.toml configs for saving model state checkpoints.
Adds save_ckpt to model harness.
Driver saves ckpts at the end of continual learning.

Motivation & Context

Teams need model weight after drift adaptation.

Approach

In a model.toml under [model],
set the max_ckpts, the number of checkpoints to keep. Remove based on file age.
set the ckpts_path, directory checkpoints will be stored.

The model harness now has a property to check if checkpointing is enabled.
Checkpointing is disabled if max_ckpts is set to 0 or if ckpts_path is unspecified.
Default config is disables checkpointing.

The model harness now has a save_ckpt method which
ensures the checkpoint directory exists, saves the the current model, and
removes the oldest checkpoints to keep max_ckpts.

checkpoint files are the model state graph saved to {ckpts_path}/drift_adaptation_{event_id}.pt
The number of checkpoints to keep depends on disk space.

Future PR should consider more comprehensive checkpointing,
a directory with the config (reproducing experiment), driver states (dataloaders, drift detector),
and model state. Checkpoints should restore training at the point the driver decides to
save a checkpoint such as in the case of restoring training runs across sequential jobs.

Screenshots / Logs (optional)

API / CLI Changes

❯ poetry run python -m src.main --config examples/mnist/mnist.toml
Loaded pretrained MNIST model from output/mnist/drift_adaptation_4.pt
INFO:0 | 10:37:37 | step=0 | continuous_monitor | ==== ContinuousMonitor initialized ====
INFO:0 | 10:37:37 | step=0 | continuous_monitor | ==== Starting Continuous Monitoring ====
Mutating the picture further using an angle of 0.7831710577011108 and a scale of 0.9978083670139313
Mutating the picture further using an angle of 7.27046012878418 and a scale of 0.9618054032325745
Mutating the picture further using an angle of 9.169278740882874 and a scale of 1.138285905122757
Mutating the picture further using an angle of 5.671741366386414 and a scale of 0.9080807566642761
Mutating the picture further using an angle of 5.156046152114868 and a scale of 1.2487933039665222
Processing batches: 1it [00:04,  4.47s/it]INFO:0 | 10:38:10 | step=702 | continuous_monitor | ==== DRIFT DETECTED (Event #1)! ====
INFO:0 | 10:38:10 | step=702 | continuous_monitor | -> Dispatching continual learning module...
INFO:0 | 10:38:21 | step=702 | continuous_trainer | ==== Continual Learning ====
CL Updates (drift_event_id=1): 100%|██████████████████████████████████████████████████████████████| 600/600 [00:23<00:00, 25.39it/s]
---------------------------------------------------------------------------██████████████████████▊| 598/600 [00:23<00:00, 26.18it/s]
Compute Performance Metrics (Averaged per Update)
---------------------------------------------------------------------------
Operation       FLOPs              Time            Throughput
---------------------------------------------------------------------------
detector        0 FLOPs            30.43 μs        0 FLOP/s
infer           1.62 GFLOPs        7.19 ms         224.95 GFLOP/s
optimizer       31.92 MFLOPs       3.21 ms         9.95 GFLOP/s
update_fwd_bwd  4.78 GFLOPs        12.13 ms        393.81 GFLOP/s
---------------------------------------------------------------------------
TOTAL           6.43 GFLOPs        22.56 ms        284.89 GFLOP/s
---------------------------------------------------------------------------
INFO:0 | 10:38:55 | step=1304 | continuous_monitor | * Checkpoint saved to: output/mnist/drift_adaptation_1.pt
INFO:0 | 10:38:55 | step=1304 | continuous_monitor | <- Continual learning complete.
INFO:0 | 10:38:55 | step=1304 | continuous_monitor | ==== RESUMING MONITORING! ====

Breaking Changes

  • None

Performance (optional)

Security & Privacy

  • No secrets committed
  • Input validation added where needed

Dependencies

Testing Plan

  • Unit tests
  • Integration tests
  • e2e / smoke test
  • Manual steps: python -m app --help

Documentation

  • Docstrings updated
  • User docs / README updated
  • CHANGELOG entry

Checklist

  • Code formatted (Ruff) → ruff format --check
  • Lint passes (Ruff) → ruff check .
  • Types pass (mypy/pyright) → mypy src
  • Tests pass (pytest) → pytest -q
  • Backward compatibility considered
  • Adequate comments for tricky parts
  • CI green

Risk & Rollback Plan

Probably not needed in the beginning

Notes for Reviewers

Copy link
Copy Markdown
Collaborator

@anagainaru anagainaru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

@anagainaru anagainaru merged commit b9a8282 into main Mar 4, 2026
3 checks passed
@anagainaru anagainaru deleted the checkpoints branch March 4, 2026 20:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants