Minimal Checkpointing by rz4 · Pull Request #88 · AI-ModCon/BaseSIM_APEIRON

rz4 · 2026-03-04T17:19:20Z

Summary

Adds model.toml configs for saving model state checkpoints.
Adds save_ckpt to model harness.
Driver saves ckpts at the end of continual learning.

Motivation & Context

Teams need model weight after drift adaptation.

Approach

In a model.toml under [model],
set the max_ckpts, the number of checkpoints to keep. Remove based on file age.
set the ckpts_path, directory checkpoints will be stored.

The model harness now has a property to check if checkpointing is enabled.
Checkpointing is disabled if max_ckpts is set to 0 or if ckpts_path is unspecified.
Default config is disables checkpointing.

The model harness now has a save_ckpt method which
ensures the checkpoint directory exists, saves the the current model, and
removes the oldest checkpoints to keep max_ckpts.

checkpoint files are the model state graph saved to {ckpts_path}/drift_adaptation_{event_id}.pt
The number of checkpoints to keep depends on disk space.

Future PR should consider more comprehensive checkpointing,
a directory with the config (reproducing experiment), driver states (dataloaders, drift detector),
and model state. Checkpoints should restore training at the point the driver decides to
save a checkpoint such as in the case of restoring training runs across sequential jobs.

Screenshots / Logs (optional)

API / CLI Changes

❯ poetry run python -m src.main --config examples/mnist/mnist.toml
Loaded pretrained MNIST model from output/mnist/drift_adaptation_4.pt
INFO:0 | 10:37:37 | step=0 | continuous_monitor | ==== ContinuousMonitor initialized ====
INFO:0 | 10:37:37 | step=0 | continuous_monitor | ==== Starting Continuous Monitoring ====
Mutating the picture further using an angle of 0.7831710577011108 and a scale of 0.9978083670139313
Mutating the picture further using an angle of 7.27046012878418 and a scale of 0.9618054032325745
Mutating the picture further using an angle of 9.169278740882874 and a scale of 1.138285905122757
Mutating the picture further using an angle of 5.671741366386414 and a scale of 0.9080807566642761
Mutating the picture further using an angle of 5.156046152114868 and a scale of 1.2487933039665222
Processing batches: 1it [00:04,  4.47s/it]INFO:0 | 10:38:10 | step=702 | continuous_monitor | ==== DRIFT DETECTED (Event #1)! ====
INFO:0 | 10:38:10 | step=702 | continuous_monitor | -> Dispatching continual learning module...
INFO:0 | 10:38:21 | step=702 | continuous_trainer | ==== Continual Learning ====
CL Updates (drift_event_id=1): 100%|██████████████████████████████████████████████████████████████| 600/600 [00:23<00:00, 25.39it/s]
---------------------------------------------------------------------------██████████████████████▊| 598/600 [00:23<00:00, 26.18it/s]
Compute Performance Metrics (Averaged per Update)
---------------------------------------------------------------------------
Operation       FLOPs              Time            Throughput
---------------------------------------------------------------------------
detector        0 FLOPs            30.43 μs        0 FLOP/s
infer           1.62 GFLOPs        7.19 ms         224.95 GFLOP/s
optimizer       31.92 MFLOPs       3.21 ms         9.95 GFLOP/s
update_fwd_bwd  4.78 GFLOPs        12.13 ms        393.81 GFLOP/s
---------------------------------------------------------------------------
TOTAL           6.43 GFLOPs        22.56 ms        284.89 GFLOP/s
---------------------------------------------------------------------------
INFO:0 | 10:38:55 | step=1304 | continuous_monitor | * Checkpoint saved to: output/mnist/drift_adaptation_1.pt
INFO:0 | 10:38:55 | step=1304 | continuous_monitor | <- Continual learning complete.
INFO:0 | 10:38:55 | step=1304 | continuous_monitor | ==== RESUMING MONITORING! ====

Breaking Changes

None

Performance (optional)

Security & Privacy

No secrets committed
Input validation added where needed

Dependencies

Testing Plan

Unit tests
Integration tests
e2e / smoke test
Manual steps: python -m app --help

Documentation

Docstrings updated
User docs / README updated
CHANGELOG entry

Checklist

Code formatted (Ruff) → ruff format --check
Lint passes (Ruff) → ruff check .
Types pass (mypy/pyright) → mypy src
Tests pass (pytest) → pytest -q

Backward compatibility considered
Adequate comments for tricky parts
CI green

Risk & Rollback Plan

Probably not needed in the beginning

Notes for Reviewers

anagainaru

Looks good

Rafael Zamora-Resendiz (AMCRD) added 2 commits March 4, 2026 10:47

checkpointing model state at end of continual learning app.

61db0b2

Passes ruff and mypy.

ed749cf

anagainaru approved these changes Mar 4, 2026

View reviewed changes

anagainaru mentioned this pull request Mar 4, 2026

Set frequency of checkpointing the model #90

Open

anagainaru merged commit b9a8282 into main Mar 4, 2026
3 checks passed

anagainaru deleted the checkpoints branch March 4, 2026 20:05

anagainaru mentioned this pull request Mar 4, 2026

Add SLAC FEL model files and preliminary workflow #83

Open

14 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minimal Checkpointing#88

Minimal Checkpointing#88
anagainaru merged 2 commits intomainfrom
checkpoints

rz4 commented Mar 4, 2026

Uh oh!

anagainaru left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rz4 commented Mar 4, 2026

Summary

Motivation & Context

Approach

Screenshots / Logs (optional)

API / CLI Changes

Breaking Changes

Performance (optional)

Security & Privacy

Dependencies

Testing Plan

Documentation

Checklist

Risk & Rollback Plan

Notes for Reviewers

Uh oh!

anagainaru left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants