Running MNIST example on Perlmutter by rz4 · Pull Request #78 · AI-ModCon/BaseSIM_APEIRON

rz4 · 2026-02-17T19:52:07Z

Summary

Running framework on Perlmutter. Setup guide documented in deployment README.

Motivation & Context

Shortest way to start running experiments on Perlmutter.

Approach

cd to $SCRATCH directory.
Clone repo, create python venv and install poetry.
Install dependencies with poetry.
Pull MNIST data.
Submit MNIST example job.

Screenshots / Logs (optional)

$ cat output/mnist_example.o49041554

==============================================
MNIST Example
==============================================
Date: Tue 17 Feb 2026 10:07:34 AM PST
Hostname: nid003964
==============================================
Loaded pretrained MNIST model from examples/mnist/mnist.pth
INFO:0 | 10:08:11 | step=0 | continuous_monitor | ==== ContinuousMonitor initialized ====
INFO:0 | 10:08:11 | step=0 | continuous_monitor | ==== Starting Continuous Monitoring ====
Mutating the picture further using an angle of 0.7831710577011108 and a scale of 0.9978083670139313
Mutating the picture further using an angle of 7.27046012878418 and a scale of 0.9618054032325745
Mutating the picture further using an angle of 9.169278740882874 and a scale of 1.138285905122757
Mutating the picture further using an angle of 5.671741366386414 and a scale of 0.9080807566642761
Mutating the picture further using an angle of 5.156046152114868 and a scale of 1.2487933039665222
INFO:0 | 10:08:19 | step=702 | continuous_monitor | ==== DRIFT DETECTED (Event #2)! ====
INFO:0 | 10:08:19 | step=702 | continuous_monitor | -> Dispatching continual learning module...
INFO:0 | 10:08:23 | step=702 | continuous_trainer | ==== Continual Learning ====
---------------------------------------------------------------------------
Compute Performance Metrics (Averaged per Update)
---------------------------------------------------------------------------
Operation       FLOPs              Time            Throughput
---------------------------------------------------------------------------
detector        0 FLOPs            49.01 μs        0 FLOP/s
infer           1.62 GFLOPs        3.15 ms         513.16 GFLOP/s
optimizer       2.28 MFLOPs        4.00 ms         570.71 MFLOP/s
update_fwd_bwd  4.82 GFLOPs        10.90 ms        441.62 GFLOP/s
---------------------------------------------------------------------------
TOTAL           6.43 GFLOPs        18.10 ms        355.53 GFLOP/s
---------------------------------------------------------------------------
INFO:0 | 10:08:44 | step=1304 | continuous_monitor | <- Continual learning complete.
INFO:0 | 10:08:44 | step=1304 | continuous_monitor | ==== RESUMING MONITORING! ====
Mutating the picture further using an angle of 1.4816296100616455 and a scale of 0.8649424314498901
Mutating the picture further using an angle of 2.0256882905960083 and a scale of 1.034351795911789
Mutating the picture further using an angle of 1.0241323709487915 and a scale of 1.1528273522853851
Mutating the picture further using an angle of 3.903886079788208 and a scale of 0.9972822070121765
Mutating the picture further using an angle of 5.877098441123962 and a scale of 0.7850101292133331
INFO:0 | 10:08:57 | step=2008 | continuous_monitor | ==== DRIFT DETECTED (Event #3)! ====
INFO:0 | 10:08:57 | step=2008 | continuous_monitor | -> Dispatching continual learning module...
INFO:0 | 10:09:04 | step=2008 | continuous_trainer | ==== Continual Learning ====
---------------------------------------------------------------------------
Compute Performance Metrics (Averaged per Update)
---------------------------------------------------------------------------
Operation       FLOPs              Time            Throughput
---------------------------------------------------------------------------
detector        0 FLOPs            50.98 μs        0 FLOP/s
infer           1.62 GFLOPs        3.22 ms         502.33 GFLOP/s
optimizer       2.28 MFLOPs        3.97 ms         573.70 MFLOP/s
update_fwd_bwd  4.82 GFLOPs        10.44 ms        461.42 GFLOP/s
---------------------------------------------------------------------------
TOTAL           6.44 GFLOPs        17.68 ms        363.95 GFLOP/s
---------------------------------------------------------------------------
INFO:0 | 10:09:37 | step=2610 | continuous_monitor | <- Continual learning complete.
INFO:0 | 10:09:37 | step=2610 | continuous_monitor | ==== RESUMING MONITORING! ====
Mutating the picture further using an angle of 8.823065757751465 and a scale of 0.8285264372825623
Mutating the picture further using an angle of 4.25514817237854 and a scale of 1.124777913093567
Mutating the picture further using an angle of 2.6921188831329346 and a scale of 1.1192893385887146
Mutating the picture further using an angle of 9.351488947868347 and a scale of 1.011735051870346
Mutating the picture further using an angle of 7.003557085990906 and a scale of 0.9170688390731812
INFO:0 | 10:09:55 | step=3314 | continuous_monitor | ==== DRIFT DETECTED (Event #4)! ====
INFO:0 | 10:09:55 | step=3314 | continuous_monitor | -> Dispatching continual learning module...
INFO:0 | 10:10:05 | step=3314 | continuous_trainer | ==== Continual Learning ====
---------------------------------------------------------------------------
Compute Performance Metrics (Averaged per Update)
---------------------------------------------------------------------------
Operation       FLOPs              Time            Throughput
---------------------------------------------------------------------------
detector        0 FLOPs            52.98 μs        0 FLOP/s
infer           1.62 GFLOPs        3.21 ms         504.52 GFLOP/s
optimizer       2.28 MFLOPs        4.01 ms         569.02 MFLOP/s
update_fwd_bwd  4.82 GFLOPs        10.67 ms        451.40 GFLOP/s
---------------------------------------------------------------------------
TOTAL           6.44 GFLOPs        17.93 ms        358.83 GFLOP/s
---------------------------------------------------------------------------
INFO:0 | 10:10:51 | step=3916 | continuous_monitor | <- Continual learning complete.
INFO:0 | 10:10:51 | step=3916 | continuous_monitor | ==== RESUMING MONITORING! ====
Mutating the picture further using an angle of 8.485900163650513 and a scale of 0.7678219676017761
Mutating the picture further using an angle of 9.587537050247192 and a scale of 0.8052024245262146
Mutating the picture further using an angle of 0.629265308380127 and a scale of 1.1789422631263733
INFO:0 | 10:11:03 | step=4268 | continuous_monitor | ==== DRIFT DETECTED (Event #5)! ====
INFO:0 | 10:11:03 | step=4268 | continuous_monitor | -> Dispatching continual learning module...
INFO:0 | 10:11:14 | step=4268 | continuous_trainer | ==== Continual Learning ====
---------------------------------------------------------------------------
Compute Performance Metrics (Averaged per Update)
---------------------------------------------------------------------------
Operation       FLOPs              Time            Throughput
---------------------------------------------------------------------------
detector        0 FLOPs            52.83 μs        0 FLOP/s
infer           1.62 GFLOPs        3.22 ms         501.82 GFLOP/s
optimizer       2.28 MFLOPs        4.01 ms         568.12 MFLOP/s
update_fwd_bwd  4.82 GFLOPs        10.67 ms        451.44 GFLOP/s
---------------------------------------------------------------------------
TOTAL           6.44 GFLOPs        17.96 ms        358.38 GFLOP/s
---------------------------------------------------------------------------
INFO:0 | 10:12:08 | step=4870 | continuous_monitor | <- Continual learning complete.
INFO:0 | 10:12:08 | step=4870 | continuous_monitor | ==== RESUMING MONITORING! ====
Mutating the picture further using an angle of 8.28125774860382 and a scale of 1.1350123882293701
Mutating the picture further using an angle of 9.250980615615845 and a scale of 1.1172855496406555
Mutating the picture further using an angle of 4.6211546659469604 and a scale of 0.9335004687309265
INFO:0 | 10:12:21 | step=5215 | continuous_monitor | ==== Continuous Monitoring Complete ====

API / CLI Changes

N/A

Breaking Changes

None

Performance (optional)

N/A

Security & Privacy

N/A

Dependencies

N/A

Testing Plan

Described in deployment readme.

Documentation

Added README under src/deployment/perlmutter/.

Checklist

Code formatted (Ruff) → ruff format --check
Lint passes (Ruff) → ruff check .
Types pass (mypy/pyright) → mypy src
Tests pass (pytest) → pytest -q

Backward compatibility considered
Adequate comments for tricky parts
CI green

Risk & Rollback Plan

Probably not needed in the beginning

Notes for Reviewers

ScSteffen · 2026-02-18T15:18:32Z

Looking good.
I'd leave this PR open for now for 2 reasons.

I'm about to add torch ddp support, and we should test if the perlmutter scripts work fine with ddp too
Fix docker CI build size #79 fixed the CI, can you please pull and see if the CI is turns green here?

Added missing poetry install command.

rz4 · 2026-03-09T17:13:11Z

I have added a link in the main README linking the instructions to setup the codebase and environment on Perlmutter, as well as submitting jobs. The document is a README under deployment. Does it make sense to move deployment documentation to a folder under docs/?

The deployment on Perlmutter has been streamlined. On the user's scratch directory, the user clones the repo, and
enters its root directory. The user then runs source src/deployment/perlmutter/instal_venv.sh to install a python virtual enviroment, poetry, and then use poetry to resolve the remaining dependencies.

Poetry installs the project dependencies within the virtual environment. The virtual enviroment can then be loaded
near the start of the SLURM script with source .venv/bin/activate.

rfgeek

Ran the mnist example on perlmutter and it worked.

rz4 added 3 commits February 17, 2026 10:18

Perlmutter slurm script for mnist.

8247394

Merge remote-tracking branch 'origin/main' into deploy_perlmutter

2bde340

Removed commented lines.

57e4a91

rz4 mentioned this pull request Feb 17, 2026

Create scripts to run things on NERSC/Frontier/Aurora #49

Open

rz4 mentioned this pull request Feb 18, 2026

Running MNIST example on Frontier #67

Merged

10 tasks

Update README.md

f244ed2

Added missing poetry install command.

ScSteffen added the Deployment Issues and PRs related to the deployment of the model back in the system label Mar 4, 2026

rz4 and others added 4 commits March 5, 2026 10:15

Update README.md

381f4e6

standardized .venv as enviroment for jobs

02edee6

Update README.md

bf177c9

Update README.md with deployment guide link.

b5e8bb9

rfgeek approved these changes Mar 9, 2026

View reviewed changes

Merge branch 'main' into deploy_perlmutter

62ca316

anagainaru approved these changes Mar 11, 2026

View reviewed changes

anagainaru merged commit cbd3c56 into main Mar 11, 2026
3 checks passed

anagainaru deleted the deploy_perlmutter branch March 11, 2026 15:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running MNIST example on Perlmutter#78

Running MNIST example on Perlmutter#78
anagainaru merged 9 commits intomainfrom
deploy_perlmutter

rz4 commented Feb 17, 2026

Uh oh!

ScSteffen commented Feb 18, 2026

Uh oh!

rz4 commented Mar 9, 2026

Uh oh!

rfgeek left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

rz4 commented Feb 17, 2026

Summary

Motivation & Context

Approach

Screenshots / Logs (optional)

API / CLI Changes

Breaking Changes

Performance (optional)

Security & Privacy

Dependencies

Testing Plan

Documentation

Checklist

Risk & Rollback Plan

Notes for Reviewers

Uh oh!

ScSteffen commented Feb 18, 2026

Uh oh!

rz4 commented Mar 9, 2026

Uh oh!

rfgeek left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants