Skip to content

Running MNIST example on Perlmutter#78

Merged
anagainaru merged 9 commits intomainfrom
deploy_perlmutter
Mar 11, 2026
Merged

Running MNIST example on Perlmutter#78
anagainaru merged 9 commits intomainfrom
deploy_perlmutter

Conversation

@rz4
Copy link
Copy Markdown
Collaborator

@rz4 rz4 commented Feb 17, 2026

Summary

Running framework on Perlmutter. Setup guide documented in deployment README.

Motivation & Context

Shortest way to start running experiments on Perlmutter.

Approach

  1. cd to $SCRATCH directory.
  2. Clone repo, create python venv and install poetry.
  3. Install dependencies with poetry.
  4. Pull MNIST data.
  5. Submit MNIST example job.

Screenshots / Logs (optional)

$ cat output/mnist_example.o49041554

==============================================
MNIST Example
==============================================
Date: Tue 17 Feb 2026 10:07:34 AM PST
Hostname: nid003964
==============================================
Loaded pretrained MNIST model from examples/mnist/mnist.pth
INFO:0 | 10:08:11 | step=0 | continuous_monitor | ==== ContinuousMonitor initialized ====
INFO:0 | 10:08:11 | step=0 | continuous_monitor | ==== Starting Continuous Monitoring ====
Mutating the picture further using an angle of 0.7831710577011108 and a scale of 0.9978083670139313
Mutating the picture further using an angle of 7.27046012878418 and a scale of 0.9618054032325745
Mutating the picture further using an angle of 9.169278740882874 and a scale of 1.138285905122757
Mutating the picture further using an angle of 5.671741366386414 and a scale of 0.9080807566642761
Mutating the picture further using an angle of 5.156046152114868 and a scale of 1.2487933039665222
INFO:0 | 10:08:19 | step=702 | continuous_monitor | ==== DRIFT DETECTED (Event #2)! ====
INFO:0 | 10:08:19 | step=702 | continuous_monitor | -> Dispatching continual learning module...
INFO:0 | 10:08:23 | step=702 | continuous_trainer | ==== Continual Learning ====
---------------------------------------------------------------------------
Compute Performance Metrics (Averaged per Update)
---------------------------------------------------------------------------
Operation       FLOPs              Time            Throughput
---------------------------------------------------------------------------
detector        0 FLOPs            49.01 μs        0 FLOP/s
infer           1.62 GFLOPs        3.15 ms         513.16 GFLOP/s
optimizer       2.28 MFLOPs        4.00 ms         570.71 MFLOP/s
update_fwd_bwd  4.82 GFLOPs        10.90 ms        441.62 GFLOP/s
---------------------------------------------------------------------------
TOTAL           6.43 GFLOPs        18.10 ms        355.53 GFLOP/s
---------------------------------------------------------------------------
INFO:0 | 10:08:44 | step=1304 | continuous_monitor | <- Continual learning complete.
INFO:0 | 10:08:44 | step=1304 | continuous_monitor | ==== RESUMING MONITORING! ====
Mutating the picture further using an angle of 1.4816296100616455 and a scale of 0.8649424314498901
Mutating the picture further using an angle of 2.0256882905960083 and a scale of 1.034351795911789
Mutating the picture further using an angle of 1.0241323709487915 and a scale of 1.1528273522853851
Mutating the picture further using an angle of 3.903886079788208 and a scale of 0.9972822070121765
Mutating the picture further using an angle of 5.877098441123962 and a scale of 0.7850101292133331
INFO:0 | 10:08:57 | step=2008 | continuous_monitor | ==== DRIFT DETECTED (Event #3)! ====
INFO:0 | 10:08:57 | step=2008 | continuous_monitor | -> Dispatching continual learning module...
INFO:0 | 10:09:04 | step=2008 | continuous_trainer | ==== Continual Learning ====
---------------------------------------------------------------------------
Compute Performance Metrics (Averaged per Update)
---------------------------------------------------------------------------
Operation       FLOPs              Time            Throughput
---------------------------------------------------------------------------
detector        0 FLOPs            50.98 μs        0 FLOP/s
infer           1.62 GFLOPs        3.22 ms         502.33 GFLOP/s
optimizer       2.28 MFLOPs        3.97 ms         573.70 MFLOP/s
update_fwd_bwd  4.82 GFLOPs        10.44 ms        461.42 GFLOP/s
---------------------------------------------------------------------------
TOTAL           6.44 GFLOPs        17.68 ms        363.95 GFLOP/s
---------------------------------------------------------------------------
INFO:0 | 10:09:37 | step=2610 | continuous_monitor | <- Continual learning complete.
INFO:0 | 10:09:37 | step=2610 | continuous_monitor | ==== RESUMING MONITORING! ====
Mutating the picture further using an angle of 8.823065757751465 and a scale of 0.8285264372825623
Mutating the picture further using an angle of 4.25514817237854 and a scale of 1.124777913093567
Mutating the picture further using an angle of 2.6921188831329346 and a scale of 1.1192893385887146
Mutating the picture further using an angle of 9.351488947868347 and a scale of 1.011735051870346
Mutating the picture further using an angle of 7.003557085990906 and a scale of 0.9170688390731812
INFO:0 | 10:09:55 | step=3314 | continuous_monitor | ==== DRIFT DETECTED (Event #4)! ====
INFO:0 | 10:09:55 | step=3314 | continuous_monitor | -> Dispatching continual learning module...
INFO:0 | 10:10:05 | step=3314 | continuous_trainer | ==== Continual Learning ====
---------------------------------------------------------------------------
Compute Performance Metrics (Averaged per Update)
---------------------------------------------------------------------------
Operation       FLOPs              Time            Throughput
---------------------------------------------------------------------------
detector        0 FLOPs            52.98 μs        0 FLOP/s
infer           1.62 GFLOPs        3.21 ms         504.52 GFLOP/s
optimizer       2.28 MFLOPs        4.01 ms         569.02 MFLOP/s
update_fwd_bwd  4.82 GFLOPs        10.67 ms        451.40 GFLOP/s
---------------------------------------------------------------------------
TOTAL           6.44 GFLOPs        17.93 ms        358.83 GFLOP/s
---------------------------------------------------------------------------
INFO:0 | 10:10:51 | step=3916 | continuous_monitor | <- Continual learning complete.
INFO:0 | 10:10:51 | step=3916 | continuous_monitor | ==== RESUMING MONITORING! ====
Mutating the picture further using an angle of 8.485900163650513 and a scale of 0.7678219676017761
Mutating the picture further using an angle of 9.587537050247192 and a scale of 0.8052024245262146
Mutating the picture further using an angle of 0.629265308380127 and a scale of 1.1789422631263733
INFO:0 | 10:11:03 | step=4268 | continuous_monitor | ==== DRIFT DETECTED (Event #5)! ====
INFO:0 | 10:11:03 | step=4268 | continuous_monitor | -> Dispatching continual learning module...
INFO:0 | 10:11:14 | step=4268 | continuous_trainer | ==== Continual Learning ====
---------------------------------------------------------------------------
Compute Performance Metrics (Averaged per Update)
---------------------------------------------------------------------------
Operation       FLOPs              Time            Throughput
---------------------------------------------------------------------------
detector        0 FLOPs            52.83 μs        0 FLOP/s
infer           1.62 GFLOPs        3.22 ms         501.82 GFLOP/s
optimizer       2.28 MFLOPs        4.01 ms         568.12 MFLOP/s
update_fwd_bwd  4.82 GFLOPs        10.67 ms        451.44 GFLOP/s
---------------------------------------------------------------------------
TOTAL           6.44 GFLOPs        17.96 ms        358.38 GFLOP/s
---------------------------------------------------------------------------
INFO:0 | 10:12:08 | step=4870 | continuous_monitor | <- Continual learning complete.
INFO:0 | 10:12:08 | step=4870 | continuous_monitor | ==== RESUMING MONITORING! ====
Mutating the picture further using an angle of 8.28125774860382 and a scale of 1.1350123882293701
Mutating the picture further using an angle of 9.250980615615845 and a scale of 1.1172855496406555
Mutating the picture further using an angle of 4.6211546659469604 and a scale of 0.9335004687309265
INFO:0 | 10:12:21 | step=5215 | continuous_monitor | ==== Continuous Monitoring Complete ====

API / CLI Changes

N/A

Breaking Changes

  • None

Performance (optional)

N/A

Security & Privacy

N/A

Dependencies

N/A

Testing Plan

Described in deployment readme.

Documentation

Added README under src/deployment/perlmutter/.

Checklist

  • Code formatted (Ruff) → ruff format --check
  • Lint passes (Ruff) → ruff check .
  • Types pass (mypy/pyright) → mypy src
  • Tests pass (pytest) → pytest -q
  • Backward compatibility considered
  • Adequate comments for tricky parts
  • CI green

Risk & Rollback Plan

Probably not needed in the beginning

Notes for Reviewers

@ScSteffen
Copy link
Copy Markdown
Collaborator

Looking good.
I'd leave this PR open for now for 2 reasons.

  1. I'm about to add torch ddp support, and we should test if the perlmutter scripts work fine with ddp too
  2. Fix docker CI build size #79 fixed the CI, can you please pull and see if the CI is turns green here?

@rz4 rz4 mentioned this pull request Feb 18, 2026
10 tasks
Added missing poetry install command.
@ScSteffen ScSteffen added the Deployment Issues and PRs related to the deployment of the model back in the system label Mar 4, 2026
@rz4
Copy link
Copy Markdown
Collaborator Author

rz4 commented Mar 9, 2026

I have added a link in the main README linking the instructions to setup the codebase and environment on Perlmutter, as well as submitting jobs. The document is a README under deployment. Does it make sense to move deployment documentation to a folder under docs/?

The deployment on Perlmutter has been streamlined. On the user's scratch directory, the user clones the repo, and
enters its root directory. The user then runs source src/deployment/perlmutter/instal_venv.sh to install a python virtual enviroment, poetry, and then use poetry to resolve the remaining dependencies.

Poetry installs the project dependencies within the virtual environment. The virtual enviroment can then be loaded
near the start of the SLURM script with source .venv/bin/activate.

Copy link
Copy Markdown
Collaborator

@rfgeek rfgeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ran the mnist example on perlmutter and it worked.

@anagainaru anagainaru merged commit cbd3c56 into main Mar 11, 2026
3 checks passed
@anagainaru anagainaru deleted the deploy_perlmutter branch March 11, 2026 15:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Deployment Issues and PRs related to the deployment of the model back in the system

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants