From 82473945b1d9e7402a8ae7a37968f4a030e71cf3 Mon Sep 17 00:00:00 2001 From: rz4 Date: Tue, 17 Feb 2026 10:18:03 -0800 Subject: [PATCH 1/7] Perlmutter slurm script for mnist. --- src/deployment/perlmutter/README.md | 29 +++++++++++++++++ .../perlmutter/mnist_example.sbatch | 31 +++++++++++++++++++ 2 files changed, 60 insertions(+) create mode 100644 src/deployment/perlmutter/README.md create mode 100644 src/deployment/perlmutter/mnist_example.sbatch diff --git a/src/deployment/perlmutter/README.md b/src/deployment/perlmutter/README.md new file mode 100644 index 0000000..995a7a7 --- /dev/null +++ b/src/deployment/perlmutter/README.md @@ -0,0 +1,29 @@ +# Deployment + +## NERSC's Perlmutter + +### Setup +First, create a local virtual enviroment in scratch directory and clone repo: + +```bash +cd $SCRATCH # User scratch space +module load python # Load stable python +python -m venv my_env # Create a virtual environment +source ./my_env/bin/activate +pip install poetry +git clone https://github.com/AI-ModCon/BaseSim_Framework.git +``` + +> Note: Testing model harness and jvp update requires MNIST dataset download on first run. +> Download the dataset before submitting the run using: + +```bash +poetry run python -c "from examples.mnist.utils import get_mnist_data; get_mnist_data()" +``` + +### Submit Job +To submit run from project root: + +```bash +sbatch -A amsc002 src/deployment/perlmutter/mnist_example.sbatch +``` diff --git a/src/deployment/perlmutter/mnist_example.sbatch b/src/deployment/perlmutter/mnist_example.sbatch new file mode 100644 index 0000000..95c8c7b --- /dev/null +++ b/src/deployment/perlmutter/mnist_example.sbatch @@ -0,0 +1,31 @@ +#!/bin/bash -l +#SBATCH -J modcon_basesim +#SBATCH -t 0:20:00 +#SBATCH -C gpu +#SBATCH -q debug +#SBATCH -n 1 +#SBATCH --gpus 1 +#SBATCH -o output/mnist_example.o%j +#SBATCH -e output/mnist_example.e%j + +# Load required modules +#module load PrgEnv-gnu +#module load gcc/12.2.0 +#module load rocm/6.4.2 + +# ROCm/MIOpen flags +#mkdir -p $MEMBERWORK/$SBATCH_ACCOUNT/miopen +#export MIOPEN_USER_DB_PATH=$MEMBERWORK/$SBATCH_ACCOUNT/miopen +#export MIOPEN_CUSTOM_CACHE_DIR=$MEMBERWORK/$SBATCH_ACCOUNT/miopen +export WANDB_MODE=offline + +# Print environment info +echo "==============================================" +echo "MNIST Example" +echo "==============================================" +echo "Date: $(date)" +echo "Hostname: $(hostname)" +echo "==============================================" + +# Run example +poetry run python -m src.main --config ./examples/mnist/mnist.toml From 57e4a91854500708a09f2fcbe1106aa2928b4245 Mon Sep 17 00:00:00 2001 From: rz4 Date: Tue, 17 Feb 2026 11:44:47 -0800 Subject: [PATCH 2/7] Removed commented lines. --- src/deployment/perlmutter/mnist_example.sbatch | 10 +--------- 1 file changed, 1 insertion(+), 9 deletions(-) diff --git a/src/deployment/perlmutter/mnist_example.sbatch b/src/deployment/perlmutter/mnist_example.sbatch index 95c8c7b..7cc38e3 100644 --- a/src/deployment/perlmutter/mnist_example.sbatch +++ b/src/deployment/perlmutter/mnist_example.sbatch @@ -8,15 +8,7 @@ #SBATCH -o output/mnist_example.o%j #SBATCH -e output/mnist_example.e%j -# Load required modules -#module load PrgEnv-gnu -#module load gcc/12.2.0 -#module load rocm/6.4.2 - -# ROCm/MIOpen flags -#mkdir -p $MEMBERWORK/$SBATCH_ACCOUNT/miopen -#export MIOPEN_USER_DB_PATH=$MEMBERWORK/$SBATCH_ACCOUNT/miopen -#export MIOPEN_CUSTOM_CACHE_DIR=$MEMBERWORK/$SBATCH_ACCOUNT/miopen +# WANDB flag export WANDB_MODE=offline # Print environment info From f244ed2fd7d1b7719c8057e0aadc29df12637982 Mon Sep 17 00:00:00 2001 From: Rafael Zamora-Resendiz <15003285+rz4@users.noreply.github.com> Date: Fri, 27 Feb 2026 16:45:32 -0500 Subject: [PATCH 3/7] Update README.md Added missing poetry install command. --- src/deployment/perlmutter/README.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/src/deployment/perlmutter/README.md b/src/deployment/perlmutter/README.md index 995a7a7..e7184c2 100644 --- a/src/deployment/perlmutter/README.md +++ b/src/deployment/perlmutter/README.md @@ -12,6 +12,8 @@ python -m venv my_env # Create a virtual environment source ./my_env/bin/activate pip install poetry git clone https://github.com/AI-ModCon/BaseSim_Framework.git +cd ./BaseSim_Framework +poetry install ``` > Note: Testing model harness and jvp update requires MNIST dataset download on first run. From 381f4e6edfdb41f250b243726536655d1309f97d Mon Sep 17 00:00:00 2001 From: Rafael Zamora-Resendiz <15003285+rz4@users.noreply.github.com> Date: Thu, 5 Mar 2026 10:15:41 -0500 Subject: [PATCH 4/7] Update README.md --- src/deployment/perlmutter/README.md | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/src/deployment/perlmutter/README.md b/src/deployment/perlmutter/README.md index e7184c2..c871bd3 100644 --- a/src/deployment/perlmutter/README.md +++ b/src/deployment/perlmutter/README.md @@ -7,7 +7,7 @@ First, create a local virtual enviroment in scratch directory and clone repo: ```bash cd $SCRATCH # User scratch space -module load python # Load stable python +module load python/3.13-26.1.0 # Load stable python python -m venv my_env # Create a virtual environment source ./my_env/bin/activate pip install poetry @@ -29,3 +29,12 @@ To submit run from project root: ```bash sbatch -A amsc002 src/deployment/perlmutter/mnist_example.sbatch ``` + +### Common Issues + +1. If running `poetry install` produces errors connecting to PyPi, run `poetry lock` then retry `poetry install`. +Poetry's lock file contains the packages download spec. It may be stale and needs an update for the new host. + +2. If running `poetry install` produces errors regarding disk space quota limits, there is not enough space in +poetry's default cache location (home directory). Retry with `poetry install --no-cache` or free up space in +home directory. From 02edee696c6738b4e58f289417786d4e6323e691 Mon Sep 17 00:00:00 2001 From: rz4 Date: Thu, 5 Mar 2026 11:09:51 -0800 Subject: [PATCH 5/7] standardized .venv as enviroment for jobs --- src/deployment/perlmutter/README.md | 21 +++++++++++++------ src/deployment/perlmutter/install_venv.sh | 7 +++++++ .../perlmutter/mnist_example.sbatch | 5 ++++- 3 files changed, 26 insertions(+), 7 deletions(-) create mode 100644 src/deployment/perlmutter/install_venv.sh diff --git a/src/deployment/perlmutter/README.md b/src/deployment/perlmutter/README.md index c871bd3..429764f 100644 --- a/src/deployment/perlmutter/README.md +++ b/src/deployment/perlmutter/README.md @@ -3,17 +3,26 @@ ## NERSC's Perlmutter ### Setup -First, create a local virtual enviroment in scratch directory and clone repo: +First, within your scratch directory, the clone the repo, and run install script.: ```bash cd $SCRATCH # User scratch space -module load python/3.13-26.1.0 # Load stable python -python -m venv my_env # Create a virtual environment -source ./my_env/bin/activate -pip install poetry git clone https://github.com/AI-ModCon/BaseSim_Framework.git cd ./BaseSim_Framework -poetry install +source ./src/deployment/perlmutter/install_venv.sh +``` + +`install_venv.sh` creates a virtual enviroment, installs poetry, then poetry installs +the project dependencies. The virtual enviroment is saved under `.venv` in the root directory. + +The following commands are run in `install_venv.sh`: +```bash +module load python/3.13-26.1.0 # Load supported python version +python -m venv .venv # Create a virtual environment +source ./venv/bin/activate +pip install poetry +poetry lock +poetry install --no_cache ``` > Note: Testing model harness and jvp update requires MNIST dataset download on first run. diff --git a/src/deployment/perlmutter/install_venv.sh b/src/deployment/perlmutter/install_venv.sh new file mode 100644 index 0000000..2d77771 --- /dev/null +++ b/src/deployment/perlmutter/install_venv.sh @@ -0,0 +1,7 @@ +# +module load python/3.13-26.1.0 # Load supported python version +python -m venv .venv # Create a virtual environment +source ./.venv/bin/activate # Activate environment +pip install poetry # Install poetry +poetry lock # Sync poetry +poetry install --no-cache # Install poetry diff --git a/src/deployment/perlmutter/mnist_example.sbatch b/src/deployment/perlmutter/mnist_example.sbatch index 7cc38e3..07381a8 100644 --- a/src/deployment/perlmutter/mnist_example.sbatch +++ b/src/deployment/perlmutter/mnist_example.sbatch @@ -8,6 +8,9 @@ #SBATCH -o output/mnist_example.o%j #SBATCH -e output/mnist_example.e%j +# +source .venv/bin/activate + # WANDB flag export WANDB_MODE=offline @@ -20,4 +23,4 @@ echo "Hostname: $(hostname)" echo "==============================================" # Run example -poetry run python -m src.main --config ./examples/mnist/mnist.toml +python -m src.main --config ./examples/mnist/mnist.toml From bf177c9837a91ec2a8782be9430ee5b7f0653dfc Mon Sep 17 00:00:00 2001 From: Rafael Zamora-Resendiz <15003285+rz4@users.noreply.github.com> Date: Mon, 9 Mar 2026 12:56:58 -0400 Subject: [PATCH 6/7] Update README.md --- src/deployment/perlmutter/README.md | 47 ++++++++++++++--------------- 1 file changed, 22 insertions(+), 25 deletions(-) diff --git a/src/deployment/perlmutter/README.md b/src/deployment/perlmutter/README.md index 429764f..ab4b891 100644 --- a/src/deployment/perlmutter/README.md +++ b/src/deployment/perlmutter/README.md @@ -1,49 +1,46 @@ # Deployment -## NERSC's Perlmutter +## NERSC Perlmutter ### Setup -First, within your scratch directory, the clone the repo, and run install script.: + +Clone the repo into your scratch directory and run the install script: ```bash -cd $SCRATCH # User scratch space +cd $SCRATCH git clone https://github.com/AI-ModCon/BaseSim_Framework.git -cd ./BaseSim_Framework +cd BaseSim_Framework source ./src/deployment/perlmutter/install_venv.sh ``` -`install_venv.sh` creates a virtual enviroment, installs poetry, then poetry installs -the project dependencies. The virtual enviroment is saved under `.venv` in the root directory. +`install_venv.sh` creates a virtual environment, installs Poetry, and uses it to resolve and install project dependencies. The environment is saved to `.venv` in the project root. The script runs the following: -The following commands are run in `install_venv.sh`: ```bash -module load python/3.13-26.1.0 # Load supported python version -python -m venv .venv # Create a virtual environment -source ./venv/bin/activate +module load python/3.13-26.1.0 +python -m venv .venv +source .venv/bin/activate pip install poetry poetry lock -poetry install --no_cache +poetry install --no-cache ``` -> Note: Testing model harness and jvp update requires MNIST dataset download on first run. -> Download the dataset before submitting the run using: +> **Note:** The MNIST example requires to the dataset, which is downloaded on first run. Download it before submitting a batch job: +> +> ```bash +> poetry run python -c "from examples.mnist.utils import get_mnist_data; get_mnist_data()" +> ``` -```bash -poetry run python -c "from examples.mnist.utils import get_mnist_data; get_mnist_data()" -``` +### Submitting a Job -### Submit Job -To submit run from project root: +The virtual environment can be sourced directly at the top of your SLURM script (`source .venv/bin/activate`), so Poetry is not needed at runtime — jobs run against the installed environment. + +From the project root: ```bash sbatch -A amsc002 src/deployment/perlmutter/mnist_example.sbatch ``` -### Common Issues - -1. If running `poetry install` produces errors connecting to PyPi, run `poetry lock` then retry `poetry install`. -Poetry's lock file contains the packages download spec. It may be stale and needs an update for the new host. +### Troubleshooting -2. If running `poetry install` produces errors regarding disk space quota limits, there is not enough space in -poetry's default cache location (home directory). Retry with `poetry install --no-cache` or free up space in -home directory. +- **`poetry install` fails to connect to PyPI** — Run `poetry lock` first, then retry. The lock file caches package download specs and may be stale on a new host. +- **`poetry install` fails with disk quota errors** — Poetry's default cache is in the home directory, which has limited space. Retry with `poetry install --no-cache` or free up space in `$HOME`. From b5e8bb90e52d27180a7be361a8f5c390c3c2be34 Mon Sep 17 00:00:00 2001 From: Rafael Zamora-Resendiz <15003285+rz4@users.noreply.github.com> Date: Mon, 9 Mar 2026 13:00:10 -0400 Subject: [PATCH 7/7] Update README.md with deployment guide link. --- README.md | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/README.md b/README.md index 9b28e0b..2ac62c7 100644 --- a/README.md +++ b/README.md @@ -55,6 +55,12 @@ To run the project's tests, execute the following command from the project root: poetry run pytest ``` +## Deployment + +Platform-specific deployment guides: + +- [NERSC Perlmutter](./src/deployment/perlmutter/README.md) + ## What `main.py` Does - Builds the `DummyCNN_MNIST` model defined in `src/model/DummyCNN_MNIST.py`, a cross-entropy loss, and an Adam optimizer. - Loads the MNIST training split, stacks the tensors, and iterates over 10 tasks (digits 0–9). Each task applies random rotation and translation to encourage continual adaptation.