diff --git a/README.md b/README.md index 8ee96c7..a757aee 100644 --- a/README.md +++ b/README.md @@ -101,7 +101,24 @@ poetry run ruff check . poetry run mypy . ``` -## Current Integration Notes +## Deployment -- `EnsembleDetector` exists but is not wired in `load_drift_detector(...)`. -- `ModelPerformanceDetector` and `EvalDetector` need additional runtime wiring for full monitor integration. +Platform-specific deployment guides: + +- [NERSC Perlmutter](./src/deployment/perlmutter/README.md) + +## What `main.py` Does +- Builds the `DummyCNN_MNIST` model defined in `src/model/DummyCNN_MNIST.py`, a cross-entropy loss, and an Adam optimizer. +- Loads the MNIST training split, stacks the tensors, and iterates over 10 tasks (digits 0–9). Each task applies random rotation and translation to encourage continual adaptation. +- Maintains replay buffers (`memory_image`, `memory_label`, etc.) so past samples remain available for rehearsal while training new tasks. +- Calls `CL(...)` to assemble task-specific dataloaders and drive the `One_task_CL` loop. The loop trains for five epochs, records loss/accuracy metrics, and prints periodic progress reports. +- Computes sensitivity scores with `src/validation/validation_utils/return_score` after each task; you can repurpose these values for analysis or adaptive triggers. + +## Tuning Tips +- Change the number of epochs by editing `n_epoch` inside `CL`. +- Adjust replay/adversarial update counts through the `params` dictionaries in `One_task_CL` and `util.update_CL_`. +- Experiment with different transforms or task definitions by modifying `data.py`. +- Update batch sizes by changing the `batch_size` parameter used when constructing the dataloaders. + +## Output +Training logs report the task id, training/test accuracy, and replay-memory accuracy every five epochs. Accuracy is computed via `test(...)` on both the current task and the accumulated memory set. diff --git a/src/deployment/perlmutter/README.md b/src/deployment/perlmutter/README.md new file mode 100644 index 0000000..ab4b891 --- /dev/null +++ b/src/deployment/perlmutter/README.md @@ -0,0 +1,46 @@ +# Deployment + +## NERSC Perlmutter + +### Setup + +Clone the repo into your scratch directory and run the install script: + +```bash +cd $SCRATCH +git clone https://github.com/AI-ModCon/BaseSim_Framework.git +cd BaseSim_Framework +source ./src/deployment/perlmutter/install_venv.sh +``` + +`install_venv.sh` creates a virtual environment, installs Poetry, and uses it to resolve and install project dependencies. The environment is saved to `.venv` in the project root. The script runs the following: + +```bash +module load python/3.13-26.1.0 +python -m venv .venv +source .venv/bin/activate +pip install poetry +poetry lock +poetry install --no-cache +``` + +> **Note:** The MNIST example requires to the dataset, which is downloaded on first run. Download it before submitting a batch job: +> +> ```bash +> poetry run python -c "from examples.mnist.utils import get_mnist_data; get_mnist_data()" +> ``` + +### Submitting a Job + +The virtual environment can be sourced directly at the top of your SLURM script (`source .venv/bin/activate`), so Poetry is not needed at runtime — jobs run against the installed environment. + +From the project root: + +```bash +sbatch -A amsc002 src/deployment/perlmutter/mnist_example.sbatch +``` + +### Troubleshooting + +- **`poetry install` fails to connect to PyPI** — Run `poetry lock` first, then retry. The lock file caches package download specs and may be stale on a new host. +- **`poetry install` fails with disk quota errors** — Poetry's default cache is in the home directory, which has limited space. Retry with `poetry install --no-cache` or free up space in `$HOME`. diff --git a/src/deployment/perlmutter/install_venv.sh b/src/deployment/perlmutter/install_venv.sh new file mode 100644 index 0000000..2d77771 --- /dev/null +++ b/src/deployment/perlmutter/install_venv.sh @@ -0,0 +1,7 @@ +# +module load python/3.13-26.1.0 # Load supported python version +python -m venv .venv # Create a virtual environment +source ./.venv/bin/activate # Activate environment +pip install poetry # Install poetry +poetry lock # Sync poetry +poetry install --no-cache # Install poetry diff --git a/src/deployment/perlmutter/mnist_example.sbatch b/src/deployment/perlmutter/mnist_example.sbatch new file mode 100644 index 0000000..07381a8 --- /dev/null +++ b/src/deployment/perlmutter/mnist_example.sbatch @@ -0,0 +1,26 @@ +#!/bin/bash -l +#SBATCH -J modcon_basesim +#SBATCH -t 0:20:00 +#SBATCH -C gpu +#SBATCH -q debug +#SBATCH -n 1 +#SBATCH --gpus 1 +#SBATCH -o output/mnist_example.o%j +#SBATCH -e output/mnist_example.e%j + +# +source .venv/bin/activate + +# WANDB flag +export WANDB_MODE=offline + +# Print environment info +echo "==============================================" +echo "MNIST Example" +echo "==============================================" +echo "Date: $(date)" +echo "Hostname: $(hostname)" +echo "==============================================" + +# Run example +python -m src.main --config ./examples/mnist/mnist.toml