Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 20 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -101,7 +101,24 @@ poetry run ruff check .
poetry run mypy .
```

## Current Integration Notes
## Deployment

- `EnsembleDetector` exists but is not wired in `load_drift_detector(...)`.
- `ModelPerformanceDetector` and `EvalDetector` need additional runtime wiring for full monitor integration.
Platform-specific deployment guides:

- [NERSC Perlmutter](./src/deployment/perlmutter/README.md)

## What `main.py` Does
- Builds the `DummyCNN_MNIST` model defined in `src/model/DummyCNN_MNIST.py`, a cross-entropy loss, and an Adam optimizer.
- Loads the MNIST training split, stacks the tensors, and iterates over 10 tasks (digits 0–9). Each task applies random rotation and translation to encourage continual adaptation.
- Maintains replay buffers (`memory_image`, `memory_label`, etc.) so past samples remain available for rehearsal while training new tasks.
- Calls `CL(...)` to assemble task-specific dataloaders and drive the `One_task_CL` loop. The loop trains for five epochs, records loss/accuracy metrics, and prints periodic progress reports.
- Computes sensitivity scores with `src/validation/validation_utils/return_score` after each task; you can repurpose these values for analysis or adaptive triggers.

## Tuning Tips
- Change the number of epochs by editing `n_epoch` inside `CL`.
- Adjust replay/adversarial update counts through the `params` dictionaries in `One_task_CL` and `util.update_CL_`.
- Experiment with different transforms or task definitions by modifying `data.py`.
- Update batch sizes by changing the `batch_size` parameter used when constructing the dataloaders.

## Output
Training logs report the task id, training/test accuracy, and replay-memory accuracy every five epochs. Accuracy is computed via `test(...)` on both the current task and the accumulated memory set.
46 changes: 46 additions & 0 deletions src/deployment/perlmutter/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
# Deployment

## NERSC Perlmutter

### Setup

Clone the repo into your scratch directory and run the install script:

```bash
cd $SCRATCH
git clone https://github.com/AI-ModCon/BaseSim_Framework.git
cd BaseSim_Framework
source ./src/deployment/perlmutter/install_venv.sh
```

`install_venv.sh` creates a virtual environment, installs Poetry, and uses it to resolve and install project dependencies. The environment is saved to `.venv` in the project root. The script runs the following:

```bash
module load python/3.13-26.1.0
python -m venv .venv
source .venv/bin/activate
pip install poetry
poetry lock
poetry install --no-cache
```

> **Note:** The MNIST example requires to the dataset, which is downloaded on first run. Download it before submitting a batch job:
>
> ```bash
> poetry run python -c "from examples.mnist.utils import get_mnist_data; get_mnist_data()"
> ```

### Submitting a Job

The virtual environment can be sourced directly at the top of your SLURM script (`source .venv/bin/activate`), so Poetry is not needed at runtime — jobs run against the installed environment.

From the project root:

```bash
sbatch -A amsc002 src/deployment/perlmutter/mnist_example.sbatch
```

### Troubleshooting

- **`poetry install` fails to connect to PyPI** — Run `poetry lock` first, then retry. The lock file caches package download specs and may be stale on a new host.
- **`poetry install` fails with disk quota errors** — Poetry's default cache is in the home directory, which has limited space. Retry with `poetry install --no-cache` or free up space in `$HOME`.
7 changes: 7 additions & 0 deletions src/deployment/perlmutter/install_venv.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
#
module load python/3.13-26.1.0 # Load supported python version
python -m venv .venv # Create a virtual environment
source ./.venv/bin/activate # Activate environment
pip install poetry # Install poetry
poetry lock # Sync poetry
poetry install --no-cache # Install poetry
26 changes: 26 additions & 0 deletions src/deployment/perlmutter/mnist_example.sbatch
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
#!/bin/bash -l
#SBATCH -J modcon_basesim
#SBATCH -t 0:20:00
#SBATCH -C gpu
#SBATCH -q debug
#SBATCH -n 1
#SBATCH --gpus 1
#SBATCH -o output/mnist_example.o%j
#SBATCH -e output/mnist_example.e%j

#
source .venv/bin/activate

# WANDB flag
export WANDB_MODE=offline

# Print environment info
echo "=============================================="
echo "MNIST Example"
echo "=============================================="
echo "Date: $(date)"
echo "Hostname: $(hostname)"
echo "=============================================="

# Run example
python -m src.main --config ./examples/mnist/mnist.toml
Loading