Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
65 changes: 54 additions & 11 deletions examples/k8s-configs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -373,21 +373,45 @@ kubectl exec -it <pod-name> -- ls -lh /data/
| **NAS** | SSH/rsync | NAS credentials in `credential.json` |
| **Local** | Filesystem | Pre-mounted PVC |

### Storage Classes
### Storage Classes (local-path vs NFS)

madengine separates **per-job results** from **long-lived shared data**:

| Volume | Typical use | Single-node (`nnodes: 1`) | Multi-node (`nnodes > 1`) |
|--------|-------------|----------------------------|----------------------------|
| **`{job}-results`** | Benchmark artifacts (`/results`) | **RWO** — `local_path_storage_class` or `single_node_results_storage_class` (e.g. `local-path`) | **RWX** — `nfs_storage_class` or `multi_node_results_storage_class` (e.g. `nfs-banff`) |
| **`madengine-shared-data`** | Dataset cache (`/data`) | **RWX** — always `ReadWriteMany` + NFS class | Same PVC |

**Built-in defaults (Banff-oriented)** are in `presets/k8s/defaults.json`: `nfs_storage_class` / `data_storage_class` → `nfs-banff`, `local_path_storage_class` → `local-path`, `recreate_shared_data_pvc` → `false`. You do not need to set these unless you use another cluster — then override in additional context.

Example override for a different cluster:

```json
{
"k8s": {
"nfs_storage_class": "nfs-client",
"local_path_storage_class": "standard",
"data_storage_class": "nfs-client"
}
}
```

- **`nfs_storage_class`**: RWX class (e.g. `nfs-banff`) — used for shared-data (with `data_storage_class`) and multi-node results unless overridden.
- **`local_path_storage_class`**: RWO class for **single-node only** results PVC.
- **`data_storage_class`**: Optional override for `madengine-shared-data` only (defaults to `nfs_storage_class` then `storage_class`).
- **`single_node_results_storage_class`** / **`multi_node_results_storage_class`**: Optional fine-grained overrides for results PVCs.
- **`recreate_shared_data_pvc`**: If `true`, deletes existing `madengine-shared-data` before create (**destroys data** — backup first). Use when migrating from RWO `local-path` to RWX NFS.

**Single-Node (RWO)** (results only):

**Single-Node (RWO)**:
- ✅ `local-path` (Rancher)
- ✅ AWS EBS (`gp3`, `io2`)
- ✅ Azure Disk
- ✅ Any RWO storage class

**Multi-Node (RWX)**:
- ✅ NFS (`nfs-client`)
- ✅ CephFS
- ✅ GlusterFS
- ✅ AWS EFS
- ✅ Azure Files
- ❌ `local-path` (RWO only)
**Multi-node & shared-data (RWX)**:

- ✅ NFS (e.g. `nfs-banff`, `nfs-client`)
- ✅ CephFS, GlusterFS, AWS EFS, Azure Files
- ❌ `local-path` (RWO only — not for shared-data or multi-node results)

### Custom PVC (Optional)

Expand Down Expand Up @@ -513,6 +537,13 @@ To use an existing PVC instead of auto-creation:
|-------|------|---------|-------------|
| `data_pvc` | string | `null` | Data PVC name (auto-created if using data provider) |
| `results_pvc` | string | `null` | Results PVC name (auto-created by default) |
| `storage_class` | string | `null` | Optional fallback if the keys below are unset |
| `nfs_storage_class` | string | **`nfs-banff`** (preset) | RWX class for shared-data / multi-node results |
| `local_path_storage_class` | string | **`local-path`** (preset) | RWO class for single-node `{job}-results` |
| `data_storage_class` | string | **`nfs-banff`** (preset) | Overrides SC for shared-data only |
| `single_node_results_storage_class` | string | `null` | Overrides single-node results SC (`local_path_storage_class` if unset) |
| `multi_node_results_storage_class` | string | `null` | Overrides multi-node results SC (`nfs_storage_class` if unset) |
| `recreate_shared_data_pvc` | boolean | **`false`** (preset) | If `true`, delete `madengine-shared-data` before create (data loss) |

#### Distributed Execution Fields

Expand Down Expand Up @@ -594,6 +625,18 @@ host_ipc: true
PVCs: Recommended for data and results
```

### Local `k8s_results` layout (after `madengine run`)

Artifacts are written under `./k8s_results/<job_name>/`:

| Path | Contents |
|------|----------|
| `<job_name>/<pod_name>/pod.log` | Container log from the Kubernetes API |
| `<job_name>/<pod_name>/pvc/` | Copy of `/results/<subdir>/` from the results PVC, matched to that pod |
| `<job_name>/pvc_unmapped/<subdir>/` | PVC folders that could not be matched to a pod name |

Write durable outputs under `/results/<replica-id>/` in the container so each replica’s files land in a predictable PVC subdir (e.g. hostname or `jobname-<index>`). Madengine maps that subdir to the full pod name when copying to the host.

### Distributed Launchers

**Training Launchers:**
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,8 @@
"cpu_limit": "32",

"image_pull_policy": "Always",
"backoff_limit": 3
"backoff_limit": 3,
"recreate_shared_data_pvc": true
},

"distributed": {
Expand Down
3 changes: 2 additions & 1 deletion examples/k8s-configs/basic/03-torchrun-multi-node-basic.json
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,8 @@

"image_pull_policy": "Always",
"backoff_limit": 3,
"host_ipc": true
"host_ipc": true,
"recreate_shared_data_pvc": true
},

"distributed": {
Expand Down
Loading