Skip to content

NERSC adapter: queue_name attribute ignored — jobs always submitted with gpu_debug QOS #49

@osmiumzero

Description

@osmiumzero

Summary

The attributes.queue_name field in the PSI/J JobSpec is ignored by the NERSC adapter. Regardless of the value provided (gpu_regular, regular, shared, etc.), all jobs are submitted with Slurm QOS gpu_debug, which has a 30-minute wall time limit.

This effectively caps all GPU jobs submitted through the API to 30 minutes, even though the user's account and the requested QOS allow much longer wall times.

Evidence

Job submitted via the API with queue_name: "gpu_regular":

$ scontrol show job 49622634 | grep QOS
   Priority=69119 Nice=0 Account=m3792_g QOS=gpu_debug

Actual Slurm QOS limits on Perlmutter (from sacctmgr show qos):

QOS MaxWall
gpu_debug 00:30:00
gpu_regular 2-00:00:00
gpu_shared 2-00:00:00

The user's account (m3792_g) has no per-association MaxWall limit (sacctmgr show assoc shows empty MaxWall), so the 30-min cap comes entirely from the API forcing gpu_debug.

Reproduction

POST /api/v1/compute/job/6d00f875-dfc1-4a41-9309-456c5f2048df
{
    "executable": "/path/to/script.sh",
    "resources": {"node_count": 1, "gpu_cores_per_process": 4},
    "attributes": {
        "queue_name": "gpu_regular",
        "account": "m3792_g",
        "duration": 5400
    }
}

Expected: Job submitted with sbatch -q gpu_regular -A m3792_g -t 01:30:00
Actual: Job submitted with QOS=gpu_debug, fails with QOSMaxWallDurationPerJobLimit for any duration > 1800s

Workaround

Limit wall time to 1800s (30 min). For longer GPU jobs, submit directly via sbatch on Perlmutter, bypassing the API.

Environment

  • NERSC Perlmutter
  • API: https://api.iri.nersc.gov/api/v1
  • Date: 2026-03-03

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions