Skip to content

fix: ray.sub will exit early if any srun fails to launch #1022

Merged
terrykong merged 5 commits intomainfrom
tk/error-on-meta-failure
Sep 3, 2025
Merged

fix: ray.sub will exit early if any srun fails to launch #1022
terrykong merged 5 commits intomainfrom
tk/error-on-meta-failure

Conversation

@terrykong
Copy link
Copy Markdown
Collaborator

Closes: #1019

Example output:

+ echo '[ERROR] Background srun '\''ray-worker-1'\'' died (pid=123456). Could be a failure in startup or an issue with the node preventing the srun to start. Attempting to exit.'
[ERROR] Background srun 'ray-worker-1' died (pid=123456). Could be a failure in startup or an issue with the node preventing the srun to start. Attempting to exit.
+ touch /logdir/5336464-logs/ENDED
+ exit 1

Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
This reverts commit 63e2f6a.

Signed-off-by: Terry Kong <terryk@nvidia.com>
This reverts commit c7ad390.

Signed-off-by: Terry Kong <terryk@nvidia.com>
Example output:
```
+ echo '[ERROR] Background srun '\''ray-worker-1'\'' died (pid=123456). Could be a failure in startup or an issue with the node preventing the srun to start. Attempting to exit.'
[ERROR] Background srun 'ray-worker-1' died (pid=123456). Could be a failure in startup or an issue with the node preventing the srun to start. Attempting to exit.
+ touch /logdir/5336464-logs/ENDED
+ exit 1
```

Signed-off-by: Terry Kong <terryk@nvidia.com>
@terrykong terrykong requested a review from hemildesai August 29, 2025 05:40
@terrykong terrykong enabled auto-merge August 29, 2025 05:40
@terrykong terrykong added this pull request to the merge queue Sep 2, 2025
github-merge-queue Bot pushed a commit that referenced this pull request Sep 2, 2025
Signed-off-by: Terry Kong <terryk@nvidia.com>
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to no response for status checks Sep 3, 2025
@terrykong terrykong added this pull request to the merge queue Sep 3, 2025
Merged via the queue into main with commit acabc79 Sep 3, 2025
21 checks passed
@terrykong terrykong deleted the tk/error-on-meta-failure branch September 3, 2025 22:19
wangshangsam pushed a commit that referenced this pull request Sep 4, 2025
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Shang Wang <samshang.wang@mail.utoronto.ca>
terrykong added a commit that referenced this pull request Sep 6, 2025
Signed-off-by: Terry Kong <terryk@nvidia.com>
guyueh1 pushed a commit to guyueh1/NeMo-RL that referenced this pull request Sep 15, 2025
PrinsYin pushed a commit to PrinsYin/RL that referenced this pull request Nov 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ray.sub hangs unnecessarily if one node is busted

2 participants