revert: "chore: improve ray.sub generalization across clusters"#1505
Merged
revert: "chore: improve ray.sub generalization across clusters"#1505
Conversation
This reverts commit 855151b. Signed-off-by: Terry Kong <terryk@nvidia.com>
ccba26e to
8e21f2a
Compare
Contributor
📝 WalkthroughWalkthroughDetects SLURM GRES support via a new maybe_gres_arg, composes COMMON_SRUN_ARGS dynamically (including GRES, container mounts, workdir), replaces static srun flags with --ntasks=1 and --cpus-per-task (CPUS_PER_WORKER = GPUS_PER_NODE*16), updates all srun uses, and improves node IP resolution with fallbacks and an ip_addresses_array. Changes
Sequence Diagram(s)sequenceDiagram
participant Init as startup
participant GRES as maybe_gres_arg
participant Composer as COMMON_SRUN_ARGS composer
participant Head as srun head
participant Worker as srun worker
participant Resolver as IP resolver
Init->>GRES: detect GRES (SLURM_NODE_GRES/GRES config)
GRES-->>Init: GRES_ARG or empty (validate GPUS_PER_NODE)
Init->>Composer: compose COMMON_SRUN_ARGS (include GRES_ARG, mounts, workdir, CPUS_PER_WORKER)
Composer-->>Head: head srun command (--ntasks=1 --cpus-per-task, GRES_ARG)
Composer-->>Worker: worker srun command (--ntasks=1 --cpus-per-task, --exact, GRES_ARG)
Head->>Resolver: resolve head IP (host/getent/nslookup/ping)
Worker->>Resolver: resolve worker IPs (build ip_addresses_array)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Possibly related PRs
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Contributor
There was a problem hiding this comment.
Actionable comments posted: 1
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
ray.sub(9 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Docs_Tests
- GitHub Check: Lint check
- GitHub Check: Post submodule check comment / Comment on PR
- GitHub Check: Post automodel integration comment / Comment on PR
hemildesai
approved these changes
Nov 11, 2025
zpqiu
pushed a commit
to sharonyu-115/RL
that referenced
this pull request
Nov 17, 2025
…IA-NeMo#1505) Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Zhaopeng Qiu <alexq@nvidia.com>
PrinsYin
pushed a commit
to PrinsYin/RL
that referenced
this pull request
Nov 30, 2025
…IA-NeMo#1505) Signed-off-by: Terry Kong <terryk@nvidia.com>
DeL-TaiseiOzaki
pushed a commit
to DeL-TaiseiOzaki/RL
that referenced
this pull request
Jan 8, 2026
…IA-NeMo#1505) Signed-off-by: Terry Kong <terryk@nvidia.com>
yuanhangsu1986
pushed a commit
to yuanhangsu1986/RL-Nemontron-Edge-Omni
that referenced
this pull request
Feb 21, 2026
…IA-NeMo#1505) Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Reverts #1451
We discovered on our cluster that while each ray node claims all the cpus, for some reason the CPU affinity of a ray task will not be all the CPUS and only two of them. This reverts this problematic change until we can figure
Summary by CodeRabbit
New Features
Bug Fixes
Improvements