Skip to content

revert: "chore: improve ray.sub generalization across clusters"#1505

Merged
terrykong merged 1 commit intomainfrom
revert-1451-tk/ray-sub-lambda
Nov 11, 2025
Merged

revert: "chore: improve ray.sub generalization across clusters"#1505
terrykong merged 1 commit intomainfrom
revert-1451-tk/ray-sub-lambda

Conversation

@terrykong
Copy link
Copy Markdown
Collaborator

@terrykong terrykong commented Nov 11, 2025

Reverts #1451

We discovered on our cluster that while each ray node claims all the cpus, for some reason the CPU affinity of a ray task will not be all the CPUS and only two of them. This reverts this problematic change until we can figure

Summary by CodeRabbit

  • New Features

    • Automatic GPU resource detection with validation and clear logs
    • Dynamic CPU-per-worker defaulting based on GPUs and applied across cluster
  • Bug Fixes

    • Consistent task and CPU allocation for head and worker processes
    • Improved node IP discovery with multiple resolution methods and mapping
  • Improvements

    • Unified resource argument handling for all cluster commands and scripts
    • More robust cluster startup and validation behavior

@terrykong terrykong requested a review from a team as a code owner November 11, 2025 03:24
This reverts commit 855151b.

Signed-off-by: Terry Kong <terryk@nvidia.com>
@terrykong terrykong force-pushed the revert-1451-tk/ray-sub-lambda branch from ccba26e to 8e21f2a Compare November 11, 2025 03:27
@terrykong terrykong added the CI:docs Run doctest label Nov 11, 2025
@terrykong terrykong requested a review from hemildesai November 11, 2025 03:27
@terrykong terrykong changed the title Revert "chore: improve ray.sub generalization across clusters" revert: "chore: improve ray.sub generalization across clusters" Nov 11, 2025
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Nov 11, 2025

📝 Walkthrough

Walkthrough

Detects SLURM GRES support via a new maybe_gres_arg, composes COMMON_SRUN_ARGS dynamically (including GRES, container mounts, workdir), replaces static srun flags with --ntasks=1 and --cpus-per-task (CPUS_PER_WORKER = GPUS_PER_NODE*16), updates all srun uses, and improves node IP resolution with fallbacks and an ip_addresses_array.

Changes

Cohort / File(s) Summary
GRES detection & init
ray.sub
Adds maybe_gres_arg to detect SLURM GRES/gpu support, validate GPUS_PER_NODE against detected GRES GPUs, set GRES_ARG, and log presence/absence.
COMMON srun args composition
ray.sub
Replaces static COMMON_SRUN_ARGS with dynamic composition including GRES_ARG, container settings, mounts, container workdir; derives CPUS_PER_WORKER = GPUS_PER_NODE * 16.
srun invocation updates
ray.sub
Changes head/worker/utility srun calls to include --ntasks=1, --cpus-per-task=$CPUS_PER_WORKER; removes --exclusive, adds --exact for workers, and ensures GRES_ARG is applied consistently.
Attach / driver scripts
ray.sub
Ensures generated attach/driver commands include GRES_ARG, --ntasks=1, and --cpus-per-task for head and workers.
IP resolution & mapping
ray.sub
Enhances node IP resolution with fallbacks (host, getent, nslookup, ping), detailed logging, and builds ip_addresses_array mapping for head and workers.

Sequence Diagram(s)

sequenceDiagram
    participant Init as startup
    participant GRES as maybe_gres_arg
    participant Composer as COMMON_SRUN_ARGS composer
    participant Head as srun head
    participant Worker as srun worker
    participant Resolver as IP resolver

    Init->>GRES: detect GRES (SLURM_NODE_GRES/GRES config)
    GRES-->>Init: GRES_ARG or empty (validate GPUS_PER_NODE)
    Init->>Composer: compose COMMON_SRUN_ARGS (include GRES_ARG, mounts, workdir, CPUS_PER_WORKER)
    Composer-->>Head: head srun command (--ntasks=1 --cpus-per-task, GRES_ARG)
    Composer-->>Worker: worker srun command (--ntasks=1 --cpus-per-task, --exact, GRES_ARG)
    Head->>Resolver: resolve head IP (host/getent/nslookup/ping)
    Worker->>Resolver: resolve worker IPs (build ip_addresses_array)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • Review maybe_gres_arg detection and mismatch/error behavior between detected GRES and GPUS_PER_NODE.
  • Verify all srun call sites were updated to use --ntasks=1 and correct --cpus-per-task values (head, worker, monitoring, status checks).
  • Validate CPUS_PER_WORKER derivation and any assumptions (GPUS_PER_NODE presence/zero).
  • Confirm IP resolution fallbacks produce stable, consistent node-to-IP mappings in target environments.

Possibly related PRs

Suggested reviewers

  • hemildesai
  • parthmannan

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Test Results For Major Changes ⚠️ Warning PR is a major revert of ray.sub changes (SLURM GRES, GPU detection, CPU allocation) with no test results or testing documentation in the description. Add testing documentation including reason for revert, test results, validation that no regressions occur, and evidence the reverted code works correctly.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Title check ✅ Passed The title clearly indicates this is a revert of a previous change (#1451), which is directly reflected in the raw_summary showing removal of GRES support and dynamic SLURM argument composition that were added in that change.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch revert-1451-tk/ray-sub-lambda

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6a035bc and 8e21f2a.

📒 Files selected for processing (1)
  • ray.sub (9 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Docs_Tests
  • GitHub Check: Lint check
  • GitHub Check: Post submodule check comment / Comment on PR
  • GitHub Check: Post automodel integration comment / Comment on PR

Comment thread ray.sub
@terrykong terrykong requested a review from joyang-nv November 11, 2025 03:37
@terrykong terrykong enabled auto-merge (squash) November 11, 2025 03:40
@terrykong terrykong merged commit 6a40247 into main Nov 11, 2025
42 of 45 checks passed
@terrykong terrykong deleted the revert-1451-tk/ray-sub-lambda branch November 11, 2025 17:28
zpqiu pushed a commit to sharonyu-115/RL that referenced this pull request Nov 17, 2025
…IA-NeMo#1505)

Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Zhaopeng Qiu <alexq@nvidia.com>
PrinsYin pushed a commit to PrinsYin/RL that referenced this pull request Nov 30, 2025
DeL-TaiseiOzaki pushed a commit to DeL-TaiseiOzaki/RL that referenced this pull request Jan 8, 2026
yuanhangsu1986 pushed a commit to yuanhangsu1986/RL-Nemontron-Edge-Omni that referenced this pull request Feb 21, 2026
…IA-NeMo#1505)

Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:docs Run doctest

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants