Skip to content

chore: improve ray.sub generalization across clusters#1451

Merged
terrykong merged 3 commits intomainfrom
tk/ray-sub-lambda
Oct 31, 2025
Merged

chore: improve ray.sub generalization across clusters#1451
terrykong merged 3 commits intomainfrom
tk/ray-sub-lambda

Conversation

@terrykong
Copy link
Copy Markdown
Collaborator

@terrykong terrykong commented Oct 31, 2025

What does this PR do ?

  1. removes cpu-per-task and just claim the whole node
  2. remove gres since it should be claimed as part of --exclusive

related #1339

Issues

List issues that this PR closes (syntax):

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

Summary by CodeRabbit

Release Notes

  • Refactor
    • Removed GPU detection and complex CPU/memory/GPU affinity specifications to simplify overall resource management.
    • Replaced dynamic resource handling with a streamlined, container-focused architecture supporting full-node resource allocation.
    • Updated cluster deployment pipelines and node launch commands for improved efficiency in distributed process management.

Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
@terrykong terrykong requested a review from a team as a code owner October 31, 2025 19:31
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Oct 31, 2025

📝 Walkthrough

Walkthrough

The ray.sub script is refactored to remove GPU resource detection and GRES-based affinity logic, replacing it with a simplified, container-focused approach using a centralized COMMON_SRUN_ARGS configuration that includes container mounting and --exclusive node resource claims.

Changes

Cohort / File(s) Summary
SLURM invocation simplification
ray.sub
Removed maybe_gres_arg() function and associated GRES detection logic; introduced COMMON_SRUN_ARGS with container-related flags (--no-container-mount-home, --mpi=pmix, --container-mounts, --container-image, --container-workdir, --exclusive); removed --cpus-per-task, --exact, and GPU affinity arguments from head and worker node launches; simplified status-check srun invocations; updated head and worker lifecycle management to use centralized container arguments.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

  • Focus areas:
    • Verify that removal of CPU/memory/GPU affinity logic (--cpus-per-task, GRES flags) doesn't break intended resource isolation in containerized environments
    • Confirm --exclusive flag properly claims full node resources as intended replacement for affinity-based logic
    • Ensure container argument propagation (COMMON_SRUN_ARGS) is applied consistently to all srun invocations (head launch, worker launch, status checks, attach scripts)
    • Validate that simplified wait conditions (-w flag changes) maintain correct ordering and node targeting

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Test Results For Major Changes ⚠️ Warning The PR makes significant refactoring changes to resource allocation logic in ray.sub, removing GRES detection and CPU affinity specifications in favor of a simpler whole-node resource claiming approach. However, the PR description contains no test results, testing information, or validation data. All contributor checklist items remain unchecked, including the items for writing tests and running unit/functional tests locally. The PR lacks evidence that these significant resource allocation changes have been validated or that no regressions have been introduced. The PR description must be updated to include test results or testing information demonstrating that the changes work correctly. At minimum, the author should check off the relevant contributor checklist items and provide or link to test results showing validation across different cluster configurations. Additionally, before-and-after validation or evidence that the simplified resource allocation approach does not cause regressions should be documented in the PR description.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The title "improve ray.sub generalization across clusters" directly aligns with the stated PR objectives, which are to remove cluster-specific logic (GRES detection and CPU affinity handling) and make the script more portable across different cluster types. The changes support this goal by removing GRES detection, dropping CPU-specific flags (--cpus-per-task, --exact), and introducing a simpler, container-focused approach with --exclusive to claim resources generically. While the title is somewhat high-level and doesn't enumerate specific technical changes, it accurately captures the primary intent of the changeset and would give a teammate scanning the history a clear understanding that this PR generalizes the script across cluster environments.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch tk/ray-sub-lambda

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
ray.sub (1)

104-108: Address or remove the TODO comment.

The debugging flags (-p and -A) are marked for deletion. These flags are redundant within the same SLURM job context since the partition and account are already inherited from the job allocation.

Should these be removed now, or is there a specific reason they're needed for debugging? If they're still required, please document why; otherwise, they should be removed to avoid confusion.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bd2e645 and 1faa3cf.

📒 Files selected for processing (1)
  • ray.sub (8 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-10-12T14:46:57.171Z
Learnt from: zpqiu
Repo: NVIDIA-NeMo/RL PR: 1324
File: tests/test_suites/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-megatron-tp2pp2cp2-pack.sh:6-11
Timestamp: 2025-10-12T14:46:57.171Z
Learning: Test scripts in tests/test_suites/llm/ follow a standard configuration pattern that includes NUM_NODES, STEPS_PER_RUN, MAX_STEPS, NUM_RUNS (calculated as `$(( (MAX_STEPS + STEPS_PER_RUN - 1) / STEPS_PER_RUN ))`), and NUM_MINUTES. These variables are part of the test infrastructure's standard interface and should not be flagged as unused even if not directly referenced within the individual script, as they are consumed by external launch tooling or common.env.

Applied to files:

  • ray.sub
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Lint check
  • GitHub Check: Post submodule check comment / Comment on PR
  • GitHub Check: Post automodel integration comment / Comment on PR
🔇 Additional comments (5)
ray.sub (5)

98-108: LGTM! Centralized resource allocation approach.

The introduction of COMMON_SRUN_ARGS with --exclusive effectively simplifies the resource allocation model by claiming whole nodes instead of managing fine-grained CPU/GPU/memory allocations. This approach generalizes better across different cluster configurations.

Also applies to: 278-278, 377-377


383-386: LGTM! Simplified wait condition.

The simplified srun --overlap call for checking the STARTED_RAY_HEAD file is appropriate. The --overlap flag correctly allows sharing the node with the existing ray-head container.


391-399: LGTM! Simplified status extraction.

The status check correctly uses --overlap and --container-name to attach to the running ray-head container. This is the appropriate way to query the Ray cluster status from within an existing container.


422-450: LGTM! Attach operations correctly use a subset of flags.

The attach script and driver launch commands intentionally don't use the full COMMON_SRUN_ARGS because they're attaching to already-running containers via --container-name. They correctly:

  • Use --overlap to share nodes with existing containers
  • Use --container-workdir to set the working directory
  • Omit --container-image and --container-mounts (already set in running containers)
  • Omit --exclusive (would conflict with existing containers)

This is the correct approach for attach operations.


96-96: GPUS_PER_NODE is configurable, not hardcoded.

The variable uses bash parameter expansion ${GPUS_PER_NODE:-8}, allowing users to override the default via environment variable. Documentation already instructs users to determine the correct GPU count for their cluster and pass it when submitting jobs (e.g., GPUS_PER_NODE=N sbatch ray.sub). No action needed.

Likely an incorrect or invalid review comment.

Copy link
Copy Markdown
Contributor

@parthmannan parthmannan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for adding this.

@terrykong terrykong added the CI:docs Run doctest label Oct 31, 2025
@terrykong terrykong enabled auto-merge (squash) October 31, 2025 21:40
@terrykong terrykong merged commit 855151b into main Oct 31, 2025
42 of 43 checks passed
@terrykong terrykong deleted the tk/ray-sub-lambda branch October 31, 2025 21:46
terrykong added a commit that referenced this pull request Nov 2, 2025
Signed-off-by: Terry Kong <terryk@nvidia.com>
lbliii pushed a commit that referenced this pull request Nov 3, 2025
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
terrykong added a commit that referenced this pull request Nov 11, 2025
terrykong added a commit that referenced this pull request Nov 11, 2025
This reverts commit 855151b.

Signed-off-by: Terry Kong <terryk@nvidia.com>
terrykong added a commit that referenced this pull request Nov 11, 2025
This reverts commit 855151b.

Signed-off-by: Terry Kong <terryk@nvidia.com>
PrinsYin pushed a commit to PrinsYin/RL that referenced this pull request Nov 30, 2025
yuanhangsu1986 pushed a commit to yuanhangsu1986/RL-Nemontron-Edge-Omni that referenced this pull request Feb 21, 2026
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:docs Run doctest

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants