chore: improve ray.sub generalization across clusters by terrykong · Pull Request #1451 · NVIDIA-NeMo/RL

terrykong · 2025-10-31T19:31:03Z

What does this PR do ?

removes cpu-per-task and just claim the whole node
remove gres since it should be claimed as part of --exclusive

related #1339

Issues

List issues that this PR closes (syntax):

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

Summary by CodeRabbit

Release Notes

Refactor
- Removed GPU detection and complex CPU/memory/GPU affinity specifications to simplify overall resource management.
- Replaced dynamic resource handling with a streamlined, container-focused architecture supporting full-node resource allocation.
- Updated cluster deployment pipelines and node launch commands for improved efficiency in distributed process management.

Signed-off-by: Terry Kong <terryk@nvidia.com>

coderabbitai · 2025-10-31T19:35:50Z

📝 Walkthrough

Walkthrough

The ray.sub script is refactored to remove GPU resource detection and GRES-based affinity logic, replacing it with a simplified, container-focused approach using a centralized COMMON_SRUN_ARGS configuration that includes container mounting and --exclusive node resource claims.

Changes

Cohort / File(s)	Summary
SLURM invocation simplification `ray.sub`	Removed `maybe_gres_arg()` function and associated GRES detection logic; introduced `COMMON_SRUN_ARGS` with container-related flags (`--no-container-mount-home`, `--mpi=pmix`, `--container-mounts`, `--container-image`, `--container-workdir`, `--exclusive`); removed `--cpus-per-task`, `--exact`, and GPU affinity arguments from head and worker node launches; simplified status-check srun invocations; updated head and worker lifecycle management to use centralized container arguments.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Focus areas:
- Verify that removal of CPU/memory/GPU affinity logic (--cpus-per-task, GRES flags) doesn't break intended resource isolation in containerized environments
- Confirm --exclusive flag properly claims full node resources as intended replacement for affinity-based logic
- Ensure container argument propagation (COMMON_SRUN_ARGS) is applied consistently to all srun invocations (head launch, worker launch, status checks, attach scripts)
- Validate that simplified wait conditions (-w flag changes) maintain correct ordering and node targeting

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Test Results For Major Changes	⚠️ Warning	The PR makes significant refactoring changes to resource allocation logic in ray.sub, removing GRES detection and CPU affinity specifications in favor of a simpler whole-node resource claiming approach. However, the PR description contains no test results, testing information, or validation data. All contributor checklist items remain unchecked, including the items for writing tests and running unit/functional tests locally. The PR lacks evidence that these significant resource allocation changes have been validated or that no regressions have been introduced.	The PR description must be updated to include test results or testing information demonstrating that the changes work correctly. At minimum, the author should check off the relevant contributor checklist items and provide or link to test results showing validation across different cluster configurations. Additionally, before-and-after validation or evidence that the simplified resource allocation approach does not cause regressions should be documented in the PR description.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title "improve ray.sub generalization across clusters" directly aligns with the stated PR objectives, which are to remove cluster-specific logic (GRES detection and CPU affinity handling) and make the script more portable across different cluster types. The changes support this goal by removing GRES detection, dropping CPU-specific flags (--cpus-per-task, --exact), and introducing a simpler, container-focused approach with --exclusive to claim resources generically. While the title is somewhat high-level and doesn't enumerate specific technical changes, it accurately captures the primary intent of the changeset and would give a teammate scanning the history a clear understanding that this PR generalizes the script across cluster environments.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch tk/ray-sub-lambda

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

ray.sub (1)

104-108: Address or remove the TODO comment.

The debugging flags (-p and -A) are marked for deletion. These flags are redundant within the same SLURM job context since the partition and account are already inherited from the job allocation.

Should these be removed now, or is there a specific reason they're needed for debugging? If they're still required, please document why; otherwise, they should be removed to avoid confusion.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bd2e645 and 1faa3cf.

📒 Files selected for processing (1)

ray.sub (8 hunks)

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-10-12T14:46:57.171Z

Learnt from: zpqiu
Repo: NVIDIA-NeMo/RL PR: 1324
File: tests/test_suites/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-megatron-tp2pp2cp2-pack.sh:6-11
Timestamp: 2025-10-12T14:46:57.171Z
Learning: Test scripts in tests/test_suites/llm/ follow a standard configuration pattern that includes NUM_NODES, STEPS_PER_RUN, MAX_STEPS, NUM_RUNS (calculated as `$(( (MAX_STEPS + STEPS_PER_RUN - 1) / STEPS_PER_RUN ))`), and NUM_MINUTES. These variables are part of the test infrastructure's standard interface and should not be flagged as unused even if not directly referenced within the individual script, as they are consumed by external launch tooling or common.env.

Applied to files:

ray.sub

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: Lint check
GitHub Check: Post submodule check comment / Comment on PR
GitHub Check: Post automodel integration comment / Comment on PR

🔇 Additional comments (5)

ray.sub (5)

98-108: LGTM! Centralized resource allocation approach.

The introduction of COMMON_SRUN_ARGS with --exclusive effectively simplifies the resource allocation model by claiming whole nodes instead of managing fine-grained CPU/GPU/memory allocations. This approach generalizes better across different cluster configurations.

Also applies to: 278-278, 377-377

383-386: LGTM! Simplified wait condition.

The simplified srun --overlap call for checking the STARTED_RAY_HEAD file is appropriate. The --overlap flag correctly allows sharing the node with the existing ray-head container.

391-399: LGTM! Simplified status extraction.

The status check correctly uses --overlap and --container-name to attach to the running ray-head container. This is the appropriate way to query the Ray cluster status from within an existing container.

422-450: LGTM! Attach operations correctly use a subset of flags.

The attach script and driver launch commands intentionally don't use the full COMMON_SRUN_ARGS because they're attaching to already-running containers via --container-name. They correctly:

Use --overlap to share nodes with existing containers

Use --container-workdir to set the working directory

Omit --container-image and --container-mounts (already set in running containers)

Omit --exclusive (would conflict with existing containers)

This is the correct approach for attach operations.

96-96: GPUS_PER_NODE is configurable, not hardcoded.

The variable uses bash parameter expansion ${GPUS_PER_NODE:-8}, allowing users to override the default via environment variable. Documentation already instructs users to determine the correct GPU count for their cluster and pass it when submitting jobs (e.g., GPUS_PER_NODE=N sbatch ray.sub). No action needed.

Likely an incorrect or invalid review comment.

parthmannan

LGTM. Thanks for adding this.

Signed-off-by: Terry Kong <terryk@nvidia.com>

Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>

This reverts commit 855151b.

This reverts commit 855151b. Signed-off-by: Terry Kong <terryk@nvidia.com>

Signed-off-by: Terry Kong <terryk@nvidia.com>

Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: yuanhangs <yuanhangs@nvidia.com>

terrykong added 3 commits October 31, 2025 19:13

chore: improve ray.sub generalization

87828fc

Signed-off-by: Terry Kong <terryk@nvidia.com>

ntasks not needed

07c6adb

Signed-off-by: Terry Kong <terryk@nvidia.com>

no more gres

1faa3cf

Signed-off-by: Terry Kong <terryk@nvidia.com>

terrykong requested a review from a team as a code owner October 31, 2025 19:31

coderabbitai Bot reviewed Oct 31, 2025

View reviewed changes

terrykong requested review from hemildesai and parthmannan October 31, 2025 19:45

parthmannan approved these changes Oct 31, 2025

View reviewed changes

terrykong added the CI:docs Run doctest label Oct 31, 2025

terrykong enabled auto-merge (squash) October 31, 2025 21:40

terrykong temporarily deployed to nemo-ci October 31, 2025 21:40 — with GitHub Actions Inactive

terrykong merged commit 855151b into main Oct 31, 2025
42 of 43 checks passed

terrykong deleted the tk/ray-sub-lambda branch October 31, 2025 21:46

terrykong added a commit that referenced this pull request Nov 2, 2025

chore: improve ray.sub generalization across clusters (#1451)

7490643

Signed-off-by: Terry Kong <terryk@nvidia.com>

lbliii pushed a commit that referenced this pull request Nov 3, 2025

chore: improve ray.sub generalization across clusters (#1451)

cca3c4b

Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>

terrykong added a commit that referenced this pull request Nov 11, 2025

Revert "chore: improve ray.sub generalization across clusters (#1451)"

ccba26e

This reverts commit 855151b.

terrykong mentioned this pull request Nov 11, 2025

revert: "chore: improve ray.sub generalization across clusters" #1505

Merged

terrykong added a commit that referenced this pull request Nov 11, 2025

Revert "chore: improve ray.sub generalization across clusters (#1451)"

8e21f2a

This reverts commit 855151b. Signed-off-by: Terry Kong <terryk@nvidia.com>

terrykong added a commit that referenced this pull request Nov 11, 2025

Revert "chore: improve ray.sub generalization across clusters (#1451)"

a0659dd

This reverts commit 855151b. Signed-off-by: Terry Kong <terryk@nvidia.com>

PrinsYin pushed a commit to PrinsYin/RL that referenced this pull request Nov 30, 2025

chore: improve ray.sub generalization across clusters (NVIDIA-NeMo#1451)

f4b93e0

Signed-off-by: Terry Kong <terryk@nvidia.com>

yuanhangsu1986 pushed a commit to yuanhangsu1986/RL-Nemontron-Edge-Omni that referenced this pull request Feb 21, 2026

chore: improve ray.sub generalization across clusters (NVIDIA-NeMo#1451)

f6223f8

Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: yuanhangs <yuanhangs@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: improve ray.sub generalization across clusters#1451

chore: improve ray.sub generalization across clusters#1451
terrykong merged 3 commits intomainfrom
tk/ray-sub-lambda

terrykong commented Oct 31, 2025 •

edited

Loading

Uh oh!

coderabbitai Bot commented Oct 31, 2025 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Uh oh!

parthmannan left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

terrykong commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Issues

Usage

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

parthmannan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

terrykong commented Oct 31, 2025 •

edited

Loading

coderabbitai Bot commented Oct 31, 2025 •

edited

Loading