chore: improve ray.sub generalization across clusters#1451
Conversation
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
📝 WalkthroughWalkthroughThe Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes
Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
ray.sub (1)
104-108: Address or remove the TODO comment.The debugging flags (
-pand-A) are marked for deletion. These flags are redundant within the same SLURM job context since the partition and account are already inherited from the job allocation.Should these be removed now, or is there a specific reason they're needed for debugging? If they're still required, please document why; otherwise, they should be removed to avoid confusion.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
ray.sub(8 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-10-12T14:46:57.171Z
Learnt from: zpqiu
Repo: NVIDIA-NeMo/RL PR: 1324
File: tests/test_suites/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-megatron-tp2pp2cp2-pack.sh:6-11
Timestamp: 2025-10-12T14:46:57.171Z
Learning: Test scripts in tests/test_suites/llm/ follow a standard configuration pattern that includes NUM_NODES, STEPS_PER_RUN, MAX_STEPS, NUM_RUNS (calculated as `$(( (MAX_STEPS + STEPS_PER_RUN - 1) / STEPS_PER_RUN ))`), and NUM_MINUTES. These variables are part of the test infrastructure's standard interface and should not be flagged as unused even if not directly referenced within the individual script, as they are consumed by external launch tooling or common.env.
Applied to files:
ray.sub
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Lint check
- GitHub Check: Post submodule check comment / Comment on PR
- GitHub Check: Post automodel integration comment / Comment on PR
🔇 Additional comments (5)
ray.sub (5)
98-108: LGTM! Centralized resource allocation approach.The introduction of
COMMON_SRUN_ARGSwith--exclusiveeffectively simplifies the resource allocation model by claiming whole nodes instead of managing fine-grained CPU/GPU/memory allocations. This approach generalizes better across different cluster configurations.Also applies to: 278-278, 377-377
383-386: LGTM! Simplified wait condition.The simplified
srun --overlapcall for checking theSTARTED_RAY_HEADfile is appropriate. The--overlapflag correctly allows sharing the node with the existing ray-head container.
391-399: LGTM! Simplified status extraction.The status check correctly uses
--overlapand--container-nameto attach to the running ray-head container. This is the appropriate way to query the Ray cluster status from within an existing container.
422-450: LGTM! Attach operations correctly use a subset of flags.The attach script and driver launch commands intentionally don't use the full
COMMON_SRUN_ARGSbecause they're attaching to already-running containers via--container-name. They correctly:
- Use
--overlapto share nodes with existing containers- Use
--container-workdirto set the working directory- Omit
--container-imageand--container-mounts(already set in running containers)- Omit
--exclusive(would conflict with existing containers)This is the correct approach for attach operations.
96-96: GPUS_PER_NODE is configurable, not hardcoded.The variable uses bash parameter expansion
${GPUS_PER_NODE:-8}, allowing users to override the default via environment variable. Documentation already instructs users to determine the correct GPU count for their cluster and pass it when submitting jobs (e.g.,GPUS_PER_NODE=N sbatch ray.sub). No action needed.Likely an incorrect or invalid review comment.
parthmannan
left a comment
There was a problem hiding this comment.
LGTM. Thanks for adding this.
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>
This reverts commit 855151b.
This reverts commit 855151b. Signed-off-by: Terry Kong <terryk@nvidia.com>
This reverts commit 855151b. Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
What does this PR do ?
related #1339
Issues
List issues that this PR closes (syntax):
Usage
# Add a code snippet demonstrating how to use thisBefore your PR is "Ready for review"
Pre checks:
Additional Information
Summary by CodeRabbit
Release Notes