Skip to content

feat: vLLM multinode elastic EP scaling support test#8183

Merged
tzulingk merged 2 commits into
mainfrom
tzulingk/multinode-elastic-ep
Apr 20, 2026
Merged

feat: vLLM multinode elastic EP scaling support test#8183
tzulingk merged 2 commits into
mainfrom
tzulingk/multinode-elastic-ep

Conversation

@tzulingk
Copy link
Copy Markdown
Contributor

@tzulingk tzulingk commented Apr 14, 2026

Overview:

Adds test infrastructure for validating vLLM's native elastic expert parallelism (elastic EP) across multiple nodes, without Dynamo operator involvement. This is a bare vLLM test that confirms the cross-node scaling capability works before wiring it into the operator.

Validated on AKS with 2 nodes and deepseek-ai/DeepSeek-V2-Lite (vLLM 0.18.1rc1.dev59+g2488a82f8): all 6 scale steps dp=2→3→4→3→2→4→2 succeeded. Inference verified at dp=3.

Details:

bare_multinode_elastic_ep.yaml — Kubernetes pod specs for the warm-standby topology:

  • Leader pod (Node 1, 2 GPUs): starts Ray head, then vLLM at dp=2 — both DP workers land on Node 1
  • Worker pod (Node 2, 2 GPUs): polls /health every 15s, joins Ray only after vLLM is serving — GPUs stay idle until scale-up
  • Headless service so the worker can reach the Ray head by DNS
  • NCCL configured for TCP socket transport (NCCL_IB_DISABLE=1, NCCL_SOCKET_IFNAME=eth0) — AKS IB GIDs are fe80:: link-local and not cross-pod routable

run_bare_multinode_elastic_ep_scale_test.sh — scale test driver:

  • Waits for leader pod ready and vLLM /health before starting
  • Blocks until the worker joins the Ray cluster (guards against Python bytecode compilation delay of up to 10 min on first start)
  • Drives 6 scale steps via POST /scale_elastic_ep with configurable timeouts
  • Captures nvidia-smi from both pods and runs inference after each step

Where should the reviewer start?

Summary by CodeRabbit

  • Tests
    • Added test infrastructure for multi-node elastic data-parallel scaling validation in vLLM environments, including Kubernetes manifests and test orchestration scripts.

Signed-off-by: Tzu-Ling <tzulingk@nvidia.com>
@tzulingk tzulingk requested a review from a team as a code owner April 14, 2026 19:41
@github-actions github-actions Bot added feat documentation Improvements or additions to documentation deployment::k8s Relates to dynamo deployment in kubernetes labels Apr 14, 2026
@tzulingk tzulingk force-pushed the tzulingk/multinode-elastic-ep branch from 1bd289c to 5b22dab Compare April 14, 2026 19:42
@github-actions
Copy link
Copy Markdown
Contributor

@tzulingk tzulingk requested a review from nnshah1 April 14, 2026 19:43
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 14, 2026

Walkthrough

Adds test infrastructure for vLLM's cross-node elastic data-parallel scaling: a Kubernetes manifest defining a headless Service and two Pods (leader and worker) with Ray and vLLM configuration, plus a Bash orchestration script that validates pod readiness, forwards ports, confirms cluster formation, and executes scaling test sequences with health checks and inference validation.

Changes

Cohort / File(s) Summary
Kubernetes Test Manifest
tests/fault_tolerance/deploy/templates/vllm/bare_multinode_elastic_ep.yaml
New manifest defining a headless Service exposing Ray (6379) and vLLM (8000), a leader Pod installing Ray, starting head node, and launching vLLM with elastic EP enabled and 2 GPUs, and a worker Pod joining the Ray cluster after health endpoint verification. Both Pods mount /dev/shm, apply GPU tolerations and node selectors.
Test Orchestration Script
tests/fault_tolerance/deploy/templates/vllm/run_bare_multinode_elastic_ep_scale_test.sh
New Bash script orchestrating the elastic EP scaling test: validates pod readiness, establishes port-forward to leader (localhost 8001 → pod 8000), waits for Ray cluster formation, defines helper functions for GPU snapshots and inference validation, and executes scaling sequence (dp: 2 → 3 → 4 → 3 → 2 → 4 → 2) with health checks and inference calls.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description check ✅ Passed The PR description includes all required template sections (Overview, Details, Where should the reviewer start) with comprehensive information about the changes, rationale, and validation details.
Title check ✅ Passed The title clearly and concisely describes the primary change: adding test support for vLLM's multinode elastic EP scaling capability.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

Warning

Review ran into problems

🔥 Problems

Timed out fetching pipeline failures after 30000ms


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@tzulingk tzulingk changed the title feat(operator): vLLM multinode elastic EP scaling support feat: vLLM multinode elastic EP scaling support Apr 14, 2026
@tzulingk tzulingk changed the title feat: vLLM multinode elastic EP scaling support feat: vLLM multinode elastic EP scaling support test Apr 14, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (2)
tests/fault_tolerance/deploy/templates/vllm/run_bare_multinode_elastic_ep_scale_test.sh (1)

33-33: Use a dynamic local port and guaranteed cleanup for the port-forward loop.

Requiring 8001 to be free is brittle, the broad pkill -f can kill an unrelated kubectl port-forward using the same mapping, and the exit 1 path in wait_worker_in_ray leaves PF running because cleanup only happens on the happy path. Allocate a free port and register an EXIT trap for the background loop. Based on learnings, hard-coded portability-reducing constants in shell/scripts should be avoided; prefer dynamic port allocation such as alloc_port instead of static ports.

Also applies to: 70-79, 123-124, 204-204

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@tests/fault_tolerance/deploy/templates/vllm/run_bare_multinode_elastic_ep_scale_test.sh`
at line 33, The script currently uses a hard-coded port 8001 and an unsafe
pkill/cleanup pattern; replace all uses of the static port (lines referenced)
with a dynamically allocated port via alloc_port, start the kubectl port-forward
loop in background capturing its PID into a variable (e.g., PF), and register a
trap on EXIT that kills PF to guarantee cleanup; also update wait_worker_in_ray
to ensure it kills PF before any early exit (remove reliance on pkill -f and
broad process matching) so the port-forward loop is always terminated whether
the function succeeds or errors.
tests/fault_tolerance/deploy/templates/vllm/bare_multinode_elastic_ep.yaml (1)

49-50: Avoid baking AKS VMSS hostnames into a templates manifest.

These kubernetes.io/hostname selectors only schedule on the author's two AKS node names, so the pods will sit Pending anywhere else. If this file is meant to be reusable, parameterize the hostnames or switch to stable node labels/affinity.

Also applies to: 154-155

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/fault_tolerance/deploy/templates/vllm/bare_multinode_elastic_ep.yaml`
around lines 49 - 50, The manifest currently hardcodes a VMSS host via the
nodeSelector key (kubernetes.io/hostname), which will cause pods to Pending on
other clusters; update the templates to remove the fixed kubernetes.io/hostname
selector and either (a) replace it with a parameterized value (expose a template
variable like nodeHostname or nodeSelectorMap and use that in the manifest) or
(b) use stable labels or nodeAffinity (e.g., a custom node label such as
gpu-type: a100 or a nodeAffinity rule) so scheduling works across clusters;
apply the same change for the second occurrence of the kubernetes.io/hostname
selector in this template.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/fault_tolerance/deploy/templates/vllm/bare_multinode_elastic_ep.yaml`:
- Around line 123-126: Replace the fixed "sleep 8" delay with a readiness loop
that probes the Ray head's port before starting vLLM: after launching the head
(the block that runs "ray start --head --port=6379 --block &" and sets
RAY_HEAD_PID), poll until the head is accepting connections on port 6379 (e.g.,
with nc/ss/tcp probe or "ray status" if available), include a sensible timeout
and error handling, and only echo "=== Ray head started ... launching vllm serve
===" and run "vllm serve --data-parallel-backend ray" after the probe succeeds
to avoid the race.
- Around line 194-206: The worker script hard-codes the namespace in LEADER_URL
and the ray start address, breaking service discovery in other namespaces;
change LEADER_URL and the ray start target to use same-namespace resolution
(e.g., reference the service by its short DNS name or build it using the pod's
current namespace from an env var like POD_NAMESPACE) so both the curl health
check against LEADER_URL and the ray start --address use the service without the
fixed "tzulingk-multinode-elastic" namespace (update the LEADER_URL definition
and the ray start invocation where they appear).

In
`@tests/fault_tolerance/deploy/templates/vllm/run_bare_multinode_elastic_ep_scale_test.sh`:
- Line 7: The script header in run_bare_multinode_elastic_ep_scale_test.sh
contains stale usage/manifest filenames (references to pre-rename
manifest/script names) — update the usage text and any commented example file
paths in the header (including the block around lines 24-32) to the current
manifest and script names used in this PR so they match the actual files (search
for the old filenames in the header and replace them with the new
manifest/script names referenced elsewhere in the repo).
- Around line 146-177: The test script lets failed HTTP calls silently pass;
update infer() and scale() to detect curl failures and non-2xx responses and
exit non‑zero so the test fails: in infer() check curl’s exit status and parse
RESP for error conditions (empty/malformed JSON or HTTP error fields) and call
exit 1 on failure; in scale() check curl’s exit status and ensure RESP indicates
success (non-error HTTP/body) and call exit 1 on failure; also propagate these
failures from callers (baseline/step/any place invoking infer()/scale()) so the
script stops and does not print "ALL STEPS COMPLETE" on timeouts or 4xx/5xx
responses.

---

Nitpick comments:
In `@tests/fault_tolerance/deploy/templates/vllm/bare_multinode_elastic_ep.yaml`:
- Around line 49-50: The manifest currently hardcodes a VMSS host via the
nodeSelector key (kubernetes.io/hostname), which will cause pods to Pending on
other clusters; update the templates to remove the fixed kubernetes.io/hostname
selector and either (a) replace it with a parameterized value (expose a template
variable like nodeHostname or nodeSelectorMap and use that in the manifest) or
(b) use stable labels or nodeAffinity (e.g., a custom node label such as
gpu-type: a100 or a nodeAffinity rule) so scheduling works across clusters;
apply the same change for the second occurrence of the kubernetes.io/hostname
selector in this template.

In
`@tests/fault_tolerance/deploy/templates/vllm/run_bare_multinode_elastic_ep_scale_test.sh`:
- Line 33: The script currently uses a hard-coded port 8001 and an unsafe
pkill/cleanup pattern; replace all uses of the static port (lines referenced)
with a dynamically allocated port via alloc_port, start the kubectl port-forward
loop in background capturing its PID into a variable (e.g., PF), and register a
trap on EXIT that kills PF to guarantee cleanup; also update wait_worker_in_ray
to ensure it kills PF before any early exit (remove reliance on pkill -f and
broad process matching) so the port-forward loop is always terminated whether
the function succeeds or errors.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 286c07aa-8fd5-4363-b43d-c548fc1a7551

📥 Commits

Reviewing files that changed from the base of the PR and between e2371a4 and 5b22dab.

📒 Files selected for processing (2)
  • tests/fault_tolerance/deploy/templates/vllm/bare_multinode_elastic_ep.yaml
  • tests/fault_tolerance/deploy/templates/vllm/run_bare_multinode_elastic_ep_scale_test.sh

Comment thread tests/fault_tolerance/deploy/templates/vllm/bare_multinode_elastic_ep.yaml Outdated
…hardcoded namespace, fix stale filenames, fail test on curl errors

Signed-off-by: Tzu-Ling <tzulingk@nvidia.com>
Copy link
Copy Markdown
Contributor

@nnshah1 nnshah1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tzulingk tzulingk merged commit f460f4e into main Apr 20, 2026
88 checks passed
@tzulingk tzulingk deleted the tzulingk/multinode-elastic-ep branch April 20, 2026 14:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

deployment::k8s Relates to dynamo deployment in kubernetes documentation Improvements or additions to documentation feat size/L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants