feat: vLLM multinode elastic EP scaling support test by tzulingk · Pull Request #8183 · ai-dynamo/dynamo

tzulingk · 2026-04-14T19:41:37Z

Overview:

Adds test infrastructure for validating vLLM's native elastic expert parallelism (elastic EP) across multiple nodes, without Dynamo operator involvement. This is a bare vLLM test that confirms the cross-node scaling capability works before wiring it into the operator.

Validated on AKS with 2 nodes and deepseek-ai/DeepSeek-V2-Lite (vLLM 0.18.1rc1.dev59+g2488a82f8): all 6 scale steps dp=2→3→4→3→2→4→2 succeeded. Inference verified at dp=3.

Details:

bare_multinode_elastic_ep.yaml — Kubernetes pod specs for the warm-standby topology:

Leader pod (Node 1, 2 GPUs): starts Ray head, then vLLM at dp=2 — both DP workers land on Node 1
Worker pod (Node 2, 2 GPUs): polls /health every 15s, joins Ray only after vLLM is serving — GPUs stay idle until scale-up
Headless service so the worker can reach the Ray head by DNS
NCCL configured for TCP socket transport (NCCL_IB_DISABLE=1, NCCL_SOCKET_IFNAME=eth0) — AKS IB GIDs are fe80:: link-local and not cross-pod routable

run_bare_multinode_elastic_ep_scale_test.sh — scale test driver:

Waits for leader pod ready and vLLM /health before starting
Blocks until the worker joins the Ray cluster (guards against Python bytecode compilation delay of up to 10 min on first start)
Drives 6 scale steps via POST /scale_elastic_ep with configurable timeouts
Captures nvidia-smi from both pods and runs inference after each step

Where should the reviewer start?

tests/fault_tolerance/deploy/templates/vllm/bare_multinode_elastic_ep.yaml — NCCL env block and worker delay rationale
tests/fault_tolerance/deploy/templates/vllm/run_bare_multinode_elastic_ep_scale_test.sh — wait_worker_in_ray function and the 6-step scale sequence at the bottom

Summary by CodeRabbit

Tests
- Added test infrastructure for multi-node elastic data-parallel scaling validation in vLLM environments, including Kubernetes manifests and test orchestration scripts.

Signed-off-by: Tzu-Ling <tzulingk@nvidia.com>

github-actions · 2026-04-14T19:43:48Z

🌿 Fern Docs Preview: https://nvidia-preview-d6e86cf1-0b72-4240-9473-f05cd4251791.docs.buildwithfern.com/dynamo/dev

coderabbitai · 2026-04-14T19:44:11Z

Walkthrough

Adds test infrastructure for vLLM's cross-node elastic data-parallel scaling: a Kubernetes manifest defining a headless Service and two Pods (leader and worker) with Ray and vLLM configuration, plus a Bash orchestration script that validates pod readiness, forwards ports, confirms cluster formation, and executes scaling test sequences with health checks and inference validation.

Changes

Cohort / File(s)	Summary
Kubernetes Test Manifest `tests/fault_tolerance/deploy/templates/vllm/bare_multinode_elastic_ep.yaml`	New manifest defining a headless Service exposing Ray (6379) and vLLM (8000), a leader Pod installing Ray, starting head node, and launching vLLM with elastic EP enabled and 2 GPUs, and a worker Pod joining the Ray cluster after health endpoint verification. Both Pods mount `/dev/shm`, apply GPU tolerations and node selectors.
Test Orchestration Script `tests/fault_tolerance/deploy/templates/vllm/run_bare_multinode_elastic_ep_scale_test.sh`	New Bash script orchestrating the elastic EP scaling test: validates pod readiness, establishes port-forward to leader (localhost 8001 → pod 8000), waits for Ray cluster formation, defines helper functions for GPU snapshots and inference validation, and executes scaling sequence (dp: 2 → 3 → 4 → 3 → 2 → 4 → 2) with health checks and inference calls.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description check	✅ Passed	The PR description includes all required template sections (Overview, Details, Where should the reviewer start) with comprehensive information about the changes, rationale, and validation details.
Title check	✅ Passed	The title clearly and concisely describes the primary change: adding test support for vLLM's multinode elastic EP scaling capability.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Warning

Review ran into problems

🔥 Problems

Timed out fetching pipeline failures after 30000ms

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (2)

tests/fault_tolerance/deploy/templates/vllm/run_bare_multinode_elastic_ep_scale_test.sh (1)
33-33: Use a dynamic local port and guaranteed cleanup for the port-forward loop.

Requiring 8001 to be free is brittle, the broad pkill -f can kill an unrelated kubectl port-forward using the same mapping, and the exit 1 path in wait_worker_in_ray leaves PF running because cleanup only happens on the happy path. Allocate a free port and register an EXIT trap for the background loop. Based on learnings, hard-coded portability-reducing constants in shell/scripts should be avoided; prefer dynamic port allocation such as alloc_port instead of static ports.

Also applies to: 70-79, 123-124, 204-204
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@tests/fault_tolerance/deploy/templates/vllm/run_bare_multinode_elastic_ep_scale_test.sh`
at line 33, The script currently uses a hard-coded port 8001 and an unsafe
pkill/cleanup pattern; replace all uses of the static port (lines referenced)
with a dynamically allocated port via alloc_port, start the kubectl port-forward
loop in background capturing its PID into a variable (e.g., PF), and register a
trap on EXIT that kills PF to guarantee cleanup; also update wait_worker_in_ray
to ensure it kills PF before any early exit (remove reliance on pkill -f and
broad process matching) so the port-forward loop is always terminated whether
the function succeeds or errors.
tests/fault_tolerance/deploy/templates/vllm/bare_multinode_elastic_ep.yaml (1)
49-50: Avoid baking AKS VMSS hostnames into a templates manifest.

These kubernetes.io/hostname selectors only schedule on the author's two AKS node names, so the pods will sit Pending anywhere else. If this file is meant to be reusable, parameterize the hostnames or switch to stable node labels/affinity.

Also applies to: 154-155
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/fault_tolerance/deploy/templates/vllm/bare_multinode_elastic_ep.yaml`
around lines 49 - 50, The manifest currently hardcodes a VMSS host via the
nodeSelector key (kubernetes.io/hostname), which will cause pods to Pending on
other clusters; update the templates to remove the fixed kubernetes.io/hostname
selector and either (a) replace it with a parameterized value (expose a template
variable like nodeHostname or nodeSelectorMap and use that in the manifest) or
(b) use stable labels or nodeAffinity (e.g., a custom node label such as
gpu-type: a100 or a nodeAffinity rule) so scheduling works across clusters;
apply the same change for the second occurrence of the kubernetes.io/hostname
selector in this template.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/fault_tolerance/deploy/templates/vllm/bare_multinode_elastic_ep.yaml`:
- Around line 123-126: Replace the fixed "sleep 8" delay with a readiness loop
that probes the Ray head's port before starting vLLM: after launching the head
(the block that runs "ray start --head --port=6379 --block &" and sets
RAY_HEAD_PID), poll until the head is accepting connections on port 6379 (e.g.,
with nc/ss/tcp probe or "ray status" if available), include a sensible timeout
and error handling, and only echo "=== Ray head started ... launching vllm serve
===" and run "vllm serve --data-parallel-backend ray" after the probe succeeds
to avoid the race.
- Around line 194-206: The worker script hard-codes the namespace in LEADER_URL
and the ray start address, breaking service discovery in other namespaces;
change LEADER_URL and the ray start target to use same-namespace resolution
(e.g., reference the service by its short DNS name or build it using the pod's
current namespace from an env var like POD_NAMESPACE) so both the curl health
check against LEADER_URL and the ray start --address use the service without the
fixed "tzulingk-multinode-elastic" namespace (update the LEADER_URL definition
and the ray start invocation where they appear).

In
`@tests/fault_tolerance/deploy/templates/vllm/run_bare_multinode_elastic_ep_scale_test.sh`:
- Line 7: The script header in run_bare_multinode_elastic_ep_scale_test.sh
contains stale usage/manifest filenames (references to pre-rename
manifest/script names) — update the usage text and any commented example file
paths in the header (including the block around lines 24-32) to the current
manifest and script names used in this PR so they match the actual files (search
for the old filenames in the header and replace them with the new
manifest/script names referenced elsewhere in the repo).
- Around line 146-177: The test script lets failed HTTP calls silently pass;
update infer() and scale() to detect curl failures and non-2xx responses and
exit non‑zero so the test fails: in infer() check curl’s exit status and parse
RESP for error conditions (empty/malformed JSON or HTTP error fields) and call
exit 1 on failure; in scale() check curl’s exit status and ensure RESP indicates
success (non-error HTTP/body) and call exit 1 on failure; also propagate these
failures from callers (baseline/step/any place invoking infer()/scale()) so the
script stops and does not print "ALL STEPS COMPLETE" on timeouts or 4xx/5xx
responses.

---

Nitpick comments:
In `@tests/fault_tolerance/deploy/templates/vllm/bare_multinode_elastic_ep.yaml`:
- Around line 49-50: The manifest currently hardcodes a VMSS host via the
nodeSelector key (kubernetes.io/hostname), which will cause pods to Pending on
other clusters; update the templates to remove the fixed kubernetes.io/hostname
selector and either (a) replace it with a parameterized value (expose a template
variable like nodeHostname or nodeSelectorMap and use that in the manifest) or
(b) use stable labels or nodeAffinity (e.g., a custom node label such as
gpu-type: a100 or a nodeAffinity rule) so scheduling works across clusters;
apply the same change for the second occurrence of the kubernetes.io/hostname
selector in this template.

In
`@tests/fault_tolerance/deploy/templates/vllm/run_bare_multinode_elastic_ep_scale_test.sh`:
- Line 33: The script currently uses a hard-coded port 8001 and an unsafe
pkill/cleanup pattern; replace all uses of the static port (lines referenced)
with a dynamically allocated port via alloc_port, start the kubectl port-forward
loop in background capturing its PID into a variable (e.g., PF), and register a
trap on EXIT that kills PF to guarantee cleanup; also update wait_worker_in_ray
to ensure it kills PF before any early exit (remove reliance on pkill -f and
broad process matching) so the port-forward loop is always terminated whether
the function succeeds or errors.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 286c07aa-8fd5-4363-b43d-c548fc1a7551

📥 Commits

Reviewing files that changed from the base of the PR and between e2371a4 and 5b22dab.

📒 Files selected for processing (2)

tests/fault_tolerance/deploy/templates/vllm/bare_multinode_elastic_ep.yaml
tests/fault_tolerance/deploy/templates/vllm/run_bare_multinode_elastic_ep_scale_test.sh

…hardcoded namespace, fix stale filenames, fail test on curl errors Signed-off-by: Tzu-Ling <tzulingk@nvidia.com>

nnshah1

LGTM

vLLM multinode elastic ep scaling script and yaml.

5b22dab

Signed-off-by: Tzu-Ling <tzulingk@nvidia.com>

tzulingk requested a review from a team as a code owner April 14, 2026 19:41

pull-request-size Bot added the size/XXL label Apr 14, 2026

github-actions Bot added feat documentation Improvements or additions to documentation deployment::k8s Relates to dynamo deployment in kubernetes labels Apr 14, 2026

tzulingk force-pushed the tzulingk/multinode-elastic-ep branch from 1bd289c to 5b22dab Compare April 14, 2026 19:42

pull-request-size Bot added size/L and removed size/XXL labels Apr 14, 2026

tzulingk requested a review from nnshah1 April 14, 2026 19:43

tzulingk changed the title ~~feat(operator): vLLM multinode elastic EP scaling support~~ feat: vLLM multinode elastic EP scaling support Apr 14, 2026

tzulingk changed the title ~~feat: vLLM multinode elastic EP scaling support~~ feat: vLLM multinode elastic EP scaling support test Apr 14, 2026

coderabbitai Bot reviewed Apr 14, 2026

View reviewed changes

fix(test): address coderabbitai review — Ray readiness probe, remove …

3a0f1f9

…hardcoded namespace, fix stale filenames, fail test on curl errors Signed-off-by: Tzu-Ling <tzulingk@nvidia.com>

copy-pr-bot Bot temporarily deployed to GITLAB April 14, 2026 20:16 Inactive

tzulingk enabled auto-merge (squash) April 14, 2026 20:16

copy-pr-bot Bot temporarily deployed to GITLAB April 15, 2026 00:27 Inactive

nnshah1 approved these changes Apr 20, 2026

View reviewed changes

tzulingk merged commit f460f4e into main Apr 20, 2026
88 checks passed

tzulingk deleted the tzulingk/multinode-elastic-ep branch April 20, 2026 14:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: vLLM multinode elastic EP scaling support test#8183

feat: vLLM multinode elastic EP scaling support test#8183
tzulingk merged 2 commits into
mainfrom
tzulingk/multinode-elastic-ep

tzulingk commented Apr 14, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

github-actions Bot commented Apr 14, 2026

Uh oh!

coderabbitai Bot commented Apr 14, 2026 •

edited

Loading

❌ Failed checks (1 warning)

Review ran into problems

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nnshah1 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tzulingk commented Apr 14, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Summary by CodeRabbit

Uh oh!

github-actions Bot commented Apr 14, 2026

Uh oh!

coderabbitai Bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Review ran into problems

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nnshah1 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tzulingk commented Apr 14, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 14, 2026 •

edited

Loading