feat: vLLM multinode elastic EP scaling support test#8183
Conversation
Signed-off-by: Tzu-Ling <tzulingk@nvidia.com>
1bd289c to
5b22dab
Compare
WalkthroughAdds test infrastructure for vLLM's cross-node elastic data-parallel scaling: a Kubernetes manifest defining a headless Service and two Pods (leader and worker) with Ray and vLLM configuration, plus a Bash orchestration script that validates pod readiness, forwards ports, confirms cluster formation, and executes scaling test sequences with health checks and inference validation. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Warning Review ran into problems🔥 ProblemsTimed out fetching pipeline failures after 30000ms Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (2)
tests/fault_tolerance/deploy/templates/vllm/run_bare_multinode_elastic_ep_scale_test.sh (1)
33-33: Use a dynamic local port and guaranteed cleanup for the port-forward loop.Requiring
8001to be free is brittle, the broadpkill -fcan kill an unrelatedkubectl port-forwardusing the same mapping, and theexit 1path inwait_worker_in_rayleavesPFrunning because cleanup only happens on the happy path. Allocate a free port and register anEXITtrap for the background loop. Based on learnings, hard-coded portability-reducing constants in shell/scripts should be avoided; prefer dynamic port allocation such asalloc_portinstead of static ports.Also applies to: 70-79, 123-124, 204-204
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/fault_tolerance/deploy/templates/vllm/run_bare_multinode_elastic_ep_scale_test.sh` at line 33, The script currently uses a hard-coded port 8001 and an unsafe pkill/cleanup pattern; replace all uses of the static port (lines referenced) with a dynamically allocated port via alloc_port, start the kubectl port-forward loop in background capturing its PID into a variable (e.g., PF), and register a trap on EXIT that kills PF to guarantee cleanup; also update wait_worker_in_ray to ensure it kills PF before any early exit (remove reliance on pkill -f and broad process matching) so the port-forward loop is always terminated whether the function succeeds or errors.tests/fault_tolerance/deploy/templates/vllm/bare_multinode_elastic_ep.yaml (1)
49-50: Avoid baking AKS VMSS hostnames into atemplatesmanifest.These
kubernetes.io/hostnameselectors only schedule on the author's two AKS node names, so the pods will sitPendinganywhere else. If this file is meant to be reusable, parameterize the hostnames or switch to stable node labels/affinity.Also applies to: 154-155
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/fault_tolerance/deploy/templates/vllm/bare_multinode_elastic_ep.yaml` around lines 49 - 50, The manifest currently hardcodes a VMSS host via the nodeSelector key (kubernetes.io/hostname), which will cause pods to Pending on other clusters; update the templates to remove the fixed kubernetes.io/hostname selector and either (a) replace it with a parameterized value (expose a template variable like nodeHostname or nodeSelectorMap and use that in the manifest) or (b) use stable labels or nodeAffinity (e.g., a custom node label such as gpu-type: a100 or a nodeAffinity rule) so scheduling works across clusters; apply the same change for the second occurrence of the kubernetes.io/hostname selector in this template.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/fault_tolerance/deploy/templates/vllm/bare_multinode_elastic_ep.yaml`:
- Around line 123-126: Replace the fixed "sleep 8" delay with a readiness loop
that probes the Ray head's port before starting vLLM: after launching the head
(the block that runs "ray start --head --port=6379 --block &" and sets
RAY_HEAD_PID), poll until the head is accepting connections on port 6379 (e.g.,
with nc/ss/tcp probe or "ray status" if available), include a sensible timeout
and error handling, and only echo "=== Ray head started ... launching vllm serve
===" and run "vllm serve --data-parallel-backend ray" after the probe succeeds
to avoid the race.
- Around line 194-206: The worker script hard-codes the namespace in LEADER_URL
and the ray start address, breaking service discovery in other namespaces;
change LEADER_URL and the ray start target to use same-namespace resolution
(e.g., reference the service by its short DNS name or build it using the pod's
current namespace from an env var like POD_NAMESPACE) so both the curl health
check against LEADER_URL and the ray start --address use the service without the
fixed "tzulingk-multinode-elastic" namespace (update the LEADER_URL definition
and the ray start invocation where they appear).
In
`@tests/fault_tolerance/deploy/templates/vllm/run_bare_multinode_elastic_ep_scale_test.sh`:
- Line 7: The script header in run_bare_multinode_elastic_ep_scale_test.sh
contains stale usage/manifest filenames (references to pre-rename
manifest/script names) — update the usage text and any commented example file
paths in the header (including the block around lines 24-32) to the current
manifest and script names used in this PR so they match the actual files (search
for the old filenames in the header and replace them with the new
manifest/script names referenced elsewhere in the repo).
- Around line 146-177: The test script lets failed HTTP calls silently pass;
update infer() and scale() to detect curl failures and non-2xx responses and
exit non‑zero so the test fails: in infer() check curl’s exit status and parse
RESP for error conditions (empty/malformed JSON or HTTP error fields) and call
exit 1 on failure; in scale() check curl’s exit status and ensure RESP indicates
success (non-error HTTP/body) and call exit 1 on failure; also propagate these
failures from callers (baseline/step/any place invoking infer()/scale()) so the
script stops and does not print "ALL STEPS COMPLETE" on timeouts or 4xx/5xx
responses.
---
Nitpick comments:
In `@tests/fault_tolerance/deploy/templates/vllm/bare_multinode_elastic_ep.yaml`:
- Around line 49-50: The manifest currently hardcodes a VMSS host via the
nodeSelector key (kubernetes.io/hostname), which will cause pods to Pending on
other clusters; update the templates to remove the fixed kubernetes.io/hostname
selector and either (a) replace it with a parameterized value (expose a template
variable like nodeHostname or nodeSelectorMap and use that in the manifest) or
(b) use stable labels or nodeAffinity (e.g., a custom node label such as
gpu-type: a100 or a nodeAffinity rule) so scheduling works across clusters;
apply the same change for the second occurrence of the kubernetes.io/hostname
selector in this template.
In
`@tests/fault_tolerance/deploy/templates/vllm/run_bare_multinode_elastic_ep_scale_test.sh`:
- Line 33: The script currently uses a hard-coded port 8001 and an unsafe
pkill/cleanup pattern; replace all uses of the static port (lines referenced)
with a dynamically allocated port via alloc_port, start the kubectl port-forward
loop in background capturing its PID into a variable (e.g., PF), and register a
trap on EXIT that kills PF to guarantee cleanup; also update wait_worker_in_ray
to ensure it kills PF before any early exit (remove reliance on pkill -f and
broad process matching) so the port-forward loop is always terminated whether
the function succeeds or errors.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 286c07aa-8fd5-4363-b43d-c548fc1a7551
📒 Files selected for processing (2)
tests/fault_tolerance/deploy/templates/vllm/bare_multinode_elastic_ep.yamltests/fault_tolerance/deploy/templates/vllm/run_bare_multinode_elastic_ep_scale_test.sh
…hardcoded namespace, fix stale filenames, fail test on curl errors Signed-off-by: Tzu-Ling <tzulingk@nvidia.com>
Overview:
Adds test infrastructure for validating vLLM's native elastic expert parallelism (elastic EP) across multiple nodes, without Dynamo operator involvement. This is a bare vLLM test that confirms the cross-node scaling capability works before wiring it into the operator.
Validated on AKS with 2 nodes and
deepseek-ai/DeepSeek-V2-Lite(vLLM0.18.1rc1.dev59+g2488a82f8): all 6 scale stepsdp=2→3→4→3→2→4→2succeeded. Inference verified at dp=3.Details:
bare_multinode_elastic_ep.yaml— Kubernetes pod specs for the warm-standby topology:/healthevery 15s, joins Ray only after vLLM is serving — GPUs stay idle until scale-upNCCL_IB_DISABLE=1,NCCL_SOCKET_IFNAME=eth0) — AKS IB GIDs arefe80::link-local and not cross-pod routablerun_bare_multinode_elastic_ep_scale_test.sh— scale test driver:/healthbefore startingPOST /scale_elastic_epwith configurable timeoutsnvidia-smifrom both pods and runs inference after each stepWhere should the reviewer start?
tests/fault_tolerance/deploy/templates/vllm/bare_multinode_elastic_ep.yaml— NCCL env block and worker delay rationaletests/fault_tolerance/deploy/templates/vllm/run_bare_multinode_elastic_ep_scale_test.sh—wait_worker_in_rayfunction and the 6-step scale sequence at the bottomSummary by CodeRabbit