fix(operator): emit Kubernetes syntax from LWS deployer so kubelet expands the leader hostname in direct-python container args#8369
Conversation
WalkthroughThe PR changes LWS multinode variable handling to use Kubernetes command-substitution syntax ( Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.11.4)level=error msg="[linters_context] typechecking error: pattern ./...: directory prefix . does not contain main module or its selected dependencies" Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
deploy/operator/internal/dynamo/utils.go (1)
20-31: Update the stale “env vars need shell” wording.The new
$(VAR)contract is right, but the earlier bullet still says shell wrapping is needed “for env vars”. Please narrow that to shell-only constructs like arithmetic expansion to avoid reintroducing unnecessarysh -cwrapping.Suggested doc tweak
- * - If shell interpretation is needed (for env vars): Wrap in "sh -c" with exec + * - If shell interpretation is needed (for shell-only expressions such as arithmetic): Wrap in "sh -c" with execBased on learnings, shell wrapping should only be used for arithmetic operations like
$((BLAH + 1)); regular$(VAR)environment variable substitutions are handled natively by Kubernetes.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@deploy/operator/internal/dynamo/utils.go` around lines 20 - 31, The comment in utils.go incorrectly implies shell wrapping is needed for env var substitution; update the bullet under "Direct Python Command" to narrow the rationale so it only recommends shell wrapping for shell-only constructs (e.g., arithmetic expansion $((...))) rather than regular Kubernetes env vars like $(VAR). Keep the new $(VAR) contract language and explicitly state that GetLeaderHostname / GetNodeRank and all MultinodeDeployer implementations must return Kubernetes env-var expansion syntax ("$(VAR)") so the kubelet performs substitution, and mention that shell wrapping is only required for true shell expansion cases (arithmetic, command substitution), not plain $(VAR).deploy/operator/internal/dynamo/lws.go (1)
21-26: Consider returningneedsShell=falsefor LWS node rank.Now that
GetNodeRankreturns$(LWS_WORKER_INDEX), kubelet can expand it directly in container args; keepingtruestill forces unnecessarysh -cwrapping for LWS workers.Suggested adjustment
-// GetNodeRank returns the current pod's rank within its LWS group in -// Kubernetes env-var expansion syntax. needsShell remains true for -// parity with Grove's arithmetic-expanding worker rank; the flag is -// ultimately about whether downstream code wraps the command in sh -c. +// GetNodeRank returns the current pod's rank within its LWS group in +// Kubernetes env-var expansion syntax; kubelet expands this directly in +// container Args/Command, so no shell wrapper is required. func (d *LWSMultinodeDeployer) GetNodeRank() (string, bool) { - return "$(LWS_WORKER_INDEX)", true + return "$(LWS_WORKER_INDEX)", false }Based on learnings, shell wrapping should only be used for arithmetic operations like
$((BLAH + 1)); regular$(VAR)environment variable substitutions are handled natively by Kubernetes.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@deploy/operator/internal/dynamo/lws.go` around lines 21 - 26, The GetNodeRank method on LWSMultinodeDeployer currently returns ("$(LWS_WORKER_INDEX)", true) which forces sh -c wrapping; change it to return ("$(LWS_WORKER_INDEX)", false) so kubelet can perform native env-var expansion without unnecessary shell wrapping—update the return tuple in GetNodeRank of LWSMultinodeDeployer accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@deploy/operator/internal/dynamo/lws.go`:
- Around line 21-26: The GetNodeRank method on LWSMultinodeDeployer currently
returns ("$(LWS_WORKER_INDEX)", true) which forces sh -c wrapping; change it to
return ("$(LWS_WORKER_INDEX)", false) so kubelet can perform native env-var
expansion without unnecessary shell wrapping—update the return tuple in
GetNodeRank of LWSMultinodeDeployer accordingly.
In `@deploy/operator/internal/dynamo/utils.go`:
- Around line 20-31: The comment in utils.go incorrectly implies shell wrapping
is needed for env var substitution; update the bullet under "Direct Python
Command" to narrow the rationale so it only recommends shell wrapping for
shell-only constructs (e.g., arithmetic expansion $((...))) rather than regular
Kubernetes env vars like $(VAR). Keep the new $(VAR) contract language and
explicitly state that GetLeaderHostname / GetNodeRank and all MultinodeDeployer
implementations must return Kubernetes env-var expansion syntax ("$(VAR)") so
the kubelet performs substitution, and mention that shell wrapping is only
required for true shell expansion cases (arithmetic, command substitution), not
plain $(VAR).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 7d32dabf-a717-4717-93b4-12b7941cffe2
📒 Files selected for processing (6)
deploy/operator/internal/controller/dynamocomponentdeployment_controller_test.godeploy/operator/internal/dynamo/backend_sglang.godeploy/operator/internal/dynamo/backend_trtllm_test.godeploy/operator/internal/dynamo/backend_vllm_test.godeploy/operator/internal/dynamo/lws.godeploy/operator/internal/dynamo/utils.go
💤 Files with no reviewable changes (1)
- deploy/operator/internal/dynamo/backend_sglang.go
…pands the leader hostname in direct-python container args
b2894a1 to
0484a3b
Compare
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
deploy/operator/internal/dynamo/lws.go (1)
36-49:⚠️ Potential issue | 🟡 MinorAdd documentation to
GetHostNamesclarifying that returned hostnames require shell evaluation.The returned hostnames mix kubelet-expanded variables like
$(LWS_LEADER_ADDRESS)(expanded by Kubernetes before shell runs) with shell command substitutions like$(echo ... | sed ...)(evaluated by the shell). This works correctly when callers wrap the output insh -c(as the current TRTLLM usage does), but would produce literal strings like"$(echo ... | sed ...)"if consumed directly without shell interpretation.Add a brief comment to the function or interface documenting that this helper's return value must flow through shell evaluation to be fully resolved.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@deploy/operator/internal/dynamo/lws.go` around lines 36 - 49, Update the GetHostNames function documentation to clearly state that the returned hostname strings include shell substitutions (e.g., "$(echo ... | sed ...)" and kubelet-expanded variables) and therefore must be passed through a shell for evaluation (for example via sh -c) by the caller; reference the GetHostNames method name in the comment and mention that callers must not treat the returned values as final literal hostnames but must perform shell evaluation to resolve worker indices.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@deploy/operator/internal/dynamo/lws.go`:
- Around line 36-49: Update the GetHostNames function documentation to clearly
state that the returned hostname strings include shell substitutions (e.g.,
"$(echo ... | sed ...)" and kubelet-expanded variables) and therefore must be
passed through a shell for evaluation (for example via sh -c) by the caller;
reference the GetHostNames method name in the comment and mention that callers
must not treat the returned values as final literal hostnames but must perform
shell evaluation to resolve worker indices.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 7a931431-aa13-4de3-afc0-7706946d98c9
📒 Files selected for processing (6)
deploy/operator/internal/controller/dynamocomponentdeployment_controller_test.godeploy/operator/internal/dynamo/backend_sglang.godeploy/operator/internal/dynamo/backend_trtllm_test.godeploy/operator/internal/dynamo/backend_vllm_test.godeploy/operator/internal/dynamo/lws.godeploy/operator/internal/dynamo/utils.go
💤 Files with no reviewable changes (1)
- deploy/operator/internal/dynamo/backend_sglang.go
…pands the leader hostname in direct-python container args
Overview:
LWS multinode deployments (vLLM / SGLang / TRT-LLM) fail on the leader pod because
$LWS_LEADER_ADDRESSis passed through as a literal string to the container args. Distributed init then crashes with:Root cause:
LWSMultinodeDeployer.GetLeaderHostnamereturned the bare-shell form$LWS_LEADER_ADDRESS. When flags are appended directly to apython ...command (nosh -cwrapper), the kubelet only expands the Kubernetes$(VAR)form, so the variable was never substituted.Details:
LWSMultinodeDeployer.GetLeaderHostnamenow returns$(LWS_LEADER_ADDRESS)(Kubernetes env-var expansion syntax). The kubelet substitutes$(VAR)in containerArgs/Commandbefore the container starts, so the same string works whether flags are appended directly to a python command or wrapped insh -c.LWSMultinodeDeployer.GetNodeRanknow returns$(LWS_WORKER_INDEX), needsShell=false. Since the kubelet expands$(VAR)directly inArgs, LWS workers no longer get a gratuitoussh -c "exec python ..."wrapper. Grove still returnsneedsShell=truebecause its rank is a shell-arithmetic expression$((GROVE_PCLQ_POD_INDEX + 1))that the kubelet does not evaluate.GroveMultinodeDeployer, which already emits$(VAR)syntax for the leader hostname, and makes theneedsShellflag genuinely reflect whether the returned string requires a shell (rather than being a blanket "true" for LWS).shellVarsToK8sSyntax+bareShellVarPatterninutils.goconvertIfShellVar+shellVarReinbackend_sglang.go(also drops the now-unusedstringsimport)k8sToShellVarSyntaxstill transforms$(LWS_LEADER_ADDRESS)into${LWS_LEADER_ADDRESS}for theexportshell script, which is needed because K8s env-var ordering can't be relied on for dynamically injected vars there.needsShell=truebecause it builds a$(( N * <rank> ))shell-arithmetic expression. The kubelet still pre-expands$(LWS_WORKER_INDEX)beforeshruns, so the arithmetic evaluates correctly.sed/mpirunshell pipelines keeps its bare$LWS_LEADER_ADDRESSform on purpose — that string is always evaluated by a shell, never handed to the kubelet.backend_vllm_test.go,backend_trtllm_test.go, anddynamocomponentdeployment_controller_test.goto match the new$(LWS_LEADER_ADDRESS)/$(LWS_WORKER_INDEX)/${LWS_LEADER_ADDRESS}outputs, and the LWS mp-worker expectation now has flatArgsinstead of ansh -c exec ...wrapper.Where should the reviewer start?
deploy/operator/internal/dynamo/lws.go— the actual behavior change + updated doc comments explaining the K8s vs shell expansion contract and whyneedsShellis nowfalsefor LWS but remainstruefor Grove.deploy/operator/internal/dynamo/utils.go— header comment now documents the contract that allMultinodeDeployerimplementations must return$(VAR)syntax; confirms the simplifiedinjectFlagsIntoContainerCommandno longer needs a conversion step.deploy/operator/internal/dynamo/backend_sglang.go— verify the removed helper wasn't silently relied on elsewhere.$(VAR)at the container-arg boundary, shell${VAR}/$VARonly insidesh -cbodies) and that the LWS worker case no longer wraps the command insh -c.Related Issues:
Summary by CodeRabbit