Skip to content

fix(operator): emit Kubernetes syntax from LWS deployer so kubelet expands the leader hostname in direct-python container args#8369

Merged
julienmancuso merged 2 commits into
mainfrom
jsm/dyn-543
Apr 20, 2026
Merged

fix(operator): emit Kubernetes syntax from LWS deployer so kubelet expands the leader hostname in direct-python container args#8369
julienmancuso merged 2 commits into
mainfrom
jsm/dyn-543

Conversation

@julienmancuso
Copy link
Copy Markdown
Contributor

@julienmancuso julienmancuso commented Apr 20, 2026

Overview:

LWS multinode deployments (vLLM / SGLang / TRT-LLM) fail on the leader pod because $LWS_LEADER_ADDRESS is passed through as a literal string to the container args. Distributed init then crashes with:

socket.cpp:764] The IPv6 network addresses of ($lws_leader_address, 29500) cannot be retrieved

Root cause: LWSMultinodeDeployer.GetLeaderHostname returned the bare-shell form $LWS_LEADER_ADDRESS. When flags are appended directly to a python ... command (no sh -c wrapper), the kubelet only expands the Kubernetes $(VAR) form, so the variable was never substituted.

Details:

  • LWSMultinodeDeployer.GetLeaderHostname now returns $(LWS_LEADER_ADDRESS) (Kubernetes env-var expansion syntax). The kubelet substitutes $(VAR) in container Args/Command before the container starts, so the same string works whether flags are appended directly to a python command or wrapped in sh -c.
  • LWSMultinodeDeployer.GetNodeRank now returns $(LWS_WORKER_INDEX), needsShell=false. Since the kubelet expands $(VAR) directly in Args, LWS workers no longer get a gratuitous sh -c "exec python ..." wrapper. Grove still returns needsShell=true because its rank is a shell-arithmetic expression $((GROVE_PCLQ_POD_INDEX + 1)) that the kubelet does not evaluate.
  • This aligns LWS with GroveMultinodeDeployer, which already emits $(VAR) syntax for the leader hostname, and makes the needsShell flag genuinely reflect whether the returned string requires a shell (rather than being a blanket "true" for LWS).
  • Removed redundant conversion helpers that existed only to patch up the previous inconsistency:
    • shellVarsToK8sSyntax + bareShellVarPattern in utils.go
    • convertIfShellVar + shellVarRe in backend_sglang.go (also drops the now-unused strings import)
  • vLLM init-container path is unaffected: k8sToShellVarSyntax still transforms $(LWS_LEADER_ADDRESS) into ${LWS_LEADER_ADDRESS} for the export shell script, which is needed because K8s env-var ordering can't be relied on for dynamically injected vars there.
  • vLLM data-parallel worker path is also unaffected: it unconditionally sets needsShell=true because it builds a $(( N * <rank> )) shell-arithmetic expression. The kubelet still pre-expands $(LWS_WORKER_INDEX) before sh runs, so the arithmetic evaluates correctly.
  • TRT-LLM worker-hostname derivation inside sed/mpirun shell pipelines keeps its bare $LWS_LEADER_ADDRESS form on purpose — that string is always evaluated by a shell, never handed to the kubelet.
  • Test expectations updated across backend_vllm_test.go, backend_trtllm_test.go, and dynamocomponentdeployment_controller_test.go to match the new $(LWS_LEADER_ADDRESS) / $(LWS_WORKER_INDEX) / ${LWS_LEADER_ADDRESS} outputs, and the LWS mp-worker expectation now has flat Args instead of an sh -c exec ... wrapper.

Where should the reviewer start?

  • deploy/operator/internal/dynamo/lws.go — the actual behavior change + updated doc comments explaining the K8s vs shell expansion contract and why needsShell is now false for LWS but remains true for Grove.
  • deploy/operator/internal/dynamo/utils.go — header comment now documents the contract that all MultinodeDeployer implementations must return $(VAR) syntax; confirms the simplified injectFlagsIntoContainerCommand no longer needs a conversion step.
  • deploy/operator/internal/dynamo/backend_sglang.go — verify the removed helper wasn't silently relied on elsewhere.
  • Test files — sanity check that the expected strings reflect the correct expansion context (K8s $(VAR) at the container-arg boundary, shell ${VAR} / $VAR only inside sh -c bodies) and that the LWS worker case no longer wraps the command in sh -c.

Related Issues:

Summary by CodeRabbit

  • Bug Fixes
    • Enhanced environment variable substitution in multinode distributed deployments by adopting Kubernetes-native variable expansion syntax, improving reliability of leader address resolution across Ray, vLLM, and TRT-LLM training backends.

@julienmancuso julienmancuso requested a review from a team as a code owner April 20, 2026 16:08
@github-actions github-actions Bot added fix deployment::k8s Relates to dynamo deployment in kubernetes labels Apr 20, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 20, 2026

Walkthrough

The PR changes LWS multinode variable handling to use Kubernetes command-substitution syntax ($(VAR)) instead of bare shell variables ($VAR), removes helper logic that converted shell-vars to that form, updates GetNodeRank's shell-flag to false, and updates tests and docs/comments to match the new form.

Changes

Cohort / File(s) Summary
LWS Multinode Deployer Core
deploy/operator/internal/dynamo/lws.go
Return leader hostname as $(LWS_LEADER_ADDRESS) (was $LWS_LEADER_ADDRESS); change GetNodeRank to return $(LWS_WORKER_INDEX), false (was ..., true).
Shell-var conversion removal
deploy/operator/internal/dynamo/backend_sglang.go
Removed convertIfShellVar logic and strings import; getMultinodeFlags now uses leader hostname verbatim as returned by deployer (no post-conversion).
Flag-injection docs / utils
deploy/operator/internal/dynamo/utils.go
Updated comments to require MultinodeDeployer implementations return Kubernetes env-var expansion syntax $(VAR); removed outdated needsShell usage in doc comments.
Controller & Backend tests
deploy/operator/internal/controller/dynamocomponentdeployment_controller_test.go, deploy/operator/internal/dynamo/backend_trtllm_test.go, deploy/operator/internal/dynamo/backend_vllm_test.go
Updated expected command/arg strings and host lists to use $(LWS_LEADER_ADDRESS) / $(LWS_WORKER_INDEX); added/adjusted vLLM test cases asserting command-substitution syntax and updated init-container env expectations.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: fixing LWS deployer to emit Kubernetes env-var expansion syntax so kubelet expands the leader hostname in container args.
Description check ✅ Passed The PR description comprehensively covers all required template sections: Overview explains the bug and root cause, Details describes all changes with context, and Where should the reviewer start identifies critical files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.11.4)

level=error msg="[linters_context] typechecking error: pattern ./...: directory prefix . does not contain main module or its selected dependencies"


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
deploy/operator/internal/dynamo/utils.go (1)

20-31: Update the stale “env vars need shell” wording.

The new $(VAR) contract is right, but the earlier bullet still says shell wrapping is needed “for env vars”. Please narrow that to shell-only constructs like arithmetic expansion to avoid reintroducing unnecessary sh -c wrapping.

Suggested doc tweak
- *    - If shell interpretation is needed (for env vars): Wrap in "sh -c" with exec
+ *    - If shell interpretation is needed (for shell-only expressions such as arithmetic): Wrap in "sh -c" with exec

Based on learnings, shell wrapping should only be used for arithmetic operations like $((BLAH + 1)); regular $(VAR) environment variable substitutions are handled natively by Kubernetes.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deploy/operator/internal/dynamo/utils.go` around lines 20 - 31, The comment
in utils.go incorrectly implies shell wrapping is needed for env var
substitution; update the bullet under "Direct Python Command" to narrow the
rationale so it only recommends shell wrapping for shell-only constructs (e.g.,
arithmetic expansion $((...))) rather than regular Kubernetes env vars like
$(VAR). Keep the new $(VAR) contract language and explicitly state that
GetLeaderHostname / GetNodeRank and all MultinodeDeployer implementations must
return Kubernetes env-var expansion syntax ("$(VAR)") so the kubelet performs
substitution, and mention that shell wrapping is only required for true shell
expansion cases (arithmetic, command substitution), not plain $(VAR).
deploy/operator/internal/dynamo/lws.go (1)

21-26: Consider returning needsShell=false for LWS node rank.

Now that GetNodeRank returns $(LWS_WORKER_INDEX), kubelet can expand it directly in container args; keeping true still forces unnecessary sh -c wrapping for LWS workers.

Suggested adjustment
-// GetNodeRank returns the current pod's rank within its LWS group in
-// Kubernetes env-var expansion syntax. needsShell remains true for
-// parity with Grove's arithmetic-expanding worker rank; the flag is
-// ultimately about whether downstream code wraps the command in sh -c.
+// GetNodeRank returns the current pod's rank within its LWS group in
+// Kubernetes env-var expansion syntax; kubelet expands this directly in
+// container Args/Command, so no shell wrapper is required.
 func (d *LWSMultinodeDeployer) GetNodeRank() (string, bool) {
-	return "$(LWS_WORKER_INDEX)", true
+	return "$(LWS_WORKER_INDEX)", false
 }

Based on learnings, shell wrapping should only be used for arithmetic operations like $((BLAH + 1)); regular $(VAR) environment variable substitutions are handled natively by Kubernetes.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deploy/operator/internal/dynamo/lws.go` around lines 21 - 26, The GetNodeRank
method on LWSMultinodeDeployer currently returns ("$(LWS_WORKER_INDEX)", true)
which forces sh -c wrapping; change it to return ("$(LWS_WORKER_INDEX)", false)
so kubelet can perform native env-var expansion without unnecessary shell
wrapping—update the return tuple in GetNodeRank of LWSMultinodeDeployer
accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@deploy/operator/internal/dynamo/lws.go`:
- Around line 21-26: The GetNodeRank method on LWSMultinodeDeployer currently
returns ("$(LWS_WORKER_INDEX)", true) which forces sh -c wrapping; change it to
return ("$(LWS_WORKER_INDEX)", false) so kubelet can perform native env-var
expansion without unnecessary shell wrapping—update the return tuple in
GetNodeRank of LWSMultinodeDeployer accordingly.

In `@deploy/operator/internal/dynamo/utils.go`:
- Around line 20-31: The comment in utils.go incorrectly implies shell wrapping
is needed for env var substitution; update the bullet under "Direct Python
Command" to narrow the rationale so it only recommends shell wrapping for
shell-only constructs (e.g., arithmetic expansion $((...))) rather than regular
Kubernetes env vars like $(VAR). Keep the new $(VAR) contract language and
explicitly state that GetLeaderHostname / GetNodeRank and all MultinodeDeployer
implementations must return Kubernetes env-var expansion syntax ("$(VAR)") so
the kubelet performs substitution, and mention that shell wrapping is only
required for true shell expansion cases (arithmetic, command substitution), not
plain $(VAR).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7d32dabf-a717-4717-93b4-12b7941cffe2

📥 Commits

Reviewing files that changed from the base of the PR and between 7f58231 and b2894a1.

📒 Files selected for processing (6)
  • deploy/operator/internal/controller/dynamocomponentdeployment_controller_test.go
  • deploy/operator/internal/dynamo/backend_sglang.go
  • deploy/operator/internal/dynamo/backend_trtllm_test.go
  • deploy/operator/internal/dynamo/backend_vllm_test.go
  • deploy/operator/internal/dynamo/lws.go
  • deploy/operator/internal/dynamo/utils.go
💤 Files with no reviewable changes (1)
  • deploy/operator/internal/dynamo/backend_sglang.go

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 0 potential issues.

View 3 additional findings in Devin Review.

Open in Devin Review

…pands the leader hostname in direct-python container args
@julienmancuso
Copy link
Copy Markdown
Contributor Author

@coderabbitai full review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 20, 2026

✅ Actions performed

Full review triggered.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
deploy/operator/internal/dynamo/lws.go (1)

36-49: ⚠️ Potential issue | 🟡 Minor

Add documentation to GetHostNames clarifying that returned hostnames require shell evaluation.

The returned hostnames mix kubelet-expanded variables like $(LWS_LEADER_ADDRESS) (expanded by Kubernetes before shell runs) with shell command substitutions like $(echo ... | sed ...) (evaluated by the shell). This works correctly when callers wrap the output in sh -c (as the current TRTLLM usage does), but would produce literal strings like "$(echo ... | sed ...)" if consumed directly without shell interpretation.

Add a brief comment to the function or interface documenting that this helper's return value must flow through shell evaluation to be fully resolved.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deploy/operator/internal/dynamo/lws.go` around lines 36 - 49, Update the
GetHostNames function documentation to clearly state that the returned hostname
strings include shell substitutions (e.g., "$(echo ... | sed ...)" and
kubelet-expanded variables) and therefore must be passed through a shell for
evaluation (for example via sh -c) by the caller; reference the GetHostNames
method name in the comment and mention that callers must not treat the returned
values as final literal hostnames but must perform shell evaluation to resolve
worker indices.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@deploy/operator/internal/dynamo/lws.go`:
- Around line 36-49: Update the GetHostNames function documentation to clearly
state that the returned hostname strings include shell substitutions (e.g.,
"$(echo ... | sed ...)" and kubelet-expanded variables) and therefore must be
passed through a shell for evaluation (for example via sh -c) by the caller;
reference the GetHostNames method name in the comment and mention that callers
must not treat the returned values as final literal hostnames but must perform
shell evaluation to resolve worker indices.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7a931431-aa13-4de3-afc0-7706946d98c9

📥 Commits

Reviewing files that changed from the base of the PR and between 0c98d9a and 0484a3b.

📒 Files selected for processing (6)
  • deploy/operator/internal/controller/dynamocomponentdeployment_controller_test.go
  • deploy/operator/internal/dynamo/backend_sglang.go
  • deploy/operator/internal/dynamo/backend_trtllm_test.go
  • deploy/operator/internal/dynamo/backend_vllm_test.go
  • deploy/operator/internal/dynamo/lws.go
  • deploy/operator/internal/dynamo/utils.go
💤 Files with no reviewable changes (1)
  • deploy/operator/internal/dynamo/backend_sglang.go

Comment thread deploy/operator/internal/dynamo/lws.go
Comment thread deploy/operator/internal/dynamo/utils.go Outdated
Comment thread deploy/operator/internal/dynamo/backend_trtllm_test.go
…pands the leader hostname in direct-python container args
@pull-request-size pull-request-size Bot added size/L and removed size/M labels Apr 20, 2026
@julienmancuso julienmancuso merged commit da7d672 into main Apr 20, 2026
65 of 66 checks passed
@julienmancuso julienmancuso deleted the jsm/dyn-543 branch April 20, 2026 18:40
nv-nmailhot pushed a commit that referenced this pull request Apr 20, 2026
…pands the leader hostname in direct-python container args (#8369) (#8386)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

deployment::k8s Relates to dynamo deployment in kubernetes fix size/L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants