Skip to content

fix: [cherry-pick] fix TRT-LLM worker SSH crash in non-root containers (http://nvbugs/5936491/1)#6772

Merged
saturley-hall merged 1 commit into
release/1.0.0from
mkosec/cherrypick-fix-trtllm-worker-ssh-non-root
Mar 2, 2026
Merged

fix: [cherry-pick] fix TRT-LLM worker SSH crash in non-root containers (http://nvbugs/5936491/1)#6772
saturley-hall merged 1 commit into
release/1.0.0from
mkosec/cherrypick-fix-trtllm-worker-ssh-non-root

Conversation

@MatejKosec
Copy link
Copy Markdown
Contributor

Cherry-pick of #6694 onto release/1.0.0

Original PR: #6694
NVBugs: http://nvbugs/5936491/1

Summary

TRT-LLM multinode worker pods crash at startup in v1.0.0 when running as a non-root container user:

mkdir: cannot create directory '/run/sshd': Permission denied

Fix

Two changes to backend_trtllm.go:

  1. Remove /run/sshd from the worker mkdir command — this directory requires root to create. It was added in fix(operator): fix SSH setup bugs in TRT-LLM multinode workers #6225 to satisfy OpenSSH's privilege separation requirement, but is not needed when sshd runs as a non-root user. Per sshd.c, the privsep directory check is gated on getuid() == 0 — non-root containers skip it entirely.

  2. Remove deprecated UsePrivilegeSeparation no — this was added as belt-and-suspenders but was never doing anything useful (deprecated since OpenSSH 7.5, generates a deprecation warning in OpenSSH 9.x).

Test

Verified on Nebius (H200, Kubernetes v1.31.9) with a 2-node TRT-LLM decode worker DGD, no runAsUser: 0. Both worker pods reached 1/1 Running with zero restarts. SSH keys generated under /home/dynamo/.ssh/ confirming non-root operation.

@MatejKosec MatejKosec requested a review from a team as a code owner March 2, 2026 21:40
@github-actions github-actions Bot added fix deployment::k8s Relates to dynamo deployment in kubernetes labels Mar 2, 2026
@saturley-hall saturley-hall merged commit 38db289 into release/1.0.0 Mar 2, 2026
43 checks passed
@saturley-hall saturley-hall deleted the mkosec/cherrypick-fix-trtllm-worker-ssh-non-root branch March 2, 2026 22:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

deployment::k8s Relates to dynamo deployment in kubernetes fix size/S

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants