fix(onboard): improve gateway startup diagnostics (#1605)#1812
fix(onboard): improve gateway startup diagnostics (#1605)#1812ericksoa merged 8 commits intoNVIDIA:mainfrom
Conversation
Three issues affected all reporters (hopex-ai, g4ur4vs, devlewiso): 1. Gateway output was truncated at 240 chars, hiding the real error. Also, compactText was applied to the whole string before splitting, collapsing newlines to spaces and preventing per-line display. Fix: split on newlines first, then apply compactText per-line, so each line from the gateway process is printed with 4-space indent under a "Gateway start returned before healthy:" header. 2. The \naiting rendering artifact (missing "W" from "Waiting") was caused by a \r embedded in the gateway output being passed through to console.log, overwriting the terminal cursor position. Fix: the per-line split+compactText approach cleans each line before printing, eliminating the carriage-return overwrite. 3. On final failure NemoClaw only told users to run `openshell doctor logs --name nemoclaw` manually; the bug reporters left the "Logs" section empty because they had to know to run it. Fix: automatically capture and print doctor logs in the catch block so users see the actual k3s/flannel/AppArmor errors (the root cause reported by devlewiso for Ubuntu 24.04) without extra steps. Closes NVIDIA#1605 Signed-off-by: Dongni Yang <dongniy@nvidia.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughGateway startup failures now redact captured output, strip ANSI sequences per-line, and print an indented multi-line block; on final failure the flow best-effort runs Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Onboard as Onboard (startGateway)
participant Gateway as Local Gateway Process
participant CLI as openshell CLI
participant Logger as StdErr / Logger
Onboard->>Gateway: spawn/start gateway
Gateway-->>Onboard: returns non-zero with combined stdout/stderr
Onboard->>Onboard: redact, split lines, strip ANSI, filter empties, indent
Onboard->>Logger: print indented multi-line startup output
Onboard->>CLI: run "openshell doctor logs --name nemoclaw" (best-effort)
CLI-->>Onboard: doctor logs output
Onboard->>Onboard: redact/strip doctor logs per-line
Onboard->>Logger: print "Gateway logs:" and cleaned doctor logs
Onboard->>Logger: continue with fallback troubleshooting output
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
test/onboard.test.ts (1)
833-905: Add deterministic cleanup for the temporary test directory.Line 835 creates
tmpDir, but this test never removes it. Please wrap the body intry/finallyand clean it up to prevent temp-dir buildup in long/local CI runs.Proposed patch
it("prints doctor logs automatically when gateway fails to start (`#1605`)", () => { const repoRoot = path.join(import.meta.dirname, ".."); const tmpDir = fs.mkdtempSync(path.join(os.tmpdir(), "nemoclaw-gateway-diag-")); @@ - fs.mkdirSync(fakeBin, { recursive: true }); + try { + fs.mkdirSync(fakeBin, { recursive: true }); @@ - assert.ok( - result.stderr.includes("OOMKilled"), - `expected doctor log output in stderr:\n${result.stderr}`, - ); + assert.ok( + result.stderr.includes("OOMKilled"), + `expected doctor log output in stderr:\n${result.stderr}`, + ); + } finally { + fs.rmSync(tmpDir, { recursive: true, force: true }); + } });🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/onboard.test.ts` around lines 833 - 905, Wrap the test body that begins after creating tmpDir (created via fs.mkdtempSync and used to build fakeBin and scriptPath) in a try/finally so temporary files are always removed; in the finally block call fs.rmSync(tmpDir, { recursive: true, force: true }) (or equivalent cleanup) after the spawnSync completes to remove fakeBin, scriptPath and tmpDir, ensuring deterministic cleanup even on assertion failures.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/lib/onboard.ts`:
- Around line 2091-2097: The printed gateway/doctor logs are currently emitted
raw and may leak secrets; before converting startResult.output (and the similar
doctor output around the 2137–2145 block) to a string and splitting, pass the
raw text through the redact(...) sanitizer and then use compactText/redact
output for subsequent .split("\n"), .map and the console.log; update the code
paths that reference startResult.output and the analogous doctor output so they
call redact(...) first and then proceed with compactText and join to ensure
sensitive data is sanitized before printing.
---
Nitpick comments:
In `@test/onboard.test.ts`:
- Around line 833-905: Wrap the test body that begins after creating tmpDir
(created via fs.mkdtempSync and used to build fakeBin and scriptPath) in a
try/finally so temporary files are always removed; in the finally block call
fs.rmSync(tmpDir, { recursive: true, force: true }) (or equivalent cleanup)
after the spawnSync completes to remove fakeBin, scriptPath and tmpDir, ensuring
deterministic cleanup even on assertion failures.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 6f5c710e-13c2-4b91-93ee-110e03dc9ac7
📒 Files selected for processing (2)
src/lib/onboard.tstest/onboard.test.ts
Apply redact() to startResult.output and doctor logs before emitting to stdout/stderr to prevent credential or token leakage into CI logs and support bundles. Also strip \r from doctor log lines. Addresses CodeRabbit security review on NVIDIA#1812. Signed-off-by: Dongni Yang <dongniy@nvidia.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Without this a malicious or misbehaving gateway process could inject terminal escape sequences (cursor movement, screen clear, colour codes) into the "Gateway start returned before healthy" and "Gateway logs" output paths, overwriting terminal content or faking NemoClaw log lines. Define ANSI_RE at module level (same pattern as gateway-state.ts and runtime-recovery.ts) and apply it per-line in both output paths before compactText, in addition to the existing redact() call. Signed-off-by: Dongni Yang <dongniy@nvidia.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
openshell provider create/update runs with the credential env in scope; openshell inference set may echo API keys or base URLs in error messages. Both paths were calling compactText() on raw stderr/stdout without first passing through redact(), creating the same credential-leak risk that was fixed for the gateway diagnostic paths in the preceding commits. Signed-off-by: Dongni Yang <dongniy@nvidia.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/lib/onboard.ts`:
- Around line 26-27: The ANSI_RE constant only removes SGR color codes and
leaves other ANSI control sequences (CSI, OSC, erase-line, cursor movements)
intact; update the ANSI_RE definition to use a more comprehensive regex that
matches all standard ANSI escape sequences (including CSI sequences like ESC [
... (not just those ending in m), OSC sequences starting with ESC ] ... BEL or
ESC \, and other single-character ESC sequences) so terminal control codes are
fully stripped before printing; locate and replace the ANSI_RE constant in
onboard.ts (symbol: ANSI_RE) with the expanded pattern and ensure any code paths
using ANSI_RE (e.g., diagnostic/gateway output) continue to call it.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: d0d0b060-7e71-4ca7-8484-490f7c899b29
📒 Files selected for processing (1)
src/lib/onboard.ts
There was a problem hiding this comment.
♻️ Duplicate comments (1)
src/lib/onboard.ts (1)
26-27:⚠️ Potential issue | 🟠 MajorBroaden ANSI stripping beyond SGR color codes
Line 27 only strips
ESC[..msequences. CSI cursor/erase and OSC sequences can still leak into log output and cause rendering artifacts in the new diagnostics paths.🔧 Proposed fix
-const ANSI_RE = /\x1b\[[0-9;]*m/g; +// Match CSI, OSC, and single-character ANSI escape sequences. +const ANSI_RE = /\x1B(?:\[[0-?]*[ -/]*[`@-`~]|\][^\x07]*(?:\x07|\x1B\\)|[`@-Z`\\-_])/g;#!/bin/bash # Verify current sanitizer misses non-SGR ANSI control sequences. python - <<'PY' import re, pathlib src = pathlib.Path("src/lib/onboard.ts").read_text() print("SGR-only regex present:", r"const ANSI_RE = /\x1b\[[0-9;]*m/g;" in src) pat = re.compile(r'\x1b\[[0-9;]*m') samples = { "SGR color": "\x1b[31mred\x1b[0m", "Erase line": "\x1b[2Kline", "Cursor up": "\x1b[1Aup", "OSC title": "\x1b]0;title\x07text", } for name, value in samples.items(): cleaned = pat.sub("", value) print(f"{name:11} -> {cleaned.encode()}") PY🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/onboard.ts` around lines 26 - 27, The current ANSI_RE only strips SGR sequences (ESC [ ... m) so CSI cursor/erase sequences and OSC sequences still leak; update the ANSI_RE constant in onboard.ts to a broader pattern that removes CSI sequences and OSC/STRING sequences (e.g., match ESC [ ... final-byte not just 'm', ESC ] ... BEL, and other single-byte ESC sequences), replace the existing /\x1b\[[0-9;]*m/g with a comprehensive regex that covers CSI and OSC (and stray ESC sequences) to sanitize logs (ensure you update the ANSI_RE identifier and run the existing tests or the provided sanitizer check to verify samples "Erase line", "Cursor up", and "OSC title" are stripped).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@src/lib/onboard.ts`:
- Around line 26-27: The current ANSI_RE only strips SGR sequences (ESC [ ... m)
so CSI cursor/erase sequences and OSC sequences still leak; update the ANSI_RE
constant in onboard.ts to a broader pattern that removes CSI sequences and
OSC/STRING sequences (e.g., match ESC [ ... final-byte not just 'm', ESC ] ...
BEL, and other single-byte ESC sequences), replace the existing
/\x1b\[[0-9;]*m/g with a comprehensive regex that covers CSI and OSC (and stray
ESC sequences) to sanitize logs (ensure you update the ANSI_RE identifier and
run the existing tests or the provided sanitizer check to verify samples "Erase
line", "Cursor up", and "OSC title" are stripped).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: e3426ca6-f846-49b8-85b2-52d0c5882c03
📒 Files selected for processing (1)
src/lib/onboard.ts
The previous regex only stripped SGR color codes (\x1b[Nm). Sequences like erase-line (\x1b[2K), cursor-movement (\x1b[1A), and OSC title (\x1b]0;...\x07) passed through unchanged, allowing terminal rendering artifacts or output-spoofing in the gateway diagnostic output paths. Replace with a comprehensive regex that covers: - CSI: \x1B\[[0-?]*[ -/]*[`@-~] - OSC: \x1B\][^\x07]*(\x07|\x1B\\) - Single-char: \x1B[`@-Z\\-_] Fixes CodeRabbit review on PR NVIDIA#1812. Signed-off-by: Dongni Yang <dongniy@nvidia.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The previous regex only stripped SGR color codes (\x1b[Nm). Sequences like erase-line (\x1b[2K), cursor-movement (\x1b[1A), hide-cursor (\x1b[?25l), and OSC title (\x1b]0;...\x07) passed through unchanged, allowing terminal rendering artifacts or output-spoofing in the gateway diagnostic output paths. Replace with a comprehensive regex per ECMA-48: - CSI: \x1B\[[0-?]*[ -/]*[@-~] (final bytes 0x40–0x7E) - OSC: \x1B\][^\x07]*(\x07|\x1B\\) - C1: \x1B[@-_] (two-byte escapes 0x40–0x5F) All 13 cases pass: SGR colors (m=109), erase-line (K), cursor-move, hide-cursor (l=108), OSC, C1, and normal-text passthrough. Fixes CodeRabbit review on PR NVIDIA#1812. Signed-off-by: Dongni Yang <dongniy@nvidia.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
c3ff850 to
7c72bc6
Compare
…daction The previous test only checked that doctor logs appeared in stderr on gateway failure. Three gaps existed: 1. No ANSI codes in fake openshell output — ANSI_RE stripping in both the gateway-start output path and the doctor-log path was untested. 2. No credential token in fake doctor logs — a future redaction regression would go undetected. 3. The gateway-start per-line output path (Fix 1 / Fix 2) had no assertions against the \naiting artifact or the separate-lines format. Add to the fake openshell: - \033[...m color codes and \r\n in gateway start output - \033[...m codes, OOMKilled text, and a fake nvapi- token in doctor logs Add assertions: - No raw \x1b in stdout or stderr (ANSI stripped) - nvapi-fakecredential-9999 not present verbatim in stderr (redacted) - \naiting not present in stdout (Fix 2: \r\n artifact eliminated) - "Deploying" and "Waiting" appear on separate lines (Fix 1: per-line) Signed-off-by: Dongni Yang <dongniy@nvidia.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
✨ Thanks for submitting this PR, which proposes a fix for a bug with the gateway startup process and may improve the overall quality of the codebase. Possibly related open issues: |
Conflict 1 (upsertProvider): upstream refactored two separate create/update code paths into a single unified `result` variable; our branch added redact() around the stderr/stdout output. Resolution: keep redact() but use the upstream `result` variable and dynamic `action` message string. Conflict 2 (exports): both sides added independent exports — upstream added dashboard helpers (buildAuthenticatedDashboardUrl, getDashboardAccessInfo, etc.); our branch added startGateway. Resolution: keep all. Signed-off-by: Dongni Yang <dongniy@nvidia.com>
4fe7e8f to
99810eb
Compare
ericksoa
left a comment
There was a problem hiding this comment.
LGTM — diagnostics and redaction improvements look solid. Approving as-is.
Two items to address in a follow-up PR:
-
Missing timeout on
openshell doctor logs— TherunCaptureOpenshell(["doctor", "logs", ...])call in the final-failure catch block has no timeout. If k3s is wedged, this will block the exit path indefinitely and make the failure experience worse. Please add a timeout (e.g.,timeout: 10_000in the opts). -
Unbounded gateway start output — The old code truncated to 240 chars; the new code prints all lines. On a gateway that emits hundreds of lines, this could produce very noisy output. Consider capping to ~50 lines.
…IA#1812) ## Summary - **Remove 240-char truncation** on gateway start output; split on newlines first, apply \`compactText\` per-line, and print each line indented under a header — fixes the \`\naiting\` rendering artifact caused by embedded \`\r\` overwriting the terminal cursor - **Auto-capture \`openshell doctor logs\`** in the final-failure catch block so users see the actual k3s/flannel/AppArmor errors without having to run a separate command manually - **Export \`startGateway\`** and add a test that verifies doctor log output appears in stderr when the gateway fails - **Redact all new output paths** with \`redact()\` + ANSI stripping before printing (addresses CodeRabbit security review) - **Redact \`upsertProvider\` and \`setupInference\` stderr** — \`openshell provider create/update\` and \`openshell inference set\` run with credential env in scope; their stderr was passed through \`compactText()\` without \`redact()\`, same class of bug as NVIDIA#1423 **Note:** The underlying gateway startup failure on Ubuntu 24.04 (overlayfs/AppArmor/flannel incompatibility reported in NVIDIA#1605) is an OpenShell upstream issue. This PR improves diagnostics so users and maintainers can see the actual error — it does not fix the root cause. Issue NVIDIA#1605 should remain open until an upstream fix or workaround lands. Improves diagnostics for NVIDIA#1605 ## Test plan - [ ] \`npm test\` passes (only pre-existing \`runtime-shell.test.ts\` failures) - [ ] New test \`prints doctor logs automatically when gateway fails to start (NVIDIA#1605)\` passes - [ ] Manual: \`nemoclaw onboard\` on a system where the gateway fails now shows \`Gateway logs:\` section with k3s output before the troubleshooting hints Signed-off-by: Dongni Yang <dongniy@nvidia.com> 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Signed-off-by: Dongni Yang <dongniy@nvidia.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Aaron Erickson 🦞 <aerickson@nvidia.com> Signed-off-by: ColinM-sys <cmcdonough@50words.com>
Summary
Note: The underlying gateway startup failure on Ubuntu 24.04 (overlayfs/AppArmor/flannel incompatibility reported in #1605) is an OpenShell upstream issue. This PR improves diagnostics so users and maintainers can see the actual error — it does not fix the root cause. Issue #1605 should remain open until an upstream fix or workaround lands.
Improves diagnostics for #1605
Test plan
Signed-off-by: Dongni Yang dongniy@nvidia.com
🤖 Generated with Claude Code