Skip to content

fix: improve gateway lifecycle recovery#952

Closed
kjw3 wants to merge 30 commits intomainfrom
fix/gateway-lifecycle-recovery-signed
Closed

fix: improve gateway lifecycle recovery#952
kjw3 wants to merge 30 commits intomainfrom
fix/gateway-lifecycle-recovery-signed

Conversation

@kjw3
Copy link
Copy Markdown
Contributor

@kjw3 kjw3 commented Mar 26, 2026

Supersedes #908 because the branch had to be rewritten onto signed commits to satisfy verified-signature requirements.

Summary

  • preserve a healthy shared nemoclaw gateway across repeat onboarding
  • reconcile live OpenShell sandbox state during connect and status instead of trusting stale local registry entries
  • classify restart/rebuild lifecycle failures so users get deterministic guidance instead of generic transport errors
  • extend double-onboard coverage so creating a second sandbox does not break the first

Issues

  • addresses #849
  • addresses part of #859
  • improves #888 by distinguishing:
    • gateway trust rotation / handshake failure
    • gateway metadata exists but the restarted API still refuses connections
    • gateway rebuilt and the old sandbox no longer exists

Security

  • no secret persistence added
  • no TLS downgrade or bypass added
  • no destructive auto-recovery of healthy sandboxes
  • no new shell-out paths beyond resolved openshell wrappers

Validation

npx vitest run test/cli.test.js test/onboard.test.js test/onboard-readiness.test.js test/registry.test.js
npx eslint bin/nemoclaw.js bin/lib/onboard.js test/cli.test.js test/onboard.test.js
npx tsc -p jsconfig.json --noEmit
bash -n test/e2e/test-double-onboard.sh
shellcheck test/e2e/test-double-onboard.sh

Brev CPU Validation

Environment:

  • brev-cpu
  • instance: kj-nemoclaw-cpu-20260325-155447
  • branch commit tested: 43cf8eb

Validated on a real disposable Linux host:

  • onboard sandbox A / onboard sandbox B path preserves the shared gateway and keeps the first sandbox reachable
  • after openshell gateway stop + openshell gateway start --name nemoclaw, NemoClaw now surfaces a precise post-restart classification instead of a generic transport failure
  • after destructive gateway rebuild, NemoClaw removes the stale local sandbox entry when the old sandbox is gone
  • rerunning onboard after that rebuild recreates the sandbox cleanly and returns to Ready

Residual

  • This PR does not make OpenShell gateway restarts durable. It makes the failure modes explicit, safer, and easier to recover from.
  • On the tested Brev CPU host, I did not find a safe non-destructive recovery once the restarted gateway API entered the persistent Connection refused state. That remains a gateway/runtime limitation, not something this PR tries to bypass.

Summary by CodeRabbit

Release Notes

  • New Features

    • Intelligent gateway reuse: System now detects and reuses healthy gateways, eliminating unnecessary recreation and enabling faster reconnection
    • Enhanced sandbox reconciliation: Live state probing and graceful recovery from sandbox/gateway mismatches with targeted error messages
  • Bug Fixes

    • Improved handling of stale sandbox registry entries and gateway states
    • Better recovery from temporary gateway connectivity issues
  • Chores

    • Build process cleanup now removes Python cache and virtual environment directories

kjw3 and others added 30 commits March 25, 2026 15:08
# Conflicts:
#	bin/lib/onboard.js
#	test/e2e-gateway-isolation.sh
# Conflicts:
#	test/e2e-gateway-isolation.sh
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 26, 2026

Caution

Review failed

Pull request was closed or merged during review

📝 Walkthrough

Walkthrough

The changes introduce gateway health checking and sandbox-to-gateway reconciliation logic in the NemoClaw CLI tooling. When connecting or querying sandbox status, the system now probes live OpenShell state, detects reconciliation issues, and attempts runtime recovery before hard-failing. Core library exports expand, and test coverage validates lifecycle messaging and recovery flows.

Changes

Cohort / File(s) Summary
Core Library Refactoring
bin/lib/onboard.js
Added ANSI-stripping and gateway name parsing helpers. Introduced isGatewayHealthy() to assess gateway reusability. Refactored gateway start logic into startGatewayWithOptions() with conditional reuse and failure modes. Updated preflight to capture status and reuse healthy gateways. Added new exports: isGatewayHealthy, preflight, startGateway, startGatewayForRecovery.
CLI Reconciliation & Recovery
bin/nemoclaw.js
Replaced shell-based OpenShell commands with direct CLI invocation helpers. Added sandbox/gateway reconciliation logic that probes live OpenShell state before connect/status operations. Integrated runtime recovery (gateway select and startGatewayForRecovery) for gateway-related failures. Converted sandboxConnect and sandboxStatus to async and added ensureLiveSandboxOrExit for targeted error guidance.
Build Context Cleanup
scripts/setup.sh
Extended sandbox staging to remove Python virtual environment, test cache, and __pycache__ directories from the copied nemoclaw-blueprint.
Unit Tests
test/onboard.test.js, test/gateway-cleanup.test.js
Added unit tests for isGatewayHealthy function covering gateway name, status, and metadata validation. Updated gateway start failure test to target startGatewayWithOptions and validate stale gateway cleanup logic.
CLI Dispatch Tests
test/cli.test.js
Added comprehensive test cases validating CLI lifecycle behaviors: gateway/sandbox health detection, registry reconciliation, stale entry removal, recovery messaging, and ANSI-colored error output matching.
E2E Test Refactoring
test/e2e/test-double-onboard.sh
Replaced stale state recovery approach with fake OpenAI-compatible endpoint testing. Added execution time limiting, server startup/cleanup helpers, and expanded verification phases to cover sandbox persistence, registry reconciliation, and gateway lifecycle response messaging.

Sequence Diagram(s)

sequenceDiagram
    actor User as User/CLI
    participant NemoClaw as nemoclaw.js
    participant OpenShell as OpenShell CLI
    participant Registry as Local Registry
    
    User->>NemoClaw: status/connect
    
    rect rgba(100, 150, 200, 0.5)
    Note over NemoClaw,OpenShell: Sandbox/Gateway Reconciliation
    NemoClaw->>OpenShell: openshell sandbox get
    OpenShell-->>NemoClaw: sandbox state
    NemoClaw->>OpenShell: openshell status
    OpenShell-->>NemoClaw: gateway status
    NemoClaw->>OpenShell: openshell gateway info
    OpenShell-->>NemoClaw: gateway metadata
    end
    
    alt Healthy Gateway Present
        NemoClaw->>Registry: Query registry
        Registry-->>NemoClaw: OK
        NemoClaw-->>User: Status/Connect success
    else Gateway Issues Detected
        rect rgba(200, 100, 100, 0.5)
        Note over NemoClaw,OpenShell: Runtime Recovery
        NemoClaw->>OpenShell: gateway select nemoclaw
        NemoClaw->>OpenShell: startGatewayForRecovery
        OpenShell-->>NemoClaw: Recovery result
        end
        
        NemoClaw->>OpenShell: Re-query sandbox state
        OpenShell-->>NemoClaw: Updated state
        
        alt Recovery Successful
            NemoClaw->>Registry: Update/reconcile registry
            Registry-->>NemoClaw: OK
            NemoClaw-->>User: Recovered + Status/Connect
        else Recovery Failed
            NemoClaw->>Registry: Mark invalid entries
            Registry-->>NemoClaw: OK
            NemoClaw-->>User: Error + Guidance
        end
    end
Loading

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 A gateway that's healthy, we check with great care,
With reconciliation logic floating through air!
When the sandbox forgets, we recover with grace,
Runtime reattachment for each API place.
No stale sessions linger—we probe and rebuild,
NemoClaw's pipeline, with wisdom now filled! 🌟

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately captures the main objective of the PR—implementing improved gateway lifecycle recovery across multiple files including onboarding, CLI dispatch, and test coverage.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/gateway-lifecycle-recovery-signed

Comment @coderabbitai help to get the list of available commands and usage tips.

@kjw3
Copy link
Copy Markdown
Contributor Author

kjw3 commented Mar 26, 2026

Closing this in favor of #953. #952 still contained unverified historical merge commits; #953 rebuilds the same work as a clean single verified signed commit so it can satisfy the branch signature requirements.

@kjw3 kjw3 closed this Mar 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants