Skip to content

fix: detect conflicting host kubelet before gateway start#433

Merged
ericksoa merged 7 commits intoNVIDIA:mainfrom
BenediktSchackenberg:fix/kubelet-cgroup-conflict
Apr 17, 2026
Merged

fix: detect conflicting host kubelet before gateway start#433
ericksoa merged 7 commits intoNVIDIA:mainfrom
BenediktSchackenberg:fix/kubelet-cgroup-conflict

Conversation

@BenediktSchackenberg
Copy link
Copy Markdown
Contributor

@BenediktSchackenberg BenediktSchackenberg commented Mar 19, 2026

Summary

When another Kubernetes distribution (MicroK8s, k3s, kubeadm) is running on the host, the gateway's embedded k3s inside Docker (started with --cgroupns=host) fights over /sys/fs/cgroup/kubepods with the host kubelet. Both kubelets see each other's pods as orphaned and kill them, resulting in rapid CrashLoopBackOff cycles.

Changes

setup-spark.sh:

  • Added detect_conflicting_kubelet() function that checks for:
    • Running kubelet/kubelite processes
    • Active MicroK8s installation (microk8s status)
    • Active k3s systemd service
  • Interactive: prompts user to stop the conflicting service or abort
  • Non-interactive: fails with clear instructions

setup.sh:

  • Added lightweight preflight warning (non-blocking) before openshell gateway start
  • Warns about detected kubelet but doesn't block — the user may know what they're doing

Why not just switch to --cgroupns=private?

That would be the ideal long-term fix (suggested in #431), but it requires changes in OpenShell's gateway container startup logic, which is outside this repo's scope. The preflight check is the safest fix we can ship now without upstream changes.

Fixes #431

AI-assisted: Built with Claude, reviewed by human.

Summary by CodeRabbit

  • New Features
    • Pre-flight detection of a running host Kubernetes (kubelet/k3s/MicroK8s) now warns about cgroup namespace conflicts that can cause pods to CrashLoopBackOff.
    • Warnings appear before restart/destroy actions; interactive runs prompt to continue, non-interactive runs fail with clear guidance to stop the conflicting Kubernetes or re-run interactively.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 19, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Added host kubelet detection and warning: new detect_kubelet_conflict() and warn_kubelet_conflict() in scripts/lib/runtime.sh. scripts/setup-spark.sh now sources the helpers and runs an interactive/non‑interactive preflight that may abort on conflict. scripts/setup.sh emits a non‑blocking warning when a host kubelet is detected.

Changes

Cohort / File(s) Summary
Runtime helpers
scripts/lib/runtime.sh
Added detect_kubelet_conflict() to detect host kubelet/kubelite/k3s, microk8s running status, and relevant systemd services; added warn_kubelet_conflict() and KUBELET_CONFLICT_DETAIL for standardized warnings.
Spark setup gating
scripts/setup-spark.sh
Now resolves script dir and sources the runtime helpers; runs detect_kubelet_conflict before changing default-cgroupns-mode; on conflict calls warn_kubelet_conflict "$KUBELET_CONFLICT_DETAIL" and either prompts and may abort in TTY, or fails fast in non‑TTY.
Gateway setup warning
scripts/setup.sh
Added an early, non‑blocking detect_kubelet_conflict invocation that emits warn_kubelet_conflict "$KUBELET_CONFLICT_DETAIL" prior to gateway restart/destroy steps; does not gate or change restart flow.

Sequence Diagram(s)

sequenceDiagram
  rect rgba(220,230,241,0.5)
    participant User as User (TTY / Non‑TTY)
    participant Script as Setup Script
    participant Host as Host (processes / systemd)
    participant Docker as Docker / container runtime
  end

  Script->>Host: detect_kubelet_conflict() — probe for kubelet/kubelite/k3s,\nmicrok8s status, systemd services
  alt conflict found
    Script->>User: warn_kubelet_conflict(detail) — emit multi-line guidance
    alt interactive (TTY)
      User-->>Script: reply (Y/N)
      alt confirmed (Y)
        Script->>Docker: proceed with cgroupns change / restart
      else not confirmed
        Script-->>User: abort/exit
      end
    else non-interactive
      Script-->>User: fail fast with stop/retry instructions
    end
  else no conflict
    Script->>Docker: proceed with cgroupns change / restart
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I sniffed the host for kubelets in the night,
Found tiny clashes that made containers fight.
I called out a warning, then waited for say,
Prompting in terminal or failing the way.
Hop lightly, restart, and keep clusters bright.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix: detect conflicting host kubelet before gateway start' accurately and concisely summarizes the main change: adding kubelet conflict detection before the gateway starts.
Linked Issues check ✅ Passed The PR implements mitigation #3 from issue #431: detecting conflicting host Kubernetes distributions (MicroK8s, k3s, kubeadm) and warning/blocking on detection with clear remediation instructions.
Out of Scope Changes check ✅ Passed All changes are directly related to adding kubelet conflict detection: two new utility functions in lib/runtime.sh and their integration into setup-spark.sh and setup.sh. No unrelated modifications present.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds host-side preflight detection of conflicting Kubernetes kubelets to reduce openshell gateway start CrashLoopBackOff scenarios caused by cgroupns=host cgroup contention (kubepods), per issue #431.

Changes:

  • Add a non-blocking kubelet/k3s warning in scripts/setup.sh before starting the gateway.
  • Add a blocking (with interactive override) conflicting-kubelet detector in scripts/setup-spark.sh before configuring Docker for cgroupns=host.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
scripts/setup.sh Adds a lightweight preflight warning if a host kubelet/k3s is detected prior to gateway start.
scripts/setup-spark.sh Adds a preflight detector that aborts (or prompts) when a conflicting host Kubernetes is running before applying Docker cgroupns changes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread scripts/setup-spark.sh Outdated
Comment on lines +88 to +91
# Check for k3s service
if systemctl is-active --quiet k3s 2>/dev/null; then
found="k3s service is active"
fi
Copy link

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The k3s detection only checks systemctl is-active k3s, which can miss hosts running k3s as k3s-agent (common on agent-only nodes) or as a k3s process without that specific unit. Consider checking k3s-agent as well (and/or pgrep -x k3s) so the conflict warning triggers reliably for k3s installations.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — added pgrep -x k3s and systemctl is-active --quiet k3s-agent to both scripts.

Comment thread scripts/setup-spark.sh Outdated
warn ""

if [ -t 0 ]; then
read -rp "Continue anyway? [y/N] " reply
Copy link

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the script runs with set -e, read -rp ... will cause an immediate exit on EOF/error (e.g., user hits Ctrl-D) before your intended abort message runs. Handle read failure explicitly (e.g., if ! read ...; then fail ...; fi) so the script exits with a clear, consistent error.

Suggested change
read -rp "Continue anyway? [y/N] " reply
if ! read -rp "Continue anyway? [y/N] " reply; then
fail "Aborted. Stop the conflicting Kubernetes service and retry."
fi

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — read failure is now handled explicitly with if ! read ... so Ctrl-D/EOF gets a clean error message instead of a cryptic set -e exit.

Comment thread scripts/setup.sh Outdated
Comment on lines +97 to +102
if pgrep -x kubelet > /dev/null 2>&1 || pgrep -x kubelite > /dev/null 2>&1 || systemctl is-active --quiet k3s 2>/dev/null; then
warn "⚠️ A Kubernetes kubelet is running on this host."
warn "The gateway's embedded k3s may conflict over cgroup paths (kubepods)."
warn "If the gateway fails with CrashLoopBackOff, stop the host Kubernetes first:"
warn " sudo microk8s stop / sudo systemctl stop k3s / sudo systemctl stop kubelet"
fi
Copy link

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This preflight check looks for kubelet/kubelite processes and an active k3s systemd unit, but it can miss k3s installations running as k3s-agent (and/or just a k3s process). Consider checking systemctl is-active --quiet k3s-agent as well (and/or pgrep -x k3s) to avoid false negatives when k3s is the conflicting host Kubernetes.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in the same commit — both pgrep -x k3s and k3s-agent systemd unit are now checked.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
scripts/setup.sh (1)

95-102: Consider centralizing kubelet-conflict detection to avoid drift.

This block now overlaps with scripts/setup-spark.sh but with different checks/behavior. Moving shared detection/message formatting into scripts/lib/runtime.sh would keep both setup paths consistent and easier to maintain.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/setup.sh` around lines 95 - 102, The kubelet-conflict detection block
in scripts/setup.sh is duplicated with different behavior in
scripts/setup-spark.sh; extract the logic into a shared function placed in
scripts/lib/runtime.sh (e.g., detect_kubelet_conflict() which runs the
pgrep/systemctl checks and returns status, and warn_kubelet_conflict() which
emits the standardized warn messages), then replace the inline checks in both
scripts (setup.sh and setup-spark.sh) with calls to detect_kubelet_conflict &&
warn_kubelet_conflict to keep detection and message formatting centralized and
consistent across both setup paths.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@scripts/setup.sh`:
- Around line 95-102: The kubelet-conflict detection block in scripts/setup.sh
is duplicated with different behavior in scripts/setup-spark.sh; extract the
logic into a shared function placed in scripts/lib/runtime.sh (e.g.,
detect_kubelet_conflict() which runs the pgrep/systemctl checks and returns
status, and warn_kubelet_conflict() which emits the standardized warn messages),
then replace the inline checks in both scripts (setup.sh and setup-spark.sh)
with calls to detect_kubelet_conflict && warn_kubelet_conflict to keep detection
and message formatting centralized and consistent across both setup paths.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7e8e7be6-e6bc-40c7-a32e-aa55b532778d

📥 Commits

Reviewing files that changed from the base of the PR and between ffdccaf and 0214255.

📒 Files selected for processing (2)
  • scripts/setup-spark.sh
  • scripts/setup.sh

@BenediktSchackenberg
Copy link
Copy Markdown
Contributor Author

Addressed CodeRabbit's nitpick — extracted detect_kubelet_conflict() and warn_kubelet_conflict() into scripts/lib/runtime.sh. Both setup.sh and setup-spark.sh now call the shared functions instead of duplicating the detection logic.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
scripts/setup-spark.sh (1)

67-87: Consider moving conflict preflight before user/group mutation.

Right now, a detected conflict can abort after usermod -aG docker has already changed host state. Running this preflight earlier would make failure paths fully fail-fast and side-effect free.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/setup-spark.sh` around lines 67 - 87, Move the Kubernetes conflict
preflight (the detect_kubelet_conflict check and its
warn_kubelet_conflict/interactive fail path) to run before any host user/group
mutation occurs (e.g., before the usermod -aG docker invocation); specifically,
relocate the entire if detect_kubelet_conflict ... fi block so it executes prior
to any calls that change system state, ensuring the script fails fast and leaves
the host unmodified when a conflict is detected.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/lib/runtime.sh`:
- Line 260: The assignment of local detail in warn_kubelet_conflict() uses
${1:-$KUBELET_CONFLICT_DETAIL} which can trigger an unbound-variable error under
set -u if KUBELET_CONFLICT_DETAIL is not defined; change the assignment to use a
nested default so it falls back to an empty value when both $1 and
KUBELET_CONFLICT_DETAIL are unset (i.e., set detail to first argument, otherwise
to KUBELET_CONFLICT_DETAIL, otherwise to an empty string) so the function is
defensive when called before detect_kubelet_conflict() initializes the variable.

---

Nitpick comments:
In `@scripts/setup-spark.sh`:
- Around line 67-87: Move the Kubernetes conflict preflight (the
detect_kubelet_conflict check and its warn_kubelet_conflict/interactive fail
path) to run before any host user/group mutation occurs (e.g., before the
usermod -aG docker invocation); specifically, relocate the entire if
detect_kubelet_conflict ... fi block so it executes prior to any calls that
change system state, ensuring the script fails fast and leaves the host
unmodified when a conflict is detected.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e6b4e37b-5c8b-48e1-ad11-68542573dd65

📥 Commits

Reviewing files that changed from the base of the PR and between 7561c3c and 8a1a285.

📒 Files selected for processing (3)
  • scripts/lib/runtime.sh
  • scripts/setup-spark.sh
  • scripts/setup.sh
✅ Files skipped from review due to trivial changes (1)
  • scripts/setup.sh

Comment thread scripts/lib/runtime.sh Outdated
@BenediktSchackenberg
Copy link
Copy Markdown
Contributor Author

All bot feedback addressed across 4 commits:

  1. Initial fix — preflight kubelet detection in setup-spark.sh + setup.sh
  2. k3s-agent + pgrep k3s — Copilot caught missing k3s detection paths
  3. Shared lib — CodeRabbit nitpick: extracted into lib/runtime.sh to avoid duplication
  4. set -u safety — CodeRabbit caught unbound variable edge case

The final diff is clean:

  • scripts/lib/runtime.sh: +44 lines — detect_kubelet_conflict() + warn_kubelet_conflict()
  • scripts/setup-spark.sh: +24 lines — interactive blocking check before cgroupns config
  • scripts/setup.sh: +6 lines — non-blocking warning before gateway start

Detects: kubelet, kubelite, k3s processes + microk8s status + k3s/k3s-agent systemd units.

Ready for human review 🙂

@psorensen-nvidia psorensen-nvidia added bug Something isn't working Platform: DGX Spark Support for DGX Spark Platform: Ubuntu Support for Linux Ubuntu priority: high Important issue that should be resolved in the next release Docker Support for Docker containerization K8s Use this label to identify Kubernetes deployment issues with NemoClaw. labels Mar 19, 2026
@BenediktSchackenberg BenediktSchackenberg force-pushed the fix/kubelet-cgroup-conflict branch 3 times, most recently from 7303195 to c6cb2da Compare March 27, 2026 19:59
@BenediktSchackenberg
Copy link
Copy Markdown
Contributor Author

@cv Rebased onto latest main (clean, 4 commits → 3 files). All review comments from Copilot were already addressed in earlier commits (k3s-agent detection, read EOF handling). Would appreciate a CI trigger when you get a chance — first-time contributor. Thanks!

@BenediktSchackenberg
Copy link
Copy Markdown
Contributor Author

Rebased onto latest main — squashed to 1 clean commit with conventional format and Signed-off-by. Note: setup.sh was deleted in main (#1235), so the kubelet detection now lives purely in scripts/lib/runtime.sh (shared helper) and scripts/setup-spark.sh.

@BenediktSchackenberg BenediktSchackenberg force-pushed the fix/kubelet-cgroup-conflict branch 3 times, most recently from 6abdffe to d417a2a Compare April 6, 2026 13:59
@wscurran wscurran added status: needs-info For issues/PRs that lack a description. (Signals to the author that more detail is required). status: rebase PR needs to be rebased against main before review can continue and removed status: needs-info For issues/PRs that lack a description. (Signals to the author that more detail is required). labels Apr 14, 2026
Signed-off-by: Benedikt Schackenberg <6381261+BenediktSchackenberg@users.noreply.github.com>
@BenediktSchackenberg BenediktSchackenberg force-pushed the fix/kubelet-cgroup-conflict branch from d417a2a to 02dd445 Compare April 14, 2026 05:04
@BenediktSchackenberg
Copy link
Copy Markdown
Contributor Author

Rebased on latest main — no conflicts.

The detection logic is still valid and in the right place. setup-spark.sh is now a deprecated alias for nemoclaw onboard, but the kubelet check runs at the host level (before openshell gateway start) through scripts/lib/runtime.shdetect_kubelet_conflict() / warn_kubelet_conflict(), sourced by scripts/setup-spark.sh. That's exactly the right call site — inside the container (nemoclaw-start.sh) is too late.

The refactored onboard flow still triggers setup-spark.sh for DGX Spark paths, so the check fires before gateway bootstrap.

Pipe continuation style and redirect spacing to match project
shfmt config (-i 2 -ci).

Signed-off-by: Benedikt Schackenberg <6381261+BenediktSchackenberg@users.noreply.github.com>
@wscurran wscurran removed the status: rebase PR needs to be rebased against main before review can continue label Apr 15, 2026
@cv cv added the v0.0.18 Release target label Apr 16, 2026
@prekshivyas prekshivyas self-assigned this Apr 16, 2026
Copy link
Copy Markdown
Contributor

@prekshivyas prekshivyas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid kubelet conflict detection — covers pgrep, MicroK8s, k3s/k3s-agent. Review comments all addressed (Copilot + CodeRabbit). CI needs a rebase + DCO sign-off in PR body. @BenediktSchackenberg can you rebase onto main and add Signed-off-by to the PR description?

@prekshivyas prekshivyas requested a review from ericksoa April 16, 2026 22:17
@ericksoa ericksoa added v0.0.19 Release target and removed v0.0.18 Release target labels Apr 17, 2026
Copy link
Copy Markdown
Contributor

@ericksoa ericksoa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — approving to unblock merge. The new detect_kubelet_conflict() / warn_kubelet_conflict() additions to runtime.sh are safe (additive-only, no caller impact), and setup-spark.sh is well-isolated (Linux + root gates).

Follow-up items tracked for a subsequent PR:

  • Add tests for detect_kubelet_conflict() in runtime-shell.test.ts
  • Add Linux guard to detect_kubelet_conflict() so it's safe if called from macOS-sourced scripts
  • Fix bare except: in setup-spark.sh Python blocks → except (IOError, json.JSONDecodeError):
  • Consider atomic write for daemon.json manipulation

@ericksoa ericksoa merged commit 38a65b5 into NVIDIA:main Apr 17, 2026
ericksoa added a commit that referenced this pull request Apr 17, 2026
Follow-up to #433. Addresses four items from regression risk review:

1. Add Linux guard to detect_kubelet_conflict() so it safely no-ops
   when runtime.sh is sourced on macOS
2. Replace bare `except:` in setup-spark.sh Python blocks with
   `except (IOError, json.JSONDecodeError):`
3. Use atomic writes (tempfile + os.replace) for daemon.json to
   prevent corruption from partial writes
4. Add tests for detect_kubelet_conflict() and warn_kubelet_conflict()
   in runtime-shell.test.ts

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ericksoa added a commit that referenced this pull request Apr 17, 2026
…2028)

## Summary

Follow-up to #433. Addresses four items identified during regression
risk review:

- Add Linux platform guard to `detect_kubelet_conflict()` so it safely
returns "no conflict" when `runtime.sh` is sourced on macOS
- Replace bare `except:` in `setup-spark.sh` Python blocks with `except
(IOError, json.JSONDecodeError):` to avoid swallowing
`KeyboardInterrupt`/`SystemExit`
- Use atomic writes (`tempfile.mkstemp` + `os.replace`) for
`daemon.json` manipulation to prevent corruption from partial writes or
concurrent reads
- Add 4 tests for `detect_kubelet_conflict()` and
`warn_kubelet_conflict()` in `runtime-shell.test.ts`

## Test plan

- [x] 4 new tests in `runtime-shell.test.ts` (all passing)
- [x] ShellCheck clean on both modified scripts
- [x] All 23 runtime-shell tests pass (16 existing + 4 new + 3 others)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Bug Fixes**
* Docker daemon configuration updates now use atomic file writes for
safer upgrades.
* Kubelet conflict detection now skips non-Linux systems to avoid
inappropriate checks.
* Error handling for configuration reads/parsing refined to catch
specific errors.

* **Tests**
* Added tests covering kubelet conflict detection across OS variations,
process/service states, and warning output.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working Docker Support for Docker containerization K8s Use this label to identify Kubernetes deployment issues with NemoClaw. Platform: DGX Spark Support for DGX Spark Platform: Ubuntu Support for Linux Ubuntu priority: high Important issue that should be resolved in the next release v0.0.19 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

openshell gateway start fails with CrashLoopBackOff when another Kubernetes distribution is running on the host

7 participants