Skip to content

security: add runtime kill switch for the managed-skills reconciler#291

Merged
mbektas merged 1 commit into
plmbr:mainfrom
pjdoland:fix/skills-reconciler-force-off
May 18, 2026
Merged

security: add runtime kill switch for the managed-skills reconciler#291
mbektas merged 1 commit into
plmbr:mainfrom
pjdoland:fix/skills-reconciler-force-off

Conversation

@pjdoland
Copy link
Copy Markdown
Collaborator

Summary

The admin guide currently documents that setting `skills_management_policy = force-off` (or `NBI_SKILLS_MANAGEMENT_POLICY=force-off`) will "suppress the managed-skills reconciler," but also concedes the suppression "takes effect on JupyterLab server restart" and that a live config-reload won't stop a reconciler that's already running. For an incident-response kill switch (compromised manifest URL, leaked managed-skills token), needing a server restart is too slow.

This PR adds two non-restart paths to stop the reconciler.

Solution

1. HTTP kill switch. New endpoint `POST /notebook-intelligence/skills/reconciler/stop`. Authenticated with the standard Jupyter session token. Returns `{"stopped": true, "was_running": }` so callers know whether the call actually interrupted work. Idempotent: a second POST after the reconciler is already stopped returns the same success shape, so admins can script the call across pods without special-casing already-stopped pods.

The endpoint is deliberately not gated by the Skills management policy. Inheriting from `SkillsBaseHandler` would mean the policy gate returns 403 in the exact `force-off` state where the kill switch is supposed to fire. A test pins this contract structurally (`SkillsBaseHandler` not in the MRO) so a future refactor can't silently re-introduce the deadlock.

2. Per-cycle env-var recheck. The reconciler's background loop now reads `NBI_SKILLS_MANAGEMENT_POLICY` at the top of each iteration via a new `_policy_force_off()` helper. If force-off is observed, the loop self-stops and logs. Latency is bounded by the cycle interval (default 24h), so this is the slow path for deployments that don't or can't call the HTTP endpoint. Adds one env-var read per cycle; zero ongoing cost.

A new `SkillReconciler.is_running()` helper supports the new endpoint's `was_running` payload field. `AIServiceManager.handle_stop_request` continues to call `reconciler.stop()` at shutdown; double-stop is safe and the early self-stop leaves `_thread` non-null so the shutdown join is harmless.

Testing

Eight new pytest cases across two files:

  • `TestPolicyForceOffStop`: force-off at start prevents any reconcile pass; force-off mid-loop stops before the next pass; user-choice leaves the loop running; case-insensitive policy parsing including the `force_off`-vs-`force-off` underscore-vs-dash typo guard.
  • `TestBackgroundThread::test_is_running_reflects_thread_state`: `is_running()` returns False before start, True during run, False after stop.
  • `TestSkillsReconcilerStopHandler`: idempotent in both running and stopped states, returns the correct `was_running` value, and a structural MRO assertion pinning the deliberate gate-bypass.

`pytest tests/` (730 passing), `jlpm tsc --noEmit`, `jlpm lint:check`, `jlpm jest` (154 passing). All four green.

Six-agent review applied: added the MRO contract test, added the no-live-restart and per-tenant trust notes to the admin guide, added a clarifying comment at the route-registration site.

Risks / follow-ups

  • No live restart of the reconciler. Once stopped, only a server bounce re-enables it in the same pod. Intentional: a kill switch another script could flip back on isn't a kill switch. Documented as a sharp edge.
  • Per-tenant trust posture. The endpoint authenticates with the user's Jupyter session token, not an admin claim. In hub deployments where the pod owner is not the policy admin, a tenant can stop their own pod's reconciler. The reconciler is per-pod, so this only affects that user's managed-skills delivery. Documented.
  • Capabilities surfacing. The Skills panel does not currently show a "reconciler stopped" status indicator. Worth a follow-up so users can verify the kill switch worked from the UI, but not load-bearing for this PR.

The admin guide previously documented that setting the Skills management
policy to force-off would "suppress the managed-skills reconciler," but
also conceded that a JupyterLab server restart was required for it to
take effect on an already-running pod. For an incident-response kill
switch (compromised manifest URL, leaked managed_skills_token), needing
a restart is too slow.

This adds two non-restart paths to stop the reconciler:

1. POST /notebook-intelligence/skills/reconciler/stop. Authenticated via
   the standard Jupyter session token. Idempotent: returns
   {"stopped": true, "was_running": <bool>} regardless of whether a
   reconciler was running, so admins can fan the call out across pods
   from a script without special-casing already-stopped pods. Deliberately
   NOT gated by SkillsBaseHandler's policy attribute. The kill switch
   must remain reachable in the exact force-off state it exists to
   support.

2. Per-cycle re-read of NBI_SKILLS_MANAGEMENT_POLICY at the top of the
   reconciler's _run_loop. If force-off is observed, the loop self-stops.
   Latency is bounded by the cycle interval (default 24h), so this is the
   slow path; the HTTP endpoint is the fast path.

Adds SkillReconciler.is_running() so the new endpoint can report whether
the call actually stopped a thread. Existing shutdown path still calls
reconciler.stop(); double-stop is safe.

Tests cover is_running() reflecting thread state, force-off at start
preventing any reconcile pass, force-off mid-loop stopping before the
next pass, user-choice leaving the loop running, case-insensitive policy
parsing, the HTTP endpoint's idempotent behavior in both running and
stopped states, and a structural assertion that the stop handler is not
in SkillsBaseHandler's MRO (pins the deliberate gate-bypass contract).

Admin guide updated with the new endpoint, the per-cycle behavior, the
no-live-restart sharp edge, and a per-tenant trust posture note for
JupyterHub deployments.
@pjdoland pjdoland added the bug Something isn't working label May 18, 2026
@mbektas mbektas merged commit 9e0e3b1 into plmbr:main May 18, 2026
5 checks passed
pjdoland added a commit to pjdoland/notebook-intelligence that referenced this pull request May 22, 2026
Promotes the [Unreleased] CHANGELOG snapshot to [5.0.0] - 2026-05-22
and expands it to cover everything merged into upstream/main after
PR plmbr#287's docs refresh. Bumps package.json to 5.0.0.

CHANGELOG additions cover the post-plmbr#287 surface:

- Settings tabs: plugin marketplace picker (plmbr#284), plugin marketplace
  details + Update button (plmbr#303), per-workspace MCP disable (plmbr#286),
  JSON-paste path in Add MCP server (plmbr#285).
- Launchers: hide-with-policy (plmbr#288), brand icons for Codex / opencode
  (plmbr#325, plmbr#333), per-launch directory picker (plmbr#332).
- Chat sidebar and agentic UX: workspace @-mention in Claude mode
  (plmbr#327), reload-open-files-on-disk (plmbr#330), steered system prompt
  away from over-eager notebook creation (plmbr#336).
- Skills: multi-manifest support (plmbr#321), tracks-upstream for user-
  imported skills (plmbr#322), HTTP kill switch for the reconciler (plmbr#291).
- Accessibility: full sub-section covering plmbr#305-plmbr#320.
- Security: shell-tool sandbox (plmbr#290), Claude UI-bridge sandbox (plmbr#323),
  0o600 on encrypted token (plmbr#293), env-secret scrubbing (plmbr#295), MCP
  config shape validation (plmbr#299), XSS allowlist (plmbr#296), Copilot WS
  auth + origin (plmbr#301), GHE host detection (plmbr#292), fastmcp -> mcp SDK
  swap (plmbr#324).
- Fixed: session listing unification (plmbr#310), session preview unwrap
  (plmbr#331), down-area runtime throw (plmbr#330 follow-up), WS message-handler
  leak (plmbr#294).
- Removed: fastmcp dependency, history.jsonl session gate.

Adds a Migration note covering the five behavior changes operators
should review before upgrading from 4.x: fastmcp swap, path
sandboxes, history.jsonl gate removal, workspace @-mention pointer
shape, and the Copilot WebSocket auth/origin tightening.

Two reviewer rounds (six personas each) applied:
- Round 1 caught security overclaims (plmbr#293, plmbr#299, plmbr#323), the
  plmbr#284/plmbr#303 mis-attribution, missing migration note, 3 em dashes,
  and the stale `fastmcp==2.x.*` recommendation in the admin guide.
- Round 2 caught the missing plmbr#301 migration bullet, missing version-
  matrix 5.0.x row, missing README TOC entry, and a couple of style
  nits (sub-heading overpromise, orphan bullet).

Skipped (deferred to future PRs):
- README first-run tour mention.
- Admin guide HTTP kill-switch row in Failure-modes table.
- Terminal drag-drop trust-model precision update after plmbr#327.
- Cipher description nit in plmbr#293 (Fernet AES-128-CBC+HMAC, not
  AES-GCM).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants