security: add runtime kill switch for the managed-skills reconciler#291
Merged
Merged
Conversation
The admin guide previously documented that setting the Skills management
policy to force-off would "suppress the managed-skills reconciler," but
also conceded that a JupyterLab server restart was required for it to
take effect on an already-running pod. For an incident-response kill
switch (compromised manifest URL, leaked managed_skills_token), needing
a restart is too slow.
This adds two non-restart paths to stop the reconciler:
1. POST /notebook-intelligence/skills/reconciler/stop. Authenticated via
the standard Jupyter session token. Idempotent: returns
{"stopped": true, "was_running": <bool>} regardless of whether a
reconciler was running, so admins can fan the call out across pods
from a script without special-casing already-stopped pods. Deliberately
NOT gated by SkillsBaseHandler's policy attribute. The kill switch
must remain reachable in the exact force-off state it exists to
support.
2. Per-cycle re-read of NBI_SKILLS_MANAGEMENT_POLICY at the top of the
reconciler's _run_loop. If force-off is observed, the loop self-stops.
Latency is bounded by the cycle interval (default 24h), so this is the
slow path; the HTTP endpoint is the fast path.
Adds SkillReconciler.is_running() so the new endpoint can report whether
the call actually stopped a thread. Existing shutdown path still calls
reconciler.stop(); double-stop is safe.
Tests cover is_running() reflecting thread state, force-off at start
preventing any reconcile pass, force-off mid-loop stopping before the
next pass, user-choice leaving the loop running, case-insensitive policy
parsing, the HTTP endpoint's idempotent behavior in both running and
stopped states, and a structural assertion that the stop handler is not
in SkillsBaseHandler's MRO (pins the deliberate gate-bypass contract).
Admin guide updated with the new endpoint, the per-cycle behavior, the
no-live-restart sharp edge, and a per-tenant trust posture note for
JupyterHub deployments.
mbektas
approved these changes
May 18, 2026
pjdoland
added a commit
to pjdoland/notebook-intelligence
that referenced
this pull request
May 22, 2026
Promotes the [Unreleased] CHANGELOG snapshot to [5.0.0] - 2026-05-22 and expands it to cover everything merged into upstream/main after PR plmbr#287's docs refresh. Bumps package.json to 5.0.0. CHANGELOG additions cover the post-plmbr#287 surface: - Settings tabs: plugin marketplace picker (plmbr#284), plugin marketplace details + Update button (plmbr#303), per-workspace MCP disable (plmbr#286), JSON-paste path in Add MCP server (plmbr#285). - Launchers: hide-with-policy (plmbr#288), brand icons for Codex / opencode (plmbr#325, plmbr#333), per-launch directory picker (plmbr#332). - Chat sidebar and agentic UX: workspace @-mention in Claude mode (plmbr#327), reload-open-files-on-disk (plmbr#330), steered system prompt away from over-eager notebook creation (plmbr#336). - Skills: multi-manifest support (plmbr#321), tracks-upstream for user- imported skills (plmbr#322), HTTP kill switch for the reconciler (plmbr#291). - Accessibility: full sub-section covering plmbr#305-plmbr#320. - Security: shell-tool sandbox (plmbr#290), Claude UI-bridge sandbox (plmbr#323), 0o600 on encrypted token (plmbr#293), env-secret scrubbing (plmbr#295), MCP config shape validation (plmbr#299), XSS allowlist (plmbr#296), Copilot WS auth + origin (plmbr#301), GHE host detection (plmbr#292), fastmcp -> mcp SDK swap (plmbr#324). - Fixed: session listing unification (plmbr#310), session preview unwrap (plmbr#331), down-area runtime throw (plmbr#330 follow-up), WS message-handler leak (plmbr#294). - Removed: fastmcp dependency, history.jsonl session gate. Adds a Migration note covering the five behavior changes operators should review before upgrading from 4.x: fastmcp swap, path sandboxes, history.jsonl gate removal, workspace @-mention pointer shape, and the Copilot WebSocket auth/origin tightening. Two reviewer rounds (six personas each) applied: - Round 1 caught security overclaims (plmbr#293, plmbr#299, plmbr#323), the plmbr#284/plmbr#303 mis-attribution, missing migration note, 3 em dashes, and the stale `fastmcp==2.x.*` recommendation in the admin guide. - Round 2 caught the missing plmbr#301 migration bullet, missing version- matrix 5.0.x row, missing README TOC entry, and a couple of style nits (sub-heading overpromise, orphan bullet). Skipped (deferred to future PRs): - README first-run tour mention. - Admin guide HTTP kill-switch row in Failure-modes table. - Terminal drag-drop trust-model precision update after plmbr#327. - Cipher description nit in plmbr#293 (Fernet AES-128-CBC+HMAC, not AES-GCM).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The admin guide currently documents that setting `skills_management_policy = force-off` (or `NBI_SKILLS_MANAGEMENT_POLICY=force-off`) will "suppress the managed-skills reconciler," but also concedes the suppression "takes effect on JupyterLab server restart" and that a live config-reload won't stop a reconciler that's already running. For an incident-response kill switch (compromised manifest URL, leaked managed-skills token), needing a server restart is too slow.
This PR adds two non-restart paths to stop the reconciler.
Solution
1. HTTP kill switch. New endpoint `POST /notebook-intelligence/skills/reconciler/stop`. Authenticated with the standard Jupyter session token. Returns `{"stopped": true, "was_running": }` so callers know whether the call actually interrupted work. Idempotent: a second POST after the reconciler is already stopped returns the same success shape, so admins can script the call across pods without special-casing already-stopped pods.
The endpoint is deliberately not gated by the Skills management policy. Inheriting from `SkillsBaseHandler` would mean the policy gate returns 403 in the exact `force-off` state where the kill switch is supposed to fire. A test pins this contract structurally (`SkillsBaseHandler` not in the MRO) so a future refactor can't silently re-introduce the deadlock.
2. Per-cycle env-var recheck. The reconciler's background loop now reads `NBI_SKILLS_MANAGEMENT_POLICY` at the top of each iteration via a new `_policy_force_off()` helper. If force-off is observed, the loop self-stops and logs. Latency is bounded by the cycle interval (default 24h), so this is the slow path for deployments that don't or can't call the HTTP endpoint. Adds one env-var read per cycle; zero ongoing cost.
A new `SkillReconciler.is_running()` helper supports the new endpoint's `was_running` payload field. `AIServiceManager.handle_stop_request` continues to call `reconciler.stop()` at shutdown; double-stop is safe and the early self-stop leaves `_thread` non-null so the shutdown join is harmless.
Testing
Eight new pytest cases across two files:
`pytest tests/` (730 passing), `jlpm tsc --noEmit`, `jlpm lint:check`, `jlpm jest` (154 passing). All four green.
Six-agent review applied: added the MRO contract test, added the no-live-restart and per-tenant trust notes to the admin guide, added a clarifying comment at the route-registration site.
Risks / follow-ups