security: add runtime kill switch for the managed-skills reconciler by pjdoland · Pull Request #291 · plmbr/notebook-intelligence

pjdoland · 2026-05-18T01:45:24Z

Summary

The admin guide currently documents that setting `skills_management_policy = force-off` (or `NBI_SKILLS_MANAGEMENT_POLICY=force-off`) will "suppress the managed-skills reconciler," but also concedes the suppression "takes effect on JupyterLab server restart" and that a live config-reload won't stop a reconciler that's already running. For an incident-response kill switch (compromised manifest URL, leaked managed-skills token), needing a server restart is too slow.

This PR adds two non-restart paths to stop the reconciler.

Solution

1. HTTP kill switch. New endpoint `POST /notebook-intelligence/skills/reconciler/stop`. Authenticated with the standard Jupyter session token. Returns `{"stopped": true, "was_running": }` so callers know whether the call actually interrupted work. Idempotent: a second POST after the reconciler is already stopped returns the same success shape, so admins can script the call across pods without special-casing already-stopped pods.

The endpoint is deliberately not gated by the Skills management policy. Inheriting from `SkillsBaseHandler` would mean the policy gate returns 403 in the exact `force-off` state where the kill switch is supposed to fire. A test pins this contract structurally (`SkillsBaseHandler` not in the MRO) so a future refactor can't silently re-introduce the deadlock.

2. Per-cycle env-var recheck. The reconciler's background loop now reads `NBI_SKILLS_MANAGEMENT_POLICY` at the top of each iteration via a new `_policy_force_off()` helper. If force-off is observed, the loop self-stops and logs. Latency is bounded by the cycle interval (default 24h), so this is the slow path for deployments that don't or can't call the HTTP endpoint. Adds one env-var read per cycle; zero ongoing cost.

A new `SkillReconciler.is_running()` helper supports the new endpoint's `was_running` payload field. `AIServiceManager.handle_stop_request` continues to call `reconciler.stop()` at shutdown; double-stop is safe and the early self-stop leaves `_thread` non-null so the shutdown join is harmless.

Testing

Eight new pytest cases across two files:

`TestPolicyForceOffStop`: force-off at start prevents any reconcile pass; force-off mid-loop stops before the next pass; user-choice leaves the loop running; case-insensitive policy parsing including the `force_off`-vs-`force-off` underscore-vs-dash typo guard.
`TestBackgroundThread::test_is_running_reflects_thread_state`: `is_running()` returns False before start, True during run, False after stop.
`TestSkillsReconcilerStopHandler`: idempotent in both running and stopped states, returns the correct `was_running` value, and a structural MRO assertion pinning the deliberate gate-bypass.

`pytest tests/` (730 passing), `jlpm tsc --noEmit`, `jlpm lint:check`, `jlpm jest` (154 passing). All four green.

Six-agent review applied: added the MRO contract test, added the no-live-restart and per-tenant trust notes to the admin guide, added a clarifying comment at the route-registration site.

Risks / follow-ups

No live restart of the reconciler. Once stopped, only a server bounce re-enables it in the same pod. Intentional: a kill switch another script could flip back on isn't a kill switch. Documented as a sharp edge.
Per-tenant trust posture. The endpoint authenticates with the user's Jupyter session token, not an admin claim. In hub deployments where the pod owner is not the policy admin, a tenant can stop their own pod's reconciler. The reconciler is per-pod, so this only affects that user's managed-skills delivery. Documented.
Capabilities surfacing. The Skills panel does not currently show a "reconciler stopped" status indicator. Worth a follow-up so users can verify the kill switch worked from the UI, but not load-bearing for this PR.

The admin guide previously documented that setting the Skills management policy to force-off would "suppress the managed-skills reconciler," but also conceded that a JupyterLab server restart was required for it to take effect on an already-running pod. For an incident-response kill switch (compromised manifest URL, leaked managed_skills_token), needing a restart is too slow. This adds two non-restart paths to stop the reconciler: 1. POST /notebook-intelligence/skills/reconciler/stop. Authenticated via the standard Jupyter session token. Idempotent: returns {"stopped": true, "was_running": <bool>} regardless of whether a reconciler was running, so admins can fan the call out across pods from a script without special-casing already-stopped pods. Deliberately NOT gated by SkillsBaseHandler's policy attribute. The kill switch must remain reachable in the exact force-off state it exists to support. 2. Per-cycle re-read of NBI_SKILLS_MANAGEMENT_POLICY at the top of the reconciler's _run_loop. If force-off is observed, the loop self-stops. Latency is bounded by the cycle interval (default 24h), so this is the slow path; the HTTP endpoint is the fast path. Adds SkillReconciler.is_running() so the new endpoint can report whether the call actually stopped a thread. Existing shutdown path still calls reconciler.stop(); double-stop is safe. Tests cover is_running() reflecting thread state, force-off at start preventing any reconcile pass, force-off mid-loop stopping before the next pass, user-choice leaving the loop running, case-insensitive policy parsing, the HTTP endpoint's idempotent behavior in both running and stopped states, and a structural assertion that the stop handler is not in SkillsBaseHandler's MRO (pins the deliberate gate-bypass contract). Admin guide updated with the new endpoint, the per-cycle behavior, the no-live-restart sharp edge, and a per-tenant trust posture note for JupyterHub deployments.

Promotes the [Unreleased] CHANGELOG snapshot to [5.0.0] - 2026-05-22 and expands it to cover everything merged into upstream/main after PR plmbr#287's docs refresh. Bumps package.json to 5.0.0. CHANGELOG additions cover the post-plmbr#287 surface: - Settings tabs: plugin marketplace picker (plmbr#284), plugin marketplace details + Update button (plmbr#303), per-workspace MCP disable (plmbr#286), JSON-paste path in Add MCP server (plmbr#285). - Launchers: hide-with-policy (plmbr#288), brand icons for Codex / opencode (plmbr#325, plmbr#333), per-launch directory picker (plmbr#332). - Chat sidebar and agentic UX: workspace @-mention in Claude mode (plmbr#327), reload-open-files-on-disk (plmbr#330), steered system prompt away from over-eager notebook creation (plmbr#336). - Skills: multi-manifest support (plmbr#321), tracks-upstream for user- imported skills (plmbr#322), HTTP kill switch for the reconciler (plmbr#291). - Accessibility: full sub-section covering plmbr#305-plmbr#320. - Security: shell-tool sandbox (plmbr#290), Claude UI-bridge sandbox (plmbr#323), 0o600 on encrypted token (plmbr#293), env-secret scrubbing (plmbr#295), MCP config shape validation (plmbr#299), XSS allowlist (plmbr#296), Copilot WS auth + origin (plmbr#301), GHE host detection (plmbr#292), fastmcp -> mcp SDK swap (plmbr#324). - Fixed: session listing unification (plmbr#310), session preview unwrap (plmbr#331), down-area runtime throw (plmbr#330 follow-up), WS message-handler leak (plmbr#294). - Removed: fastmcp dependency, history.jsonl session gate. Adds a Migration note covering the five behavior changes operators should review before upgrading from 4.x: fastmcp swap, path sandboxes, history.jsonl gate removal, workspace @-mention pointer shape, and the Copilot WebSocket auth/origin tightening. Two reviewer rounds (six personas each) applied: - Round 1 caught security overclaims (plmbr#293, plmbr#299, plmbr#323), the plmbr#284/plmbr#303 mis-attribution, missing migration note, 3 em dashes, and the stale `fastmcp==2.x.*` recommendation in the admin guide. - Round 2 caught the missing plmbr#301 migration bullet, missing version- matrix 5.0.x row, missing README TOC entry, and a couple of style nits (sub-heading overpromise, orphan bullet). Skipped (deferred to future PRs): - README first-run tour mention. - Admin guide HTTP kill-switch row in Failure-modes table. - Terminal drag-drop trust-model precision update after plmbr#327. - Cipher description nit in plmbr#293 (Fernet AES-128-CBC+HMAC, not AES-GCM).

pjdoland added the bug Something isn't working label May 18, 2026

mbektas approved these changes May 18, 2026

View reviewed changes

mbektas merged commit 9e0e3b1 into plmbr:main May 18, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

security: add runtime kill switch for the managed-skills reconciler#291

security: add runtime kill switch for the managed-skills reconciler#291
mbektas merged 1 commit into
plmbr:mainfrom
pjdoland:fix/skills-reconciler-force-off

pjdoland commented May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pjdoland commented May 18, 2026

Summary

Solution

Testing

Risks / follow-ups

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants