Skip to content

feat(skills): accept comma-separated list of managed-skills manifests#321

Merged
mbektas merged 5 commits into
plmbr:mainfrom
pjdoland:feat/multi-skills-manifest
May 19, 2026
Merged

feat(skills): accept comma-separated list of managed-skills manifests#321
mbektas merged 5 commits into
plmbr:mainfrom
pjdoland:feat/multi-skills-manifest

Conversation

@pjdoland
Copy link
Copy Markdown
Collaborator

Summary

NBI_SKILLS_MANIFEST (and the matching skills_manifest traitlet) now accept a comma-separated list of YAML/JSON manifests. The reconciler loads each independently and unions the entries, with first-wins URL dedupe and per-entry-error name dedupe. Single-manifest deployments behave identically; the wire format is strictly additive.

The use case: a JupyterHub deployment can ship an org-wide baseline manifest plus a per-team manifest in one env var, without baking a single combined file at deploy time.

Solution

Three layers of change, one commit each, plus a follow-up commit applying the convergent review findings.

  • load_manifests(sources, *, token) in notebook_intelligence/skill_manifest.py. Iterates sources, catches ManifestError per source so a transient outage on one manifest doesn't block the others, and merges entries with two passes of dedupe:
    • By URL: same url: in two manifests means the earlier one wins; a warning names both manifests.
    • By installed name: predicted via the same _slug() step the real installer applies (subpath casing, whitespace, capitals all normalize), so two URLs that would install to the same name collide before staging. The second entry drops with a per-entry error.
  • SkillReconciler now takes manifest_sources: List[str] instead of a single string. When any source fails to load, the cycle skips stale-removal so the missing source's skills don't get falsely orphaned. Per-entry installs still run against whatever did load.
  • Wire format resolved through a new util.split_csv helper that both _resolve_csv_appended and the new _resolve_skills_manifest_sources share. The new helper means \" ,,, \" and other all-separator inputs degrade cleanly to "no manifests configured" rather than constructing phantom sources.

Testing

  • tests/test_skill_manifest.py: 13 new cases. split_sources (5), load_manifests (8 covering union, URL dedupe with warning, name collision with explicit names, name collision via slugging, non-GitHub URL warning, partial fetch failure both orderings, URLError isolation, all-fail, order preservation, token-per-source).
  • tests/test_skill_reconciler.py: 7 new cases in a TestMultipleManifests class. Multi-source union, URL dedupe through the reconciler, name collision through the reconciler, partial failure preserving managed skills (both [good, missing] and [missing, good] orderings), all-sources-fail, empty-sources no-op. Plus 15 existing tests adapted from the single-string signature to the new list signature.
  • tests/test_config_integration.py: 7 new cases in TestSkillsManifestResolver pinning the env-vs-traitlet contract end to end, including the all-separator-input edge case.

Full suite green: pytest tests/ --ignore=tests/test_claude_client.py → 908 passed. jlpm tsc --noEmit, jlpm lint:check, jlpm jest all clean.

Risks and follow-ups

  • Permanent partial-failure means stale skills accumulate. A manifest source that fails forever pins stale-removal off indefinitely. Documented as accepted behavior in docs/skills.md; a follow-up could add a "force stale removal after K consecutive partial-failure cycles" escape hatch.
  • Stop-signal latency. SkillReconciler.stop() can block up to N * FETCH_TIMEOUT_SECONDS if the first manifest fetch hangs. Pre-existing single-manifest behavior; scales linearly with N. Follow-up could thread the stop event through load_manifests so the loop bails between sources.
  • Per-source ownership of installed skills. Today Skill.managed_source records the entry URL, not the owning manifest URL. With per-source ownership the reconciler could remove only the skills owned by the loaded manifests on partial failure, instead of skipping removal entirely. That's a schema change in serialize_skill_md and warrants its own PR.
  • Token fanout. NBI_MANAGED_SKILLS_TOKEN is shipped as a Bearer header to every manifest URL in the list. Pre-existing single-manifest behavior; the multi-manifest path multiplies the blast radius of a misconfigured host. Worth a one-line caveat in the docs; deferred to keep this PR scoped to the wire-format change.

The full review pass (six agents: code-reuse, code-quality, efficiency, backend/contract, distributed-systems, test-architecture) produced fix-before-merge items that all landed in the final commit. Remaining items above are follow-ups.

pjdoland added 4 commits May 18, 2026 20:57
Adds `load_manifests` that takes a list of manifest URLs/paths, loads
each independently, and merges the entries with dedupe-by-URL
(first-wins + warning) and dedupe-by-predicted-name (skip second +
per-entry error). A `split_sources` helper handles the comma-separated
env-var / traitlet wire format with whitespace tolerance.

The new `MergedManifest` return type carries the deduped entries,
per-source errors, advisory warnings, and the set of sources that
loaded successfully. Reconciler will consume `loaded_sources` to
decide whether stale removal is safe this cycle (a missing source
shouldn't orphan its skills).

`ManifestEntry` gains a `source` field so dedupe diagnostics can name
both the kept and the ignored manifest. Single-manifest callers
continue to work unchanged.
Switches SkillReconciler from a single `manifest_source: str` to
`manifest_sources: List[str]` and routes the per-cycle load through
`load_manifests`. Per-source errors and dedupe warnings surface in
ReconcileResult and the server log.

Partial-failure safety: if any source fails to load, the reconciler
skips stale-removal for that cycle so a transient outage on one
manifest can't orphan the skills it owned. Per-entry installs still
run against whatever sources did load. Single-source deployments
behave identically (1/1 sources loaded → stale removal runs).

AIServiceManager now consumes `skills_manifest_sources` (a list)
from its options rather than the single-string `skills_manifest`.
`NBI_SKILLS_MANIFEST` and the matching traitlet now accept either a
single source (URL or filesystem path) or a comma-separated list of
either. Empty input still disables the feature; `,,,` and other
all-separator inputs degrade cleanly to "no manifests configured."

Docs (`admin-guide.md`, `skills.md`) and the handler 409 message
explain the new format. The reconciler endpoint's "no manifest"
response now mentions the multi-source contract so operators don't
hit it and wonder why their second manifest stops loading.
Six findings from the parallel review pass.

- `_predicted_name` now runs the same `_slug()` step the real installer
  uses, so subpath casing/whitespace variants dedupe correctly. Two
  manifests pointing at .../skills/data-eda and .../Data_EDA collide
  before install rather than letting the second managed install
  silently overwrite the first.

- When `_predicted_name` returns None (non-GitHub URL or unparseable
  entry), `load_manifests` now surfaces a warning so operators know
  cross-manifest name dedupe was skipped for that entry. The
  install-time collision check in SkillManager remains the backstop.

- Extracted `util.split_csv` and have both `_resolve_csv_appended`
  (extension.py) and `skill_manifest.split_sources` use it. Single
  source of truth for the comma/strip/drop-empties wire format.

- Extracted `_resolve_skills_manifest_sources(traitlet_value)` so
  the env-var-vs-traitlet contract can be unit-tested without
  spinning up the extension harness. Seven new tests pin env-wins,
  traitlet-fallback, comma splitting, whitespace edges, and the
  all-separator-input disable path.

- Added test coverage for: name collision via slugging (the bug fixed
  above), non-GitHub URL warning, URLError isolation across sources,
  and a reverse-order partial-failure case ([missing, good] alongside
  the existing [good, missing]).

Skipped findings noted as follow-ups: stop-signal latency on hanging
fetches (pre-existing single-manifest behavior), permanent partial-
failure stale accumulation (documented contract), per-source skill
ownership for precise stale handling (schema change), failure-
visibility metric, and the `MergedManifest.all_loaded()` helper.
@pjdoland pjdoland added the enhancement New feature or request label May 19, 2026
Second review pass surfaced a pre-existing thread-safety hole in
SkillReconciler.start() / stop(): two concurrent start() calls can
spawn an untracked daemon thread, and a start() racing with stop()
can orphan threads or flip the wrong stop event. Probability is low
but the admin-reachable kill-switch endpoint
(SkillsReconcilerStopHandler) is documented as an incident-response
tool, so the race is reachable in practice.

Adds a `_lifecycle_lock` around the `_thread` / `_stop` mutations.
start() is a no-op when a thread already exists; stop() captures
the thread reference under the lock then joins outside the lock so a
second stop() doesn't block on a hung join. Two new tests pin the
contract (concurrent-start does-not-spawn-two-threads, and an
empty-manifest-among-loaded test that documents the stale-removal
behavior the test architect flagged as silently destructive).

Also adds a trust-boundary note to docs/skills.md near the
multi-manifest section: NBI_MANAGED_SKILLS_TOKEN is sent to every
URL in the comma-separated list, including non-GitHub hosts. The
no-redirect handler blocks server-side leaks but cannot stop a
typo'd entry from receiving the token. The pre-existing single-
manifest contract had the same property; multi-manifest just
multiplies the surface and operators should know.
@mbektas mbektas merged commit 3f4e7cc into plmbr:main May 19, 2026
4 checks passed
pjdoland added a commit to pjdoland/notebook-intelligence that referenced this pull request May 22, 2026
Promotes the [Unreleased] CHANGELOG snapshot to [5.0.0] - 2026-05-22
and expands it to cover everything merged into upstream/main after
PR plmbr#287's docs refresh. Bumps package.json to 5.0.0.

CHANGELOG additions cover the post-plmbr#287 surface:

- Settings tabs: plugin marketplace picker (plmbr#284), plugin marketplace
  details + Update button (plmbr#303), per-workspace MCP disable (plmbr#286),
  JSON-paste path in Add MCP server (plmbr#285).
- Launchers: hide-with-policy (plmbr#288), brand icons for Codex / opencode
  (plmbr#325, plmbr#333), per-launch directory picker (plmbr#332).
- Chat sidebar and agentic UX: workspace @-mention in Claude mode
  (plmbr#327), reload-open-files-on-disk (plmbr#330), steered system prompt
  away from over-eager notebook creation (plmbr#336).
- Skills: multi-manifest support (plmbr#321), tracks-upstream for user-
  imported skills (plmbr#322), HTTP kill switch for the reconciler (plmbr#291).
- Accessibility: full sub-section covering plmbr#305-plmbr#320.
- Security: shell-tool sandbox (plmbr#290), Claude UI-bridge sandbox (plmbr#323),
  0o600 on encrypted token (plmbr#293), env-secret scrubbing (plmbr#295), MCP
  config shape validation (plmbr#299), XSS allowlist (plmbr#296), Copilot WS
  auth + origin (plmbr#301), GHE host detection (plmbr#292), fastmcp -> mcp SDK
  swap (plmbr#324).
- Fixed: session listing unification (plmbr#310), session preview unwrap
  (plmbr#331), down-area runtime throw (plmbr#330 follow-up), WS message-handler
  leak (plmbr#294).
- Removed: fastmcp dependency, history.jsonl session gate.

Adds a Migration note covering the five behavior changes operators
should review before upgrading from 4.x: fastmcp swap, path
sandboxes, history.jsonl gate removal, workspace @-mention pointer
shape, and the Copilot WebSocket auth/origin tightening.

Two reviewer rounds (six personas each) applied:
- Round 1 caught security overclaims (plmbr#293, plmbr#299, plmbr#323), the
  plmbr#284/plmbr#303 mis-attribution, missing migration note, 3 em dashes,
  and the stale `fastmcp==2.x.*` recommendation in the admin guide.
- Round 2 caught the missing plmbr#301 migration bullet, missing version-
  matrix 5.0.x row, missing README TOC entry, and a couple of style
  nits (sub-heading overpromise, orphan bullet).

Skipped (deferred to future PRs):
- README first-run tour mention.
- Admin guide HTTP kill-switch row in Failure-modes table.
- Terminal drag-drop trust-model precision update after plmbr#327.
- Cipher description nit in plmbr#293 (Fernet AES-128-CBC+HMAC, not
  AES-GCM).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants