Skip to content

feat: unblock SP calls for UC in production#310

Draft
atilafassina wants to merge 15 commits intomainfrom
sp-files
Draft

feat: unblock SP calls for UC in production#310
atilafassina wants to merge 15 commits intomainfrom
sp-files

Conversation

@atilafassina
Copy link
Copy Markdown
Contributor

No description provided.

atilafassina and others added 15 commits April 23, 2026 15:03
Phase 1 of 2 — remove pre-policy 401 on missing x-forwarded-user; let the
volume policy decide via { id: <sp-id>, isServicePrincipal: true }.
asUser(req) keeps strict throw semantics; single logger.debug on fallback
replaces the dev-mode logger.warn. Startup no-explicit-policy warning
broadened to mention header-less HTTP. Tests rewritten to codify new
contract. Two pre-existing auto-generated files reformatted to match biome.

Co-authored-by: Isaac
Signed-off-by: Atila Fassina <atila@fassina.eu>
Phase 2 of 2 — non-behavioral polish. JSDoc on FilePolicyUser.id /
isServicePrincipal broadened to describe header-less HTTP as a valid
SP call origin. VolumeHandle JSDoc notes asUser(req) throws
AuthenticationError.missingToken regardless of NODE_ENV. Files-plugin
docs paragraphs that implied x-forwarded-user was mandatory have
been reworded. Auto-regenerated typedoc for FilePolicyUser included.

Co-authored-by: Isaac
Signed-off-by: Atila Fassina <atila@fassina.eu>
Adds optional `auth: "service-principal" | "on-behalf-of-user"` to both
VolumeConfig and IFilesConfig. Resolution order:
volume.auth ?? plugin.auth ?? "service-principal".

Removes the undocumented `bypassPolicy` parameter from createVolumeAPI
(zero callsites in packages/ or apps/, so no consumers to migrate).

Phase 1 of files-per-volume-auth-mode — config surface only, no routing
or SDK identity change yet. _resolveAuth is unused at this point and
will be wired in Phase 2.

Co-authored-by: Isaac
Signed-off-by: Atila Fassina <atila@fassina.eu>
Adds `_extractObiUser` to the files plugin and wires identity selection
into `_enforcePolicy` so it branches on `_resolveAuth(volumeKey)`:
- service-principal volumes: existing inline extraction (unchanged)
- on-behalf-of-user volumes: require x-forwarded-access-token + user
  headers; 401 in production when missing, dev-fallback to SP with a
  single warn in development

Only the policy-user identity changes here. UC SDK calls still execute
as the service principal until Phase 3 wires `runInUserContext`.

Phase 2 of files-per-volume-auth-mode.

Co-authored-by: Isaac
Signed-off-by: Atila Fassina <atila@fassina.eu>
Wires the seven read handlers (list, read, download, raw, exists,
metadata, preview) through a new `_runWithAuth(req, volumeKey, fn)`
helper that:
- runs `fn` directly on SP volumes — byte-for-byte identical to today
- wraps `fn` in `runInUserContext` on OBO volumes when both x-forwarded
  headers are present, so the SDK call and `getCurrentUserId()` resolve
  to the end user's identity. Cache keys use `getCurrentUserId()`, so
  per-user cache isolation falls out for free.

Policy still gates first; `_enforcePolicy` already 401s in production
when OBO headers are missing, so the dev-fallback path inside
`_runWithAuth` is only reachable under `NODE_ENV === "development"`.

Also rewrites the cache invalidation comment and the `VolumeAPI` /
`VolumeHandle` / `exports()` JSDoc to describe both modes (previously
all three claimed "operations execute as the service principal").

Phase 3 of files-per-volume-auth-mode — first end-to-end demoable
slice. Write routes (Phase 4) and `VolumeHandle.asUser` SDK identity
(Phase 5) are not yet wired.

Co-authored-by: Isaac
Signed-off-by: Atila Fassina <atila@fassina.eu>
Wires `_handleUpload`, `_handleMkdir`, and `_handleDelete` through
`_runWithAuth` so write traffic on OBO volumes executes as the end
user. SP volumes are byte-for-byte identical (no-op wrap fall-through).

The upload handler does a hand-rolled `fetch PUT` and calls
`client.config.authenticate(headers)` to populate auth headers — the
whole body now runs inside the user-context wrap, so the user-token
client is what authenticates the outgoing request.

Adds the non-negotiable upload-headers contract test: triggers an OBO
upload and asserts the outgoing fetch carries `Authorization: Bearer
USER-TOKEN-FOO` (not the SP's marker), proving the chain
runInUserContext → getWorkspaceClient → client.config.authenticate →
fetch headers is intact end-to-end. Verified the test fails meaningfully
if the wrap is removed.

Phase 4 of files-per-volume-auth-mode. Phase 5 (`VolumeHandle.asUser`
SDK identity fix) is next.

Co-authored-by: Isaac
Signed-off-by: Atila Fassina <atila@fassina.eu>
`appKit.files("vol").asUser(req).<op>()` previously only swapped the
user identity passed to the volume policy — the underlying SDK call
still ran as the service principal. After this change, every method on
the returned handle is wrapped in `runInUserContext` with the request's
user identity, so the UC SDK call genuinely executes as the end user.
This is a hard override at the SDK level: it applies even on
`auth: "service-principal"` volumes, where it was previously a no-op
for SDK identity.

Implementation: a new `_wrapVolumeAPIInUserContext` helper wraps each
method in the returned `VolumeAPI`. Reuses `_buildUserContextOrNull` so
the dev-mode fallback (`NODE_ENV === "development"` + missing token)
skips the wrap and continues to execute as the SP — matching pre-OBO
behavior locally without a reverse proxy.

Strict `_extractUser` is unchanged — `asUser` still throws
`AuthenticationError.missingToken` in production when the user header
is missing.

Programmatic OBO note: `appKit.files("obo-vol").<op>()` (no req, no
asUser) cannot synthesize a user identity and continues to execute
against whatever client `getWorkspaceClient()` resolves to at the call
site (typically the SP at the top level). The OBO volume default
applies to HTTP route traffic via `_runWithAuth`. For programmatic
per-user execution, `asUser(req)` is the supported path.

Phase 5 of files-per-volume-auth-mode.

BREAKING CHANGE: programmatic callers of
appKit.files("vol").asUser(req).<op>() that expected the SDK call to
execute as the service principal (with only the policy seeing the
user) now see the SDK call execute as the user. Audit programmatic
asUser callers and remove the asUser wrap if SP credentials were the
desired behavior.

Co-authored-by: Isaac
Signed-off-by: Atila Fassina <atila@fassina.eu>
…phase 6)

Tags every Files plugin trace span with `files.auth_mode` set to either
`"service-principal"` or `"on-behalf-of-user"`, reflecting what
operationally happened (whether the SDK call ran inside
`runInUserContext`). Two complementary plumbing paths land on the same
attribute key:

- HTTP routes: route handlers compute the effective mode via a new
  `_effectiveAuthMode(req, volumeKey)` helper and thread it into the
  existing `PluginExecutionSettings.telemetryInterceptor.attributes`
  shape — `TelemetryInterceptor` already passes those attributes to
  `tracer.startActiveSpan`, so the attribute lands on the existing
  `plugin.execute` span without new infrastructure.
- Programmatic API (`exports()` SP path + `asUser` OBO path): wraps each
  VolumeAPI method in a new `_withAuthModeSpan(operation, mode, fn)`
  helper that calls `this.telemetry.startActiveSpan("files.<op>", ...)`.
  Programmatic calls previously created NO span, so this introduces new
  `files.<op>` spans on programmatic surface — a small observability
  enrichment beyond pure attribute plumbing. Spans respect the plugin's
  `traces` config; the noop tracer takes over when traces are disabled.

Updates `getResourceRequirements()` JSDoc to document the per-volume
permission split: SP volumes need the SP to hold `WRITE_VOLUME`; OBO
volumes need the end user (not the SP) to hold it. Adds a
`// TODO: extend plugin-manifest.schema.json` marker for the eventual
schema-level fix.

Phase 6 of files-per-volume-auth-mode. Phase 7 (docs, playground,
changelog) is next.

Co-authored-by: Isaac
Signed-off-by: Atila Fassina <atila@fassina.eu>
Final polish phase for files-per-volume-auth-mode:

JSDoc:
- Expand `VolumeConfig.auth` and `IFilesConfig.auth` with SP and OBO
  example blocks plus resolution-order notes.
- Extend `FilePolicyUser.isServicePrincipal` JSDoc with the full
  six-row matrix covering SP/OBO x HTTP/programmatic x header
  presence, including the `asUser(req)` rows.

Docs (`docs/docs/plugins/files.md`):
- New "Auth modes" section: two modes, resolution order, per-mode
  permission requirements, prod/dev OBO behavior, manifest-scope
  limitation, side-by-side config examples.
- New `usersOnly` policy example using `isServicePrincipal: false`
  and a "Policy user matrix" subsection.
- Rewrites every "all operations execute as the service principal"
  paragraph to describe both modes.
- Dedicated `asUser(req)` subsection with the honest limitation that
  programmatic OBO defaults don't apply without `asUser`.

Dev-playground:
- Server: new `obo_demo` volume with `auth: "on-behalf-of-user"` and
  a `usersOnly` policy; new smoke route `GET /policy/obo-volume`
  using `asUser(req)`. `app.yaml` gets the matching
  `DATABRICKS_VOLUME_OBO_DEMO` resource binding.
- Client: `policy-matrix.route.tsx` gets a "Per-volume OBO mode"
  section with two probes (HTTP route + programmatic smoke) so the
  end-to-end OBO path is exercisable from the UI.

Changelog:
- New `## Unreleased` section in `CHANGELOG.md` (lives above the
  release-it generated `[0.24.0]` block) covering the per-volume auth
  feature, the `files.auth_mode` span, the `asUser` SDK identity
  behavior change, the `bypassPolicy` removal, and the honest
  limitation about programmatic OBO without `asUser`.

Generated:
- `types.generated.ts` and `plugin-manifest.generated.ts` are
  regenerated formatting (line-collapsing) from `pnpm build`'s
  generator step. Content unchanged.

Backpressure: pnpm build, pnpm docs:build, pnpm check:fix,
pnpm -r typecheck, pnpm test (1666 passing) all clean.

Phase 7 of files-per-volume-auth-mode — feature is shippable.

Co-authored-by: Isaac
Signed-off-by: Atila Fassina <atila@fassina.eu>
…che, spans

Addresses four findings from the multi-model review of the files OBO
feature.

1. asUser privilege confusion in production (CRITICAL, security)
   `_extractUser` now requires BOTH `x-forwarded-user` AND
   `x-forwarded-access-token` in production — throws
   `AuthenticationError.missingToken` when either is missing. Previously
   a request with the user header but no token returned a SP-wrapped
   API where the policy saw the request as a real end user
   (isServicePrincipal: undefined) but the SDK ran with SP credentials,
   so policies like `usersOnly: !user.isServicePrincipal` were
   satisfiable while wielding the SP's broader UC grants. CWE-639.
   Dev fallback now marks the returned policy user as
   `isServicePrincipal: true` and warns (was debug) so policies can't
   be tricked even in dev.

2. Double WorkspaceClient allocation per OBO request (HIGH, perf)
   New `_resolveAuthForRequest(req, volumeKey)` returns
   { mode, userCtx } and builds the UserContext at most once per
   request. `_runWithAuth(userCtx, fn)` now takes the pre-built ctx.
   Removes `_effectiveAuthMode` (no longer needed). Every OBO HTTP
   request previously constructed two WorkspaceClients; cache hits
   paid that cost too. Now: one allocation, even on cache hits.

3. Write-cache invalidation on OBO + wrong path arg (HIGH, correctness)
   Two related bugs at `_invalidateListCache`: (a) cache keys included
   `getCurrentUserId()`, so user A's write only busted user A's list
   cache — user B continued to see stale listings; (b) the `path` arg
   was the file path instead of the parent directory.

   Fix: list-cache is disabled on OBO volumes (no cross-user staleness
   possible). On SP volumes, invalidation now derives the parent
   directory via `parentDirectory()`, falling back to the `"__root__"`
   sentinel for root-level writes — matches `_handleList`'s key shape.
   Cache delete + path resolution are wrapped in best-effort try/catch
   so an invalidation failure cannot convert a successful write into
   HTTP 500.

4. Duplicate `files.<op>` spans on programmatic calls (HIGH, perf)
   `FilesConnector.<op>` already opens a `files.<op>` span via its
   `traced()` decorator. The Phase 6 `_withAuthModeSpan` opened an
   identical span on top, doubling span allocation/export volume on
   every programmatic call.

   Fix: new `runWithFilesSpanAttributes(attrs, fn)` exported from
   `connectors/files/client.ts` uses AsyncLocalStorage to make ambient
   attributes available to the connector's `traced()` decorator.
   `_withAuthModeSpan` renamed to `_withAuthModeAttributes` — no longer
   creates a span; just sets `files.auth_mode` on the connector's
   existing span via the ALS channel. Programmatic calls now produce
   exactly one `files.<op>` span. HTTP route attribute path
   (`PluginExecutionSettings.telemetryInterceptor.attributes`) is
   unchanged.

Tests: 8 new regression guards covering each fix's failure mode (1674
total, was 1666). The pre-existing `DownloadResponse` unused-import
warning is also resolved as cleanup fallout.

User-facing notes (for changelog follow-up):
- asUser(req) throws on missing token in production (was silent SP
  fallback — the previous behavior was a privilege-confusion bug).
- OBO volume reads are no longer cached. Trade-off: every OBO read
  hits the SDK; in exchange, no cross-user staleness. SP volume reads
  still cache.
- Programmatic appKit.files(vol).<op>() emits one `files.<op>` span
  instead of two. The `files.auth_mode` attribute now lands on the
  connector's existing span.

Co-authored-by: Isaac
Signed-off-by: Atila Fassina <atila@fassina.eu>
Two follow-up review fixes on top of 3f98628.

A — _invalidateListCache not awaited (medium, correctness)
   Write handlers (_handleUpload, _handleMkdir, _handleDelete) called
   _invalidateListCache without `await`, so res.json() shipped before
   cache invalidation completed. A client that issued a follow-up
   GET /list in the same tick could hit the stale cache.

   Fix: _invalidateListCache is now genuinely `async` and all three
   call sites `await` it before sending the response. Best-effort
   try/catch around cache.delete and connector.resolvePath remains, so
   an invalidation failure cannot convert a successful write into 500.

   Regression test installs a deferred-promise cache.delete and
   asserts res.json is NOT called until the delete settles. Verified
   the test fails when the `await` is removed and passes when it is
   added back.

B — hardcoded ports in integration tests (medium, CI hygiene)
   plugin.integration.test.ts bound to fixed offsets of TEST_PORT
   (9880, 9881, 9882). Concurrent CI runs or stale test processes
   risked EADDRINUSE flakiness.

   Fix: bind `port: 0` so the OS assigns ephemeral ports. New
   getListeningPort() helper waits for the server's `listening` event
   and returns the assigned port for building localBase URLs. Chose
   port-0 over supertest because the tests build their own AppKit
   instance, hold the Server for afterAll cleanup, and use real fetch
   calls — moving to supertest would have required restructuring the
   fixture lifecycle.

Tests: 1675 (was 1674 + 1 new). typecheck + biome clean.

Co-authored-by: Isaac
Signed-off-by: Atila Fassina <atila@fassina.eu>
Signed-off-by: Atila Fassina <atila@fassina.eu>

# Conflicts:
#	CHANGELOG.md
#	apps/dev-playground/server/index.ts
- _readPathQuery helper rejects array/object query params with 400 across
  all 7 path-bearing handlers (was: req.query.path raw-cast)
- /read streams via connector.download + size-enforcing TransformStream
  capped at maxReadSize (new VolumeConfig field, default 10MB); /read no
  longer participates in the read-tier cache
- Upload TransformStream enforces bytesReceived <= contentLength when
  declared, closing a per-user policy bypass where small Content-Length
  + larger body would exceed an approved upload size
- _enforcePolicy 401 and _handleApiError 4xx now return generic public
  bodies (Unauthorized / standard HTTP STATUS_CODES); raw error.message
  goes to server-side logs only (CWE-209)

Co-authored-by: Isaac

Co-authored-by: Orca <help@stably.ai>
Signed-off-by: Atila Fassina <atila@fassina.eu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant