Skip to content

RFC: stash inline Arrow IPC server-side, serve via /arrow-result#320

Draft
jamesbroadhead wants to merge 17 commits intomainfrom
proposal/arrow-stream-stash-and-serve
Draft

RFC: stash inline Arrow IPC server-side, serve via /arrow-result#320
jamesbroadhead wants to merge 17 commits intomainfrom
proposal/arrow-stream-stash-and-serve

Conversation

@jamesbroadhead
Copy link
Copy Markdown
Contributor

Status: DRAFT — for design discussion only, not for immediate merge.

This PR proposes an alternative architecture for the inline-Arrow path landed in #256.

Background

#256 lands ARROW_STREAM + INLINE support by emitting a new { type: "arrow_inline", attachment: <base64> } SSE message and decoding the Arrow IPC bytes on the client. To make that work for non-trivial result sizes, #256 also bumps both the server and client SSE event-size caps from 1 MiB to 8 MiB.

That works, but has two architectural concerns:

  1. SSE was designed for streams of small control messages. Pushing multi-MiB Arrow IPC bytes through SSE inflates the per-event memory ceiling on both server and client.
  2. Proxies and load balancers sometimes have their own SSE event-size limits, independent of our defaults. The 8 MiB bump makes us depend on those tolerating larger events.

What this PR does

Delivers ARROW_STREAM + INLINE responses out-of-band via the existing /arrow-result/:jobId endpoint:

  1. The connector remains unchanged — still passes result.attachment through, still synthesizes empty schemas as base64 IPC.
  2. The analytics route base64-decodes the attachment once, stashes the resulting buffer in InlineArrowStash (bounded LRU + TTL), and emits the same { type: "arrow", statement_id } SSE message that EXTERNAL_LINKS already uses — but with a synthetic id prefixed inline-.
  3. The client makes no distinction. It calls /arrow-result/:jobId for any type: "arrow" message (unchanged from main).
  4. The route handler checks the stash first; if the id has the inline- prefix and is still present, it serves bytes from memory. Otherwise it falls through to the existing warehouse-fetch path for real EXTERNAL_LINKS statement_ids.

The arrow_inline SSE message type, the client-side base64 decode helper, and the SSE buffer-size bumps from #256 all go away. SSE returns to the 1 MiB default and only carries control messages.

Why this is cleaner

Aspect #256 (arrow_inline over SSE) This PR (stash + /arrow-result)
SSE event size pressure 8 MiB per event <1 KiB per event (just statement_id)
Proxy compatibility Depends on proxy SSE event limits Standard HTTP — works everywhere
Inline result size cap 8 MiB (matches SSE) 25 MiB (matches Databricks API hard limit)
Client wire-protocol branches 2 (arrow, arrow_inline) 1 (arrow for both INLINE and EXTERNAL_LINKS)
HTTP/2 streaming, gzip, background-fetch Not applicable Works naturally

Tradeoffs

  • Server-side memory. Each stash entry holds the decoded IPC buffer until the client fetches it. take() is one-shot and removes the entry on read; TTL evicts unread entries. Worst case with default bounds (100 entries × 25 MiB) is ~2.5 GiB peak. Steady state is much smaller because the client fetches within a few seconds.
  • Single-process only. A multi-server deployment would need a shared store (Redis or equivalent) keyed on the synthetic id. The stash interface is intentionally small (put / take / size) to make this swap straightforward.
  • One-shot reads. Clients cannot retry a failed /arrow-result fetch — by the time they retry, the stash entry is gone. Acceptable for our case (the warehouse query returned data; the hook either succeeds or surfaces a generic error).

Implementation surface vs #256

+1 file:  packages/appkit/src/plugins/analytics/inline-arrow-stash.ts
+1 file:  packages/appkit/src/plugins/analytics/tests/inline-arrow-stash.test.ts (10 tests)
~1 file:  packages/appkit/src/plugins/analytics/analytics.ts (route handler + stash wiring)
~1 file:  packages/appkit-ui/src/react/hooks/use-analytics-query.ts (drop arrow_inline branch)
~3 files: SSE/connector caps reverted to pre-bump values

10 net files, +296/−125 vs #256.

Test status

  • 1862/1862 tests pass
  • Typecheck clean
  • 10 new stash unit tests (LRU eviction, TTL, one-shot reads, prefix routing, UUID uniqueness)

What I want to know

  1. Is the server-side memory profile acceptable for the SDK's expected scale? (How many concurrent SSE clients does an AppKit deployment realistically have?)
  2. Is single-process state a hard blocker, or is "single replica is fine, multi-replica needs Redis" an acceptable starting position?
  3. Should the stash bounds (maxEntries, ttlMs) be configurable through the analytics plugin config?

If we like the direction, I'll cherry-pick this on top of #256 once it lands; otherwise we keep #256's 5b approach.

This pull request was AI-assisted by Isaac.

Some serverless warehouses only support ARROW_STREAM with INLINE
disposition, but the analytics plugin only offered JSON_ARRAY (INLINE)
and ARROW_STREAM (EXTERNAL_LINKS). This adds a new "ARROW_STREAM"
format option that uses INLINE disposition, making the plugin
compatible with these warehouses.

Fixes #242
Tests verify:
- ARROW_STREAM format passes INLINE disposition + ARROW_STREAM format
- ARROW format passes EXTERNAL_LINKS disposition + ARROW_STREAM format
- Default JSON format does not pass disposition or format overrides
The server-side ARROW_STREAM format added in the previous commit was
not exposed to the frontend or typegen:

- Add "ARROW_STREAM" to AnalyticsFormat in appkit-ui hooks
- Add "arrow_stream" to DataFormat in chart types
- Handle "arrow_stream" in useChartData's resolveFormat()
- Make typegen resilient to ARROW_STREAM-only warehouses by
  retrying DESCRIBE QUERY without format when JSON_ARRAY is rejected

Co-authored-by: Isaac
Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>
…compatibility

ARROW_STREAM with INLINE disposition is the only format that works
across all warehouse types, including serverless warehouses that reject
JSON_ARRAY. Change the default from JSON to ARROW_STREAM throughout:

- Server: defaults.ts, analytics plugin request handler
- Client: useAnalyticsQuery, UseAnalyticsQueryOptions, useChartData
- Tests: update assertions for new default

JSON and ARROW formats remain available via explicit format parameter.

Co-authored-by: Isaac
Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>
When using the default ARROW_STREAM format, the analytics plugin now
automatically falls back through formats if the warehouse rejects one:
ARROW_STREAM → JSON → ARROW.

This handles warehouses that only support a subset of format/disposition
combinations without requiring users to know their warehouse's
capabilities. Explicit format requests (JSON, ARROW) are respected
without fallback.

Co-authored-by: Isaac
Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>
Previously, _transformDataArray unconditionally called updateWithArrowStatus
for any ARROW_STREAM response, which discards inline data and returns only
statement_id + status. This was designed for EXTERNAL_LINKS (where data is
fetched separately) but broke INLINE disposition where data is in data_array.

Changes:
- _transformDataArray now checks for data_array before routing to the
  EXTERNAL_LINKS path: if data_array is present, it falls through to the
  standard row-to-object transform.
- JSON format now explicitly sends JSON_ARRAY + INLINE rather than relying
  on connector defaults. This prevents the connector default format from
  leaking into explicit JSON requests.
- Connector defaults reverted to JSON_ARRAY for backward compatibility with
  classic warehouses (the analytics plugin sets formats explicitly).
- Added connector-level tests for _transformDataArray covering ARROW_STREAM
  + INLINE, ARROW_STREAM + EXTERNAL_LINKS, and JSON_ARRAY paths.

Co-authored-by: Isaac
Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>
Some serverless warehouses return ARROW_STREAM + INLINE results as base64
Arrow IPC in `result.attachment` rather than `result.data_array`. This adds
server-side decoding using apache-arrow's tableFromIPC to convert the
attachment into row objects, producing the same response shape as JSON_ARRAY
regardless of warehouse backend.

This abstracts a Databricks internal implementation detail (different
warehouses returning different response formats) so app developers get a
consistent `type: "result"` response with named row objects.

Changes:
- Add apache-arrow@21.1.0 as a server dependency (already used client-side)
- _transformDataArray detects `attachment` field and decodes via tableFromIPC
- Connector tests use real base64 Arrow IPC captured from a live serverless
  warehouse, covering: classic JSON_ARRAY, classic EXTERNAL_LINKS,
  serverless INLINE attachment, data_array fallback, and edge cases

Co-authored-by: Isaac
Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>
…ctor, files plugin

New test files covering major coverage gaps:
- context/tests/service-context.test.ts (35 tests, 7% -> 100%)
- stream/tests/stream-registry.test.ts (34 tests, 32% -> 100%)
- connectors/genie/tests/client.test.ts (28 tests, 61% -> 97%)
- plugins/files/tests/upload-and-write.test.ts (50 tests, 69% -> 89%)

Total: 1566 -> 1713 tests, all passing.

Co-authored-by: Isaac
…format model

Address Mario's review: collapse three formats (JSON, ARROW, ARROW_STREAM) into
two that match the Databricks API enums. ARROW_STREAM supports both INLINE and
EXTERNAL_LINKS dispositions with automatic fallback. Default remains JSON_ARRAY
per reviewer request.

Co-authored-by: Isaac
Fixes issues found by parallel review across Claude harsh-reviewer, GPT 5.4
xhigh, and Gemini 3.1 Pro on PR #256.

- Validate `format` at the route boundary (reject anything other than
  JSON_ARRAY / ARROW_STREAM with 400) so legacy `"JSON"` / `"ARROW"` callers
  fail loudly instead of silently falling through to ARROW_STREAM.
- Disable cache for ARROW_STREAM queries — EXTERNAL_LINKS pre-signed URLs
  expire much sooner than the default 1h TTL, so caching the statement_id
  hands out dead URLs on cache hits.
- Re-check `signal.aborted` before issuing the EXTERNAL_LINKS fallback —
  previously an aborted query would still bill a second statement that the
  client never reads.
- Replace fragile `msg.includes("ARROW_STREAM") || msg.includes("INLINE")`
  format-error detection with `_isInlineArrowUnsupported()` that requires
  both keywords plus a marker phrase, or a structured `error_code` of
  INVALID_PARAMETER_VALUE / NOT_IMPLEMENTED. Reduces false-positives on
  legitimate SQL/permission errors that mention those words.
- Normalize BigInt values from `row.toJSON()` when decoding inline Arrow
  attachments — `apache-arrow` returns BigInt for INT64/DECIMAL columns,
  which JSON.stringify rejects when the SSE writer serializes the response.
- Cap inline Arrow attachment decode at 64 MiB and wrap decode in try/catch
  so malformed payloads surface a typed ExecutionError instead of a generic
  500.
- Detect external_links explicitly in `_transformDataArray` rather than
  treating any "no attachment + no data_array" response as EXTERNAL_LINKS.
  Previously an empty INLINE response would return a phantom statement_id
  the client could not fetch.
- Remove the typegen DESCRIBE QUERY fallback that retried without `format`
  and got ARROW_STREAM responses with no `data_array`, silently degrading
  generated types to `unknown`. The surrounding allSettled already converts
  thrown errors to `unknown` types via `generateUnknownResultQuery`.

Co-authored-by: Isaac
Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>
useAnalyticsQuery({ format: "ARROW_STREAM" }) is typed as TypedArrowTable,
but the previous implementation decoded inline IPC attachments to row objects
on the server and emitted them as { type: "result", data: rows }. Same
requested format produced a TypedArrowTable for EXTERNAL_LINKS warehouses
and a plain Row[] for INLINE/serverless warehouses, breaking direct callers
that invoked Arrow Table methods.

This commit makes ARROW_STREAM always deliver an Arrow Table to the client:

- SQL connector forwards the base64 IPC attachment through unchanged (with
  a 64 MiB size cap) instead of decoding it server-side.
- Analytics route emits a new SSE message { type: "arrow_inline",
  attachment: <base64> } when ARROW_STREAM + INLINE returns an attachment.
- useAnalyticsQuery decodes the base64 payload via the existing
  ArrowClient.processArrowBuffer pipeline, producing a Table with the same
  runtime shape as the EXTERNAL_LINKS path.
- Drops apache-arrow from the appkit (server) dependencies — decoding now
  lives only in appkit-ui where Arrow was already loaded.

Tests:
- Updated client.test.ts to assert the connector preserves the attachment.
- Added a size-cap test for oversized inline attachments.
- Added analytics.test.ts coverage for: arrow_inline SSE emission, format
  validation rejecting unknown values with 400, and aborted-signal
  short-circuit before fallback retry.
- Added use-analytics-query.test.ts (new) covering arrow_inline decoding,
  decode-failure error path, and JSON_ARRAY result handling.

Co-authored-by: Isaac
Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>
upload-and-write.test.ts (added in 7164d3b) covers FilesPlugin internals
(_handleApiError mapping, upload stream size enforcement, cache
invalidation, etc.) — entirely separate from this PR's analytics/Arrow
scope. The file ships with 9 failing tests against the current files
plugin (assertion mismatches on HTTP status codes, Readable.toWeb mock
type errors) that block CI here.

Removing it from this PR. The coverage push will come back as its own
PR where the failures can be properly debugged.

Co-authored-by: Isaac
Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>
Some serverless warehouses reject JSON_ARRAY + INLINE for DESCRIBE QUERY
and return ARROW_STREAM by default. The previous behavior just removed
the broken fallback, which meant typegen produced `unknown` types for
those warehouses' queries.

This restores the fallback (retry without explicit format if JSON_ARRAY
is rejected) and teaches `convertToQueryType` to decode an inline base64
Arrow IPC attachment when `data_array` is empty. The DESCRIBE QUERY
result is itself a table with rows shaped (col_name, data_type, comment),
so the decode reads `table.toArray().map(r => r.toJSON())` rather than
`table.schema.fields` — reading the schema would yield bogus types
(every query would come out shaped like the metadata columns).

Re-adds apache-arrow as an appkit dependency (only the typegen uses it;
the runtime SDK does not).

Tests cover: schema extraction from data rows, lowercase type
normalization, data_array taking precedence when both are present, and
graceful degradation on malformed attachments.

Supersedes #316.

Co-authored-by: Isaac
Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>
Tightens correctness gaps surfaced by the second ACE review pass
(GPT 5.4 + Gemini 3.1 Pro + Claude harsh-reviewer):

- _isInlineArrowUnsupported now requires both INLINE and ARROW_STREAM in
  the message regardless of whether a structured error_code is present.
  The previous shape would fall back on any INVALID_PARAMETER_VALUE that
  mentioned just one keyword.
- useAnalyticsQuery validates that arrow_inline messages carry a
  non-empty string attachment before invoking atob, so a malformed
  payload surfaces a clear error instead of a confusing decode crash.
- ARROW_STREAM cache TTL is capped at 600s rather than disabling the
  cache outright. Pre-signed EXTERNAL_LINKS URLs typically expire in
  ~15min, so 10min is a safe upper bound that still preserves caching
  for INLINE attachment responses.
- Type generator's JSON_ARRAY-rejection retry now (a) requires both
  JSON_ARRAY and a marker phrase before retrying, mirroring the
  analytics-plugin tightening, and (b) explicitly requests
  ARROW_STREAM + INLINE on the retry rather than letting the warehouse
  default kick in. Previously the warehouse default could return
  EXTERNAL_LINKS, in which case neither data_array nor attachment were
  populated and types silently degraded to `unknown`.
- _validateArrowAttachment strips whitespace and accounts for `=`
  padding so the size check is exact rather than an upper bound. Also
  uses `Math.floor` so the cap matches the actual decoded byte count.

Tests:
- New: structured error_code path triggers fallback when both keywords
  are present.
- New: regression — error mentioning only one of INLINE / ARROW_STREAM
  must not escalate to EXTERNAL_LINKS even if the retry interceptor
  attempts the query multiple times.
- New: hook rejects arrow_inline with attachment values that are
  undefined / null / empty / non-string / object, never invoking
  atob or processArrowBuffer on bad input.

Co-authored-by: Isaac
Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>
The hook is typed `TypedArrowTable` for ARROW_STREAM, but empty results
(zero rows / no inline attachment / no external_links) previously fell
through to `{ type: "result", ...result }` with no `data` field. Direct
useAnalyticsQuery callers expecting a Table would crash on `.numRows`
or `.getChild()`.

This commit makes ARROW_STREAM truly contract-consistent: when the
response has a manifest schema but no row payload, the connector
synthesizes a zero-row Arrow IPC stream from the schema and stashes it
in `result.attachment`. The existing arrow_inline path emits it to the
client, which decodes it as a real Arrow Table — same shape as a
non-empty result, just with `numRows === 0` and empty vectors per
column.

A new module `connectors/sql-warehouse/arrow-schema.ts` parses
Databricks SQL `type_text` values into Apache Arrow `DataType`s. It is
fully exhaustive:
  - Scalars (STRING, BIGINT, DECIMAL, TIMESTAMP, BOOLEAN, …)
  - Parameterized: DECIMAL(p,s), VARCHAR(n), CHAR(n)
  - Nested: ARRAY<T>, MAP<K,V>, STRUCT<f1:T1, f2:T2 COMMENT '…', …>
  - INTERVAL year-month and day-time variants
  - Backtick-quoted struct field names (incl. `` `` `` escapes)
  - Recursive nesting (verified up to MAP<STRING,ARRAY<STRUCT<…>>>)
  - NOT NULL annotations on struct fields

Unknown / unparseable types degrade to Utf8 rather than throwing, so
future Databricks types don't crash empty-result handling.

Tests: 72 cases covering every scalar, parameterization edge case,
deeply nested combinations, struct field annotations / comments /
backtick escapes, INTERVAL variants, and IPC round-trip. 7 connector
tests covering attachment synthesis + the cases where it must NOT
synthesize (external_links present, schema absent).

Co-authored-by: Isaac
Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>
The previous SSE event-size limits (1 MiB on both server and client)
were below the connector's 64 MiB cap on inline Arrow IPC attachments.
Anything between 1 MiB and 64 MiB would pass the connector check but
get rejected at the SSE layer with a confusing "Buffer size exceeded"
error. The 64 MiB cap was effectively unreachable.

This commit aligns all three caps at 8 MiB:

- `streamDefaults.maxEventSize` (server SSE event cap)
- `connectSSE.maxBufferSize` (client SSE buffer cap)
- `MAX_INLINE_ATTACHMENT_BYTES` (connector inline attachment cap)

8 MiB is well above what analytics queries return in practice (most are
well under 1 MiB), gives ~3x headroom for compact ARROW_STREAM payloads,
and stays below the Databricks Statement Execution API hard limit on
INLINE disposition (~25 MiB). Anything larger than 8 MiB now fails fast
at the connector with a typed `ExecutionError`, and ARROW_STREAM falls
back to EXTERNAL_LINKS which has no size pressure.

A follow-up draft PR explores stash-and-serve via /arrow-result/:jobId
as a stronger architectural solution that removes the SSE size cap
entirely.

Co-authored-by: Isaac
Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>
DRAFT proposal. Replaces the `arrow_inline` SSE message type from #256
with an out-of-band delivery mechanism. Motivation: remove the SSE
event-size cap as a constraint on inline Arrow result size.

Architecture
------------
ARROW_STREAM responses are now delivered uniformly:
1. The connector returns the base64 Arrow IPC attachment in
   `result.attachment` for INLINE responses (unchanged from #256).
2. The analytics route base64-decodes it once, stashes the resulting
   buffer in `InlineArrowStash` (bounded LRU + TTL), and emits the
   same `{ type: "arrow", statement_id }` SSE message that
   EXTERNAL_LINKS already uses — with a synthetic id prefixed
   `inline-`.
3. The client (unchanged from main) calls `/arrow-result/:jobId` for
   any `type: "arrow"` message.
4. The route handler checks the stash first; if the id has the
   `inline-` prefix and is still present, it serves the bytes from
   memory. Otherwise it falls through to the existing warehouse
   fetch path for real EXTERNAL_LINKS statement_ids.

The `arrow_inline` SSE message type, the client-side base64 decode
helper, and the SSE buffer-size bumps from #256 all go away.

Reasoning
---------
SSE was designed for streams of small control messages. Pushing
multi-MB Arrow IPC bytes through it has two unavoidable costs:
- single-event memory ceiling on both server and client,
- proxy/load-balancer compatibility issues for large events.

Bumping the SSE buffer (5b in the design discussion) raises both
caps but doesn't fix the architectural mismatch. This PR moves bulk
bytes to HTTP, where they belong:
- HTTP/2 streaming, gzip, and browser background-fetch all work
  naturally.
- SSE buffer can stay at the conservative 1 MiB default.
- The wire protocol unifies INLINE and EXTERNAL_LINKS — the client
  has a single code path for ARROW_STREAM data.
- Inline result size cap goes back up to the Databricks API limit
  (25 MiB) instead of being constrained by SSE.

Tradeoffs
---------
- Server holds IPC buffers in memory until the client fetches them
  (one-shot `take()` removes on read; passive TTL evicts otherwise).
  For 100 entries × 25 MiB worst case, that's 2.5 GiB peak — but
  steady state is much smaller because entries are removed as soon
  as the client fetches.
- Single-process only. A multi-server deployment would need a
  shared store (Redis or equivalent) keyed on the synthetic id. The
  stash interface is small enough to swap implementations.
- One-shot reads mean the client cannot retry a failed fetch.
  Acceptable for our use case (warehouse query returned data, the
  hook either succeeds in fetching or surfaces a generic error).

Status
------
DRAFT: opened to discuss the architectural direction before landing.
#256 ships with the simpler 5b SSE-buffer-bump approach (8 MiB) which
is sufficient for the realistic analytics range. This proposal is the
cleaner long-term solution if the team wants to invest in the
server-side state.

Co-authored-by: Isaac
Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>
jamesbroadhead added a commit that referenced this pull request Apr 28, 2026
#2 — replace Object.defineProperty(signal, "aborted", ...) with
vi.spyOn(signal, "aborted", "get").mockReturnValue(true). The native
AbortSignal.aborted is a non-configurable getter; the previous form
worked under vitest's Node env but could throw "Cannot redefine
property" on stricter engines. The getter spy is vitest's idiomatic
way to mock non-configurable accessors and works regardless of
configurability.

#3 — defensive client-side size cap on arrow_inline attachments
(8 MiB), mirroring the connector cap. Prevents a misconfigured proxy
or future server bug from pushing the browser into allocating an
unbounded Uint8Array. Both caps carry a comment pointing to PR #320
which removes the arrow_inline SSE path entirely; that PR can drop
the client cap and raise the connector cap to the API's 25 MiB hard
limit when it lands.

Tests: 1 new — hook rejects oversized arrow_inline payloads before
decodeBase64 or processArrowBuffer is called.

Co-authored-by: Isaac
Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant