RFC: stash inline Arrow IPC server-side, serve via /arrow-result#320
Draft
jamesbroadhead wants to merge 17 commits intomainfrom
Draft
RFC: stash inline Arrow IPC server-side, serve via /arrow-result#320jamesbroadhead wants to merge 17 commits intomainfrom
jamesbroadhead wants to merge 17 commits intomainfrom
Conversation
Some serverless warehouses only support ARROW_STREAM with INLINE disposition, but the analytics plugin only offered JSON_ARRAY (INLINE) and ARROW_STREAM (EXTERNAL_LINKS). This adds a new "ARROW_STREAM" format option that uses INLINE disposition, making the plugin compatible with these warehouses. Fixes #242
Tests verify: - ARROW_STREAM format passes INLINE disposition + ARROW_STREAM format - ARROW format passes EXTERNAL_LINKS disposition + ARROW_STREAM format - Default JSON format does not pass disposition or format overrides
The server-side ARROW_STREAM format added in the previous commit was not exposed to the frontend or typegen: - Add "ARROW_STREAM" to AnalyticsFormat in appkit-ui hooks - Add "arrow_stream" to DataFormat in chart types - Handle "arrow_stream" in useChartData's resolveFormat() - Make typegen resilient to ARROW_STREAM-only warehouses by retrying DESCRIBE QUERY without format when JSON_ARRAY is rejected Co-authored-by: Isaac Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>
…compatibility ARROW_STREAM with INLINE disposition is the only format that works across all warehouse types, including serverless warehouses that reject JSON_ARRAY. Change the default from JSON to ARROW_STREAM throughout: - Server: defaults.ts, analytics plugin request handler - Client: useAnalyticsQuery, UseAnalyticsQueryOptions, useChartData - Tests: update assertions for new default JSON and ARROW formats remain available via explicit format parameter. Co-authored-by: Isaac Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>
When using the default ARROW_STREAM format, the analytics plugin now automatically falls back through formats if the warehouse rejects one: ARROW_STREAM → JSON → ARROW. This handles warehouses that only support a subset of format/disposition combinations without requiring users to know their warehouse's capabilities. Explicit format requests (JSON, ARROW) are respected without fallback. Co-authored-by: Isaac Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>
Previously, _transformDataArray unconditionally called updateWithArrowStatus for any ARROW_STREAM response, which discards inline data and returns only statement_id + status. This was designed for EXTERNAL_LINKS (where data is fetched separately) but broke INLINE disposition where data is in data_array. Changes: - _transformDataArray now checks for data_array before routing to the EXTERNAL_LINKS path: if data_array is present, it falls through to the standard row-to-object transform. - JSON format now explicitly sends JSON_ARRAY + INLINE rather than relying on connector defaults. This prevents the connector default format from leaking into explicit JSON requests. - Connector defaults reverted to JSON_ARRAY for backward compatibility with classic warehouses (the analytics plugin sets formats explicitly). - Added connector-level tests for _transformDataArray covering ARROW_STREAM + INLINE, ARROW_STREAM + EXTERNAL_LINKS, and JSON_ARRAY paths. Co-authored-by: Isaac Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>
Some serverless warehouses return ARROW_STREAM + INLINE results as base64 Arrow IPC in `result.attachment` rather than `result.data_array`. This adds server-side decoding using apache-arrow's tableFromIPC to convert the attachment into row objects, producing the same response shape as JSON_ARRAY regardless of warehouse backend. This abstracts a Databricks internal implementation detail (different warehouses returning different response formats) so app developers get a consistent `type: "result"` response with named row objects. Changes: - Add apache-arrow@21.1.0 as a server dependency (already used client-side) - _transformDataArray detects `attachment` field and decodes via tableFromIPC - Connector tests use real base64 Arrow IPC captured from a live serverless warehouse, covering: classic JSON_ARRAY, classic EXTERNAL_LINKS, serverless INLINE attachment, data_array fallback, and edge cases Co-authored-by: Isaac Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>
…ctor, files plugin New test files covering major coverage gaps: - context/tests/service-context.test.ts (35 tests, 7% -> 100%) - stream/tests/stream-registry.test.ts (34 tests, 32% -> 100%) - connectors/genie/tests/client.test.ts (28 tests, 61% -> 97%) - plugins/files/tests/upload-and-write.test.ts (50 tests, 69% -> 89%) Total: 1566 -> 1713 tests, all passing. Co-authored-by: Isaac
…format model Address Mario's review: collapse three formats (JSON, ARROW, ARROW_STREAM) into two that match the Databricks API enums. ARROW_STREAM supports both INLINE and EXTERNAL_LINKS dispositions with automatic fallback. Default remains JSON_ARRAY per reviewer request. Co-authored-by: Isaac
Fixes issues found by parallel review across Claude harsh-reviewer, GPT 5.4 xhigh, and Gemini 3.1 Pro on PR #256. - Validate `format` at the route boundary (reject anything other than JSON_ARRAY / ARROW_STREAM with 400) so legacy `"JSON"` / `"ARROW"` callers fail loudly instead of silently falling through to ARROW_STREAM. - Disable cache for ARROW_STREAM queries — EXTERNAL_LINKS pre-signed URLs expire much sooner than the default 1h TTL, so caching the statement_id hands out dead URLs on cache hits. - Re-check `signal.aborted` before issuing the EXTERNAL_LINKS fallback — previously an aborted query would still bill a second statement that the client never reads. - Replace fragile `msg.includes("ARROW_STREAM") || msg.includes("INLINE")` format-error detection with `_isInlineArrowUnsupported()` that requires both keywords plus a marker phrase, or a structured `error_code` of INVALID_PARAMETER_VALUE / NOT_IMPLEMENTED. Reduces false-positives on legitimate SQL/permission errors that mention those words. - Normalize BigInt values from `row.toJSON()` when decoding inline Arrow attachments — `apache-arrow` returns BigInt for INT64/DECIMAL columns, which JSON.stringify rejects when the SSE writer serializes the response. - Cap inline Arrow attachment decode at 64 MiB and wrap decode in try/catch so malformed payloads surface a typed ExecutionError instead of a generic 500. - Detect external_links explicitly in `_transformDataArray` rather than treating any "no attachment + no data_array" response as EXTERNAL_LINKS. Previously an empty INLINE response would return a phantom statement_id the client could not fetch. - Remove the typegen DESCRIBE QUERY fallback that retried without `format` and got ARROW_STREAM responses with no `data_array`, silently degrading generated types to `unknown`. The surrounding allSettled already converts thrown errors to `unknown` types via `generateUnknownResultQuery`. Co-authored-by: Isaac Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>
useAnalyticsQuery({ format: "ARROW_STREAM" }) is typed as TypedArrowTable,
but the previous implementation decoded inline IPC attachments to row objects
on the server and emitted them as { type: "result", data: rows }. Same
requested format produced a TypedArrowTable for EXTERNAL_LINKS warehouses
and a plain Row[] for INLINE/serverless warehouses, breaking direct callers
that invoked Arrow Table methods.
This commit makes ARROW_STREAM always deliver an Arrow Table to the client:
- SQL connector forwards the base64 IPC attachment through unchanged (with
a 64 MiB size cap) instead of decoding it server-side.
- Analytics route emits a new SSE message { type: "arrow_inline",
attachment: <base64> } when ARROW_STREAM + INLINE returns an attachment.
- useAnalyticsQuery decodes the base64 payload via the existing
ArrowClient.processArrowBuffer pipeline, producing a Table with the same
runtime shape as the EXTERNAL_LINKS path.
- Drops apache-arrow from the appkit (server) dependencies — decoding now
lives only in appkit-ui where Arrow was already loaded.
Tests:
- Updated client.test.ts to assert the connector preserves the attachment.
- Added a size-cap test for oversized inline attachments.
- Added analytics.test.ts coverage for: arrow_inline SSE emission, format
validation rejecting unknown values with 400, and aborted-signal
short-circuit before fallback retry.
- Added use-analytics-query.test.ts (new) covering arrow_inline decoding,
decode-failure error path, and JSON_ARRAY result handling.
Co-authored-by: Isaac
Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>
upload-and-write.test.ts (added in 7164d3b) covers FilesPlugin internals (_handleApiError mapping, upload stream size enforcement, cache invalidation, etc.) — entirely separate from this PR's analytics/Arrow scope. The file ships with 9 failing tests against the current files plugin (assertion mismatches on HTTP status codes, Readable.toWeb mock type errors) that block CI here. Removing it from this PR. The coverage push will come back as its own PR where the failures can be properly debugged. Co-authored-by: Isaac Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>
Some serverless warehouses reject JSON_ARRAY + INLINE for DESCRIBE QUERY and return ARROW_STREAM by default. The previous behavior just removed the broken fallback, which meant typegen produced `unknown` types for those warehouses' queries. This restores the fallback (retry without explicit format if JSON_ARRAY is rejected) and teaches `convertToQueryType` to decode an inline base64 Arrow IPC attachment when `data_array` is empty. The DESCRIBE QUERY result is itself a table with rows shaped (col_name, data_type, comment), so the decode reads `table.toArray().map(r => r.toJSON())` rather than `table.schema.fields` — reading the schema would yield bogus types (every query would come out shaped like the metadata columns). Re-adds apache-arrow as an appkit dependency (only the typegen uses it; the runtime SDK does not). Tests cover: schema extraction from data rows, lowercase type normalization, data_array taking precedence when both are present, and graceful degradation on malformed attachments. Supersedes #316. Co-authored-by: Isaac Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>
Tightens correctness gaps surfaced by the second ACE review pass (GPT 5.4 + Gemini 3.1 Pro + Claude harsh-reviewer): - _isInlineArrowUnsupported now requires both INLINE and ARROW_STREAM in the message regardless of whether a structured error_code is present. The previous shape would fall back on any INVALID_PARAMETER_VALUE that mentioned just one keyword. - useAnalyticsQuery validates that arrow_inline messages carry a non-empty string attachment before invoking atob, so a malformed payload surfaces a clear error instead of a confusing decode crash. - ARROW_STREAM cache TTL is capped at 600s rather than disabling the cache outright. Pre-signed EXTERNAL_LINKS URLs typically expire in ~15min, so 10min is a safe upper bound that still preserves caching for INLINE attachment responses. - Type generator's JSON_ARRAY-rejection retry now (a) requires both JSON_ARRAY and a marker phrase before retrying, mirroring the analytics-plugin tightening, and (b) explicitly requests ARROW_STREAM + INLINE on the retry rather than letting the warehouse default kick in. Previously the warehouse default could return EXTERNAL_LINKS, in which case neither data_array nor attachment were populated and types silently degraded to `unknown`. - _validateArrowAttachment strips whitespace and accounts for `=` padding so the size check is exact rather than an upper bound. Also uses `Math.floor` so the cap matches the actual decoded byte count. Tests: - New: structured error_code path triggers fallback when both keywords are present. - New: regression — error mentioning only one of INLINE / ARROW_STREAM must not escalate to EXTERNAL_LINKS even if the retry interceptor attempts the query multiple times. - New: hook rejects arrow_inline with attachment values that are undefined / null / empty / non-string / object, never invoking atob or processArrowBuffer on bad input. Co-authored-by: Isaac Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>
The hook is typed `TypedArrowTable` for ARROW_STREAM, but empty results
(zero rows / no inline attachment / no external_links) previously fell
through to `{ type: "result", ...result }` with no `data` field. Direct
useAnalyticsQuery callers expecting a Table would crash on `.numRows`
or `.getChild()`.
This commit makes ARROW_STREAM truly contract-consistent: when the
response has a manifest schema but no row payload, the connector
synthesizes a zero-row Arrow IPC stream from the schema and stashes it
in `result.attachment`. The existing arrow_inline path emits it to the
client, which decodes it as a real Arrow Table — same shape as a
non-empty result, just with `numRows === 0` and empty vectors per
column.
A new module `connectors/sql-warehouse/arrow-schema.ts` parses
Databricks SQL `type_text` values into Apache Arrow `DataType`s. It is
fully exhaustive:
- Scalars (STRING, BIGINT, DECIMAL, TIMESTAMP, BOOLEAN, …)
- Parameterized: DECIMAL(p,s), VARCHAR(n), CHAR(n)
- Nested: ARRAY<T>, MAP<K,V>, STRUCT<f1:T1, f2:T2 COMMENT '…', …>
- INTERVAL year-month and day-time variants
- Backtick-quoted struct field names (incl. `` `` `` escapes)
- Recursive nesting (verified up to MAP<STRING,ARRAY<STRUCT<…>>>)
- NOT NULL annotations on struct fields
Unknown / unparseable types degrade to Utf8 rather than throwing, so
future Databricks types don't crash empty-result handling.
Tests: 72 cases covering every scalar, parameterization edge case,
deeply nested combinations, struct field annotations / comments /
backtick escapes, INTERVAL variants, and IPC round-trip. 7 connector
tests covering attachment synthesis + the cases where it must NOT
synthesize (external_links present, schema absent).
Co-authored-by: Isaac
Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>
The previous SSE event-size limits (1 MiB on both server and client) were below the connector's 64 MiB cap on inline Arrow IPC attachments. Anything between 1 MiB and 64 MiB would pass the connector check but get rejected at the SSE layer with a confusing "Buffer size exceeded" error. The 64 MiB cap was effectively unreachable. This commit aligns all three caps at 8 MiB: - `streamDefaults.maxEventSize` (server SSE event cap) - `connectSSE.maxBufferSize` (client SSE buffer cap) - `MAX_INLINE_ATTACHMENT_BYTES` (connector inline attachment cap) 8 MiB is well above what analytics queries return in practice (most are well under 1 MiB), gives ~3x headroom for compact ARROW_STREAM payloads, and stays below the Databricks Statement Execution API hard limit on INLINE disposition (~25 MiB). Anything larger than 8 MiB now fails fast at the connector with a typed `ExecutionError`, and ARROW_STREAM falls back to EXTERNAL_LINKS which has no size pressure. A follow-up draft PR explores stash-and-serve via /arrow-result/:jobId as a stronger architectural solution that removes the SSE size cap entirely. Co-authored-by: Isaac Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>
DRAFT proposal. Replaces the `arrow_inline` SSE message type from #256 with an out-of-band delivery mechanism. Motivation: remove the SSE event-size cap as a constraint on inline Arrow result size. Architecture ------------ ARROW_STREAM responses are now delivered uniformly: 1. The connector returns the base64 Arrow IPC attachment in `result.attachment` for INLINE responses (unchanged from #256). 2. The analytics route base64-decodes it once, stashes the resulting buffer in `InlineArrowStash` (bounded LRU + TTL), and emits the same `{ type: "arrow", statement_id }` SSE message that EXTERNAL_LINKS already uses — with a synthetic id prefixed `inline-`. 3. The client (unchanged from main) calls `/arrow-result/:jobId` for any `type: "arrow"` message. 4. The route handler checks the stash first; if the id has the `inline-` prefix and is still present, it serves the bytes from memory. Otherwise it falls through to the existing warehouse fetch path for real EXTERNAL_LINKS statement_ids. The `arrow_inline` SSE message type, the client-side base64 decode helper, and the SSE buffer-size bumps from #256 all go away. Reasoning --------- SSE was designed for streams of small control messages. Pushing multi-MB Arrow IPC bytes through it has two unavoidable costs: - single-event memory ceiling on both server and client, - proxy/load-balancer compatibility issues for large events. Bumping the SSE buffer (5b in the design discussion) raises both caps but doesn't fix the architectural mismatch. This PR moves bulk bytes to HTTP, where they belong: - HTTP/2 streaming, gzip, and browser background-fetch all work naturally. - SSE buffer can stay at the conservative 1 MiB default. - The wire protocol unifies INLINE and EXTERNAL_LINKS — the client has a single code path for ARROW_STREAM data. - Inline result size cap goes back up to the Databricks API limit (25 MiB) instead of being constrained by SSE. Tradeoffs --------- - Server holds IPC buffers in memory until the client fetches them (one-shot `take()` removes on read; passive TTL evicts otherwise). For 100 entries × 25 MiB worst case, that's 2.5 GiB peak — but steady state is much smaller because entries are removed as soon as the client fetches. - Single-process only. A multi-server deployment would need a shared store (Redis or equivalent) keyed on the synthetic id. The stash interface is small enough to swap implementations. - One-shot reads mean the client cannot retry a failed fetch. Acceptable for our use case (warehouse query returned data, the hook either succeeds in fetching or surfaces a generic error). Status ------ DRAFT: opened to discuss the architectural direction before landing. #256 ships with the simpler 5b SSE-buffer-bump approach (8 MiB) which is sufficient for the realistic analytics range. This proposal is the cleaner long-term solution if the team wants to invest in the server-side state. Co-authored-by: Isaac Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>
jamesbroadhead
added a commit
that referenced
this pull request
Apr 28, 2026
#2 — replace Object.defineProperty(signal, "aborted", ...) with vi.spyOn(signal, "aborted", "get").mockReturnValue(true). The native AbortSignal.aborted is a non-configurable getter; the previous form worked under vitest's Node env but could throw "Cannot redefine property" on stricter engines. The getter spy is vitest's idiomatic way to mock non-configurable accessors and works regardless of configurability. #3 — defensive client-side size cap on arrow_inline attachments (8 MiB), mirroring the connector cap. Prevents a misconfigured proxy or future server bug from pushing the browser into allocating an unbounded Uint8Array. Both caps carry a comment pointing to PR #320 which removes the arrow_inline SSE path entirely; that PR can drop the client cap and raise the connector cap to the API's 25 MiB hard limit when it lands. Tests: 1 new — hook rejects oversized arrow_inline payloads before decodeBase64 or processArrowBuffer is called. Co-authored-by: Isaac Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Status: DRAFT — for design discussion only, not for immediate merge.
This PR proposes an alternative architecture for the inline-Arrow path landed in #256.
Background
#256 lands
ARROW_STREAM + INLINEsupport by emitting a new{ type: "arrow_inline", attachment: <base64> }SSE message and decoding the Arrow IPC bytes on the client. To make that work for non-trivial result sizes, #256 also bumps both the server and client SSE event-size caps from 1 MiB to 8 MiB.That works, but has two architectural concerns:
What this PR does
Delivers ARROW_STREAM + INLINE responses out-of-band via the existing
/arrow-result/:jobIdendpoint:result.attachmentthrough, still synthesizes empty schemas as base64 IPC.InlineArrowStash(bounded LRU + TTL), and emits the same{ type: "arrow", statement_id }SSE message that EXTERNAL_LINKS already uses — but with a synthetic id prefixedinline-./arrow-result/:jobIdfor anytype: "arrow"message (unchanged frommain).inline-prefix and is still present, it serves bytes from memory. Otherwise it falls through to the existing warehouse-fetch path for real EXTERNAL_LINKS statement_ids.The
arrow_inlineSSE message type, the client-side base64 decode helper, and the SSE buffer-size bumps from #256 all go away. SSE returns to the 1 MiB default and only carries control messages.Why this is cleaner
arrow_inlineover SSE)/arrow-result)statement_id)arrow,arrow_inline)arrowfor both INLINE and EXTERNAL_LINKS)Tradeoffs
take()is one-shot and removes the entry on read; TTL evicts unread entries. Worst case with default bounds (100 entries × 25 MiB) is ~2.5 GiB peak. Steady state is much smaller because the client fetches within a few seconds.put/take/size) to make this swap straightforward./arrow-resultfetch — by the time they retry, the stash entry is gone. Acceptable for our case (the warehouse query returned data; the hook either succeeds or surfaces a generic error).Implementation surface vs #256
10 net files, +296/−125 vs #256.
Test status
What I want to know
maxEntries,ttlMs) be configurable through the analytics plugin config?If we like the direction, I'll cherry-pick this on top of #256 once it lands; otherwise we keep #256's 5b approach.
This pull request was AI-assisted by Isaac.