feat: add INLINE + ARROW_STREAM format support for analytics plugin#256
feat: add INLINE + ARROW_STREAM format support for analytics plugin#256jamesbroadhead wants to merge 21 commits intomainfrom
Conversation
|
|
||
| /** Supported data formats for analytics queries */ | ||
| export type DataFormat = "json" | "arrow" | "auto"; | ||
| export type DataFormat = "json" | "arrow" | "arrow_stream" | "auto"; |
There was a problem hiding this comment.
in theory arrow is the same as arrow_stream, so I'm not following what's the problem?
| /** Format configurations in fallback order. */ | ||
| private static readonly FORMAT_CONFIGS = { | ||
| ARROW_STREAM: { | ||
| formatParameters: { disposition: "INLINE", format: "ARROW_STREAM" }, |
There was a problem hiding this comment.
from this URL
https://docs.databricks.com/api/workspace/statementexecution/executestatement#format
Important: The formats ARROW_STREAM and CSV are supported only with EXTERNAL_LINKS disposition. JSON_ARRAY is supported in INLINE and EXTERNAL_LINKS disposition.
so before changing anything this was already supporting arrow, can I know what's the case where this was failing? I would like to see it
|
Seeing that there's a case of Arrow + inline, let's refactor what we had instead of introducing a new format. Let's change the format "ARROW" to "ARROW_STREAM" and allow it to use both "EXTERNAL_LINKS" and "INLINE". Then for now let's keep JSON + inline as the default. This might require some UI hooks changes too |
|
Hi Mario — thanks for the review, you were right on both counts. I've rebased and reworked the PR: What's happening: Serverless warehouses return What I changed (per your suggestion):
Also added 147 new unit tests covering major coverage gaps (service-context 7%→100%, stream-registry 32%→100%, genie connector 61%→97%, files plugin 69%→89%). All 1711 tests pass. Cleaned up the branch — it's now TS-only, no unrelated changes. |
Some serverless warehouses only support ARROW_STREAM with INLINE disposition, but the analytics plugin only offered JSON_ARRAY (INLINE) and ARROW_STREAM (EXTERNAL_LINKS). This adds a new "ARROW_STREAM" format option that uses INLINE disposition, making the plugin compatible with these warehouses. Fixes #242
Tests verify: - ARROW_STREAM format passes INLINE disposition + ARROW_STREAM format - ARROW format passes EXTERNAL_LINKS disposition + ARROW_STREAM format - Default JSON format does not pass disposition or format overrides
The server-side ARROW_STREAM format added in the previous commit was not exposed to the frontend or typegen: - Add "ARROW_STREAM" to AnalyticsFormat in appkit-ui hooks - Add "arrow_stream" to DataFormat in chart types - Handle "arrow_stream" in useChartData's resolveFormat() - Make typegen resilient to ARROW_STREAM-only warehouses by retrying DESCRIBE QUERY without format when JSON_ARRAY is rejected Co-authored-by: Isaac Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>
…compatibility ARROW_STREAM with INLINE disposition is the only format that works across all warehouse types, including serverless warehouses that reject JSON_ARRAY. Change the default from JSON to ARROW_STREAM throughout: - Server: defaults.ts, analytics plugin request handler - Client: useAnalyticsQuery, UseAnalyticsQueryOptions, useChartData - Tests: update assertions for new default JSON and ARROW formats remain available via explicit format parameter. Co-authored-by: Isaac Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>
When using the default ARROW_STREAM format, the analytics plugin now automatically falls back through formats if the warehouse rejects one: ARROW_STREAM → JSON → ARROW. This handles warehouses that only support a subset of format/disposition combinations without requiring users to know their warehouse's capabilities. Explicit format requests (JSON, ARROW) are respected without fallback. Co-authored-by: Isaac Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>
Previously, _transformDataArray unconditionally called updateWithArrowStatus for any ARROW_STREAM response, which discards inline data and returns only statement_id + status. This was designed for EXTERNAL_LINKS (where data is fetched separately) but broke INLINE disposition where data is in data_array. Changes: - _transformDataArray now checks for data_array before routing to the EXTERNAL_LINKS path: if data_array is present, it falls through to the standard row-to-object transform. - JSON format now explicitly sends JSON_ARRAY + INLINE rather than relying on connector defaults. This prevents the connector default format from leaking into explicit JSON requests. - Connector defaults reverted to JSON_ARRAY for backward compatibility with classic warehouses (the analytics plugin sets formats explicitly). - Added connector-level tests for _transformDataArray covering ARROW_STREAM + INLINE, ARROW_STREAM + EXTERNAL_LINKS, and JSON_ARRAY paths. Co-authored-by: Isaac Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>
Some serverless warehouses return ARROW_STREAM + INLINE results as base64 Arrow IPC in `result.attachment` rather than `result.data_array`. This adds server-side decoding using apache-arrow's tableFromIPC to convert the attachment into row objects, producing the same response shape as JSON_ARRAY regardless of warehouse backend. This abstracts a Databricks internal implementation detail (different warehouses returning different response formats) so app developers get a consistent `type: "result"` response with named row objects. Changes: - Add apache-arrow@21.1.0 as a server dependency (already used client-side) - _transformDataArray detects `attachment` field and decodes via tableFromIPC - Connector tests use real base64 Arrow IPC captured from a live serverless warehouse, covering: classic JSON_ARRAY, classic EXTERNAL_LINKS, serverless INLINE attachment, data_array fallback, and edge cases Co-authored-by: Isaac Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>
…ctor, files plugin New test files covering major coverage gaps: - context/tests/service-context.test.ts (35 tests, 7% -> 100%) - stream/tests/stream-registry.test.ts (34 tests, 32% -> 100%) - connectors/genie/tests/client.test.ts (28 tests, 61% -> 97%) - plugins/files/tests/upload-and-write.test.ts (50 tests, 69% -> 89%) Total: 1566 -> 1713 tests, all passing. Co-authored-by: Isaac
…format model Address Mario's review: collapse three formats (JSON, ARROW, ARROW_STREAM) into two that match the Databricks API enums. ARROW_STREAM supports both INLINE and EXTERNAL_LINKS dispositions with automatic fallback. Default remains JSON_ARRAY per reviewer request. Co-authored-by: Isaac
d4e4cea to
a003274
Compare
Fixes issues found by parallel review across Claude harsh-reviewer, GPT 5.4 xhigh, and Gemini 3.1 Pro on PR #256. - Validate `format` at the route boundary (reject anything other than JSON_ARRAY / ARROW_STREAM with 400) so legacy `"JSON"` / `"ARROW"` callers fail loudly instead of silently falling through to ARROW_STREAM. - Disable cache for ARROW_STREAM queries — EXTERNAL_LINKS pre-signed URLs expire much sooner than the default 1h TTL, so caching the statement_id hands out dead URLs on cache hits. - Re-check `signal.aborted` before issuing the EXTERNAL_LINKS fallback — previously an aborted query would still bill a second statement that the client never reads. - Replace fragile `msg.includes("ARROW_STREAM") || msg.includes("INLINE")` format-error detection with `_isInlineArrowUnsupported()` that requires both keywords plus a marker phrase, or a structured `error_code` of INVALID_PARAMETER_VALUE / NOT_IMPLEMENTED. Reduces false-positives on legitimate SQL/permission errors that mention those words. - Normalize BigInt values from `row.toJSON()` when decoding inline Arrow attachments — `apache-arrow` returns BigInt for INT64/DECIMAL columns, which JSON.stringify rejects when the SSE writer serializes the response. - Cap inline Arrow attachment decode at 64 MiB and wrap decode in try/catch so malformed payloads surface a typed ExecutionError instead of a generic 500. - Detect external_links explicitly in `_transformDataArray` rather than treating any "no attachment + no data_array" response as EXTERNAL_LINKS. Previously an empty INLINE response would return a phantom statement_id the client could not fetch. - Remove the typegen DESCRIBE QUERY fallback that retried without `format` and got ARROW_STREAM responses with no `data_array`, silently degrading generated types to `unknown`. The surrounding allSettled already converts thrown errors to `unknown` types via `generateUnknownResultQuery`. Co-authored-by: Isaac Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>
useAnalyticsQuery({ format: "ARROW_STREAM" }) is typed as TypedArrowTable,
but the previous implementation decoded inline IPC attachments to row objects
on the server and emitted them as { type: "result", data: rows }. Same
requested format produced a TypedArrowTable for EXTERNAL_LINKS warehouses
and a plain Row[] for INLINE/serverless warehouses, breaking direct callers
that invoked Arrow Table methods.
This commit makes ARROW_STREAM always deliver an Arrow Table to the client:
- SQL connector forwards the base64 IPC attachment through unchanged (with
a 64 MiB size cap) instead of decoding it server-side.
- Analytics route emits a new SSE message { type: "arrow_inline",
attachment: <base64> } when ARROW_STREAM + INLINE returns an attachment.
- useAnalyticsQuery decodes the base64 payload via the existing
ArrowClient.processArrowBuffer pipeline, producing a Table with the same
runtime shape as the EXTERNAL_LINKS path.
- Drops apache-arrow from the appkit (server) dependencies — decoding now
lives only in appkit-ui where Arrow was already loaded.
Tests:
- Updated client.test.ts to assert the connector preserves the attachment.
- Added a size-cap test for oversized inline attachments.
- Added analytics.test.ts coverage for: arrow_inline SSE emission, format
validation rejecting unknown values with 400, and aborted-signal
short-circuit before fallback retry.
- Added use-analytics-query.test.ts (new) covering arrow_inline decoding,
decode-failure error path, and JSON_ARRAY result handling.
Co-authored-by: Isaac
Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>
upload-and-write.test.ts (added in 7164d3b) covers FilesPlugin internals (_handleApiError mapping, upload stream size enforcement, cache invalidation, etc.) — entirely separate from this PR's analytics/Arrow scope. The file ships with 9 failing tests against the current files plugin (assertion mismatches on HTTP status codes, Readable.toWeb mock type errors) that block CI here. Removing it from this PR. The coverage push will come back as its own PR where the failures can be properly debugged. Co-authored-by: Isaac Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>
Some serverless warehouses reject JSON_ARRAY + INLINE for DESCRIBE QUERY and return ARROW_STREAM by default. The previous behavior just removed the broken fallback, which meant typegen produced `unknown` types for those warehouses' queries. This restores the fallback (retry without explicit format if JSON_ARRAY is rejected) and teaches `convertToQueryType` to decode an inline base64 Arrow IPC attachment when `data_array` is empty. The DESCRIBE QUERY result is itself a table with rows shaped (col_name, data_type, comment), so the decode reads `table.toArray().map(r => r.toJSON())` rather than `table.schema.fields` — reading the schema would yield bogus types (every query would come out shaped like the metadata columns). Re-adds apache-arrow as an appkit dependency (only the typegen uses it; the runtime SDK does not). Tests cover: schema extraction from data rows, lowercase type normalization, data_array taking precedence when both are present, and graceful degradation on malformed attachments. Supersedes #316. Co-authored-by: Isaac Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>
Tightens correctness gaps surfaced by the second ACE review pass (GPT 5.4 + Gemini 3.1 Pro + Claude harsh-reviewer): - _isInlineArrowUnsupported now requires both INLINE and ARROW_STREAM in the message regardless of whether a structured error_code is present. The previous shape would fall back on any INVALID_PARAMETER_VALUE that mentioned just one keyword. - useAnalyticsQuery validates that arrow_inline messages carry a non-empty string attachment before invoking atob, so a malformed payload surfaces a clear error instead of a confusing decode crash. - ARROW_STREAM cache TTL is capped at 600s rather than disabling the cache outright. Pre-signed EXTERNAL_LINKS URLs typically expire in ~15min, so 10min is a safe upper bound that still preserves caching for INLINE attachment responses. - Type generator's JSON_ARRAY-rejection retry now (a) requires both JSON_ARRAY and a marker phrase before retrying, mirroring the analytics-plugin tightening, and (b) explicitly requests ARROW_STREAM + INLINE on the retry rather than letting the warehouse default kick in. Previously the warehouse default could return EXTERNAL_LINKS, in which case neither data_array nor attachment were populated and types silently degraded to `unknown`. - _validateArrowAttachment strips whitespace and accounts for `=` padding so the size check is exact rather than an upper bound. Also uses `Math.floor` so the cap matches the actual decoded byte count. Tests: - New: structured error_code path triggers fallback when both keywords are present. - New: regression — error mentioning only one of INLINE / ARROW_STREAM must not escalate to EXTERNAL_LINKS even if the retry interceptor attempts the query multiple times. - New: hook rejects arrow_inline with attachment values that are undefined / null / empty / non-string / object, never invoking atob or processArrowBuffer on bad input. Co-authored-by: Isaac Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>
The hook is typed `TypedArrowTable` for ARROW_STREAM, but empty results
(zero rows / no inline attachment / no external_links) previously fell
through to `{ type: "result", ...result }` with no `data` field. Direct
useAnalyticsQuery callers expecting a Table would crash on `.numRows`
or `.getChild()`.
This commit makes ARROW_STREAM truly contract-consistent: when the
response has a manifest schema but no row payload, the connector
synthesizes a zero-row Arrow IPC stream from the schema and stashes it
in `result.attachment`. The existing arrow_inline path emits it to the
client, which decodes it as a real Arrow Table — same shape as a
non-empty result, just with `numRows === 0` and empty vectors per
column.
A new module `connectors/sql-warehouse/arrow-schema.ts` parses
Databricks SQL `type_text` values into Apache Arrow `DataType`s. It is
fully exhaustive:
- Scalars (STRING, BIGINT, DECIMAL, TIMESTAMP, BOOLEAN, …)
- Parameterized: DECIMAL(p,s), VARCHAR(n), CHAR(n)
- Nested: ARRAY<T>, MAP<K,V>, STRUCT<f1:T1, f2:T2 COMMENT '…', …>
- INTERVAL year-month and day-time variants
- Backtick-quoted struct field names (incl. `` `` `` escapes)
- Recursive nesting (verified up to MAP<STRING,ARRAY<STRUCT<…>>>)
- NOT NULL annotations on struct fields
Unknown / unparseable types degrade to Utf8 rather than throwing, so
future Databricks types don't crash empty-result handling.
Tests: 72 cases covering every scalar, parameterization edge case,
deeply nested combinations, struct field annotations / comments /
backtick escapes, INTERVAL variants, and IPC round-trip. 7 connector
tests covering attachment synthesis + the cases where it must NOT
synthesize (external_links present, schema absent).
Co-authored-by: Isaac
Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>
The previous SSE event-size limits (1 MiB on both server and client) were below the connector's 64 MiB cap on inline Arrow IPC attachments. Anything between 1 MiB and 64 MiB would pass the connector check but get rejected at the SSE layer with a confusing "Buffer size exceeded" error. The 64 MiB cap was effectively unreachable. This commit aligns all three caps at 8 MiB: - `streamDefaults.maxEventSize` (server SSE event cap) - `connectSSE.maxBufferSize` (client SSE buffer cap) - `MAX_INLINE_ATTACHMENT_BYTES` (connector inline attachment cap) 8 MiB is well above what analytics queries return in practice (most are well under 1 MiB), gives ~3x headroom for compact ARROW_STREAM payloads, and stays below the Databricks Statement Execution API hard limit on INLINE disposition (~25 MiB). Anything larger than 8 MiB now fails fast at the connector with a typed `ExecutionError`, and ARROW_STREAM falls back to EXTERNAL_LINKS which has no size pressure. A follow-up draft PR explores stash-and-serve via /arrow-result/:jobId as a stronger architectural solution that removes the SSE size cap entirely. Co-authored-by: Isaac Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>
DRAFT proposal. Replaces the `arrow_inline` SSE message type from #256 with an out-of-band delivery mechanism. Motivation: remove the SSE event-size cap as a constraint on inline Arrow result size. Architecture ------------ ARROW_STREAM responses are now delivered uniformly: 1. The connector returns the base64 Arrow IPC attachment in `result.attachment` for INLINE responses (unchanged from #256). 2. The analytics route base64-decodes it once, stashes the resulting buffer in `InlineArrowStash` (bounded LRU + TTL), and emits the same `{ type: "arrow", statement_id }` SSE message that EXTERNAL_LINKS already uses — with a synthetic id prefixed `inline-`. 3. The client (unchanged from main) calls `/arrow-result/:jobId` for any `type: "arrow"` message. 4. The route handler checks the stash first; if the id has the `inline-` prefix and is still present, it serves the bytes from memory. Otherwise it falls through to the existing warehouse fetch path for real EXTERNAL_LINKS statement_ids. The `arrow_inline` SSE message type, the client-side base64 decode helper, and the SSE buffer-size bumps from #256 all go away. Reasoning --------- SSE was designed for streams of small control messages. Pushing multi-MB Arrow IPC bytes through it has two unavoidable costs: - single-event memory ceiling on both server and client, - proxy/load-balancer compatibility issues for large events. Bumping the SSE buffer (5b in the design discussion) raises both caps but doesn't fix the architectural mismatch. This PR moves bulk bytes to HTTP, where they belong: - HTTP/2 streaming, gzip, and browser background-fetch all work naturally. - SSE buffer can stay at the conservative 1 MiB default. - The wire protocol unifies INLINE and EXTERNAL_LINKS — the client has a single code path for ARROW_STREAM data. - Inline result size cap goes back up to the Databricks API limit (25 MiB) instead of being constrained by SSE. Tradeoffs --------- - Server holds IPC buffers in memory until the client fetches them (one-shot `take()` removes on read; passive TTL evicts otherwise). For 100 entries × 25 MiB worst case, that's 2.5 GiB peak — but steady state is much smaller because entries are removed as soon as the client fetches. - Single-process only. A multi-server deployment would need a shared store (Redis or equivalent) keyed on the synthetic id. The stash interface is small enough to swap implementations. - One-shot reads mean the client cannot retry a failed fetch. Acceptable for our use case (warehouse query returned data, the hook either succeeds in fetching or surfaces a generic error). Status ------ DRAFT: opened to discuss the architectural direction before landing. #256 ships with the simpler 5b SSE-buffer-bump approach (8 MiB) which is sufficient for the realistic analytics range. This proposal is the cleaner long-term solution if the team wants to invest in the server-side state. Co-authored-by: Isaac Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>
#2 — replace Object.defineProperty(signal, "aborted", ...) with vi.spyOn(signal, "aborted", "get").mockReturnValue(true). The native AbortSignal.aborted is a non-configurable getter; the previous form worked under vitest's Node env but could throw "Cannot redefine property" on stricter engines. The getter spy is vitest's idiomatic way to mock non-configurable accessors and works regardless of configurability. #3 — defensive client-side size cap on arrow_inline attachments (8 MiB), mirroring the connector cap. Prevents a misconfigured proxy or future server bug from pushing the browser into allocating an unbounded Uint8Array. Both caps carry a comment pointing to PR #320 which removes the arrow_inline SSE path entirely; that PR can drop the client cap and raise the connector cap to the API's 25 MiB hard limit when it lands. Tests: 1 new — hook rejects oversized arrow_inline payloads before decodeBase64 or processArrowBuffer is called. Co-authored-by: Isaac Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>
The fallback decision in analytics.ts previously depended on substring- matching warehouse error messages, which is brittle against any change in error wording from the SDK or warehouse. The SDK already exposes a structured ApiError with `errorCode`, and the warehouse's `status.error.error_code` is a typed enum on the SDK's ServiceError. Both signals were being thrown away when we wrapped them into ExecutionError. This commit propagates the code through: - ExecutionError gains a `readonly errorCode?: string` field and `statementFailed(message, errorCode?)` accepts it. - The connector forwards `status.error?.error_code` at both Path B sites (initial SUCCEEDED/FAILED switch and the polling loop). - The connector's catch-all wrap detects an SDK ApiError-shaped object via duck-typing on `errorCode` and forwards it through Path A too. - _isInlineArrowUnsupported now checks `err instanceof ExecutionError && err.errorCode === "INVALID_PARAMETER_VALUE"` as the primary signal. The substring backstop remains for legacy SDK builds that don't surface structured codes, but it's secondary. Net effect: format-rejection detection becomes a structured-code check that's stable across error-message wording changes. Substring matching shrinks to a backstop role. Tests: new — assert the route falls back correctly when the connector hands up a structured ExecutionError, not just a substring-matchable message. Co-authored-by: Isaac Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>
Replaces the loosely-typed Promise<{ type: string; [key: string]: any }>
return type with a Zod-validated discriminated union shared between the
server and the client.
The schema lives in `shared/src/sse/analytics.ts` and is the single
source of truth for what messages can flow over /api/analytics/query:
- result — JSON_ARRAY rows (or empty results)
- arrow — ARROW_STREAM via /arrow-result/:jobId
- arrow_inline — ARROW_STREAM with base64 IPC bytes inline
Server side:
- _executeWithFormatFallback now returns AnalyticsSseMessage.
- makeResultMessage / makeArrowMessage / makeArrowInlineMessage
builders enforce required fields at construction time. A typo on
a key fails to compile.
Client side (useAnalyticsQuery):
- Validates the parsed JSON via AnalyticsSseMessage.safeParse before
dispatching to a handler. Type narrows correctly across branches —
no more `parsed.attachment as string` casts.
- A malformed payload (server bug or version skew) surfaces a clear
error instead of allocating a buffer from `undefined` or treating
a missing field as data.
Adds zod 3.23.8 to `shared`. CLAUDE.md already lists zod as a key
dependency for runtime validation, but the codebase didn't actually
use it anywhere — this lands the first instance.
Tests:
- 13 new schema tests in shared (each variant accept/reject, builder
round-trips).
- Existing analytics route + hook tests pass unchanged because the
schema accepts the message shapes the route was already producing.
- One hook test was rewritten: bad-attachment cases now hit the
schema validator (which rejects non-string / empty / null) and
surface the same generic error as before.
Co-authored-by: Isaac
Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>
The "logs a warning and yields no columns on malformed attachment"
test name promised a logger.warn assertion but only checked
hasResults === false. A regression that silently produced empty types
(no telemetry signal) would still pass.
Mock the logger module via vi.hoisted + vi.mock and assert:
- logger.warn fires with the expected message,
- the generated type is the unknown-result fallback (contains
"unknown"),
- DESCRIBE QUERY metadata column names ("col_name", "data_type")
don't leak into the user-facing type — that would mean the parser
swallowed the failure and produced bogus columns from misinterpreted
IPC bytes instead of degrading cleanly.
Co-authored-by: Isaac
Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>
Two CI failures in the previous run, both behind-the-scenes regressions from recent commits: 1. Playground integration tests timed out because chart canvases never rendered. Root cause: pnpm resolved zod 4.1.13 (rather than the 3.23.8 specified in package.json — peer/dedup behavior), and zod 4 changed `z.record()` to require both key and value type arguments. The single-arg form `z.record(z.unknown())` from the AnalyticsResultMessage schema crashed at runtime when validating any payload with a `data` array, the safeParse returned success:false, the hook treated every mock SSE response as malformed, and the chart loading state never cleared. Fix: switch to the two-arg form `z.record(z.string(), z.unknown())` which is valid in both zod 3 and zod 4. 2. Docs build sync check failed because Class.ExecutionError.md was stale relative to the new `errorCode` field added in 3d32491. Regenerated via `pnpm docs:build`. No new tests needed — the existing analytics route + hook tests now exercise the corrected schema, and the regression would have been caught earlier had the schema test been driven through the SDK's JSON.stringify → JSON.parse roundtrip the way the integration test does. Worth a follow-up to add an integration-style schema test that asserts a sample mock payload validates successfully. Co-authored-by: Isaac Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>
|
Hi — Claude here, on James's behalf. Closing in favour of a 3-PR stack carved out of this branch. Same end-state, easier to review:
Merge order: #327 → #328 → #329. Verified the sum of the three branches matches the PR-touched files on this branch exactly (0 diffs across all 30 files). |
Summary
Serverless warehouses return
ARROW_STREAM+INLINEresults as base64 Arrow IPC inresult.attachmentrather thanresult.data_array. The previous code path discarded inline data for anyARROW_STREAMresponse (designed forEXTERNAL_LINKS), so these warehouses silently returned empty results.This PR makes the analytics plugin work across classic and serverless warehouses by handling both dispositions for
ARROW_STREAM, decoding inline Arrow IPC attachments server-side, and falling back toJSON_ARRAYwhen a warehouse rejectsARROW_STREAM+INLINE.Changes
JSON_ARRAYandARROW_STREAM— replacing the priorJSON/ARROW/ARROW_STREAMtriple.ARROW_STREAMnow supports bothINLINEandEXTERNAL_LINKSdispositions.JSON_ARRAY,ARROW_STREAM,INLINE,EXTERNAL_LINKS) are the verbatim values defined by the Statement Execution API — we no longer translate to/from local aliases. Confirmed against the Databricks SDK Go source:Format(JSON_ARRAY/ARROW_STREAM/CSV) andDisposition(INLINE/EXTERNAL_LINKS). See also the API reference._transformDataArraydetectsresult.attachmentand decodes viaapache-arrow'stableFromIPC, producing the same row-object shape asJSON_ARRAYregardless of warehouse backend.apache-arrow@21.1.0added as a server dependency.ARROW_STREAM+INLINErequests automatically fall back toJSON_ARRAYif a classic warehouse rejects them. Explicit format requests are respected without fallback.JSON_ARRAYfor compatibility.json_array/arrow_streamnaming inuseAnalyticsQuery,useChartData, and chart types.Tests
Added ~147 unit tests covering coverage gaps surfaced during the rework:
connectors/genie/tests/client.test.ts(new, 786 lines)connectors/sql-warehouse/tests/client.test.ts(new, 286 lines) — uses real base64 Arrow IPC captured from a live serverless warehousecontext/tests/service-context.test.ts(new, 457 lines)plugins/files/tests/upload-and-write.test.ts(new, 1245 lines)stream/tests/stream-registry.test.ts(new, 582 lines)plugins/analytics/tests/analytics.test.ts(+217)Coverage deltas: service-context 7%→100%, stream-registry 32%→100%, genie connector 61%→
97%, files plugin 69%→89%.Test plan
ARROW_STREAM+INLINEattachments — verifyuseAnalyticsQueryreturns rowsJSON_ARRAYdefault works andARROW_STREAM+EXTERNAL_LINKSstill worksARROW_STREAM+INLINEFixes #242
This pull request was AI-assisted by Isaac.