aitools: run multiple SQL queries in parallel from one invocation by simonfaltum · Pull Request #5093 · databricks/cli

simonfaltum · 2026-04-27T13:38:16Z

Stack

This PR is part of a 4-PR stack making aitools data exploration faster for ai-dev-kit. Each PR is independently reviewable; merge in order.

aitools: extract pollStatement helper and pin OnWaitTimeout #5092 — aitools: extract pollStatement helper and pin OnWaitTimeout (base: main)
aitools: run multiple SQL queries in parallel from one invocation #5093 — aitools: run multiple SQL queries in parallel from one query invocation (base: aitools: extract pollStatement helper and pin OnWaitTimeout #5092) — this PR
aitools: add 'tools statement' lifecycle commands #5095 — aitools: add 'tools statement' lifecycle commands (base: aitools: run multiple SQL queries in parallel from one invocation #5093)
aitools: parallelize discover-schema across tables and probes #5097 — aitools: parallelize discover-schema across tables and probes (base: aitools: add 'tools statement' lifecycle commands #5095)

Use git diff <base>...HEAD or set the comparison base in the GitHub UI to see only this PR's changes; the default "Files changed" diff against main includes ancestor PRs.

Why

Today databricks experimental aitools tools query runs one SQL at a time. ai-dev-kit's data-exploration phase fires 5-10 probes per dashboard (cardinality, top values, distributions, trend viability) and they all run in series because each is a separate CLI invocation that blocks. End-to-end exploration takes about a minute when it could take seconds.

Quentin already wired up a bash workaround that fans out via the raw /api/2.0/sql/statements endpoint with wait_timeout=0s and harvests results separately. This PR exposes that pattern natively so the skill can drop the hack and other CLI users get the same speed-up.

Changes

Before: query accepted at most one positional SQL or a single --file. Mixing the two errored. JSON output was an array of row objects.

Now: query accepts any number of positional SQLs and/or repeated --file paths. With one input, behavior is unchanged (back-compat). With two or more, the queries run in parallel against the warehouse and the result is a JSON array of one object per input in input order:

[
  {
    "sql": "SELECT count(*) FROM t",
    "statement_id": "01ef...",
    "state": "SUCCEEDED",
    "elapsed_ms": 412,
    "columns": ["count"],
    "rows": [["12345"]]
  },
  {
    "sql": "SELECT bad_syntax",
    "statement_id": "01ef...",
    "state": "FAILED",
    "elapsed_ms": 87,
    "error": {
      "message": "near 'bad_syntax': syntax error",
      "error_code": "SYNTAX_ERROR"
    }
  }
]

Implementation:

New experimental/aitools/cmd/batch.go with executeBatch (errgroup with bounded parallelism) and runOneBatchQuery. Each goroutine submits with OnWaitTimeout: CONTINUE, polls via the helper from aitools: extract pollStatement helper and pin OnWaitTimeout #5092, and encodes its outcome into a batchResult struct. Failures don't abort siblings.
New --concurrency flag (default 8). Same value used by cmd/fs/cp.go for similar fan-out. Validated > 0 in PreRunE (a 0 value would deadlock errgroup.SetLimit).
--file is now a repeatable string slice. Previous --file + positional conflict error is removed; both compose.
resolveSQL is replaced by resolveSQLs returning []string. Result order is --file inputs first (in flag order), then positional SQLs (in arg order).
Multi-query output is JSON-only. --output text and --output csv are rejected with an actionable error before any API call.
On Ctrl+C, in-flight statements are cancelled server-side via CancelExecution after g.Wait() returns. Statements that finished normally before the cancel are left alone.
Exit code is non-zero (root.ErrAlreadyPrinted) when any statement failed; the JSON already contains the error detail, no extra stderr noise.

Test plan

go test ./experimental/aitools/... passes.
make checks clean.
make fmt no drift.
make lint 0 issues.
New unit tests cover:
- all-succeed batch with input-order preservation
- server-reported failure on one of N (others still complete)
- submission-time transport error encoded into per-result error
- explicit OnWaitTimeout: CONTINUE on every ExecuteStatement
- staggered completion (1 slow + 2 fast) preserves input order in results
- context cancellation triggers CancelExecution for each in-flight statement
- cobra-level rejection of --output text and --output csv with multiple positionals
- cobra-level rejection of --concurrency 0 and --concurrency -1
- resolveSQLs covering mixed sources, multiple files, multiple positionals, indexed-error message

Manual smoke against a real warehouse:

databricks experimental aitools tools query \
  --warehouse <wh> --output json \
  "SELECT 1" "SELECT 2" "SELECT current_timestamp()"

arsenyinfo · 2026-04-28T07:20:01Z

Cancel-RPC contexts derived from an already-cancelled parent, causing server-side cancel to be a no-op

Priority: P2
Location: experimental/aitools/cmd/batch.go:201-202, experimental/aitools/cmd/query.go:347-355
Scenario: When the parent ctx passed to executeBatch (or executeAndPoll) is itself cancelled — as exercised by TestExecuteBatchContextCancellationCancelsInFlight, which cancels the parent context before invocation — context.WithTimeout(ctx, cancelTimeout) in cancelInFlight and cancelStatement inherits the already-done parent, so the derived context is immediately cancelled. api.CancelExecution short-circuits with ctx.Err() and no cancel RPC reaches the warehouse, leaving in-flight statements running server-side. The test only passes today because the mock accepts mock.Anything for the context argument and never inspects liveness.
Potential solution: Detach the cancel timeout from any inbound cancellation by using context.WithTimeout(context.WithoutCancel(ctx), cancelTimeout) (Go 1.21+) in both cancelInFlight (batch.go:201) and cancelStatement (query.go:347); fall back to context.Background() if the minimum Go version is below 1.21.

🔍 Reviewed by nitpicker

Refactor `executeAndPoll` in `experimental/aitools/cmd/query.go` to extract a pure `pollStatement(ctx, api, resp)` helper. The helper polls until the statement reaches a terminal state and returns the response without any signal handling, spinner, or server-side cancellation; those concerns stay in `executeAndPoll` where they belong. Also pin `OnWaitTimeout: CONTINUE` explicitly on the `ExecuteStatement` call. The SDK default happens to be CONTINUE today, but relying on it is a hidden coupling: a server-side default flip would silently break the poll loop by killing the statement before our first GET. Behavior is unchanged for the existing `query` command. Follow-up PRs (parallel batch queries, statement lifecycle command tree) will reuse the helper. Co-authored-by: Isaac

Allow `databricks experimental aitools tools query` to accept several SQLs in a single invocation and run them in parallel against the warehouse. Pass multiple positional arguments and/or repeat `--file` to fan out: databricks experimental aitools tools query \ --warehouse <wh> --output json \ "SELECT count(*) FROM t" \ "SELECT min(ts), max(ts) FROM t" \ "SELECT col, count(*) FROM t GROUP BY 1" Multi-query output is always a JSON array of one object per input, preserving input order. The shape is `{sql, statement_id, state, elapsed_ms, columns, rows, error}`. Individual statement failures don't abort siblings; each is encoded in the per-result `error` field, and the exit code is non-zero when any statement failed. A new `--concurrency` flag (default 8) caps in-flight statements. On Ctrl+C the still-running statements are cancelled server-side via CancelExecution before exit. Single-query behavior is unchanged. The previous restriction that forbade mixing `--file` and a positional SQL is lifted, since both now contribute to the batch. Co-authored-by: Isaac

Address two findings from a cursor PR review: 1. --concurrency was passed straight into errgroup.SetLimit. A value of 0 deadlocks (errgroup refuses to add goroutines), and a negative value silently removes the cap. Add a PreRunE check that rejects anything <= 0 with errInvalidBatchConcurrency, matching the shape used by cmd/fs/cp.go for the same flag. 2. The Long help previously said multi-query results come back "in input order", which was ambiguous when --file and positional SQLs are mixed. The actual behavior (already covered by TestResolveSQLsMixedFileAndPositional) is: --file inputs first in flag order, then positional SQLs in arg order. Tighten the help text to state that contract precisely. Adds two unit tests that verify --concurrency 0 and -1 are rejected before any API call. Co-authored-by: Isaac

… cases Two pairs of cobra-level tests were each testing one rejection code path with two flag values. Fold them into table-driven subtests so the shared assertion lives in one place: - TestQueryCommandBatchTextOutputRejected + ...CsvOutputRejected → TestQueryCommandBatchOutputRejection (text, csv subtests) - TestQueryCommandConcurrencyZeroRejected + ...NegativeRejected → TestQueryCommandConcurrencyRejection (0, -1 subtests) Same coverage, half the test functions. Co-authored-by: Isaac

Address Arseni's P2 finding on the batch PR. cancelInFlight (batch.go) and cancelStatement (query.go) used to derive the cancel-RPC ctx via context.WithTimeout(ctx, cancelTimeout). On the actual hot path (Ctrl+C or parent ctx cancelled), the inbound ctx is already cancelled by the time we reach the cancel sweep. The SDK then short-circuits on ctx.Err() and the cancel RPC never reaches the warehouse, leaving in-flight statements running server-side. Wrap with context.WithoutCancel(ctx) (Go 1.21+) so the timeout context keeps the caller's values but drops the cancellation signal. The cancel RPC now actually fires. Also tighten the existing tests: - TestExecuteBatchContextCancellationCancelsInFlight - TestExecuteAndPollCancelledContextCallsCancelExecution Both previously matched mock.Anything for the ctx argument, so they passed regardless of whether the bug was present. They now use mock.MatchedBy(c.Err() == nil) to assert the cancel-RPC ctx is alive. This is a regression guard; reverting the production fix makes the tests fail with "unexpected call" because the matcher no longer matches. Co-authored-by: Isaac

…ks#5092) ## Stack This PR is part of a 4-PR stack making `aitools` data exploration faster for ai-dev-kit. Each PR is independently reviewable; merge in order. 1. **databricks#5092 — aitools: extract pollStatement helper and pin OnWaitTimeout** *(base: `main`)* — **this PR** 2. databricks#5093 — aitools: run multiple SQL queries in parallel from one query invocation *(base: databricks#5092)* 3. databricks#5095 — aitools: add 'tools statement' lifecycle commands *(base: databricks#5093)* 4. databricks#5097 — aitools: parallelize discover-schema across tables and probes *(base: databricks#5095)* Use `git diff <base>...HEAD` or set the comparison base in the GitHub UI to see only this PR's changes; the default "Files changed" diff against `main` includes ancestor PRs. --- ## Why The query command in `experimental/aitools/cmd/query.go` works today, but two things make it fragile and hard to reuse: 1. The polling loop, signal handling, spinner, and server-side cancellation are entangled in one ~100-line function. Upcoming features (parallel batch queries, a statement lifecycle command tree) need pure polling without the signal-handler side effects, so the helper has to come out cleanly. 2. The `ExecuteStatement` request sets `WaitTimeout: 0s` but does not set `OnWaitTimeout`. That relies on the SDK's default being `CONTINUE`. It is today, but a flip would silently break the command: the statement would be cancelled before our first GET and we'd never see the result. This PR is a pure refactor + one explicit-default fix. No user-visible behavior change. ## Changes - Extract `pollStatement(ctx, api, resp)` from `executeAndPoll`. The helper polls until the statement reaches a terminal state and returns the response. It does not call `CancelExecution` on context cancellation, that's the caller's job (and a deliberate design choice for the upcoming `statement get` command, where Ctrl+C should stop polling without killing the server-side statement). - Pin `OnWaitTimeout: CONTINUE` explicitly on the `ExecuteStatement` call. - Update `executeAndPoll` to delegate to `pollStatement` and keep the existing signal-handling, spinner, and server-side cancel-on-Ctrl+C semantics intact. - Add five unit tests covering the new helper: - Immediate terminal short-circuit (no Get calls) - Failed terminal returned without error (caller decides) - Eventual success across multiple polls - Context cancellation returns ctx error and does NOT call CancelExecution - GetStatement transport error is wrapped and propagated - Update the existing `TestExecuteAndPollImmediateSuccess` matcher to assert `OnWaitTimeout == CONTINUE` so a future SDK default flip cannot regress us. ## Test plan - [x] `go test ./experimental/aitools/...` passes (10 polling-related cases including the 5 new ones). - [x] `make checks` clean (tidy, whitespace, dead code). - [x] `make fmt` no drift. - [x] `make lint` 0 issues. - [x] Existing `executeAndPoll` tests (immediate success, immediate failure, polling, fail-during-poll, ctx-cancellation-calls-cancel-execution) all still pass without modification beyond the matcher tweak.

## Stack This PR is part of a 4-PR stack making `aitools` data exploration faster for ai-dev-kit. Each PR is independently reviewable; merge in order. 1. databricks#5092 — aitools: extract pollStatement helper and pin OnWaitTimeout *(base: `main`)* 2. databricks#5093 — aitools: run multiple SQL queries in parallel from one query invocation *(base: databricks#5092)* 3. **databricks#5095 — aitools: add 'tools statement' lifecycle commands** *(base: databricks#5093)* — **this PR** 4. databricks#5097 — aitools: parallelize discover-schema across tables and probes *(base: databricks#5095)* Use `git diff <base>...HEAD` or set the comparison base in the GitHub UI to see only this PR's changes; the default "Files changed" diff against `main` includes ancestor PRs. --- ## Why Quentin's ai-dev-kit skill works against synchronous `tools query`. That covers most cases, but there are workflows where the agent wants a server-side handle it can poll separately: long-running maintenance queries, parallel exploration where the agent does other work in between, and any "submit-now-harvest-later" pattern. `tools query` with a single SQL is for "I want results now." This PR adds a low-level command tree, `tools statement`, for "I want a handle." Cleaner separation than overloading `query` with `--async`/`--cancel` flags (which would be semantically forced — a `query` shouldn't manage someone else's statement_id). ## Changes Four new subcommands under `databricks experimental aitools tools statement`: ```bash # Fire and exit with a handle. databricks experimental aitools tools statement submit \ --warehouse <wh> "SELECT pg_sleep(60)" # Output: # { "statement_id": "01ef...", "state": "PENDING", "warehouse_id": "..." } # Block until terminal and emit rows. databricks experimental aitools tools statement get <statement_id> # Peek at current state without polling. databricks experimental aitools tools statement status <statement_id> # Request cancellation. databricks experimental aitools tools statement cancel <statement_id> ``` Implementation notes: - All four subcommands emit a uniform `statementInfo` JSON shape: `{statement_id, state, warehouse_id, columns, rows, error}` with `omitempty` on every field except `statement_id`. So `submit` doesn't include `columns/rows`, `cancel` doesn't include `warehouse_id`, etc. Consumer parsing is uniform. - `submit` uses `WaitTimeout: "0s"` and `OnWaitTimeout: CONTINUE` (matching the helper from databricks#5092). - `get` uses `pollStatement` (from databricks#5092) and inherits its "ctx cancellation does NOT cancel server-side" semantics. This is the **important UX difference from `tools query`**: hitting Ctrl+C on `get` stops polling but leaves the statement running on the warehouse. Use `cancel` for explicit termination. That asymmetry is intentional, since `get` is poll-only by design — the user already submitted async. - `status` does a single `GetStatementByStatementId` with no polling. - `cancel` calls `CancelExecution` and optimistically reports `state=CANCELED`. The Statements API returns no body on cancel; the actual server-side state transitions asynchronously. The `Long` help points users at `status` if they need certainty. - A shared helper `statementErrorFromStatus` populates the `error` field for every non-success terminal state (FAILED, CANCELED, CLOSED), even when the server returns no `Status.Error` payload. So skill consumers can branch on `error == null` alone instead of inspecting `state`. - Each subcommand has a small testable helper (`submitStatement`, `getStatementResult`, `getStatementStatus`, `cancelStatementExecution`) extracted from the cobra `RunE`. Tests target the helpers directly with a mock `StatementExecutionInterface`. - Parent `statement.go` registers the four subcommands and is wired into `tools.go` next to `query`, `discover-schema`, and `get-default-warehouse`. - `submit` validates input (rejects mixed --file + positional) BEFORE accessing `WorkspaceClient`, so the error surfaces cleanly without an auth or warehouse roundtrip. ## Test plan - [x] `go test ./experimental/aitools/...` passes. - [x] `make checks` clean. - [x] `make fmt` no drift. - [x] `make lint` 0 issues. - [x] New tests cover: - `submit` returns the statement_id and pins `OnWaitTimeout: CONTINUE` - `submit` wraps transport errors with `execute statement: ...` - `get` polls until terminal and assembles rows - `get` reports server-side errors in the JSON without raising a Go error - `get` ctx cancellation propagates **without** calling `CancelExecution` (the deliberate UX difference from `query`) - `get` synthesizes `error` for terminal CLOSED / FAILED with no backend payload - `status` does a single GET, no polling - `status` reports server-side errors in the JSON; running/pending stay error-free - `status` synthesizes `error` for FAILED with no backend payload - `cancel` calls `CancelExecution` and reports `state=CANCELED` - `cancel` wraps API errors - `statementErrorFromStatus` table-driven across nil, succeeded, running, failed-with-error, failed/canceled/closed-without-error - `renderStatementInfo` JSON shape (full and minimal) - cobra-level: `submit` rejects mixed --file + positional, `submit` enforces MaximumNArgs(1), `get` and `cancel` require a positional statement_id - [x] Manual smoke against a real warehouse: ```bash SID=$(databricks experimental aitools tools statement submit \ --warehouse <wh> "SELECT pg_sleep(5)" | jq -r '.statement_id') databricks experimental aitools tools statement status "$SID" databricks experimental aitools tools statement get "$SID" ```

…icks#5097) ## Stack This PR is part of a 4-PR stack making `aitools` data exploration faster for ai-dev-kit. Each PR is independently reviewable; merge in order. 1. databricks#5092 — aitools: extract pollStatement helper and pin OnWaitTimeout *(base: `main`)* 2. databricks#5093 — aitools: run multiple SQL queries in parallel from one query invocation *(base: databricks#5092)* 3. databricks#5095 — aitools: add 'tools statement' lifecycle commands *(base: databricks#5093)* 4. **databricks#5097 — aitools: parallelize discover-schema across tables and probes** *(base: databricks#5095)* — **this PR** Use `git diff <base>...HEAD` or set the comparison base in the GitHub UI to see only this PR's changes; the default "Files changed" diff against `main` includes ancestor PRs. --- ## Why `discover-schema` walked tables sequentially and ran each table's three probes (DESCRIBE, sample SELECT, null counts) one after the other. For ai-dev-kit's data-exploration phase that meant warehouse-bound work was idle most of the time. Same root cause as the multi-query exploration latency that databricks#5093 (batch query) fixed; same fix. This is a pure latency win. No new user-facing API surface, no output-shape change. ## Changes **Two layers of parallelism plus a shared statement budget:** 1. **Across tables.** The for-loop in `RunE` becomes an `errgroup.Group`. A failure on one table never aborts the others; it's rendered inline as `"Error discovering ..."` exactly as before. 2. **Within a table.** `discoverTable` still runs DESCRIBE first because the column list feeds the null-counts query. After DESCRIBE returns, the sample SELECT and null-counts probes run concurrently. Output text is assembled once both probes finish, preserving the existing `COLUMNS / SAMPLE DATA / NULL COUNTS` order. 3. **Single warehouse-statement budget.** A new `sqlGate` (chan struct{} of capacity N + statement_id tracking) wraps every `executeSQL` call. `--concurrency` (default 8) caps total in-flight statements globally, regardless of how many tables you pass. So `--concurrency 1` actually serializes statement load, not just table fan-out. **Switch `executeSQL` to use `pollStatement`** (the helper extracted in databricks#5092) instead of the SDK's `ExecuteAndWait`. Pins `OnWaitTimeout: CONTINUE`. Failed states flow through `checkFailedState`, yielding more specific error messages (e.g. `"query failed: SYNTAX_ERROR near 'oops'"`) than the previous hand-rolled branch. The user-visible `"SAMPLE DATA: Error - %v" / "NULL COUNTS: Error - %v"` wrapping is unchanged. Future polling-helper improvements land here for free. **Cancellation discipline mirroring batch.go (databricks#5093):** signal handler cancels a derived `pollCtx`; `sqlGate` records each `statement_id` post-submission; on cancellation the recorded IDs are swept via `CancelExecution` before returning `root.ErrAlreadyPrinted`. Without this, parallelism would orphan up to N×2 statements server-side on Ctrl+C. **`--concurrency` validation** mirrors `cmd/fs/cp.go` and databricks#5093: `PreRunE` rejects values <= 0 with `errInvalidBatchConcurrency`. Table-name validation also runs in `PreRunE` so malformed identifiers are rejected before `MustWorkspaceClient` runs (no unnecessary auth roundtrip on bad input). **Output unchanged** for any input that previously succeeded. Same dividers, same header/probe ordering, same per-probe error wrapping. ## Test plan - [x] `go test ./experimental/aitools/...` passes. - [x] `make checks` clean. - [x] `make fmt` no drift. - [x] `make lint` 0 issues. - [x] New unit tests in `discover_schema_test.go`: - `quoteTableName` table-driven (valid, missing parts, too many parts, injection attempts, empty parts, leading-digit identifiers, backtick in name) - `parseDescribeResult` skips metadata rows (`#`-prefixed and empty) - `sqlGate.run` pins `OnWaitTimeout: CONTINUE`, propagates FAILED state, wraps transport errors, records IDs, respects cancelled context - `cancelDiscoverInFlight` calls API per ID; empty list is a no-op - `discoverTable`: sample and null-count probes run concurrently after DESCRIBE (deterministic atomic-counter + sync.OnceFunc + channel-close barrier; sequential execution surfaces a timeout error) - `discoverTable`: a sample-probe failure does not abort null counts - `--concurrency 0` and `-1` rejected at PreRunE - Invalid table name (not `CATALOG.SCHEMA.TABLE`) and injection attempts rejected at PreRunE before any API call - [x] Manual smoke against a real warehouse: ```bash databricks experimental aitools tools discover-schema \ samples.nyctaxi.trips samples.tpch.orders samples.tpch.customer ```

simonfaltum temporarily deployed to test-trigger-is April 27, 2026 13:38 — with GitHub Actions Inactive

simonfaltum temporarily deployed to test-trigger-is April 27, 2026 13:50 — with GitHub Actions Inactive

This was referenced Apr 27, 2026

aitools: add 'tools statement' lifecycle commands #5095

Merged

aitools: parallelize discover-schema across tables and probes #5097

Merged

aitools: extract pollStatement helper and pin OnWaitTimeout #5092

Merged

simonfaltum temporarily deployed to test-trigger-is April 27, 2026 19:37 — with GitHub Actions Inactive

arsenyinfo approved these changes Apr 28, 2026

View reviewed changes

simonfaltum temporarily deployed to test-trigger-is April 28, 2026 08:13 — with GitHub Actions Inactive

simonfaltum force-pushed the simonfaltum/aitools-pr1-refactor branch from 8e00440 to 79fc080 Compare April 28, 2026 09:08

simonfaltum added 4 commits April 28, 2026 11:09

simonfaltum force-pushed the simonfaltum/aitools-pr2-batch branch from 3e565e9 to a1c5ca6 Compare April 28, 2026 09:10

simonfaltum temporarily deployed to test-trigger-is April 28, 2026 09:11 — with GitHub Actions Inactive

Base automatically changed from simonfaltum/aitools-pr1-refactor to main April 28, 2026 11:11

simonfaltum added this pull request to the merge queue Apr 28, 2026

Merged via the queue into main with commit b91f5ee Apr 28, 2026
18 checks passed

simonfaltum deleted the simonfaltum/aitools-pr2-batch branch April 28, 2026 11:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

aitools: run multiple SQL queries in parallel from one invocation#5093

aitools: run multiple SQL queries in parallel from one invocation#5093
simonfaltum merged 5 commits intomainfrom
simonfaltum/aitools-pr2-batch

simonfaltum commented Apr 27, 2026 •

edited

Loading

Uh oh!

arsenyinfo commented Apr 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

simonfaltum commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Stack

Why

Changes

Test plan

Uh oh!

arsenyinfo commented Apr 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

simonfaltum commented Apr 27, 2026 •

edited

Loading