Skip to content

aitools: add 'tools statement' lifecycle commands#5095

Merged
simonfaltum merged 10 commits intomainfrom
simonfaltum/aitools-pr3-statement
Apr 28, 2026
Merged

aitools: add 'tools statement' lifecycle commands#5095
simonfaltum merged 10 commits intomainfrom
simonfaltum/aitools-pr3-statement

Conversation

@simonfaltum
Copy link
Copy Markdown
Member

@simonfaltum simonfaltum commented Apr 27, 2026

Stack

This PR is part of a 4-PR stack making aitools data exploration faster for ai-dev-kit. Each PR is independently reviewable; merge in order.

  1. aitools: extract pollStatement helper and pin OnWaitTimeout #5092 — aitools: extract pollStatement helper and pin OnWaitTimeout (base: main)
  2. aitools: run multiple SQL queries in parallel from one invocation #5093 — aitools: run multiple SQL queries in parallel from one query invocation (base: aitools: extract pollStatement helper and pin OnWaitTimeout #5092)
  3. aitools: add 'tools statement' lifecycle commands #5095 — aitools: add 'tools statement' lifecycle commands (base: aitools: run multiple SQL queries in parallel from one invocation #5093)this PR
  4. aitools: parallelize discover-schema across tables and probes #5097 — aitools: parallelize discover-schema across tables and probes (base: aitools: add 'tools statement' lifecycle commands #5095)

Use git diff <base>...HEAD or set the comparison base in the GitHub UI to see only this PR's changes; the default "Files changed" diff against main includes ancestor PRs.


Why

Quentin's ai-dev-kit skill works against synchronous tools query. That covers most cases, but there are workflows where the agent wants a server-side handle it can poll separately: long-running maintenance queries, parallel exploration where the agent does other work in between, and any "submit-now-harvest-later" pattern.

tools query with a single SQL is for "I want results now." This PR adds a low-level command tree, tools statement, for "I want a handle." Cleaner separation than overloading query with --async/--cancel flags (which would be semantically forced — a query shouldn't manage someone else's statement_id).

Changes

Four new subcommands under databricks experimental aitools tools statement:

# Fire and exit with a handle.
databricks experimental aitools tools statement submit \
  --warehouse <wh> "SELECT pg_sleep(60)"

# Output:
# { "statement_id": "01ef...", "state": "PENDING", "warehouse_id": "..." }

# Block until terminal and emit rows.
databricks experimental aitools tools statement get <statement_id>

# Peek at current state without polling.
databricks experimental aitools tools statement status <statement_id>

# Request cancellation.
databricks experimental aitools tools statement cancel <statement_id>

Implementation notes:

  • All four subcommands emit a uniform statementInfo JSON shape: {statement_id, state, warehouse_id, columns, rows, error} with omitempty on every field except statement_id. So submit doesn't include columns/rows, cancel doesn't include warehouse_id, etc. Consumer parsing is uniform.
  • submit uses WaitTimeout: "0s" and OnWaitTimeout: CONTINUE (matching the helper from aitools: extract pollStatement helper and pin OnWaitTimeout #5092).
  • get uses pollStatement (from aitools: extract pollStatement helper and pin OnWaitTimeout #5092) and inherits its "ctx cancellation does NOT cancel server-side" semantics. This is the important UX difference from tools query: hitting Ctrl+C on get stops polling but leaves the statement running on the warehouse. Use cancel for explicit termination. That asymmetry is intentional, since get is poll-only by design — the user already submitted async.
  • status does a single GetStatementByStatementId with no polling.
  • cancel calls CancelExecution and optimistically reports state=CANCELED. The Statements API returns no body on cancel; the actual server-side state transitions asynchronously. The Long help points users at status if they need certainty.
  • A shared helper statementErrorFromStatus populates the error field for every non-success terminal state (FAILED, CANCELED, CLOSED), even when the server returns no Status.Error payload. So skill consumers can branch on error == null alone instead of inspecting state.
  • Each subcommand has a small testable helper (submitStatement, getStatementResult, getStatementStatus, cancelStatementExecution) extracted from the cobra RunE. Tests target the helpers directly with a mock StatementExecutionInterface.
  • Parent statement.go registers the four subcommands and is wired into tools.go next to query, discover-schema, and get-default-warehouse.
  • submit validates input (rejects mixed --file + positional) BEFORE accessing WorkspaceClient, so the error surfaces cleanly without an auth or warehouse roundtrip.

Test plan

  • go test ./experimental/aitools/... passes.

  • make checks clean.

  • make fmt no drift.

  • make lint 0 issues.

  • New tests cover:

    • submit returns the statement_id and pins OnWaitTimeout: CONTINUE
    • submit wraps transport errors with execute statement: ...
    • get polls until terminal and assembles rows
    • get reports server-side errors in the JSON without raising a Go error
    • get ctx cancellation propagates without calling CancelExecution (the deliberate UX difference from query)
    • get synthesizes error for terminal CLOSED / FAILED with no backend payload
    • status does a single GET, no polling
    • status reports server-side errors in the JSON; running/pending stay error-free
    • status synthesizes error for FAILED with no backend payload
    • cancel calls CancelExecution and reports state=CANCELED
    • cancel wraps API errors
    • statementErrorFromStatus table-driven across nil, succeeded, running, failed-with-error, failed/canceled/closed-without-error
    • renderStatementInfo JSON shape (full and minimal)
    • cobra-level: submit rejects mixed --file + positional, submit enforces MaximumNArgs(1), get and cancel require a positional statement_id
  • Manual smoke against a real warehouse:

    SID=$(databricks experimental aitools tools statement submit \
      --warehouse <wh> "SELECT pg_sleep(5)" | jq -r '.statement_id')
    databricks experimental aitools tools statement status "$SID"
    databricks experimental aitools tools statement get "$SID"

@arsenyinfo
Copy link
Copy Markdown
Contributor

getStatementResult discards partial JSON when fetchAllRows fails after a SUCCEEDED state

  • Priority: P2
  • Location: experimental/aitools/cmd/statement_get.go:78-85 and statement_get.go:32-35
  • Scenario: Statement reaches SUCCEEDED and getStatementResult populates info.StatementID, info.State, and info.Columns, but a subsequent GetStatementResultChunkN call inside fetchAllRows fails (network blip, throttling, transient 5xx). Line 82 returns (info, err); RunE at lines 33-34 discards info and surfaces only the raw Go error. The user sees fetch result chunk N: ... with no JSON, no statement_id, and no machine-readable error field — inconsistent with the terminal-failure path at lines 41-44 which renders JSON and returns root.ErrAlreadyPrinted.
  • Potential solution: Render the partially-populated statementInfo with a synthesized Error describing the chunk-fetch failure, then return root.ErrAlreadyPrinted, mirroring the terminal-state failure path just above in the same function.

submit input validation runs after WorkspaceClient setup, contradicting the PR description

  • Priority: P3
  • Location: experimental/aitools/cmd/statement_submit.go:29-44
  • Scenario: User runs statement submit --file foo.sql "select 2" (or statement submit with no source) without configured auth. PreRunE: root.MustWorkspaceClient at line 30 runs before RunE, so auth/workspace setup fails first; resolveSQLs (line 38) and the len(sqls) != 1 check (lines 42-43) never execute. The PR description's claim that submit validates input before accessing WorkspaceClient is not delivered.
  • Potential solution: Move resolveSQLs and the length check into a custom PreRunE that runs before root.MustWorkspaceClient, mirroring the pattern already used in query.go:113-118, or correct the PR description to remove the validation-order claim.

🔍 Reviewed by nitpicker

@simonfaltum simonfaltum force-pushed the simonfaltum/aitools-pr3-statement branch from ff58192 to a34f39e Compare April 28, 2026 08:14
Refactor `executeAndPoll` in `experimental/aitools/cmd/query.go` to extract
a pure `pollStatement(ctx, api, resp)` helper. The helper polls until the
statement reaches a terminal state and returns the response without any
signal handling, spinner, or server-side cancellation; those concerns stay
in `executeAndPoll` where they belong.

Also pin `OnWaitTimeout: CONTINUE` explicitly on the `ExecuteStatement`
call. The SDK default happens to be CONTINUE today, but relying on it is
a hidden coupling: a server-side default flip would silently break the
poll loop by killing the statement before our first GET.

Behavior is unchanged for the existing `query` command. Follow-up PRs
(parallel batch queries, statement lifecycle command tree) will reuse the
helper.

Co-authored-by: Isaac
Allow `databricks experimental aitools tools query` to accept several SQLs
in a single invocation and run them in parallel against the warehouse.
Pass multiple positional arguments and/or repeat `--file` to fan out:

  databricks experimental aitools tools query \
    --warehouse <wh> --output json \
    "SELECT count(*) FROM t" \
    "SELECT min(ts), max(ts) FROM t" \
    "SELECT col, count(*) FROM t GROUP BY 1"

Multi-query output is always a JSON array of one object per input,
preserving input order. The shape is `{sql, statement_id, state,
elapsed_ms, columns, rows, error}`. Individual statement failures don't
abort siblings; each is encoded in the per-result `error` field, and the
exit code is non-zero when any statement failed.

A new `--concurrency` flag (default 8) caps in-flight statements. On
Ctrl+C the still-running statements are cancelled server-side via
CancelExecution before exit.

Single-query behavior is unchanged. The previous restriction that
forbade mixing `--file` and a positional SQL is lifted, since both now
contribute to the batch.

Co-authored-by: Isaac
Address two findings from a cursor PR review:

1. --concurrency was passed straight into errgroup.SetLimit. A value of
   0 deadlocks (errgroup refuses to add goroutines), and a negative
   value silently removes the cap. Add a PreRunE check that rejects
   anything <= 0 with errInvalidBatchConcurrency, matching the shape
   used by cmd/fs/cp.go for the same flag.

2. The Long help previously said multi-query results come back "in
   input order", which was ambiguous when --file and positional SQLs
   are mixed. The actual behavior (already covered by
   TestResolveSQLsMixedFileAndPositional) is: --file inputs first in
   flag order, then positional SQLs in arg order. Tighten the help
   text to state that contract precisely.

Adds two unit tests that verify --concurrency 0 and -1 are rejected
before any API call.

Co-authored-by: Isaac
… cases

Two pairs of cobra-level tests were each testing one rejection code
path with two flag values. Fold them into table-driven subtests so the
shared assertion lives in one place:

- TestQueryCommandBatchTextOutputRejected + ...CsvOutputRejected →
  TestQueryCommandBatchOutputRejection (text, csv subtests)
- TestQueryCommandConcurrencyZeroRejected + ...NegativeRejected →
  TestQueryCommandConcurrencyRejection (0, -1 subtests)

Same coverage, half the test functions.

Co-authored-by: Isaac
Address Arseni's P2 finding on the batch PR. cancelInFlight (batch.go)
and cancelStatement (query.go) used to derive the cancel-RPC ctx via
context.WithTimeout(ctx, cancelTimeout). On the actual hot path (Ctrl+C
or parent ctx cancelled), the inbound ctx is already cancelled by the
time we reach the cancel sweep. The SDK then short-circuits on
ctx.Err() and the cancel RPC never reaches the warehouse, leaving
in-flight statements running server-side.

Wrap with context.WithoutCancel(ctx) (Go 1.21+) so the timeout context
keeps the caller's values but drops the cancellation signal. The cancel
RPC now actually fires.

Also tighten the existing tests:
- TestExecuteBatchContextCancellationCancelsInFlight
- TestExecuteAndPollCancelledContextCallsCancelExecution

Both previously matched mock.Anything for the ctx argument, so they
passed regardless of whether the bug was present. They now use
mock.MatchedBy(c.Err() == nil) to assert the cancel-RPC ctx is alive.
This is a regression guard; reverting the production fix makes the
tests fail with "unexpected call" because the matcher no longer matches.

Co-authored-by: Isaac
@simonfaltum simonfaltum force-pushed the simonfaltum/aitools-pr2-batch branch from 3e565e9 to a1c5ca6 Compare April 28, 2026 09:10
Adds a low-level command tree for asynchronous SQL statement
management, complementing the synchronous 'tools query':

  databricks experimental aitools tools statement submit  "SELECT ..."
  databricks experimental aitools tools statement get     <statement_id>
  databricks experimental aitools tools statement status  <statement_id>
  databricks experimental aitools tools statement cancel  <statement_id>

submit fires an ExecuteStatement with WaitTimeout=0s and
OnWaitTimeout=CONTINUE, returning the statement_id immediately. get
polls (via pollStatement from #5092) until terminal and emits rows on
success or an error object on failure. status performs a single GET
without polling. cancel sends CancelExecution.

All four subcommands emit a uniform JSON shape {statement_id, state,
warehouse_id, columns, rows, error} with omitempty so the payload only
includes fields that subcommand has.

Important UX nuance: 'statement get' Ctrl+C stops polling but does NOT
cancel the server-side statement. Users that want server-side
termination call 'statement cancel' explicitly. (This differs from
'tools query', which cancels server-side on Ctrl+C because the user
invoked the synchronous path.) The pollStatement helper from #5092 is
already designed to propagate ctx errors without touching the server,
so 'get' inherits this behavior for free.

Co-authored-by: Isaac
Address a cursor PR review finding: 'statement get' and 'statement
status' previously only set info.Error when pollResp.Status.Error was
non-nil. The Statements API can return a non-success terminal state
(FAILED, CANCELED, CLOSED) with no Error payload, so the JSON contract
"emits rows on success or an error object on failure" wasn't actually
guaranteed. Skill consumers couldn't branch on `error == null` alone:
they had to also inspect `state`. Especially bad for 'get', which
exits non-zero on non-success terminal states without giving the
caller structured failure detail.

Add a shared helper, statementErrorFromStatus, that returns a
batchResultError for any terminal non-success state, populated from
the SDK's ServiceError when present and synthesizing
"statement reached terminal state X" when the backend doesn't supply
one. Mirrors the pattern already used by runOneBatchQuery in batch.go,
so the contract is uniform across batch and single-statement paths.

Both 'get' and 'status' now use the helper. PENDING and RUNNING still
emit no error (legitimately mid-flight).

New tests:
- table-driven coverage of statementErrorFromStatus across nil,
  succeeded, running, pending, failed-with-error, failed-no-error,
  canceled-no-error, closed-no-error
- getStatementResult with CLOSED state and no Error
- getStatementResult with FAILED state and no Error
- getStatementStatus with FAILED state and no Error
- getStatementStatus with RUNNING state confirms no error is set

Co-authored-by: Isaac
…tests

Self-review pass on the test suite found ~8 functions worth trimming
without losing coverage:

Drop (cobra built-ins, not our contract):
- TestStatementSubmitArgsBound: tests cobra's MaximumNArgs(1)
- TestStatementGetRequiresStatementID: tests cobra's ExactArgs(1)
- TestStatementCancelRequiresStatementID: tests cobra's ExactArgs(1)

Drop (already covered by TestStatementErrorFromStatus, the table-driven
helper test added with the cursor-fix commit):
- TestGetStatementResultClosedTerminalSynthesizesError
- TestGetStatementResultFailedWithoutBackendErrorSynthesizesError
- TestGetStatementStatusFailedWithoutBackendErrorSynthesizesError
- TestGetStatementStatusRunningHasNoError

Fold:
- TestRenderStatementInfo + TestRenderStatementInfoOmitsEmptyFields →
  one table-driven TestRenderStatementInfo with the full and minimal
  cases as subtests.

Kept the validation we actually wrote (TestStatementSubmitRejectsMultipleSQLs)
and the wiring tests that pin distinct contracts
(TestGetStatementResultPolls, TestGetStatementResultFailedStateReportsError,
TestGetStatementResultDoesNotCancelServerSideOnContextCancel,
TestGetStatementStatusSinglePoll, TestGetStatementStatusReportsError,
the cancel pair, and submit pair).

Co-authored-by: Isaac
…put before auth

Address two findings from Arseni's review.

P2 (statement_get.go):
getStatementResult used to return (info, err) when fetchAllRows failed
after a SUCCEEDED state. RunE then discarded the populated info and
surfaced only the raw Go error, so the user got an unstructured
"fetch result chunk N: ..." string with no statement_id and no
machine-readable error field. That contradicts the contract in the
failed-terminal path two cases above, which renders JSON and returns
root.ErrAlreadyPrinted.

Now: on chunk-fetch failure, populate info.Error with the chunk-fetch
message and return (info, nil). RunE renders the partial info as JSON
and signals exit-non-zero based on info.Error != nil. The caller still
gets statement_id and columns; the error field carries the failure
detail. New test
TestGetStatementResultChunkFetchFailureRendersPartialInfo locks this
in.

P3 (statement_submit.go):
The PR description claims submit validates input before accessing
WorkspaceClient. The code didn't actually deliver that: PreRunE was
root.MustWorkspaceClient (auth/profile setup), then RunE did the
resolveSQLs / "exactly one" checks. So a malformed invocation hit auth
errors before ever surfacing the input error.

Move resolveSQLs and the length check into a custom PreRunE that runs
before root.MustWorkspaceClient, mirroring the pattern in
query.go:113-118. The result is stashed in a closure variable
(sqlStatement) for RunE to consume. Existing test
TestStatementSubmitRejectsMultipleSQLs is renamed to
...BeforeWorkspaceClient and no longer needs to stub out PreRunE: the
new ordering means a bad invocation gets the validation error without
ever attempting workspace-client setup.

Co-authored-by: Isaac
@simonfaltum simonfaltum force-pushed the simonfaltum/aitools-pr3-statement branch from dba7285 to 9b52b65 Compare April 28, 2026 09:12
Base automatically changed from simonfaltum/aitools-pr2-batch to main April 28, 2026 11:30
@simonfaltum simonfaltum enabled auto-merge April 28, 2026 12:20
@simonfaltum simonfaltum added this pull request to the merge queue Apr 28, 2026
Merged via the queue into main with commit 2bd698f Apr 28, 2026
19 checks passed
@simonfaltum simonfaltum deleted the simonfaltum/aitools-pr3-statement branch April 28, 2026 12:28
mkazia pushed a commit to mkazia/cli that referenced this pull request Apr 30, 2026
…ks#5092)

## Stack

This PR is part of a 4-PR stack making `aitools` data exploration faster
for ai-dev-kit. Each PR is independently reviewable; merge in order.

1. **databricks#5092 — aitools: extract pollStatement helper and pin
OnWaitTimeout** *(base: `main`)* — **this PR**
2. databricks#5093 — aitools: run multiple SQL queries in parallel from one query
invocation *(base: databricks#5092)*
3. databricks#5095 — aitools: add 'tools statement' lifecycle commands *(base:
databricks#5093)*
4. databricks#5097 — aitools: parallelize discover-schema across tables and probes
*(base: databricks#5095)*

Use `git diff <base>...HEAD` or set the comparison base in the GitHub UI
to see only this PR's changes; the default "Files changed" diff against
`main` includes ancestor PRs.

---

## Why

The query command in `experimental/aitools/cmd/query.go` works today,
but two things make it fragile and hard to reuse:

1. The polling loop, signal handling, spinner, and server-side
cancellation are entangled in one ~100-line function. Upcoming features
(parallel batch queries, a statement lifecycle command tree) need pure
polling without the signal-handler side effects, so the helper has to
come out cleanly.
2. The `ExecuteStatement` request sets `WaitTimeout: 0s` but does not
set `OnWaitTimeout`. That relies on the SDK's default being `CONTINUE`.
It is today, but a flip would silently break the command: the statement
would be cancelled before our first GET and we'd never see the result.

This PR is a pure refactor + one explicit-default fix. No user-visible
behavior change.

## Changes

- Extract `pollStatement(ctx, api, resp)` from `executeAndPoll`. The
helper polls until the statement reaches a terminal state and returns
the response. It does not call `CancelExecution` on context
cancellation, that's the caller's job (and a deliberate design choice
for the upcoming `statement get` command, where Ctrl+C should stop
polling without killing the server-side statement).
- Pin `OnWaitTimeout: CONTINUE` explicitly on the `ExecuteStatement`
call.
- Update `executeAndPoll` to delegate to `pollStatement` and keep the
existing signal-handling, spinner, and server-side cancel-on-Ctrl+C
semantics intact.
- Add five unit tests covering the new helper:
  - Immediate terminal short-circuit (no Get calls)
  - Failed terminal returned without error (caller decides)
  - Eventual success across multiple polls
- Context cancellation returns ctx error and does NOT call
CancelExecution
  - GetStatement transport error is wrapped and propagated
- Update the existing `TestExecuteAndPollImmediateSuccess` matcher to
assert `OnWaitTimeout == CONTINUE` so a future SDK default flip cannot
regress us.

## Test plan

- [x] `go test ./experimental/aitools/...` passes (10 polling-related
cases including the 5 new ones).
- [x] `make checks` clean (tidy, whitespace, dead code).
- [x] `make fmt` no drift.
- [x] `make lint` 0 issues.
- [x] Existing `executeAndPoll` tests (immediate success, immediate
failure, polling, fail-during-poll,
ctx-cancellation-calls-cancel-execution) all still pass without
modification beyond the matcher tweak.
mkazia pushed a commit to mkazia/cli that referenced this pull request Apr 30, 2026
…tabricks#5093)

## Stack

This PR is part of a 4-PR stack making `aitools` data exploration faster
for ai-dev-kit. Each PR is independently reviewable; merge in order.

1. databricks#5092 — aitools: extract pollStatement helper and pin OnWaitTimeout
*(base: `main`)*
2. **databricks#5093 — aitools: run multiple SQL queries in parallel from one
query invocation** *(base: databricks#5092)* — **this PR**
3. databricks#5095 — aitools: add 'tools statement' lifecycle commands *(base:
databricks#5093)*
4. databricks#5097 — aitools: parallelize discover-schema across tables and probes
*(base: databricks#5095)*

Use `git diff <base>...HEAD` or set the comparison base in the GitHub UI
to see only this PR's changes; the default "Files changed" diff against
`main` includes ancestor PRs.

---

## Why

Today `databricks experimental aitools tools query` runs one SQL at a
time. ai-dev-kit's data-exploration phase fires 5-10 probes per
dashboard (cardinality, top values, distributions, trend viability) and
they all run in series because each is a separate CLI invocation that
blocks. End-to-end exploration takes about a minute when it could take
seconds.

Quentin already wired up a bash workaround that fans out via the raw
`/api/2.0/sql/statements` endpoint with `wait_timeout=0s` and harvests
results separately. This PR exposes that pattern natively so the skill
can drop the hack and other CLI users get the same speed-up.

## Changes

**Before:** `query` accepted at most one positional SQL or a single
`--file`. Mixing the two errored. JSON output was an array of row
objects.

**Now:** `query` accepts any number of positional SQLs and/or repeated
`--file` paths. With one input, behavior is unchanged (back-compat).
With two or more, the queries run in parallel against the warehouse and
the result is a JSON array of one object per input in input order:

```json
[
  {
    "sql": "SELECT count(*) FROM t",
    "statement_id": "01ef...",
    "state": "SUCCEEDED",
    "elapsed_ms": 412,
    "columns": ["count"],
    "rows": [["12345"]]
  },
  {
    "sql": "SELECT bad_syntax",
    "statement_id": "01ef...",
    "state": "FAILED",
    "elapsed_ms": 87,
    "error": {
      "message": "near 'bad_syntax': syntax error",
      "error_code": "SYNTAX_ERROR"
    }
  }
]
```

Implementation:

- New `experimental/aitools/cmd/batch.go` with `executeBatch` (errgroup
with bounded parallelism) and `runOneBatchQuery`. Each goroutine submits
with `OnWaitTimeout: CONTINUE`, polls via the helper from databricks#5092, and
encodes its outcome into a `batchResult` struct. Failures don't abort
siblings.
- New `--concurrency` flag (default 8). Same value used by
`cmd/fs/cp.go` for similar fan-out. Validated `> 0` in `PreRunE` (a 0
value would deadlock `errgroup.SetLimit`).
- `--file` is now a repeatable string slice. Previous `--file` +
positional conflict error is removed; both compose.
- `resolveSQL` is replaced by `resolveSQLs` returning `[]string`. Result
order is `--file` inputs first (in flag order), then positional SQLs (in
arg order).
- Multi-query output is JSON-only. `--output text` and `--output csv`
are rejected with an actionable error before any API call.
- On Ctrl+C, in-flight statements are cancelled server-side via
`CancelExecution` after `g.Wait()` returns. Statements that finished
normally before the cancel are left alone.
- Exit code is non-zero (`root.ErrAlreadyPrinted`) when any statement
failed; the JSON already contains the error detail, no extra stderr
noise.

## Test plan

- [x] `go test ./experimental/aitools/...` passes.
- [x] `make checks` clean.
- [x] `make fmt` no drift.
- [x] `make lint` 0 issues.
- [x] New unit tests cover:
  - all-succeed batch with input-order preservation
  - server-reported failure on one of N (others still complete)
  - submission-time transport error encoded into per-result error
  - explicit `OnWaitTimeout: CONTINUE` on every `ExecuteStatement`
- staggered completion (1 slow + 2 fast) preserves input order in
results
- context cancellation triggers `CancelExecution` for each in-flight
statement
- cobra-level rejection of `--output text` and `--output csv` with
multiple positionals
  - cobra-level rejection of `--concurrency 0` and `--concurrency -1`
- `resolveSQLs` covering mixed sources, multiple files, multiple
positionals, indexed-error message
- [x] Manual smoke against a real warehouse:

  ```bash
  databricks experimental aitools tools query \
    --warehouse <wh> --output json \
    "SELECT 1" "SELECT 2" "SELECT current_timestamp()"
  ```
mkazia pushed a commit to mkazia/cli that referenced this pull request Apr 30, 2026
…icks#5097)

## Stack

This PR is part of a 4-PR stack making `aitools` data exploration faster
for ai-dev-kit. Each PR is independently reviewable; merge in order.

1. databricks#5092 — aitools: extract pollStatement helper and pin OnWaitTimeout
*(base: `main`)*
2. databricks#5093 — aitools: run multiple SQL queries in parallel from one query
invocation *(base: databricks#5092)*
3. databricks#5095 — aitools: add 'tools statement' lifecycle commands *(base:
databricks#5093)*
4. **databricks#5097 — aitools: parallelize discover-schema across tables and
probes** *(base: databricks#5095)* — **this PR**

Use `git diff <base>...HEAD` or set the comparison base in the GitHub UI
to see only this PR's changes; the default "Files changed" diff against
`main` includes ancestor PRs.

---

## Why

`discover-schema` walked tables sequentially and ran each table's three
probes (DESCRIBE, sample SELECT, null counts) one after the other. For
ai-dev-kit's data-exploration phase that meant warehouse-bound work was
idle most of the time. Same root cause as the multi-query exploration
latency that databricks#5093 (batch query) fixed; same fix.

This is a pure latency win. No new user-facing API surface, no
output-shape change.

## Changes

**Two layers of parallelism plus a shared statement budget:**

1. **Across tables.** The for-loop in `RunE` becomes an
`errgroup.Group`. A failure on one table never aborts the others; it's
rendered inline as `"Error discovering ..."` exactly as before.
2. **Within a table.** `discoverTable` still runs DESCRIBE first because
the column list feeds the null-counts query. After DESCRIBE returns, the
sample SELECT and null-counts probes run concurrently. Output text is
assembled once both probes finish, preserving the existing `COLUMNS /
SAMPLE DATA / NULL COUNTS` order.
3. **Single warehouse-statement budget.** A new `sqlGate` (chan struct{}
of capacity N + statement_id tracking) wraps every `executeSQL` call.
`--concurrency` (default 8) caps total in-flight statements globally,
regardless of how many tables you pass. So `--concurrency 1` actually
serializes statement load, not just table fan-out.

**Switch `executeSQL` to use `pollStatement`** (the helper extracted in
databricks#5092) instead of the SDK's `ExecuteAndWait`. Pins `OnWaitTimeout:
CONTINUE`. Failed states flow through `checkFailedState`, yielding more
specific error messages (e.g. `"query failed: SYNTAX_ERROR near
'oops'"`) than the previous hand-rolled branch. The user-visible
`"SAMPLE DATA: Error - %v" / "NULL COUNTS: Error - %v"` wrapping is
unchanged. Future polling-helper improvements land here for free.

**Cancellation discipline mirroring batch.go (databricks#5093):** signal handler
cancels a derived `pollCtx`; `sqlGate` records each `statement_id`
post-submission; on cancellation the recorded IDs are swept via
`CancelExecution` before returning `root.ErrAlreadyPrinted`. Without
this, parallelism would orphan up to N×2 statements server-side on
Ctrl+C.

**`--concurrency` validation** mirrors `cmd/fs/cp.go` and databricks#5093:
`PreRunE` rejects values <= 0 with `errInvalidBatchConcurrency`.
Table-name validation also runs in `PreRunE` so malformed identifiers
are rejected before `MustWorkspaceClient` runs (no unnecessary auth
roundtrip on bad input).

**Output unchanged** for any input that previously succeeded. Same
dividers, same header/probe ordering, same per-probe error wrapping.

## Test plan

- [x] `go test ./experimental/aitools/...` passes.
- [x] `make checks` clean.
- [x] `make fmt` no drift.
- [x] `make lint` 0 issues.
- [x] New unit tests in `discover_schema_test.go`:
- `quoteTableName` table-driven (valid, missing parts, too many parts,
injection attempts, empty parts, leading-digit identifiers, backtick in
name)
  - `parseDescribeResult` skips metadata rows (`#`-prefixed and empty)
- `sqlGate.run` pins `OnWaitTimeout: CONTINUE`, propagates FAILED state,
wraps transport errors, records IDs, respects cancelled context
  - `cancelDiscoverInFlight` calls API per ID; empty list is a no-op
- `discoverTable`: sample and null-count probes run concurrently after
DESCRIBE (deterministic atomic-counter + sync.OnceFunc + channel-close
barrier; sequential execution surfaces a timeout error)
  - `discoverTable`: a sample-probe failure does not abort null counts
  - `--concurrency 0` and `-1` rejected at PreRunE
- Invalid table name (not `CATALOG.SCHEMA.TABLE`) and injection attempts
rejected at PreRunE before any API call
- [x] Manual smoke against a real warehouse:

  ```bash
  databricks experimental aitools tools discover-schema \
    samples.nyctaxi.trips samples.tpch.orders samples.tpch.customer
  ```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants