Skip to content

aitools: extract pollStatement helper and pin OnWaitTimeout#5092

Merged
simonfaltum merged 1 commit intomainfrom
simonfaltum/aitools-pr1-refactor
Apr 28, 2026
Merged

aitools: extract pollStatement helper and pin OnWaitTimeout#5092
simonfaltum merged 1 commit intomainfrom
simonfaltum/aitools-pr1-refactor

Conversation

@simonfaltum
Copy link
Copy Markdown
Member

@simonfaltum simonfaltum commented Apr 27, 2026

Stack

This PR is part of a 4-PR stack making aitools data exploration faster for ai-dev-kit. Each PR is independently reviewable; merge in order.

  1. aitools: extract pollStatement helper and pin OnWaitTimeout #5092 — aitools: extract pollStatement helper and pin OnWaitTimeout (base: main)this PR
  2. aitools: run multiple SQL queries in parallel from one invocation #5093 — aitools: run multiple SQL queries in parallel from one query invocation (base: aitools: extract pollStatement helper and pin OnWaitTimeout #5092)
  3. aitools: add 'tools statement' lifecycle commands #5095 — aitools: add 'tools statement' lifecycle commands (base: aitools: run multiple SQL queries in parallel from one invocation #5093)
  4. aitools: parallelize discover-schema across tables and probes #5097 — aitools: parallelize discover-schema across tables and probes (base: aitools: add 'tools statement' lifecycle commands #5095)

Use git diff <base>...HEAD or set the comparison base in the GitHub UI to see only this PR's changes; the default "Files changed" diff against main includes ancestor PRs.


Why

The query command in experimental/aitools/cmd/query.go works today, but two things make it fragile and hard to reuse:

  1. The polling loop, signal handling, spinner, and server-side cancellation are entangled in one ~100-line function. Upcoming features (parallel batch queries, a statement lifecycle command tree) need pure polling without the signal-handler side effects, so the helper has to come out cleanly.
  2. The ExecuteStatement request sets WaitTimeout: 0s but does not set OnWaitTimeout. That relies on the SDK's default being CONTINUE. It is today, but a flip would silently break the command: the statement would be cancelled before our first GET and we'd never see the result.

This PR is a pure refactor + one explicit-default fix. No user-visible behavior change.

Changes

  • Extract pollStatement(ctx, api, resp) from executeAndPoll. The helper polls until the statement reaches a terminal state and returns the response. It does not call CancelExecution on context cancellation, that's the caller's job (and a deliberate design choice for the upcoming statement get command, where Ctrl+C should stop polling without killing the server-side statement).
  • Pin OnWaitTimeout: CONTINUE explicitly on the ExecuteStatement call.
  • Update executeAndPoll to delegate to pollStatement and keep the existing signal-handling, spinner, and server-side cancel-on-Ctrl+C semantics intact.
  • Add five unit tests covering the new helper:
    • Immediate terminal short-circuit (no Get calls)
    • Failed terminal returned without error (caller decides)
    • Eventual success across multiple polls
    • Context cancellation returns ctx error and does NOT call CancelExecution
    • GetStatement transport error is wrapped and propagated
  • Update the existing TestExecuteAndPollImmediateSuccess matcher to assert OnWaitTimeout == CONTINUE so a future SDK default flip cannot regress us.

Test plan

  • go test ./experimental/aitools/... passes (10 polling-related cases including the 5 new ones).
  • make checks clean (tidy, whitespace, dead code).
  • make fmt no drift.
  • make lint 0 issues.
  • Existing executeAndPoll tests (immediate success, immediate failure, polling, fail-during-poll, ctx-cancellation-calls-cancel-execution) all still pass without modification beyond the matcher tweak.

simonfaltum added a commit that referenced this pull request Apr 27, 2026
Adds a low-level command tree for asynchronous SQL statement
management, complementing the synchronous 'tools query':

  databricks experimental aitools tools statement submit  "SELECT ..."
  databricks experimental aitools tools statement get     <statement_id>
  databricks experimental aitools tools statement status  <statement_id>
  databricks experimental aitools tools statement cancel  <statement_id>

submit fires an ExecuteStatement with WaitTimeout=0s and
OnWaitTimeout=CONTINUE, returning the statement_id immediately. get
polls (via pollStatement from #5092) until terminal and emits rows on
success or an error object on failure. status performs a single GET
without polling. cancel sends CancelExecution.

All four subcommands emit a uniform JSON shape {statement_id, state,
warehouse_id, columns, rows, error} with omitempty so the payload only
includes fields that subcommand has.

Important UX nuance: 'statement get' Ctrl+C stops polling but does NOT
cancel the server-side statement. Users that want server-side
termination call 'statement cancel' explicitly. (This differs from
'tools query', which cancels server-side on Ctrl+C because the user
invoked the synchronous path.) The pollStatement helper from #5092 is
already designed to propagate ctx errors without touching the server,
so 'get' inherits this behavior for free.

Co-authored-by: Isaac
simonfaltum added a commit that referenced this pull request Apr 27, 2026
discover-schema previously walked tables sequentially and ran each
table's three probes (DESCRIBE, sample SELECT, null counts) one after
the other. For ai-dev-kit's data-exploration phase that meant
warehouse-bound work was idle most of the time. Same root cause as the
multi-query exploration latency that PR 2 fixed; same fix.

Two layers of parallelism:

1. Tables fan out via errgroup with --concurrency (default 8). A
   failure on one table never aborts the others; it gets rendered
   inline as "Error discovering ...".
2. Within a table, DESCRIBE still runs first because the column list
   feeds the null-counts query. After DESCRIBE returns, the sample
   SELECT and null-counts probes run concurrently. The output text is
   assembled once both finish, preserving the existing column order
   (COLUMNS, SAMPLE DATA, NULL COUNTS).

Switch executeSQL from the SDK's ExecuteAndWait helper to
ExecuteStatement + pollStatement (the helper extracted in #5092). This
brings discover-schema in line with query.go and statement.go: explicit
OnWaitTimeout=CONTINUE on every call, and any future polling-helper
improvement (e.g. signal handling) lands here for free. Failed states
now flow through checkFailedState, which yields more specific error
messages (e.g. "query failed: SYNTAX_ERROR ...") than the previous
hand-rolled branch. The user-visible "SAMPLE DATA: Error - %v" / "NULL
COUNTS: Error - %v" wrapping is unchanged.

Add --concurrency validation matching the cmd/fs/cp.go and
experimental/aitools/cmd/query.go pattern: PreRunE rejects values <= 0
with errInvalidBatchConcurrency.

Tests added in discover_schema_test.go:
- quoteTableName (table-driven across valid identifiers, missing
  parts, injection attempts, empty parts, leading-digit identifiers)
- parseDescribeResult skipping metadata rows
- executeSQL pins OnWaitTimeout=CONTINUE
- executeSQL propagates server-reported FAILED state
- executeSQL wraps transport errors
- discoverTable: sample and null-count probes run concurrently after
  DESCRIBE (atomic peak-counter assertion)
- discoverTable: a sample failure does not abort null counts
- --concurrency 0 and -1 rejected at PreRunE time
- invalid table name (not CATALOG.SCHEMA.TABLE) rejected at RunE
  validation before any API call

Co-authored-by: Isaac
@arsenyinfo
Copy link
Copy Markdown
Contributor

No findings.


🔍 Reviewed by nitpicker

simonfaltum added a commit that referenced this pull request Apr 28, 2026
Adds a low-level command tree for asynchronous SQL statement
management, complementing the synchronous 'tools query':

  databricks experimental aitools tools statement submit  "SELECT ..."
  databricks experimental aitools tools statement get     <statement_id>
  databricks experimental aitools tools statement status  <statement_id>
  databricks experimental aitools tools statement cancel  <statement_id>

submit fires an ExecuteStatement with WaitTimeout=0s and
OnWaitTimeout=CONTINUE, returning the statement_id immediately. get
polls (via pollStatement from #5092) until terminal and emits rows on
success or an error object on failure. status performs a single GET
without polling. cancel sends CancelExecution.

All four subcommands emit a uniform JSON shape {statement_id, state,
warehouse_id, columns, rows, error} with omitempty so the payload only
includes fields that subcommand has.

Important UX nuance: 'statement get' Ctrl+C stops polling but does NOT
cancel the server-side statement. Users that want server-side
termination call 'statement cancel' explicitly. (This differs from
'tools query', which cancels server-side on Ctrl+C because the user
invoked the synchronous path.) The pollStatement helper from #5092 is
already designed to propagate ctx errors without touching the server,
so 'get' inherits this behavior for free.

Co-authored-by: Isaac
simonfaltum added a commit that referenced this pull request Apr 28, 2026
discover-schema previously walked tables sequentially and ran each
table's three probes (DESCRIBE, sample SELECT, null counts) one after
the other. For ai-dev-kit's data-exploration phase that meant
warehouse-bound work was idle most of the time. Same root cause as the
multi-query exploration latency that PR 2 fixed; same fix.

Two layers of parallelism:

1. Tables fan out via errgroup with --concurrency (default 8). A
   failure on one table never aborts the others; it gets rendered
   inline as "Error discovering ...".
2. Within a table, DESCRIBE still runs first because the column list
   feeds the null-counts query. After DESCRIBE returns, the sample
   SELECT and null-counts probes run concurrently. The output text is
   assembled once both finish, preserving the existing column order
   (COLUMNS, SAMPLE DATA, NULL COUNTS).

Switch executeSQL from the SDK's ExecuteAndWait helper to
ExecuteStatement + pollStatement (the helper extracted in #5092). This
brings discover-schema in line with query.go and statement.go: explicit
OnWaitTimeout=CONTINUE on every call, and any future polling-helper
improvement (e.g. signal handling) lands here for free. Failed states
now flow through checkFailedState, which yields more specific error
messages (e.g. "query failed: SYNTAX_ERROR ...") than the previous
hand-rolled branch. The user-visible "SAMPLE DATA: Error - %v" / "NULL
COUNTS: Error - %v" wrapping is unchanged.

Add --concurrency validation matching the cmd/fs/cp.go and
experimental/aitools/cmd/query.go pattern: PreRunE rejects values <= 0
with errInvalidBatchConcurrency.

Tests added in discover_schema_test.go:
- quoteTableName (table-driven across valid identifiers, missing
  parts, injection attempts, empty parts, leading-digit identifiers)
- parseDescribeResult skipping metadata rows
- executeSQL pins OnWaitTimeout=CONTINUE
- executeSQL propagates server-reported FAILED state
- executeSQL wraps transport errors
- discoverTable: sample and null-count probes run concurrently after
  DESCRIBE (atomic peak-counter assertion)
- discoverTable: a sample failure does not abort null counts
- --concurrency 0 and -1 rejected at PreRunE time
- invalid table name (not CATALOG.SCHEMA.TABLE) rejected at RunE
  validation before any API call

Co-authored-by: Isaac
Refactor `executeAndPoll` in `experimental/aitools/cmd/query.go` to extract
a pure `pollStatement(ctx, api, resp)` helper. The helper polls until the
statement reaches a terminal state and returns the response without any
signal handling, spinner, or server-side cancellation; those concerns stay
in `executeAndPoll` where they belong.

Also pin `OnWaitTimeout: CONTINUE` explicitly on the `ExecuteStatement`
call. The SDK default happens to be CONTINUE today, but relying on it is
a hidden coupling: a server-side default flip would silently break the
poll loop by killing the statement before our first GET.

Behavior is unchanged for the existing `query` command. Follow-up PRs
(parallel batch queries, statement lifecycle command tree) will reuse the
helper.

Co-authored-by: Isaac
@simonfaltum simonfaltum force-pushed the simonfaltum/aitools-pr1-refactor branch from 8e00440 to 79fc080 Compare April 28, 2026 09:08
simonfaltum added a commit that referenced this pull request Apr 28, 2026
Adds a low-level command tree for asynchronous SQL statement
management, complementing the synchronous 'tools query':

  databricks experimental aitools tools statement submit  "SELECT ..."
  databricks experimental aitools tools statement get     <statement_id>
  databricks experimental aitools tools statement status  <statement_id>
  databricks experimental aitools tools statement cancel  <statement_id>

submit fires an ExecuteStatement with WaitTimeout=0s and
OnWaitTimeout=CONTINUE, returning the statement_id immediately. get
polls (via pollStatement from #5092) until terminal and emits rows on
success or an error object on failure. status performs a single GET
without polling. cancel sends CancelExecution.

All four subcommands emit a uniform JSON shape {statement_id, state,
warehouse_id, columns, rows, error} with omitempty so the payload only
includes fields that subcommand has.

Important UX nuance: 'statement get' Ctrl+C stops polling but does NOT
cancel the server-side statement. Users that want server-side
termination call 'statement cancel' explicitly. (This differs from
'tools query', which cancels server-side on Ctrl+C because the user
invoked the synchronous path.) The pollStatement helper from #5092 is
already designed to propagate ctx errors without touching the server,
so 'get' inherits this behavior for free.

Co-authored-by: Isaac
simonfaltum added a commit that referenced this pull request Apr 28, 2026
discover-schema previously walked tables sequentially and ran each
table's three probes (DESCRIBE, sample SELECT, null counts) one after
the other. For ai-dev-kit's data-exploration phase that meant
warehouse-bound work was idle most of the time. Same root cause as the
multi-query exploration latency that PR 2 fixed; same fix.

Two layers of parallelism:

1. Tables fan out via errgroup with --concurrency (default 8). A
   failure on one table never aborts the others; it gets rendered
   inline as "Error discovering ...".
2. Within a table, DESCRIBE still runs first because the column list
   feeds the null-counts query. After DESCRIBE returns, the sample
   SELECT and null-counts probes run concurrently. The output text is
   assembled once both finish, preserving the existing column order
   (COLUMNS, SAMPLE DATA, NULL COUNTS).

Switch executeSQL from the SDK's ExecuteAndWait helper to
ExecuteStatement + pollStatement (the helper extracted in #5092). This
brings discover-schema in line with query.go and statement.go: explicit
OnWaitTimeout=CONTINUE on every call, and any future polling-helper
improvement (e.g. signal handling) lands here for free. Failed states
now flow through checkFailedState, which yields more specific error
messages (e.g. "query failed: SYNTAX_ERROR ...") than the previous
hand-rolled branch. The user-visible "SAMPLE DATA: Error - %v" / "NULL
COUNTS: Error - %v" wrapping is unchanged.

Add --concurrency validation matching the cmd/fs/cp.go and
experimental/aitools/cmd/query.go pattern: PreRunE rejects values <= 0
with errInvalidBatchConcurrency.

Tests added in discover_schema_test.go:
- quoteTableName (table-driven across valid identifiers, missing
  parts, injection attempts, empty parts, leading-digit identifiers)
- parseDescribeResult skipping metadata rows
- executeSQL pins OnWaitTimeout=CONTINUE
- executeSQL propagates server-reported FAILED state
- executeSQL wraps transport errors
- discoverTable: sample and null-count probes run concurrently after
  DESCRIBE (atomic peak-counter assertion)
- discoverTable: a sample failure does not abort null counts
- --concurrency 0 and -1 rejected at PreRunE time
- invalid table name (not CATALOG.SCHEMA.TABLE) rejected at RunE
  validation before any API call

Co-authored-by: Isaac
@simonfaltum simonfaltum added this pull request to the merge queue Apr 28, 2026
Merged via the queue into main with commit 7ff1a81 Apr 28, 2026
19 checks passed
@simonfaltum simonfaltum deleted the simonfaltum/aitools-pr1-refactor branch April 28, 2026 11:11
mkazia pushed a commit to mkazia/cli that referenced this pull request Apr 30, 2026
…tabricks#5093)

## Stack

This PR is part of a 4-PR stack making `aitools` data exploration faster
for ai-dev-kit. Each PR is independently reviewable; merge in order.

1. databricks#5092 — aitools: extract pollStatement helper and pin OnWaitTimeout
*(base: `main`)*
2. **databricks#5093 — aitools: run multiple SQL queries in parallel from one
query invocation** *(base: databricks#5092)* — **this PR**
3. databricks#5095 — aitools: add 'tools statement' lifecycle commands *(base:
databricks#5093)*
4. databricks#5097 — aitools: parallelize discover-schema across tables and probes
*(base: databricks#5095)*

Use `git diff <base>...HEAD` or set the comparison base in the GitHub UI
to see only this PR's changes; the default "Files changed" diff against
`main` includes ancestor PRs.

---

## Why

Today `databricks experimental aitools tools query` runs one SQL at a
time. ai-dev-kit's data-exploration phase fires 5-10 probes per
dashboard (cardinality, top values, distributions, trend viability) and
they all run in series because each is a separate CLI invocation that
blocks. End-to-end exploration takes about a minute when it could take
seconds.

Quentin already wired up a bash workaround that fans out via the raw
`/api/2.0/sql/statements` endpoint with `wait_timeout=0s` and harvests
results separately. This PR exposes that pattern natively so the skill
can drop the hack and other CLI users get the same speed-up.

## Changes

**Before:** `query` accepted at most one positional SQL or a single
`--file`. Mixing the two errored. JSON output was an array of row
objects.

**Now:** `query` accepts any number of positional SQLs and/or repeated
`--file` paths. With one input, behavior is unchanged (back-compat).
With two or more, the queries run in parallel against the warehouse and
the result is a JSON array of one object per input in input order:

```json
[
  {
    "sql": "SELECT count(*) FROM t",
    "statement_id": "01ef...",
    "state": "SUCCEEDED",
    "elapsed_ms": 412,
    "columns": ["count"],
    "rows": [["12345"]]
  },
  {
    "sql": "SELECT bad_syntax",
    "statement_id": "01ef...",
    "state": "FAILED",
    "elapsed_ms": 87,
    "error": {
      "message": "near 'bad_syntax': syntax error",
      "error_code": "SYNTAX_ERROR"
    }
  }
]
```

Implementation:

- New `experimental/aitools/cmd/batch.go` with `executeBatch` (errgroup
with bounded parallelism) and `runOneBatchQuery`. Each goroutine submits
with `OnWaitTimeout: CONTINUE`, polls via the helper from databricks#5092, and
encodes its outcome into a `batchResult` struct. Failures don't abort
siblings.
- New `--concurrency` flag (default 8). Same value used by
`cmd/fs/cp.go` for similar fan-out. Validated `> 0` in `PreRunE` (a 0
value would deadlock `errgroup.SetLimit`).
- `--file` is now a repeatable string slice. Previous `--file` +
positional conflict error is removed; both compose.
- `resolveSQL` is replaced by `resolveSQLs` returning `[]string`. Result
order is `--file` inputs first (in flag order), then positional SQLs (in
arg order).
- Multi-query output is JSON-only. `--output text` and `--output csv`
are rejected with an actionable error before any API call.
- On Ctrl+C, in-flight statements are cancelled server-side via
`CancelExecution` after `g.Wait()` returns. Statements that finished
normally before the cancel are left alone.
- Exit code is non-zero (`root.ErrAlreadyPrinted`) when any statement
failed; the JSON already contains the error detail, no extra stderr
noise.

## Test plan

- [x] `go test ./experimental/aitools/...` passes.
- [x] `make checks` clean.
- [x] `make fmt` no drift.
- [x] `make lint` 0 issues.
- [x] New unit tests cover:
  - all-succeed batch with input-order preservation
  - server-reported failure on one of N (others still complete)
  - submission-time transport error encoded into per-result error
  - explicit `OnWaitTimeout: CONTINUE` on every `ExecuteStatement`
- staggered completion (1 slow + 2 fast) preserves input order in
results
- context cancellation triggers `CancelExecution` for each in-flight
statement
- cobra-level rejection of `--output text` and `--output csv` with
multiple positionals
  - cobra-level rejection of `--concurrency 0` and `--concurrency -1`
- `resolveSQLs` covering mixed sources, multiple files, multiple
positionals, indexed-error message
- [x] Manual smoke against a real warehouse:

  ```bash
  databricks experimental aitools tools query \
    --warehouse <wh> --output json \
    "SELECT 1" "SELECT 2" "SELECT current_timestamp()"
  ```
mkazia pushed a commit to mkazia/cli that referenced this pull request Apr 30, 2026
## Stack

This PR is part of a 4-PR stack making `aitools` data exploration faster
for ai-dev-kit. Each PR is independently reviewable; merge in order.

1. databricks#5092 — aitools: extract pollStatement helper and pin OnWaitTimeout
*(base: `main`)*
2. databricks#5093 — aitools: run multiple SQL queries in parallel from one query
invocation *(base: databricks#5092)*
3. **databricks#5095 — aitools: add 'tools statement' lifecycle commands** *(base:
databricks#5093)* — **this PR**
4. databricks#5097 — aitools: parallelize discover-schema across tables and probes
*(base: databricks#5095)*

Use `git diff <base>...HEAD` or set the comparison base in the GitHub UI
to see only this PR's changes; the default "Files changed" diff against
`main` includes ancestor PRs.

---

## Why

Quentin's ai-dev-kit skill works against synchronous `tools query`. That
covers most cases, but there are workflows where the agent wants a
server-side handle it can poll separately: long-running maintenance
queries, parallel exploration where the agent does other work in
between, and any "submit-now-harvest-later" pattern.

`tools query` with a single SQL is for "I want results now." This PR
adds a low-level command tree, `tools statement`, for "I want a handle."
Cleaner separation than overloading `query` with `--async`/`--cancel`
flags (which would be semantically forced — a `query` shouldn't manage
someone else's statement_id).

## Changes

Four new subcommands under `databricks experimental aitools tools
statement`:

```bash
# Fire and exit with a handle.
databricks experimental aitools tools statement submit \
  --warehouse <wh> "SELECT pg_sleep(60)"

# Output:
# { "statement_id": "01ef...", "state": "PENDING", "warehouse_id": "..." }

# Block until terminal and emit rows.
databricks experimental aitools tools statement get <statement_id>

# Peek at current state without polling.
databricks experimental aitools tools statement status <statement_id>

# Request cancellation.
databricks experimental aitools tools statement cancel <statement_id>
```

Implementation notes:

- All four subcommands emit a uniform `statementInfo` JSON shape:
`{statement_id, state, warehouse_id, columns, rows, error}` with
`omitempty` on every field except `statement_id`. So `submit` doesn't
include `columns/rows`, `cancel` doesn't include `warehouse_id`, etc.
Consumer parsing is uniform.
- `submit` uses `WaitTimeout: "0s"` and `OnWaitTimeout: CONTINUE`
(matching the helper from databricks#5092).
- `get` uses `pollStatement` (from databricks#5092) and inherits its "ctx
cancellation does NOT cancel server-side" semantics. This is the
**important UX difference from `tools query`**: hitting Ctrl+C on `get`
stops polling but leaves the statement running on the warehouse. Use
`cancel` for explicit termination. That asymmetry is intentional, since
`get` is poll-only by design — the user already submitted async.
- `status` does a single `GetStatementByStatementId` with no polling.
- `cancel` calls `CancelExecution` and optimistically reports
`state=CANCELED`. The Statements API returns no body on cancel; the
actual server-side state transitions asynchronously. The `Long` help
points users at `status` if they need certainty.
- A shared helper `statementErrorFromStatus` populates the `error` field
for every non-success terminal state (FAILED, CANCELED, CLOSED), even
when the server returns no `Status.Error` payload. So skill consumers
can branch on `error == null` alone instead of inspecting `state`.
- Each subcommand has a small testable helper (`submitStatement`,
`getStatementResult`, `getStatementStatus`, `cancelStatementExecution`)
extracted from the cobra `RunE`. Tests target the helpers directly with
a mock `StatementExecutionInterface`.
- Parent `statement.go` registers the four subcommands and is wired into
`tools.go` next to `query`, `discover-schema`, and
`get-default-warehouse`.
- `submit` validates input (rejects mixed --file + positional) BEFORE
accessing `WorkspaceClient`, so the error surfaces cleanly without an
auth or warehouse roundtrip.

## Test plan

- [x] `go test ./experimental/aitools/...` passes.
- [x] `make checks` clean.
- [x] `make fmt` no drift.
- [x] `make lint` 0 issues.
- [x] New tests cover:
  - `submit` returns the statement_id and pins `OnWaitTimeout: CONTINUE`
  - `submit` wraps transport errors with `execute statement: ...`
  - `get` polls until terminal and assembles rows
- `get` reports server-side errors in the JSON without raising a Go
error
- `get` ctx cancellation propagates **without** calling
`CancelExecution` (the deliberate UX difference from `query`)
- `get` synthesizes `error` for terminal CLOSED / FAILED with no backend
payload
  - `status` does a single GET, no polling
- `status` reports server-side errors in the JSON; running/pending stay
error-free
  - `status` synthesizes `error` for FAILED with no backend payload
  - `cancel` calls `CancelExecution` and reports `state=CANCELED`
  - `cancel` wraps API errors
- `statementErrorFromStatus` table-driven across nil, succeeded,
running, failed-with-error, failed/canceled/closed-without-error
  - `renderStatementInfo` JSON shape (full and minimal)
- cobra-level: `submit` rejects mixed --file + positional, `submit`
enforces MaximumNArgs(1), `get` and `cancel` require a positional
statement_id
- [x] Manual smoke against a real warehouse:

  ```bash
  SID=$(databricks experimental aitools tools statement submit \
    --warehouse <wh> "SELECT pg_sleep(5)" | jq -r '.statement_id')
  databricks experimental aitools tools statement status "$SID"
  databricks experimental aitools tools statement get "$SID"
  ```
mkazia pushed a commit to mkazia/cli that referenced this pull request Apr 30, 2026
…icks#5097)

## Stack

This PR is part of a 4-PR stack making `aitools` data exploration faster
for ai-dev-kit. Each PR is independently reviewable; merge in order.

1. databricks#5092 — aitools: extract pollStatement helper and pin OnWaitTimeout
*(base: `main`)*
2. databricks#5093 — aitools: run multiple SQL queries in parallel from one query
invocation *(base: databricks#5092)*
3. databricks#5095 — aitools: add 'tools statement' lifecycle commands *(base:
databricks#5093)*
4. **databricks#5097 — aitools: parallelize discover-schema across tables and
probes** *(base: databricks#5095)* — **this PR**

Use `git diff <base>...HEAD` or set the comparison base in the GitHub UI
to see only this PR's changes; the default "Files changed" diff against
`main` includes ancestor PRs.

---

## Why

`discover-schema` walked tables sequentially and ran each table's three
probes (DESCRIBE, sample SELECT, null counts) one after the other. For
ai-dev-kit's data-exploration phase that meant warehouse-bound work was
idle most of the time. Same root cause as the multi-query exploration
latency that databricks#5093 (batch query) fixed; same fix.

This is a pure latency win. No new user-facing API surface, no
output-shape change.

## Changes

**Two layers of parallelism plus a shared statement budget:**

1. **Across tables.** The for-loop in `RunE` becomes an
`errgroup.Group`. A failure on one table never aborts the others; it's
rendered inline as `"Error discovering ..."` exactly as before.
2. **Within a table.** `discoverTable` still runs DESCRIBE first because
the column list feeds the null-counts query. After DESCRIBE returns, the
sample SELECT and null-counts probes run concurrently. Output text is
assembled once both probes finish, preserving the existing `COLUMNS /
SAMPLE DATA / NULL COUNTS` order.
3. **Single warehouse-statement budget.** A new `sqlGate` (chan struct{}
of capacity N + statement_id tracking) wraps every `executeSQL` call.
`--concurrency` (default 8) caps total in-flight statements globally,
regardless of how many tables you pass. So `--concurrency 1` actually
serializes statement load, not just table fan-out.

**Switch `executeSQL` to use `pollStatement`** (the helper extracted in
databricks#5092) instead of the SDK's `ExecuteAndWait`. Pins `OnWaitTimeout:
CONTINUE`. Failed states flow through `checkFailedState`, yielding more
specific error messages (e.g. `"query failed: SYNTAX_ERROR near
'oops'"`) than the previous hand-rolled branch. The user-visible
`"SAMPLE DATA: Error - %v" / "NULL COUNTS: Error - %v"` wrapping is
unchanged. Future polling-helper improvements land here for free.

**Cancellation discipline mirroring batch.go (databricks#5093):** signal handler
cancels a derived `pollCtx`; `sqlGate` records each `statement_id`
post-submission; on cancellation the recorded IDs are swept via
`CancelExecution` before returning `root.ErrAlreadyPrinted`. Without
this, parallelism would orphan up to N×2 statements server-side on
Ctrl+C.

**`--concurrency` validation** mirrors `cmd/fs/cp.go` and databricks#5093:
`PreRunE` rejects values <= 0 with `errInvalidBatchConcurrency`.
Table-name validation also runs in `PreRunE` so malformed identifiers
are rejected before `MustWorkspaceClient` runs (no unnecessary auth
roundtrip on bad input).

**Output unchanged** for any input that previously succeeded. Same
dividers, same header/probe ordering, same per-probe error wrapping.

## Test plan

- [x] `go test ./experimental/aitools/...` passes.
- [x] `make checks` clean.
- [x] `make fmt` no drift.
- [x] `make lint` 0 issues.
- [x] New unit tests in `discover_schema_test.go`:
- `quoteTableName` table-driven (valid, missing parts, too many parts,
injection attempts, empty parts, leading-digit identifiers, backtick in
name)
  - `parseDescribeResult` skips metadata rows (`#`-prefixed and empty)
- `sqlGate.run` pins `OnWaitTimeout: CONTINUE`, propagates FAILED state,
wraps transport errors, records IDs, respects cancelled context
  - `cancelDiscoverInFlight` calls API per ID; empty list is a no-op
- `discoverTable`: sample and null-count probes run concurrently after
DESCRIBE (deterministic atomic-counter + sync.OnceFunc + channel-close
barrier; sequential execution surfaces a timeout error)
  - `discoverTable`: a sample-probe failure does not abort null counts
  - `--concurrency 0` and `-1` rejected at PreRunE
- Invalid table name (not `CATALOG.SCHEMA.TABLE`) and injection attempts
rejected at PreRunE before any API call
- [x] Manual smoke against a real warehouse:

  ```bash
  databricks experimental aitools tools discover-schema \
    samples.nyctaxi.trips samples.tpch.orders samples.tpch.customer
  ```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants