Skip to content

refactor(wren): drop ibis for canner/postgres/mysql/mssql/trino/clickhouse/athena (combined)#2313

Merged
douenergy merged 38 commits into
mainfrom
refactor/native-drivers-combined
May 21, 2026
Merged

refactor(wren): drop ibis for canner/postgres/mysql/mssql/trino/clickhouse/athena (combined)#2313
douenergy merged 38 commits into
mainfrom
refactor/native-drivers-combined

Conversation

@goldmedal
Copy link
Copy Markdown
Collaborator

@goldmedal goldmedal commented May 21, 2026

Summary

Consolidates seven open PRs that each migrate one core/wren connector off ibis-framework to its native driver. They were merged into a single feature branch so conflicts on shared files (data_source.py, pyproject.toml, uv.lock, factory.py, tests/conftest.py, justfile) only had to be resolved once.

Bundled PRs:

All seven were merged with --no-ff to preserve their individual commit history.

Conflict resolutions

  • core/wren/src/wren/model/data_source.py — unified the return-type alias to BackendOrConnection = Union[BaseBackend, "PyAthenaConnection", "MySQLdb.Connection", "psycopg.Connection"]; consolidated TYPE_CHECKING imports; kept the URL-dispatch fast-paths added by mysql/doris and trino PRs; dropped the now-orphaned import ssl, SSLMode, and _create_ssl_context helper (mysql PR moved SSL handling into wren.connector.mysql); dropped the orphaned import boto3 (athena PR moved STS into wren.connector.athena).
  • core/wren/src/wren/connector/factory.py — every connector now resolves to its own module (athena/clickhouse/trino out of wren.connector.ibis); _NEEDS_DATA_SOURCE trimmed to {mysql, doris, trino}.
  • core/wren/src/wren/connector/ibis.py — deleted; no remaining datasource resolves to it after the migrations.
  • core/wren/pyproject.toml — optional-deps union: psycopg[binary], mysqlclient, pyodbc, clickhouse-connect, trino, pyathena, snowflake-connector-python (no ibis-framework[<x>] extras for the migrated connectors).
  • core/wren/uv.lock — regenerated once at the end.
  • core/wren/tests/conftest.py and core/wren/justfile — additive concatenation of new pytest markers and just test-<connector> recipes.

Test plan

  • cd core/wren && uv lock — clean resolve
  • uv run ruff check src tests — same 32 pre-existing errors as origin/main, no regressions
  • uv run pytest tests/unit/test_athena_connector.py tests/unit/test_mssql_connection.py tests/unit/test_mysql_helpers.py — 54 passed, 1 skipped
  • Per-connector integration suites (Docker required): just test-canner, test-postgres, test-mysql, test-mssql, test-clickhouse, test-trino, test-athena

Notes

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Native connectors added for Athena, ClickHouse, Trino, and Canner; improved native MySQL, Postgres, and MSSQL connectors.
  • Improvements

    • Better type conversion, timeout/error mapping, and more robust connection handling across backends.
    • Connector factory updated to prefer dedicated native connectors.
  • Tests

    • Expanded integration and unit tests for all connectors and new pytest targets for added backends.

Review Change Stack

goldmedal and others added 30 commits May 14, 2026 11:18
Canner Enterprise speaks the Postgres wire protocol; the connector now
uses psycopg directly instead of the ibis postgres backend.

Changes
- `connector/canner.py`: native psycopg cursor with a self-contained
  PG OID -> Arrow type map covering the canner-flavoured types
  (VARCHAR/CHAR -> string, DECIMAL -> decimal128, BIGINT/INT/SMALLINT
  -> int, BOOLEAN -> bool, DATE/TIMESTAMP/TIMESTAMPTZ -> date/timestamp,
  ROW/ARRAY/MAP serialised as JSON strings). Errors are wrapped as
  WrenError with the dialect SQL attached, mirroring the existing
  postgres connector contract.
- `model/data_source.py::get_canner_connection`: returns a
  `psycopg.Connection` (autocommit) instead of an ibis backend.

Tests
- `tests/connectors/test_canner.py` exercises the type-mapping helpers
  and runs the connector against a PostgresContainer with the common
  canner result types (incl. JSON/JSONB and arrays). Marker `canner`
  is registered in `tests/conftest.py` and a `just test-canner` target
  is added.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The trino connector now uses the `trino` python client directly,
parsing Trino type strings via sqlglot to build PyArrow schemas.
The trino extra installs `trino>=0.333,<1` instead of
`ibis-framework[trino]`.

Highlights
- `connector/trino.py`: native cursor execution; type-string -> Arrow
  via sqlglot, including row(...) / array(...) / map(...) /
  decimal(p,s) / timestamp with time zone.
- `model/data_source.py::get_trino_connection`: returns
  `trino.dbapi.Connection`; native code path no longer routes through
  ibis.
- `pyproject.toml`: trino extra -> `trino>=0.333,<1`.

Tests
- `tests/connectors/test_trino.py` covers ~36 Trino type categories
  including nested row/array/map plus testcontainer-backed query
  suite.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The athena connector now uses pyathena directly with a Trino-style type
lexer to materialise cursor results into PyArrow tables, removing the
ibis-framework[athena] dependency from the athena extra.

Highlights
- connector/athena.py: native pyathena cursor; type strings parsed via
  sqlglot (varchar, decimal(p,s), array<T>, row(...), map<K,V>, etc.).
- model/data_source.py::get_athena_connection: preserves the
  Web-Identity-Token (OIDC -> AssumeRoleWithWebIdentity) and access-key
  auth flows; returns a pyathena.connection.Connection.
- pyproject.toml: athena extra -> pyathena[pandas]>=3.

Tests
- tests/unit/test_athena_connector.py mocks pyathena cursor + boto3 STS
  to verify the type lexer, cursor->Arrow materialisation, error
  mapping, and all three credential resolution paths.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The postgres connector now uses psycopg directly instead of going through
ibis. This removes ibis-framework[postgres] from the postgres extra and
gives us direct control over Arrow conversion for pg-specific types
(numeric scale, arrays, intervals, jsonb).

Changes
- connector/postgres.py: native psycopg3 cursor execution + OID-driven
  Arrow type mapping (bool, int{2,4,8}, float{4,8}, decimal128(p,s), text,
  bytea, uuid, jsonb, timestamp[tz], date, time, array-of-T).
- model/data_source.py::get_postgres_connection: returns psycopg.Connection.
- pyproject.toml: postgres extra now installs psycopg[binary]>=3 instead
  of ibis-framework[postgres].
- connector/base.py and model/data_source.py: ibis imports made lazy so
  importing wren.connector.postgres no longer pulls in ibis.

Tests
- New direct-connector tests in tests/connectors/test_postgres.py exercise
  10+ pg types (int4, int8, numeric(38,9), text, bool, bytea, uuid, jsonb,
  timestamp, timestamptz, int4[], text[], numeric[]) against a
  testcontainer postgres, asserting Arrow schema and value round-trip.
- All existing tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The mysql / doris connector now uses mysqlclient (MySQLdb) directly
instead of going through ibis. The mysql extra installs `mysqlclient`
only.

Changes
- `connector/mysql.py`: native MySQLdb cursor + field-type-code →
  Arrow type mapping covering integer (signed/unsigned variants),
  decimal, float, char/varchar/text, json, blob, datetime/timestamp,
  date, time, bit. SSL kwargs derive from MySqlConnectionInfo.
- `model/data_source.py`: `get_mysql_connection` / `get_doris_connection`
  return `MySQLdb.Connection`; SSL context construction is moved into
  the connector module.
- `pyproject.toml`: mysql extra → `mysqlclient>=2.2`.

Tests
- `tests/connectors/test_mysql_connector.py` exercises 15+ MySQL field
  types via a testcontainer mysql.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The mssql connector now uses pyodbc directly with custom Arrow type
inference and a datetimeoffset output converter. The mssql extra
installs `pyodbc` instead of `ibis-framework[mssql]`.

Highlights
- `connector/mssql.py`: raw pyodbc cursor; sqlglot-based LIMIT/OFFSET
  rewrite (`OFFSET 0 ROWS FETCH NEXT n ROWS ONLY`); Arrow schema built
  from `cursor.description` + value sampling; dry_run falls back to
  `sys.dm_exec_describe_first_result_set` for precise error messages.
- `model/data_source.py`: `_connect_mssql_pyodbc` builds an ODBC
  connection string with proper `{...}` escaping; `mssql://` URL
  parsing added; output converter for DATETIMEOFFSET (type code -155)
  decodes the 20-byte payload into a tz-aware datetime.
- `pyproject.toml`: mssql extra -> `pyodbc>=5,<6`.

Tests
- `tests/connectors/test_mssql.py` covers int sizes (tinyint/smallint/
  int/bigint), bit, varchar, decimal, datetime/datetime2,
  datetimeoffset (utc + non-utc), uniqueidentifier and varbinary;
  also exercises the URL connection path and pagination rewrite.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… dependency

The ClickHouse connector now uses ``clickhouse-connect`` directly,
parsing ClickHouse type strings via sqlglot to build PyArrow schemas
rather than going through the ibis-project clickhouse backend.

Highlights
- ``connector/clickhouse.py``: native client; type lexer covers
  ``Nullable(T)`` / ``LowCardinality(T)`` / ``Array(T)`` / ``Tuple(...)``
  / ``Map(K,V)`` / ``DateTime64(p, 'TZ')`` / ``Decimal(p,s)``, plus
  ``Int128/256`` and ``UInt128/256`` (surfaced as string to avoid
  silent truncation past 64-bit Arrow widths).
- ``model/data_source.py::get_clickhouse_connection`` returns a
  ``clickhouse_connect.Client``; ``_handle_clickhouse_url`` now also
  accepts ``clickhouse+http://`` / ``clickhouse+https://`` URLs.
- ``pyproject.toml``: clickhouse extra now pulls
  ``clickhouse-connect>=0.8`` instead of ``ibis-framework[clickhouse]``.

Tests
- ``tests/connectors/test_clickhouse.py`` exercises the full query
  path against a ClickHouse testcontainer (TPCH sf=0.01) and
  parametrises 35+ type strings through ``_parse_clickhouse_type``,
  including ``DateTime64`` with timezone and nested ``Tuple``.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…bles

SQL NULL in json/jsonb columns was being coerced into the string "null",
breaking NULL semantics downstream. Drop the oid-114/3802 special case so
None passes through unchanged.

Arrow tables were built via dict(zip(names, arrays)), which silently
drops duplicate column names (e.g. self-joins projecting two `id`
columns). Switch to pa.Table.from_arrays(..., schema=schema) so
positional construction keeps duplicates.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace ``pa.table(dict(zip(...)), schema=schema)`` with positional
``pa.Table.from_arrays``. The dict-based form silently drops duplicate
column names, which corrupts join results like
``SELECT a.id, b.id FROM t a, t b``. Add a regression test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The postgres branch returns a raw ``psycopg.Connection`` rather than an
ibis ``BaseBackend``. Introduce a ``ConnectionLike`` alias (under
TYPE_CHECKING) covering both shapes, apply it to ``get_connection``, and
annotate ``get_postgres_connection`` with its concrete return type.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CodeRabbit flagged that data_source.get_athena_connection re-implemented
the pyathena connect-kwargs logic that also lives in connector/athena.py,
letting them drift on schema_name / kill_on_interrupt / info.kwargs, and
the get_connection signature claimed BaseBackend even though the Athena
path returned a raw pyathena Connection.

Route data_source.get_athena_connection through the connector's shared
_build_connect_kwargs builder, and widen the return-type annotation to a
BackendOrConnection Union so the type matches reality without forcing a
runtime pyathena import.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Reject Trino connection URLs without a username instead of silently
  falling back to "test"; raise a clear INVALID_CONNECTION_INFO error.
- Wrap lazy ``import trino`` calls so users without the trino extra
  get a "pip install wren-engine[trino]" hint instead of a confusing
  ImportError.
- Strip trailing whitespace and a single trailing semicolon before
  wrapping user SQL in ``SELECT * FROM (...) AS _sub LIMIT N`` for
  both ``query()`` and ``dry_run()``; previously a trailing ``;``
  produced invalid SQL.
- Route ``ConnectionUrl`` for trino through the native URL handler in
  ``DataSourceExtension.get_connection()`` so it no longer hits the
  removed generic ``ibis.connect()`` path.

Adds regression unit tests covering each fix (semicolon stripping,
missing-username URL, bad-scheme URL, and the ImportError path via
monkeypatched ``__import__``).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously only the password got unquote_plus() — username, the
database path, and the query-string handling left other components
encoded, so URL-encoded specials (e.g. '@' in a username, spaces in a
database name) reached the ODBC driver still percent-encoded.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous branch silently mishandled asymmetric credentials:
user=None with password set crashed in password.get_secret_value(),
and user set with password=None emitted UID= without PWD=, producing
an incomplete ODBC connection string. Now: both absent falls back to
Trusted_Connection=yes, exactly one absent raises INVALID_CONNECTION_INFO,
both present uses normal SQL auth.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
statement_timeout was cast to int only after pyodbc.connect() returned,
so a non-numeric value (e.g. "5s") leaked the just-opened connection
when int() raised ValueError. Validate and cast up-front; raise
INVALID_CONNECTION_INFO before any connect call.

Also adds unit tests with mocked pyodbc covering all three review
findings: URL component decoding, asymmetric auth rejection, and the
pre-connect timeout validation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The native MySQL/Doris path returns a raw MySQLdb.Connection rather than an
ibis BaseBackend. Reflect that in the return type via a ConnectionHandle
union so static analysis matches runtime behaviour.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…type

Address CodeRabbit major findings on the native MySQL connector:

- Initialize ``_closed`` before opening the cursor and close the connection
  if the ``ANSI_QUOTES`` init query raises, so a failed init never leaks a
  live socket.
- Coerce the ``limit`` argument via ``int()`` and reject negative values
  before interpolating into SQL — the previous f-string was a SQL-injection
  vector when callers passed an attacker-controlled value.
- Append ``LIMIT n`` to the user's SQL (stripping a trailing semicolon)
  instead of wrapping it in ``SELECT * FROM (...) AS _sub``; the subquery
  approach failed with ``ER_DUP_FIELDNAME`` whenever the inner SELECT
  projected two columns with the same name (e.g. ``SELECT a.id, b.id`` in a
  join). ``dry_run`` switches to ``EXPLAIN <sql>`` for the same reason and
  to remain side-effect-free.
- Derive ``DECIMAL`` precision/scale from ``cursor.description`` (PEP 249
  ``precision``/``scale`` fields) instead of hard-coding ``decimal128(38, 9)``;
  that loss of precision silently truncated MySQL columns with ``D > 9``.
  Precision is recovered from MySQLdb's display-length convention and
  clamped to Arrow's ``decimal128`` maximum (38) plus MySQL's scale ceiling
  (30).

Adds regression tests for duplicate-column SELECT, SQL-injection limit
rejection, trailing semicolon, large-scale decimal round-trip, and unit
tests for the new helpers. ``just test-mysql`` now runs both the engine
suite and the type-coverage connector suite.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Round-2 review found _decimal_type defaulted scale=9 when the column
typmod was missing, so Decimal.quantize silently rounded high-precision
values (e.g. 18-significant-figure NUMERIC). Fall back to pa.string() for
unconstrained NUMERIC and NUMERIC[] columns so the exact textual value
round-trips. Same approach Trino's connector takes for dynamic-decimal
casts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Round-2 review found that query() and dry_run() wrap user SQL as
"SELECT * FROM ({sql}) AS _t LIMIT N", which Postgres/Canner reject
when the inner SQL ends in a semicolon. Add a _strip_trailing_semicolon
helper that only strips the terminating run of semicolons and
whitespace (so semicolons inside string literals are preserved) and
apply it on both call sites.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…lues

MySQL ``TIME`` ranges ``-838:59:59`` to ``838:59:59``. PyArrow
``time64("us")`` only accepts 0-24h positive values, so the previous
mapping silently truncated or corrupted any TIME value outside that
window. Map to ``duration("us")`` instead so the full MySQL range
round-trips, and convert MySQLdb's ``datetime.timedelta`` to signed
microseconds explicitly (``timedelta.days`` is signed for negative
values, ``seconds``/``microseconds`` are not — combining all three
recovers the signed total).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
SQL Server TINYINT is an unsigned 8-bit integer (0..255), but the
arrow-type helper was branching on the sampled value sign and could
fall back to int8. Map internal_size == 1 directly to pa.uint8() so
the schema reflects the driver-declared type regardless of the rows.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The DATETIMEOFFSET output converter assumed pyodbc would always hand it
a 20-byte payload. Truncated or malformed buffers fell through and
surfaced as a cryptic "month must be in 1..12" ValueError from
datetime(). Reject non-20-byte payloads up front with a clear message
that points at the actual length mismatch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ettings

In the ClickHouse client-kwargs assembly, ``out.update(kwargs)`` /
``client_kwargs.update(kwargs)`` would clobber the merged ``settings`` dict
(carrying ``max_execution_time`` from ``statement_timeout``) whenever the
caller also passed their own ``settings`` via ``kwargs``. Pop ``settings``
from incoming ``kwargs`` first and merge it into the local dict so the
timeout survives. Also wrap driver ``TIMEOUT_EXCEEDED`` errors as the
existing ``DatabaseTimeoutError`` instead of re-raising the raw driver
exception, for consistency with the typed error model.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wrap the readiness-loop client in try/finally so we close on failed
attempts, not only on success. Replace the DuckDB TPCH extension
fixture (which pulled the extension over the network on every run)
with inline-fabricated rows so the test stays hermetic.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous lazy initializers mutated module-level dicts/sets in place,
so a thread that raced the first call could observe a partially-populated
map. Switch each initializer to a @functools.cache'd accessor that
returns a fully-built local container — the cache slot is GIL-protected
and only published once initialization runs to completion.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…n sites

``pa.table(dict(zip(names, arrays)), schema=...)`` silently drops one of
two columns that share a name because the intermediate dict collapses
the key. Queries like ``SELECT a.id, b.id FROM t a JOIN t b`` would
return a one-column table even though the cursor description carries
both fields. Switch to ``pa.Table.from_arrays(arrays, schema=schema)``
so the schema is name-positional and both columns survive.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
``mysql:8`` is a moving tag — when Docker Hub publishes a new ``8.x``
the CI run can pick up an image with different default flags (e.g. the
deprecated-features warning behaviour changed between 8.0 and 8.4).
Pin to ``mysql:8.0.36`` so the connector test suite stays reproducible.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The unit-test job runs without the mysql extra, so importing
MySQLdb fails. Guard the test with pytest.importorskip.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added dependencies Pull requests that update a dependency file python Pull requests that update Python code core labels May 21, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 21, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: ca0462eb-5ee1-47e5-a648-b788315dd0b7

📥 Commits

Reviewing files that changed from the base of the PR and between c1ba926 and ab79d14.

📒 Files selected for processing (4)
  • core/wren/src/wren/connector/clickhouse.py
  • core/wren/src/wren/connector/mssql.py
  • core/wren/tests/unit/test_clickhouse_helpers.py
  • core/wren/tests/unit/test_mssql_connection.py

Walkthrough

Replaces ibis-backed connectors with native DB-API implementations, updates factory routing and dependency groups, and adds extensive unit and integration tests to validate type parsing, value coercion, connection creation, and query/dry_run/close behavior.

Changes

Multi-Backend Native Connector Migration

Layer / File(s) Summary
Configuration & test tasks
core/wren/justfile, core/wren/pyproject.toml, core/wren/tests/conftest.py
Add just tasks for connector tests, register new pytest markers, and replace ibis-framework[...] extras with native driver optional deps (psycopg[binary], mysqlclient, clickhouse-connect, trino, pyodbc, pyathena).
Connector factory & base
core/wren/src/wren/connector/base.py, core/wren/src/wren/connector/factory.py
Lazy-load ibis type imports under TYPE_CHECKING and update registry to route several DataSource values to dedicated native connector modules instead of the generic ibis module.
Athena native connector
core/wren/src/wren/connector/athena.py, core/wren/tests/unit/test_athena_connector.py
New pyathena-backed connector with sqlglot-based type parsing, value coercion to PyArrow, credential resolution (static keys / web identity / boto3 chain), query/dry_run/close, and unit tests covering parsing, table building, credential flows, and error wrapping.
ClickHouse native connector
core/wren/src/wren/connector/clickhouse.py, core/wren/tests/connectors/test_clickhouse.py, core/wren/tests/unit/test_clickhouse_helpers.py
New clickhouse-connect-backed connector with type descriptor parsing, client kwargs building, statement_timeout→settings mapping, SQL wrapping/stripping, error mapping, idempotent close, plus integration tests (container) and unit tests for parsing and settings merging.
Trino native connector
core/wren/src/wren/connector/trino.py, core/wren/tests/connectors/test_trino.py
New trino.dbapi connector: sqlglot type parsing, value coercion, URL parsing/support for trino+https, lazy import with install hint, auth selection (token/basic), query/dry_run/close, and unit+integration TPCH tests.
Postgres native refactor
core/wren/src/wren/connector/postgres.py, core/wren/tests/connectors/test_postgres.py
Postgres connector rewritten to psycopg3: OID→Arrow mapping, decimal handling, subquery-based LIMIT wrapping, dry_run via LIMIT 0, idempotent close, plus integration tests validating types, round-trips, NULLs, limits, and duplicate columns.
MySQL/Doris native refactor
core/wren/src/wren/connector/mysql.py, core/wren/tests/connectors/test_mysql.py, core/wren/tests/connectors/test_mysql_connector.py, core/wren/tests/unit/test_mysql_helpers.py
MySql/Doris connectors reimplemented with mysqlclient: LIMIT helpers, cached FIELD_TYPE→Arrow mappings, DECIMAL precision/scale derivation and clamping, connection kwargs (including SSL), plus extensive integration/unit tests (type coverage, limit coercion/injection resistance, duplicate columns, TIME→duration handling, and concurrency regression for caches).
Canner native refactor
core/wren/src/wren/connector/canner.py, core/wren/tests/connectors/test_canner.py
Canner connector now uses psycopg cursor execution with OID→Arrow mapping, JSON/JSONB handling, decimal quantization, semicolon-stripping, idempotent close, and tests verifying internals and end-to-end behavior.
MSSQL native refactor
core/wren/src/wren/connector/mssql.py, core/wren/tests/connectors/test_mssql.py, core/wren/tests/unit/test_mssql_connection.py
MSSQL connector rewritten to pyodbc with cursor execution, SQL rewrite for pagination, sys.dm_exec_describe_first_result_set probing for dry_run errors, MSSQL-specific Arrow type inference (TINYINT→uint8, datetimeoffset decoding), close handling, and unit/integration tests for URL parsing, credential validation, statement_timeout, and decoding.
DataSource integration
core/wren/src/wren/model/data_source.py
Introduce BackendOrConnection union, update DataSource/DataSourceExtension return types, and wire native connection builders for Athena, Canner, ClickHouse, MSSQL, MySQL/Doris, Postgres, and Trino; expand ClickHouse URL schemes and enforce secure=True for clickhouse+https.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

  • Canner/WrenAI#2268: Concurrent refactor of Snowflake connector from ibis to native implementation with factory routing changes.

Suggested reviewers

  • douenergy
  • andreashimin

🐰 I hopped through code, from ibis to wire,
Native drivers now make queries sing and fly,
Types parsed, tables built, tests spin up with cheer,
Close calls idempotent — no more hidden tears,
A carrot for reviewers, speedy patches nigh!

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch refactor/native-drivers-combined

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (7)
core/wren/tests/unit/test_athena_connector.py (2)

35-37: 💤 Low value

Consider using direct assignment for more reliable stubbing.

sys.modules.setdefault won't override if pyathena is already imported (e.g., in a test environment where the package is installed and another test imported it first). If test isolation is critical, consider explicit assignment or a fixture that saves/restores the original.

That said, for unit test suites with controlled import order, the current approach works.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@core/wren/tests/unit/test_athena_connector.py` around lines 35 - 37, Replace
the non-overriding stub registration using sys.modules.setdefault with an
explicit assignment so the fake module reliably replaces any installed pyathena
during the test: save the original sys.modules.get("pyathena") into a temp (so
it can be restored), assign sys.modules["pyathena"] = _pyathena_module (which
already sets _pyathena_module.connect = _fake_pyathena_connect), and restore the
saved original after the test (or use a fixture/monkeypatch to set and undo the
assignment) to ensure test isolation.

312-328: 💤 Low value

The object.__setattr__ bypass is fragile but acceptable for this test.

The comment explains the intent—testing that the kwargs builder defensively reads an optional kwargs field. If AthenaConnectionInfo later adds a proper kwargs field, this test should be updated to use the model's constructor. For now, this adequately tests the connector's defensive handling.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@core/wren/tests/unit/test_athena_connector.py` around lines 312 - 328, Test
uses object.__setattr__ to inject a kwargs dict which is fragile; keep the
current defensive injection for now but change the test to construct
AthenaConnectionInfo with a real kwargs argument if/when AthenaConnectionInfo
gains a kwargs field. Update the test function
test_data_source_get_athena_connection_propagates_schema_and_kwargs to prefer
calling the AthenaConnectionInfo constructor with kwargs (instead of
object.__setattr__) when the model defines a kwargs attribute, otherwise fall
back to the current object.__setattr__ injection; reference AthenaConnectionInfo
and DataSourceExtension.get_athena_connection to locate the relevant code.
core/wren/tests/unit/test_mssql_connection.py (1)

57-77: ⚡ Quick win

Add a password decode case that covers + as space.

This test currently validates %20 decoding but not + handling. Adding a + case will lock in the expected unquote_plus behavior and prevent subtle credential parsing regressions.

Based on learnings: decode URL-derived passwords using urllib.parse.unquote_plus so + is treated as a space, and keep parsed.username as-is unless a verified mismatch exists.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@core/wren/tests/unit/test_mssql_connection.py` around lines 57 - 77, Add a
test case that asserts plus-signs in URL-encoded passwords decode to spaces by
updating test_mssql_url_decodes_user_database_and_password to include a password
containing '+' (e.g., "p+ss+word") and assert
DataSourceExtension.get_mssql_connection_from_url yields PWD "p ss word"; if the
implementation does not already use urllib.parse.unquote_plus for decoding
credentials, update the decoding in
DataSourceExtension.get_mssql_connection_from_url (and any helpers used by
_FakePyodbc/_parse_conn_str) to call urllib.parse.unquote_plus for the password
(and other credential fields that should treat '+' as space) while leaving
parsed.username unchanged unless you find a verified mismatch.
core/wren/src/wren/connector/postgres.py (2)

239-249: ⚖️ Poor tradeoff

Transaction left in aborted state after query errors.

When psycopg3 encounters an error during query execution, the transaction enters an aborted state requiring rollback() to recover. The exception handlers wrap the error but don't rollback, leaving the connection unusable for subsequent queries. The test file confirms this by explicitly calling connector.connection.rollback() as a workaround.

If the connector is intended for single-use, this may be acceptable. Otherwise, consider rolling back on exceptions or enabling autocommit mode.

♻️ Option: Rollback on exception
         except Exception as e:
+            try:
+                self.connection.rollback()
+            except Exception:
+                pass
             raise WrenError(
                 ErrorCode.GENERIC_USER_ERROR,
                 str(e),
                 phase=ErrorPhase.SQL_EXECUTION,
                 metadata={DIALECT_SQL: sql},
             ) from e

Also applies to: 256-266

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@core/wren/src/wren/connector/postgres.py` around lines 239 - 249, The
exception handlers that wrap DB errors (the except blocks that raise WrenError
with ErrorPhase.SQL_EXECUTION and metadata {DIALECT_SQL: sql}) do not roll back
the psycopg3 transaction, leaving the connection aborted; update those handlers
to call connection.rollback() (e.g., self.connection.rollback() or
connector.connection.rollback()) before raising so the connection is recovered
for further use, and apply the same change to the other symmetric handler block
referenced (the one around lines handling 256-266).

231-233: ⚡ Quick win

Consider validating limit defensively to prevent SQL injection.

While limit is typed as int | None, Python doesn't enforce this at runtime. A caller violating the type contract could inject arbitrary SQL. Given the PR objectives mention "limit sanitization to avoid SQL injection," consider adding explicit validation.

🛡️ Proposed defensive validation
     def query(self, sql: str, limit: int | None = None) -> pa.Table:
         if limit is not None:
+            if not isinstance(limit, int) or limit < 0:
+                raise ValueError("limit must be a non-negative integer")
             sql = f"SELECT * FROM ({sql}) AS _sub LIMIT {limit}"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@core/wren/src/wren/connector/postgres.py` around lines 231 - 233, The query
method builds a LIMIT clause by interpolating the limit value directly; add
defensive validation in query(sql: str, limit: int | None = None) to ensure
limit is an actual non-negative integer (e.g., check isinstance(limit, int) and
limit >= 0 or attempt safe int() conversion and reject on failure) and raise
TypeError/ValueError for invalid values before constructing sql; keep using the
local variable limit only after validation so the f"SELECT * FROM ({sql}) AS
_sub LIMIT {limit}" interpolation cannot be exploited by a non-integer caller.
core/wren/src/wren/model/data_source.py (1)

57-59: 💤 Low value

Incomplete BackendOrConnection type alias.

The union is missing types returned by other connectors: clickhouse_connect.driver.client.Client (ClickHouse), pyodbc.Connection (MSSQL), and trino.dbapi.Connection (Trino). This causes type-checker mismatches for callers.

Suggested fix
 if TYPE_CHECKING:
     import MySQLdb
     import psycopg
     from pyathena.connection import Connection as PyAthenaConnection
+    from clickhouse_connect.driver.client import Client as ClickHouseClient
+    from trino.dbapi import Connection as TrinoConnection
 
 BackendOrConnection = Union[
-    BaseBackend, "PyAthenaConnection", "MySQLdb.Connection", "psycopg.Connection"
+    BaseBackend,
+    "PyAthenaConnection",
+    "MySQLdb.Connection",
+    "psycopg.Connection",
+    "pyodbc.Connection",
+    "ClickHouseClient",
+    "TrinoConnection",
 ]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@core/wren/src/wren/model/data_source.py` around lines 57 - 59, The
BackendOrConnection type alias in data_source.py is missing connector return
types causing mypy failures; update the BackendOrConnection Union (symbol:
BackendOrConnection) to include clickhouse_connect.driver.client.Client,
pyodbc.Connection, and trino.dbapi.Connection, and ensure imports are added
safely (use string literals or guard imports with TYPE_CHECKING) so the new
types are available to the type-checker without causing runtime import
side-effects.
core/wren/tests/connectors/test_trino.py (1)

275-297: 💤 Low value

Consider retry logic instead of fixed sleep for container readiness.

The time.sleep(5) is a pragmatic workaround for coordinator readiness, but it could either be too short (causing flaky tests) or too long (slowing CI). A retry loop with exponential backoff would be more robust.

That said, this is test code and the current approach is documented. This is a minor improvement that can be deferred.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@core/wren/tests/connectors/test_trino.py` around lines 275 - 297, The test
fixture engine currently uses a fixed time.sleep(5) after starting
TrinoContainer which can be flaky or slow; replace that fixed sleep with a retry
loop that polls the coordinator readiness (e.g., attempting a lightweight query
or connection) with exponential backoff and a max timeout before calling
_create_tpch_tables, referencing the engine fixture, TrinoContainer, and
_create_tpch_tables to locate where to implement the retry; ensure the loop
aborts with a clear test failure if readiness isn't achieved within the timeout.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@core/wren/src/wren/connector/clickhouse.py`:
- Around line 332-343: In dry_run, the input SQL isn't stripped of trailing
semicolons before being wrapped in the subquery; update the dry_run method
(function dry_run) to strip whitespace and trailing semicolons from the sql
argument (e.g., sql = sql.strip().rstrip(";")) before building the f"SELECT *
FROM ({sql}) AS _wren_sub LIMIT 0" query, preserving the existing exception
handling that maps TIMEOUT_EXCEEDED to DatabaseTimeoutError and other errors to
WrenError (ErrorCode.INVALID_SQL, phase=ErrorPhase.SQL_DRY_RUN,
metadata={DIALECT_SQL: sql}).
- Around line 264-265: The username/password extracted from the URL
(parsed.username and parsed.password) must be URL-decoded before use; update the
ClickHouse connector code that builds the auth dict to pass parsed.username and
parsed.password through urllib.parse.unquote_plus (or the repo's URL-decoding
helper) so percent-encoded characters (e.g. %40) are converted to their literal
characters before being sent to ClickHouse.
- Around line 315-330: The query method currently wraps the incoming sql into a
subquery without removing trailing semicolons, causing wrapped statements like
"SELECT * FROM (SELECT 1;) ..." to be invalid; fix by trimming trailing
semicolons and whitespace from the input sql before building the wrapped
statement (e.g., compute a cleaned_sql from the sql parameter and use that when
constructing statement when limit is set), but keep the original sql in
metadata={DIALECT_SQL: sql} if you want to preserve the caller's input; update
the logic in query (refer to the statement variable and the sql parameter) to
use the cleaned SQL for execution.

In `@core/wren/src/wren/connector/mssql.py`:
- Around line 41-50: The result construction currently zips names into a dict
which collapses duplicate column names; replace the dict(zip(names, arrays))
approach in the return of the method with building the pyarrow Table directly
from the arrays and the arrow_schema while preserving the original names (e.g.
use pa.Table.from_arrays or equivalent with names=names and
schema=arrow_schema). Locate the call site that uses cursor.description,
_build_mssql_arrow_schema, and _build_mssql_column and change the final return
to construct the table from arrays + schema + names so duplicate projected
columns (e.g. SELECT a, a) are preserved.

---

Nitpick comments:
In `@core/wren/src/wren/connector/postgres.py`:
- Around line 239-249: The exception handlers that wrap DB errors (the except
blocks that raise WrenError with ErrorPhase.SQL_EXECUTION and metadata
{DIALECT_SQL: sql}) do not roll back the psycopg3 transaction, leaving the
connection aborted; update those handlers to call connection.rollback() (e.g.,
self.connection.rollback() or connector.connection.rollback()) before raising so
the connection is recovered for further use, and apply the same change to the
other symmetric handler block referenced (the one around lines handling
256-266).
- Around line 231-233: The query method builds a LIMIT clause by interpolating
the limit value directly; add defensive validation in query(sql: str, limit: int
| None = None) to ensure limit is an actual non-negative integer (e.g., check
isinstance(limit, int) and limit >= 0 or attempt safe int() conversion and
reject on failure) and raise TypeError/ValueError for invalid values before
constructing sql; keep using the local variable limit only after validation so
the f"SELECT * FROM ({sql}) AS _sub LIMIT {limit}" interpolation cannot be
exploited by a non-integer caller.

In `@core/wren/src/wren/model/data_source.py`:
- Around line 57-59: The BackendOrConnection type alias in data_source.py is
missing connector return types causing mypy failures; update the
BackendOrConnection Union (symbol: BackendOrConnection) to include
clickhouse_connect.driver.client.Client, pyodbc.Connection, and
trino.dbapi.Connection, and ensure imports are added safely (use string literals
or guard imports with TYPE_CHECKING) so the new types are available to the
type-checker without causing runtime import side-effects.

In `@core/wren/tests/connectors/test_trino.py`:
- Around line 275-297: The test fixture engine currently uses a fixed
time.sleep(5) after starting TrinoContainer which can be flaky or slow; replace
that fixed sleep with a retry loop that polls the coordinator readiness (e.g.,
attempting a lightweight query or connection) with exponential backoff and a max
timeout before calling _create_tpch_tables, referencing the engine fixture,
TrinoContainer, and _create_tpch_tables to locate where to implement the retry;
ensure the loop aborts with a clear test failure if readiness isn't achieved
within the timeout.

In `@core/wren/tests/unit/test_athena_connector.py`:
- Around line 35-37: Replace the non-overriding stub registration using
sys.modules.setdefault with an explicit assignment so the fake module reliably
replaces any installed pyathena during the test: save the original
sys.modules.get("pyathena") into a temp (so it can be restored), assign
sys.modules["pyathena"] = _pyathena_module (which already sets
_pyathena_module.connect = _fake_pyathena_connect), and restore the saved
original after the test (or use a fixture/monkeypatch to set and undo the
assignment) to ensure test isolation.
- Around line 312-328: Test uses object.__setattr__ to inject a kwargs dict
which is fragile; keep the current defensive injection for now but change the
test to construct AthenaConnectionInfo with a real kwargs argument if/when
AthenaConnectionInfo gains a kwargs field. Update the test function
test_data_source_get_athena_connection_propagates_schema_and_kwargs to prefer
calling the AthenaConnectionInfo constructor with kwargs (instead of
object.__setattr__) when the model defines a kwargs attribute, otherwise fall
back to the current object.__setattr__ injection; reference AthenaConnectionInfo
and DataSourceExtension.get_athena_connection to locate the relevant code.

In `@core/wren/tests/unit/test_mssql_connection.py`:
- Around line 57-77: Add a test case that asserts plus-signs in URL-encoded
passwords decode to spaces by updating
test_mssql_url_decodes_user_database_and_password to include a password
containing '+' (e.g., "p+ss+word") and assert
DataSourceExtension.get_mssql_connection_from_url yields PWD "p ss word"; if the
implementation does not already use urllib.parse.unquote_plus for decoding
credentials, update the decoding in
DataSourceExtension.get_mssql_connection_from_url (and any helpers used by
_FakePyodbc/_parse_conn_str) to call urllib.parse.unquote_plus for the password
(and other credential fields that should treat '+' as space) while leaving
parsed.username unchanged unless you find a verified mismatch.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 6d880d85-dd2d-463e-a089-800df247ba4e

📥 Commits

Reviewing files that changed from the base of the PR and between 2ae4eea and c1ba926.

⛔ Files ignored due to path filters (1)
  • core/wren/uv.lock is excluded by !**/*.lock
📒 Files selected for processing (24)
  • core/wren/justfile
  • core/wren/pyproject.toml
  • core/wren/src/wren/connector/athena.py
  • core/wren/src/wren/connector/base.py
  • core/wren/src/wren/connector/canner.py
  • core/wren/src/wren/connector/clickhouse.py
  • core/wren/src/wren/connector/factory.py
  • core/wren/src/wren/connector/ibis.py
  • core/wren/src/wren/connector/mssql.py
  • core/wren/src/wren/connector/mysql.py
  • core/wren/src/wren/connector/postgres.py
  • core/wren/src/wren/connector/trino.py
  • core/wren/src/wren/model/data_source.py
  • core/wren/tests/conftest.py
  • core/wren/tests/connectors/test_canner.py
  • core/wren/tests/connectors/test_clickhouse.py
  • core/wren/tests/connectors/test_mssql.py
  • core/wren/tests/connectors/test_mysql.py
  • core/wren/tests/connectors/test_mysql_connector.py
  • core/wren/tests/connectors/test_postgres.py
  • core/wren/tests/connectors/test_trino.py
  • core/wren/tests/unit/test_athena_connector.py
  • core/wren/tests/unit/test_mssql_connection.py
  • core/wren/tests/unit/test_mysql_helpers.py
💤 Files with no reviewable changes (1)
  • core/wren/src/wren/connector/ibis.py

Comment thread core/wren/src/wren/connector/clickhouse.py Outdated
Comment thread core/wren/src/wren/connector/clickhouse.py
Comment thread core/wren/src/wren/connector/clickhouse.py
Comment thread core/wren/src/wren/connector/mssql.py Outdated
- clickhouse._build_clickhouse_client_kwargs: unquote_plus username/password
  parsed from a connection URL so percent-encoded credentials (e.g. %40 → @)
  don't reach ClickHouse verbatim and fail auth.
- clickhouse.ClickHouseConnector.{query,dry_run}: strip the terminating run
  of ;/whitespace before wrapping in `SELECT * FROM (...) AS _wren_sub LIMIT N`
  — otherwise `SELECT 1;` becomes invalid SQL. Reuse canner's
  `[;\s]+\Z` regex so semicolons inside string literals are preserved.
- mssql.MSSqlConnector.query and clickhouse._build_clickhouse_arrow_table:
  build the result table via `pa.Table.from_arrays(arrays, schema=...)`
  instead of `dict(zip(names, arrays))`, which silently collapses duplicate
  column names from projections like `SELECT a, a`.

Adds pure-Python regression tests under tests/unit/ for all four behaviours.
@goldmedal goldmedal requested a review from douenergy May 21, 2026 06:19
@douenergy douenergy merged commit 6f67124 into main May 21, 2026
10 checks passed
goldmedal added a commit that referenced this pull request May 22, 2026
Ibis is no longer a load-bearing dependency for the bulk of connectors
after the recent direct-connector refactors (canner/postgres/mysql/mssql/
trino/clickhouse/athena dropped ibis in #2313, snowflake in #2268); the
intro now only credits Apache DataFusion as the engine.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core dependencies Pull requests that update a dependency file python Pull requests that update Python code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants