refactor(wren): drop ibis for canner/postgres/mysql/mssql/trino/clickhouse/athena (combined)#2313
Conversation
Canner Enterprise speaks the Postgres wire protocol; the connector now uses psycopg directly instead of the ibis postgres backend. Changes - `connector/canner.py`: native psycopg cursor with a self-contained PG OID -> Arrow type map covering the canner-flavoured types (VARCHAR/CHAR -> string, DECIMAL -> decimal128, BIGINT/INT/SMALLINT -> int, BOOLEAN -> bool, DATE/TIMESTAMP/TIMESTAMPTZ -> date/timestamp, ROW/ARRAY/MAP serialised as JSON strings). Errors are wrapped as WrenError with the dialect SQL attached, mirroring the existing postgres connector contract. - `model/data_source.py::get_canner_connection`: returns a `psycopg.Connection` (autocommit) instead of an ibis backend. Tests - `tests/connectors/test_canner.py` exercises the type-mapping helpers and runs the connector against a PostgresContainer with the common canner result types (incl. JSON/JSONB and arrays). Marker `canner` is registered in `tests/conftest.py` and a `just test-canner` target is added. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The trino connector now uses the `trino` python client directly, parsing Trino type strings via sqlglot to build PyArrow schemas. The trino extra installs `trino>=0.333,<1` instead of `ibis-framework[trino]`. Highlights - `connector/trino.py`: native cursor execution; type-string -> Arrow via sqlglot, including row(...) / array(...) / map(...) / decimal(p,s) / timestamp with time zone. - `model/data_source.py::get_trino_connection`: returns `trino.dbapi.Connection`; native code path no longer routes through ibis. - `pyproject.toml`: trino extra -> `trino>=0.333,<1`. Tests - `tests/connectors/test_trino.py` covers ~36 Trino type categories including nested row/array/map plus testcontainer-backed query suite. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The athena connector now uses pyathena directly with a Trino-style type lexer to materialise cursor results into PyArrow tables, removing the ibis-framework[athena] dependency from the athena extra. Highlights - connector/athena.py: native pyathena cursor; type strings parsed via sqlglot (varchar, decimal(p,s), array<T>, row(...), map<K,V>, etc.). - model/data_source.py::get_athena_connection: preserves the Web-Identity-Token (OIDC -> AssumeRoleWithWebIdentity) and access-key auth flows; returns a pyathena.connection.Connection. - pyproject.toml: athena extra -> pyathena[pandas]>=3. Tests - tests/unit/test_athena_connector.py mocks pyathena cursor + boto3 STS to verify the type lexer, cursor->Arrow materialisation, error mapping, and all three credential resolution paths. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The postgres connector now uses psycopg directly instead of going through
ibis. This removes ibis-framework[postgres] from the postgres extra and
gives us direct control over Arrow conversion for pg-specific types
(numeric scale, arrays, intervals, jsonb).
Changes
- connector/postgres.py: native psycopg3 cursor execution + OID-driven
Arrow type mapping (bool, int{2,4,8}, float{4,8}, decimal128(p,s), text,
bytea, uuid, jsonb, timestamp[tz], date, time, array-of-T).
- model/data_source.py::get_postgres_connection: returns psycopg.Connection.
- pyproject.toml: postgres extra now installs psycopg[binary]>=3 instead
of ibis-framework[postgres].
- connector/base.py and model/data_source.py: ibis imports made lazy so
importing wren.connector.postgres no longer pulls in ibis.
Tests
- New direct-connector tests in tests/connectors/test_postgres.py exercise
10+ pg types (int4, int8, numeric(38,9), text, bool, bytea, uuid, jsonb,
timestamp, timestamptz, int4[], text[], numeric[]) against a
testcontainer postgres, asserting Arrow schema and value round-trip.
- All existing tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The mysql / doris connector now uses mysqlclient (MySQLdb) directly instead of going through ibis. The mysql extra installs `mysqlclient` only. Changes - `connector/mysql.py`: native MySQLdb cursor + field-type-code → Arrow type mapping covering integer (signed/unsigned variants), decimal, float, char/varchar/text, json, blob, datetime/timestamp, date, time, bit. SSL kwargs derive from MySqlConnectionInfo. - `model/data_source.py`: `get_mysql_connection` / `get_doris_connection` return `MySQLdb.Connection`; SSL context construction is moved into the connector module. - `pyproject.toml`: mysql extra → `mysqlclient>=2.2`. Tests - `tests/connectors/test_mysql_connector.py` exercises 15+ MySQL field types via a testcontainer mysql. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The mssql connector now uses pyodbc directly with custom Arrow type
inference and a datetimeoffset output converter. The mssql extra
installs `pyodbc` instead of `ibis-framework[mssql]`.
Highlights
- `connector/mssql.py`: raw pyodbc cursor; sqlglot-based LIMIT/OFFSET
rewrite (`OFFSET 0 ROWS FETCH NEXT n ROWS ONLY`); Arrow schema built
from `cursor.description` + value sampling; dry_run falls back to
`sys.dm_exec_describe_first_result_set` for precise error messages.
- `model/data_source.py`: `_connect_mssql_pyodbc` builds an ODBC
connection string with proper `{...}` escaping; `mssql://` URL
parsing added; output converter for DATETIMEOFFSET (type code -155)
decodes the 20-byte payload into a tz-aware datetime.
- `pyproject.toml`: mssql extra -> `pyodbc>=5,<6`.
Tests
- `tests/connectors/test_mssql.py` covers int sizes (tinyint/smallint/
int/bigint), bit, varchar, decimal, datetime/datetime2,
datetimeoffset (utc + non-utc), uniqueidentifier and varbinary;
also exercises the URL connection path and pagination rewrite.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… dependency The ClickHouse connector now uses ``clickhouse-connect`` directly, parsing ClickHouse type strings via sqlglot to build PyArrow schemas rather than going through the ibis-project clickhouse backend. Highlights - ``connector/clickhouse.py``: native client; type lexer covers ``Nullable(T)`` / ``LowCardinality(T)`` / ``Array(T)`` / ``Tuple(...)`` / ``Map(K,V)`` / ``DateTime64(p, 'TZ')`` / ``Decimal(p,s)``, plus ``Int128/256`` and ``UInt128/256`` (surfaced as string to avoid silent truncation past 64-bit Arrow widths). - ``model/data_source.py::get_clickhouse_connection`` returns a ``clickhouse_connect.Client``; ``_handle_clickhouse_url`` now also accepts ``clickhouse+http://`` / ``clickhouse+https://`` URLs. - ``pyproject.toml``: clickhouse extra now pulls ``clickhouse-connect>=0.8`` instead of ``ibis-framework[clickhouse]``. Tests - ``tests/connectors/test_clickhouse.py`` exercises the full query path against a ClickHouse testcontainer (TPCH sf=0.01) and parametrises 35+ type strings through ``_parse_clickhouse_type``, including ``DateTime64`` with timezone and nested ``Tuple``. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…bles SQL NULL in json/jsonb columns was being coerced into the string "null", breaking NULL semantics downstream. Drop the oid-114/3802 special case so None passes through unchanged. Arrow tables were built via dict(zip(names, arrays)), which silently drops duplicate column names (e.g. self-joins projecting two `id` columns). Switch to pa.Table.from_arrays(..., schema=schema) so positional construction keeps duplicates. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace ``pa.table(dict(zip(...)), schema=schema)`` with positional ``pa.Table.from_arrays``. The dict-based form silently drops duplicate column names, which corrupts join results like ``SELECT a.id, b.id FROM t a, t b``. Add a regression test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The postgres branch returns a raw ``psycopg.Connection`` rather than an ibis ``BaseBackend``. Introduce a ``ConnectionLike`` alias (under TYPE_CHECKING) covering both shapes, apply it to ``get_connection``, and annotate ``get_postgres_connection`` with its concrete return type. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CodeRabbit flagged that data_source.get_athena_connection re-implemented the pyathena connect-kwargs logic that also lives in connector/athena.py, letting them drift on schema_name / kill_on_interrupt / info.kwargs, and the get_connection signature claimed BaseBackend even though the Athena path returned a raw pyathena Connection. Route data_source.get_athena_connection through the connector's shared _build_connect_kwargs builder, and widen the return-type annotation to a BackendOrConnection Union so the type matches reality without forcing a runtime pyathena import. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Reject Trino connection URLs without a username instead of silently falling back to "test"; raise a clear INVALID_CONNECTION_INFO error. - Wrap lazy ``import trino`` calls so users without the trino extra get a "pip install wren-engine[trino]" hint instead of a confusing ImportError. - Strip trailing whitespace and a single trailing semicolon before wrapping user SQL in ``SELECT * FROM (...) AS _sub LIMIT N`` for both ``query()`` and ``dry_run()``; previously a trailing ``;`` produced invalid SQL. - Route ``ConnectionUrl`` for trino through the native URL handler in ``DataSourceExtension.get_connection()`` so it no longer hits the removed generic ``ibis.connect()`` path. Adds regression unit tests covering each fix (semicolon stripping, missing-username URL, bad-scheme URL, and the ImportError path via monkeypatched ``__import__``). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously only the password got unquote_plus() — username, the database path, and the query-string handling left other components encoded, so URL-encoded specials (e.g. '@' in a username, spaces in a database name) reached the ODBC driver still percent-encoded. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous branch silently mishandled asymmetric credentials: user=None with password set crashed in password.get_secret_value(), and user set with password=None emitted UID= without PWD=, producing an incomplete ODBC connection string. Now: both absent falls back to Trusted_Connection=yes, exactly one absent raises INVALID_CONNECTION_INFO, both present uses normal SQL auth. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
statement_timeout was cast to int only after pyodbc.connect() returned, so a non-numeric value (e.g. "5s") leaked the just-opened connection when int() raised ValueError. Validate and cast up-front; raise INVALID_CONNECTION_INFO before any connect call. Also adds unit tests with mocked pyodbc covering all three review findings: URL component decoding, asymmetric auth rejection, and the pre-connect timeout validation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The native MySQL/Doris path returns a raw MySQLdb.Connection rather than an ibis BaseBackend. Reflect that in the return type via a ConnectionHandle union so static analysis matches runtime behaviour. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…type Address CodeRabbit major findings on the native MySQL connector: - Initialize ``_closed`` before opening the cursor and close the connection if the ``ANSI_QUOTES`` init query raises, so a failed init never leaks a live socket. - Coerce the ``limit`` argument via ``int()`` and reject negative values before interpolating into SQL — the previous f-string was a SQL-injection vector when callers passed an attacker-controlled value. - Append ``LIMIT n`` to the user's SQL (stripping a trailing semicolon) instead of wrapping it in ``SELECT * FROM (...) AS _sub``; the subquery approach failed with ``ER_DUP_FIELDNAME`` whenever the inner SELECT projected two columns with the same name (e.g. ``SELECT a.id, b.id`` in a join). ``dry_run`` switches to ``EXPLAIN <sql>`` for the same reason and to remain side-effect-free. - Derive ``DECIMAL`` precision/scale from ``cursor.description`` (PEP 249 ``precision``/``scale`` fields) instead of hard-coding ``decimal128(38, 9)``; that loss of precision silently truncated MySQL columns with ``D > 9``. Precision is recovered from MySQLdb's display-length convention and clamped to Arrow's ``decimal128`` maximum (38) plus MySQL's scale ceiling (30). Adds regression tests for duplicate-column SELECT, SQL-injection limit rejection, trailing semicolon, large-scale decimal round-trip, and unit tests for the new helpers. ``just test-mysql`` now runs both the engine suite and the type-coverage connector suite. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Round-2 review found _decimal_type defaulted scale=9 when the column typmod was missing, so Decimal.quantize silently rounded high-precision values (e.g. 18-significant-figure NUMERIC). Fall back to pa.string() for unconstrained NUMERIC and NUMERIC[] columns so the exact textual value round-trips. Same approach Trino's connector takes for dynamic-decimal casts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Round-2 review found that query() and dry_run() wrap user SQL as
"SELECT * FROM ({sql}) AS _t LIMIT N", which Postgres/Canner reject
when the inner SQL ends in a semicolon. Add a _strip_trailing_semicolon
helper that only strips the terminating run of semicolons and
whitespace (so semicolons inside string literals are preserved) and
apply it on both call sites.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…lues
MySQL ``TIME`` ranges ``-838:59:59`` to ``838:59:59``. PyArrow
``time64("us")`` only accepts 0-24h positive values, so the previous
mapping silently truncated or corrupted any TIME value outside that
window. Map to ``duration("us")`` instead so the full MySQL range
round-trips, and convert MySQLdb's ``datetime.timedelta`` to signed
microseconds explicitly (``timedelta.days`` is signed for negative
values, ``seconds``/``microseconds`` are not — combining all three
recovers the signed total).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
SQL Server TINYINT is an unsigned 8-bit integer (0..255), but the arrow-type helper was branching on the sampled value sign and could fall back to int8. Map internal_size == 1 directly to pa.uint8() so the schema reflects the driver-declared type regardless of the rows. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The DATETIMEOFFSET output converter assumed pyodbc would always hand it a 20-byte payload. Truncated or malformed buffers fell through and surfaced as a cryptic "month must be in 1..12" ValueError from datetime(). Reject non-20-byte payloads up front with a clear message that points at the actual length mismatch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ettings In the ClickHouse client-kwargs assembly, ``out.update(kwargs)`` / ``client_kwargs.update(kwargs)`` would clobber the merged ``settings`` dict (carrying ``max_execution_time`` from ``statement_timeout``) whenever the caller also passed their own ``settings`` via ``kwargs``. Pop ``settings`` from incoming ``kwargs`` first and merge it into the local dict so the timeout survives. Also wrap driver ``TIMEOUT_EXCEEDED`` errors as the existing ``DatabaseTimeoutError`` instead of re-raising the raw driver exception, for consistency with the typed error model. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wrap the readiness-loop client in try/finally so we close on failed attempts, not only on success. Replace the DuckDB TPCH extension fixture (which pulled the extension over the network on every run) with inline-fabricated rows so the test stays hermetic. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous lazy initializers mutated module-level dicts/sets in place, so a thread that raced the first call could observe a partially-populated map. Switch each initializer to a @functools.cache'd accessor that returns a fully-built local container — the cache slot is GIL-protected and only published once initialization runs to completion. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…n sites ``pa.table(dict(zip(names, arrays)), schema=...)`` silently drops one of two columns that share a name because the intermediate dict collapses the key. Queries like ``SELECT a.id, b.id FROM t a JOIN t b`` would return a one-column table even though the cursor description carries both fields. Switch to ``pa.Table.from_arrays(arrays, schema=schema)`` so the schema is name-positional and both columns survive. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
``mysql:8`` is a moving tag — when Docker Hub publishes a new ``8.x`` the CI run can pick up an image with different default flags (e.g. the deprecated-features warning behaviour changed between 8.0 and 8.4). Pin to ``mysql:8.0.36`` so the connector test suite stays reproducible. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The unit-test job runs without the mysql extra, so importing MySQLdb fails. Guard the test with pytest.importorskip. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (4)
WalkthroughReplaces ibis-backed connectors with native DB-API implementations, updates factory routing and dependency groups, and adds extensive unit and integration tests to validate type parsing, value coercion, connection creation, and query/dry_run/close behavior. ChangesMulti-Backend Native Connector Migration
Estimated code review effort 🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related PRs
Suggested reviewers
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
|
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (7)
core/wren/tests/unit/test_athena_connector.py (2)
35-37: 💤 Low valueConsider using direct assignment for more reliable stubbing.
sys.modules.setdefaultwon't override ifpyathenais already imported (e.g., in a test environment where the package is installed and another test imported it first). If test isolation is critical, consider explicit assignment or a fixture that saves/restores the original.That said, for unit test suites with controlled import order, the current approach works.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@core/wren/tests/unit/test_athena_connector.py` around lines 35 - 37, Replace the non-overriding stub registration using sys.modules.setdefault with an explicit assignment so the fake module reliably replaces any installed pyathena during the test: save the original sys.modules.get("pyathena") into a temp (so it can be restored), assign sys.modules["pyathena"] = _pyathena_module (which already sets _pyathena_module.connect = _fake_pyathena_connect), and restore the saved original after the test (or use a fixture/monkeypatch to set and undo the assignment) to ensure test isolation.
312-328: 💤 Low valueThe
object.__setattr__bypass is fragile but acceptable for this test.The comment explains the intent—testing that the kwargs builder defensively reads an optional
kwargsfield. IfAthenaConnectionInfolater adds a properkwargsfield, this test should be updated to use the model's constructor. For now, this adequately tests the connector's defensive handling.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@core/wren/tests/unit/test_athena_connector.py` around lines 312 - 328, Test uses object.__setattr__ to inject a kwargs dict which is fragile; keep the current defensive injection for now but change the test to construct AthenaConnectionInfo with a real kwargs argument if/when AthenaConnectionInfo gains a kwargs field. Update the test function test_data_source_get_athena_connection_propagates_schema_and_kwargs to prefer calling the AthenaConnectionInfo constructor with kwargs (instead of object.__setattr__) when the model defines a kwargs attribute, otherwise fall back to the current object.__setattr__ injection; reference AthenaConnectionInfo and DataSourceExtension.get_athena_connection to locate the relevant code.core/wren/tests/unit/test_mssql_connection.py (1)
57-77: ⚡ Quick winAdd a password decode case that covers
+as space.This test currently validates
%20decoding but not+handling. Adding a+case will lock in the expectedunquote_plusbehavior and prevent subtle credential parsing regressions.Based on learnings: decode URL-derived passwords using
urllib.parse.unquote_plusso+is treated as a space, and keepparsed.usernameas-is unless a verified mismatch exists.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@core/wren/tests/unit/test_mssql_connection.py` around lines 57 - 77, Add a test case that asserts plus-signs in URL-encoded passwords decode to spaces by updating test_mssql_url_decodes_user_database_and_password to include a password containing '+' (e.g., "p+ss+word") and assert DataSourceExtension.get_mssql_connection_from_url yields PWD "p ss word"; if the implementation does not already use urllib.parse.unquote_plus for decoding credentials, update the decoding in DataSourceExtension.get_mssql_connection_from_url (and any helpers used by _FakePyodbc/_parse_conn_str) to call urllib.parse.unquote_plus for the password (and other credential fields that should treat '+' as space) while leaving parsed.username unchanged unless you find a verified mismatch.core/wren/src/wren/connector/postgres.py (2)
239-249: ⚖️ Poor tradeoffTransaction left in aborted state after query errors.
When psycopg3 encounters an error during query execution, the transaction enters an aborted state requiring
rollback()to recover. The exception handlers wrap the error but don't rollback, leaving the connection unusable for subsequent queries. The test file confirms this by explicitly callingconnector.connection.rollback()as a workaround.If the connector is intended for single-use, this may be acceptable. Otherwise, consider rolling back on exceptions or enabling autocommit mode.
♻️ Option: Rollback on exception
except Exception as e: + try: + self.connection.rollback() + except Exception: + pass raise WrenError( ErrorCode.GENERIC_USER_ERROR, str(e), phase=ErrorPhase.SQL_EXECUTION, metadata={DIALECT_SQL: sql}, ) from eAlso applies to: 256-266
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@core/wren/src/wren/connector/postgres.py` around lines 239 - 249, The exception handlers that wrap DB errors (the except blocks that raise WrenError with ErrorPhase.SQL_EXECUTION and metadata {DIALECT_SQL: sql}) do not roll back the psycopg3 transaction, leaving the connection aborted; update those handlers to call connection.rollback() (e.g., self.connection.rollback() or connector.connection.rollback()) before raising so the connection is recovered for further use, and apply the same change to the other symmetric handler block referenced (the one around lines handling 256-266).
231-233: ⚡ Quick winConsider validating
limitdefensively to prevent SQL injection.While
limitis typed asint | None, Python doesn't enforce this at runtime. A caller violating the type contract could inject arbitrary SQL. Given the PR objectives mention "limit sanitization to avoid SQL injection," consider adding explicit validation.🛡️ Proposed defensive validation
def query(self, sql: str, limit: int | None = None) -> pa.Table: if limit is not None: + if not isinstance(limit, int) or limit < 0: + raise ValueError("limit must be a non-negative integer") sql = f"SELECT * FROM ({sql}) AS _sub LIMIT {limit}"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@core/wren/src/wren/connector/postgres.py` around lines 231 - 233, The query method builds a LIMIT clause by interpolating the limit value directly; add defensive validation in query(sql: str, limit: int | None = None) to ensure limit is an actual non-negative integer (e.g., check isinstance(limit, int) and limit >= 0 or attempt safe int() conversion and reject on failure) and raise TypeError/ValueError for invalid values before constructing sql; keep using the local variable limit only after validation so the f"SELECT * FROM ({sql}) AS _sub LIMIT {limit}" interpolation cannot be exploited by a non-integer caller.core/wren/src/wren/model/data_source.py (1)
57-59: 💤 Low valueIncomplete
BackendOrConnectiontype alias.The union is missing types returned by other connectors:
clickhouse_connect.driver.client.Client(ClickHouse),pyodbc.Connection(MSSQL), andtrino.dbapi.Connection(Trino). This causes type-checker mismatches for callers.Suggested fix
if TYPE_CHECKING: import MySQLdb import psycopg from pyathena.connection import Connection as PyAthenaConnection + from clickhouse_connect.driver.client import Client as ClickHouseClient + from trino.dbapi import Connection as TrinoConnection BackendOrConnection = Union[ - BaseBackend, "PyAthenaConnection", "MySQLdb.Connection", "psycopg.Connection" + BaseBackend, + "PyAthenaConnection", + "MySQLdb.Connection", + "psycopg.Connection", + "pyodbc.Connection", + "ClickHouseClient", + "TrinoConnection", ]🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@core/wren/src/wren/model/data_source.py` around lines 57 - 59, The BackendOrConnection type alias in data_source.py is missing connector return types causing mypy failures; update the BackendOrConnection Union (symbol: BackendOrConnection) to include clickhouse_connect.driver.client.Client, pyodbc.Connection, and trino.dbapi.Connection, and ensure imports are added safely (use string literals or guard imports with TYPE_CHECKING) so the new types are available to the type-checker without causing runtime import side-effects.core/wren/tests/connectors/test_trino.py (1)
275-297: 💤 Low valueConsider retry logic instead of fixed sleep for container readiness.
The
time.sleep(5)is a pragmatic workaround for coordinator readiness, but it could either be too short (causing flaky tests) or too long (slowing CI). A retry loop with exponential backoff would be more robust.That said, this is test code and the current approach is documented. This is a minor improvement that can be deferred.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@core/wren/tests/connectors/test_trino.py` around lines 275 - 297, The test fixture engine currently uses a fixed time.sleep(5) after starting TrinoContainer which can be flaky or slow; replace that fixed sleep with a retry loop that polls the coordinator readiness (e.g., attempting a lightweight query or connection) with exponential backoff and a max timeout before calling _create_tpch_tables, referencing the engine fixture, TrinoContainer, and _create_tpch_tables to locate where to implement the retry; ensure the loop aborts with a clear test failure if readiness isn't achieved within the timeout.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@core/wren/src/wren/connector/clickhouse.py`:
- Around line 332-343: In dry_run, the input SQL isn't stripped of trailing
semicolons before being wrapped in the subquery; update the dry_run method
(function dry_run) to strip whitespace and trailing semicolons from the sql
argument (e.g., sql = sql.strip().rstrip(";")) before building the f"SELECT *
FROM ({sql}) AS _wren_sub LIMIT 0" query, preserving the existing exception
handling that maps TIMEOUT_EXCEEDED to DatabaseTimeoutError and other errors to
WrenError (ErrorCode.INVALID_SQL, phase=ErrorPhase.SQL_DRY_RUN,
metadata={DIALECT_SQL: sql}).
- Around line 264-265: The username/password extracted from the URL
(parsed.username and parsed.password) must be URL-decoded before use; update the
ClickHouse connector code that builds the auth dict to pass parsed.username and
parsed.password through urllib.parse.unquote_plus (or the repo's URL-decoding
helper) so percent-encoded characters (e.g. %40) are converted to their literal
characters before being sent to ClickHouse.
- Around line 315-330: The query method currently wraps the incoming sql into a
subquery without removing trailing semicolons, causing wrapped statements like
"SELECT * FROM (SELECT 1;) ..." to be invalid; fix by trimming trailing
semicolons and whitespace from the input sql before building the wrapped
statement (e.g., compute a cleaned_sql from the sql parameter and use that when
constructing statement when limit is set), but keep the original sql in
metadata={DIALECT_SQL: sql} if you want to preserve the caller's input; update
the logic in query (refer to the statement variable and the sql parameter) to
use the cleaned SQL for execution.
In `@core/wren/src/wren/connector/mssql.py`:
- Around line 41-50: The result construction currently zips names into a dict
which collapses duplicate column names; replace the dict(zip(names, arrays))
approach in the return of the method with building the pyarrow Table directly
from the arrays and the arrow_schema while preserving the original names (e.g.
use pa.Table.from_arrays or equivalent with names=names and
schema=arrow_schema). Locate the call site that uses cursor.description,
_build_mssql_arrow_schema, and _build_mssql_column and change the final return
to construct the table from arrays + schema + names so duplicate projected
columns (e.g. SELECT a, a) are preserved.
---
Nitpick comments:
In `@core/wren/src/wren/connector/postgres.py`:
- Around line 239-249: The exception handlers that wrap DB errors (the except
blocks that raise WrenError with ErrorPhase.SQL_EXECUTION and metadata
{DIALECT_SQL: sql}) do not roll back the psycopg3 transaction, leaving the
connection aborted; update those handlers to call connection.rollback() (e.g.,
self.connection.rollback() or connector.connection.rollback()) before raising so
the connection is recovered for further use, and apply the same change to the
other symmetric handler block referenced (the one around lines handling
256-266).
- Around line 231-233: The query method builds a LIMIT clause by interpolating
the limit value directly; add defensive validation in query(sql: str, limit: int
| None = None) to ensure limit is an actual non-negative integer (e.g., check
isinstance(limit, int) and limit >= 0 or attempt safe int() conversion and
reject on failure) and raise TypeError/ValueError for invalid values before
constructing sql; keep using the local variable limit only after validation so
the f"SELECT * FROM ({sql}) AS _sub LIMIT {limit}" interpolation cannot be
exploited by a non-integer caller.
In `@core/wren/src/wren/model/data_source.py`:
- Around line 57-59: The BackendOrConnection type alias in data_source.py is
missing connector return types causing mypy failures; update the
BackendOrConnection Union (symbol: BackendOrConnection) to include
clickhouse_connect.driver.client.Client, pyodbc.Connection, and
trino.dbapi.Connection, and ensure imports are added safely (use string literals
or guard imports with TYPE_CHECKING) so the new types are available to the
type-checker without causing runtime import side-effects.
In `@core/wren/tests/connectors/test_trino.py`:
- Around line 275-297: The test fixture engine currently uses a fixed
time.sleep(5) after starting TrinoContainer which can be flaky or slow; replace
that fixed sleep with a retry loop that polls the coordinator readiness (e.g.,
attempting a lightweight query or connection) with exponential backoff and a max
timeout before calling _create_tpch_tables, referencing the engine fixture,
TrinoContainer, and _create_tpch_tables to locate where to implement the retry;
ensure the loop aborts with a clear test failure if readiness isn't achieved
within the timeout.
In `@core/wren/tests/unit/test_athena_connector.py`:
- Around line 35-37: Replace the non-overriding stub registration using
sys.modules.setdefault with an explicit assignment so the fake module reliably
replaces any installed pyathena during the test: save the original
sys.modules.get("pyathena") into a temp (so it can be restored), assign
sys.modules["pyathena"] = _pyathena_module (which already sets
_pyathena_module.connect = _fake_pyathena_connect), and restore the saved
original after the test (or use a fixture/monkeypatch to set and undo the
assignment) to ensure test isolation.
- Around line 312-328: Test uses object.__setattr__ to inject a kwargs dict
which is fragile; keep the current defensive injection for now but change the
test to construct AthenaConnectionInfo with a real kwargs argument if/when
AthenaConnectionInfo gains a kwargs field. Update the test function
test_data_source_get_athena_connection_propagates_schema_and_kwargs to prefer
calling the AthenaConnectionInfo constructor with kwargs (instead of
object.__setattr__) when the model defines a kwargs attribute, otherwise fall
back to the current object.__setattr__ injection; reference AthenaConnectionInfo
and DataSourceExtension.get_athena_connection to locate the relevant code.
In `@core/wren/tests/unit/test_mssql_connection.py`:
- Around line 57-77: Add a test case that asserts plus-signs in URL-encoded
passwords decode to spaces by updating
test_mssql_url_decodes_user_database_and_password to include a password
containing '+' (e.g., "p+ss+word") and assert
DataSourceExtension.get_mssql_connection_from_url yields PWD "p ss word"; if the
implementation does not already use urllib.parse.unquote_plus for decoding
credentials, update the decoding in
DataSourceExtension.get_mssql_connection_from_url (and any helpers used by
_FakePyodbc/_parse_conn_str) to call urllib.parse.unquote_plus for the password
(and other credential fields that should treat '+' as space) while leaving
parsed.username unchanged unless you find a verified mismatch.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 6d880d85-dd2d-463e-a089-800df247ba4e
⛔ Files ignored due to path filters (1)
core/wren/uv.lockis excluded by!**/*.lock
📒 Files selected for processing (24)
core/wren/justfilecore/wren/pyproject.tomlcore/wren/src/wren/connector/athena.pycore/wren/src/wren/connector/base.pycore/wren/src/wren/connector/canner.pycore/wren/src/wren/connector/clickhouse.pycore/wren/src/wren/connector/factory.pycore/wren/src/wren/connector/ibis.pycore/wren/src/wren/connector/mssql.pycore/wren/src/wren/connector/mysql.pycore/wren/src/wren/connector/postgres.pycore/wren/src/wren/connector/trino.pycore/wren/src/wren/model/data_source.pycore/wren/tests/conftest.pycore/wren/tests/connectors/test_canner.pycore/wren/tests/connectors/test_clickhouse.pycore/wren/tests/connectors/test_mssql.pycore/wren/tests/connectors/test_mysql.pycore/wren/tests/connectors/test_mysql_connector.pycore/wren/tests/connectors/test_postgres.pycore/wren/tests/connectors/test_trino.pycore/wren/tests/unit/test_athena_connector.pycore/wren/tests/unit/test_mssql_connection.pycore/wren/tests/unit/test_mysql_helpers.py
💤 Files with no reviewable changes (1)
- core/wren/src/wren/connector/ibis.py
- clickhouse._build_clickhouse_client_kwargs: unquote_plus username/password
parsed from a connection URL so percent-encoded credentials (e.g. %40 → @)
don't reach ClickHouse verbatim and fail auth.
- clickhouse.ClickHouseConnector.{query,dry_run}: strip the terminating run
of ;/whitespace before wrapping in `SELECT * FROM (...) AS _wren_sub LIMIT N`
— otherwise `SELECT 1;` becomes invalid SQL. Reuse canner's
`[;\s]+\Z` regex so semicolons inside string literals are preserved.
- mssql.MSSqlConnector.query and clickhouse._build_clickhouse_arrow_table:
build the result table via `pa.Table.from_arrays(arrays, schema=...)`
instead of `dict(zip(names, arrays))`, which silently collapses duplicate
column names from projections like `SELECT a, a`.
Adds pure-Python regression tests under tests/unit/ for all four behaviours.
Ibis is no longer a load-bearing dependency for the bulk of connectors after the recent direct-connector refactors (canner/postgres/mysql/mssql/ trino/clickhouse/athena dropped ibis in #2313, snowflake in #2268); the intro now only credits Apache DataFusion as the engine. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Consolidates seven open PRs that each migrate one
core/wrenconnector offibis-frameworkto its native driver. They were merged into a single feature branch so conflicts on shared files (data_source.py,pyproject.toml,uv.lock,factory.py,tests/conftest.py,justfile) only had to be resolved once.Bundled PRs:
All seven were merged with
--no-ffto preserve their individual commit history.Conflict resolutions
core/wren/src/wren/model/data_source.py— unified the return-type alias toBackendOrConnection = Union[BaseBackend, "PyAthenaConnection", "MySQLdb.Connection", "psycopg.Connection"]; consolidated TYPE_CHECKING imports; kept the URL-dispatch fast-paths added by mysql/doris and trino PRs; dropped the now-orphanedimport ssl,SSLMode, and_create_ssl_contexthelper (mysql PR moved SSL handling intowren.connector.mysql); dropped the orphanedimport boto3(athena PR moved STS intowren.connector.athena).core/wren/src/wren/connector/factory.py— every connector now resolves to its own module (athena/clickhouse/trino out ofwren.connector.ibis);_NEEDS_DATA_SOURCEtrimmed to{mysql, doris, trino}.core/wren/src/wren/connector/ibis.py— deleted; no remaining datasource resolves to it after the migrations.core/wren/pyproject.toml— optional-deps union:psycopg[binary],mysqlclient,pyodbc,clickhouse-connect,trino,pyathena,snowflake-connector-python(noibis-framework[<x>]extras for the migrated connectors).core/wren/uv.lock— regenerated once at the end.core/wren/tests/conftest.pyandcore/wren/justfile— additive concatenation of new pytest markers andjust test-<connector>recipes.Test plan
cd core/wren && uv lock— clean resolveuv run ruff check src tests— same 32 pre-existing errors asorigin/main, no regressionsuv run pytest tests/unit/test_athena_connector.py tests/unit/test_mssql_connection.py tests/unit/test_mysql_helpers.py— 54 passed, 1 skippedjust test-canner,test-postgres,test-mysql,test-mssql,test-clickhouse,test-trino,test-athenaNotes
tests/unit/test_context_cli.py::test_validate_strict_warnsand 5tests/unit/test_cube_cli.pycases that need a rebuiltwren-core-pywheel.🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Improvements
Tests