Skip to content

refactor(canner): use psycopg native driver, drop ibis dependency#2269

Closed
goldmedal wants to merge 5 commits into
mainfrom
refactor/canner-native-driver
Closed

refactor(canner): use psycopg native driver, drop ibis dependency#2269
goldmedal wants to merge 5 commits into
mainfrom
refactor/canner-native-driver

Conversation

@goldmedal
Copy link
Copy Markdown
Collaborator

@goldmedal goldmedal commented May 14, 2026

Summary

Canner Enterprise speaks the Postgres wire protocol; the connector now uses
psycopg directly instead of the ibis postgres backend.

Changes

  • connector/canner.py: native psycopg cursor with a self-contained PG OID -> Arrow type map covering the canner-flavoured types (VARCHAR/CHAR -> string, DECIMAL -> decimal128, BIGINT/INT/SMALLINT -> int, BOOLEAN -> bool, DATE/TIMESTAMP/TIMESTAMPTZ -> date/timestamp, ROW/ARRAY/MAP serialised as JSON strings). Errors are wrapped as WrenError with the dialect SQL attached, mirroring the existing postgres connector contract.
  • model/data_source.py::get_canner_connection: returns a psycopg.Connection (autocommit) instead of an ibis backend.

Tests

  • tests/connectors/test_canner.py exercises the type-mapping helpers and runs the connector against a PostgresContainer with the common canner result types (incl. JSON/JSONB and arrays).
  • New canner marker registered in tests/conftest.py; just test-canner target added.

Test plan

  • just test-canner (7 passed)
  • just lint
  • Verify against a live canner endpoint in a downstream environment

Wren Engine PR template acceptance criteria:

  • No ibis import path remains through wren.connector.canner.
  • Tests green.

Summary by CodeRabbit

  • New Features

    • Connector now runs queries directly against Postgres with improved type mapping and more accurate handling of numerics, JSON, arrays, and limits; dry-run and connection-close behavior improved.
  • Tests

    • Added comprehensive unit and integration tests covering types, query execution, limits, dry-run, error handling, and regressions; pytest marker for connector tests added.
  • Chores

    • New test-canner recipe to run the connector test suite.

Review Change Stack

Canner Enterprise speaks the Postgres wire protocol; the connector now
uses psycopg directly instead of the ibis postgres backend.

Changes
- `connector/canner.py`: native psycopg cursor with a self-contained
  PG OID -> Arrow type map covering the canner-flavoured types
  (VARCHAR/CHAR -> string, DECIMAL -> decimal128, BIGINT/INT/SMALLINT
  -> int, BOOLEAN -> bool, DATE/TIMESTAMP/TIMESTAMPTZ -> date/timestamp,
  ROW/ARRAY/MAP serialised as JSON strings). Errors are wrapped as
  WrenError with the dialect SQL attached, mirroring the existing
  postgres connector contract.
- `model/data_source.py::get_canner_connection`: returns a
  `psycopg.Connection` (autocommit) instead of an ibis backend.

Tests
- `tests/connectors/test_canner.py` exercises the type-mapping helpers
  and runs the connector against a PostgresContainer with the common
  canner result types (incl. JSON/JSONB and arrays). Marker `canner`
  is registered in `tests/conftest.py` and a `just test-canner` target
  is added.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 14, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: c2f35e11-3805-4f24-a496-d8347a70a0e5

📥 Commits

Reviewing files that changed from the base of the PR and between bb18b0f and 43ebe33.

📒 Files selected for processing (2)
  • core/wren/src/wren/connector/canner.py
  • core/wren/tests/connectors/test_canner.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • core/wren/src/wren/connector/canner.py
  • core/wren/tests/connectors/test_canner.py

Walkthrough

Replaces ibis-backed Canner connector with direct psycopg execution, adds Postgres OID→PyArrow type mapping and Arrow table construction, updates connection creation to use psycopg with autocommit, registers a canner pytest marker, and adds unit and testcontainer-based integration tests plus a justfile recipe.

Changes

Canner psycopg migration

Layer / File(s) Summary
Type mapping and conversion utilities
core/wren/src/wren/connector/canner.py
Postgres OID→PyArrow mapping, decimal precision/scale selection, decimal quantization/coercion, and Arrow array building with JSON/binary/list handling; table construction preserving duplicate names and typed empty columns.
Connector query execution
core/wren/src/wren/connector/canner.py
CannerConnector executes SQL via psycopg cursor (optional LIMIT wrapper), constructs pyarrow.Table from cursor results, maps SQL errors to WrenError with SQL metadata, implements dry_run as LIMIT 0, and adds guarded close.
Connection factory
core/wren/src/wren/model/data_source.py
DataSourceExtension.get_canner_connection now returns a psycopg.connect(...) with dbname=info.workspace, user/password from connection info, host/port, and autocommit=True.
Test setup and unit tests
core/wren/tests/conftest.py, core/wren/tests/connectors/test_canner.py
Registered canner pytest marker; unit tests validate _arrow_type mappings, _build_column serialization/quantization, semicolon stripping; dry_run unit test; fixture DDL and rows cover scalar, numeric, temporal, array, and JSON types.
Integration/e2e tests
core/wren/tests/connectors/test_canner.py
Testcontainers Postgres fixture and end-to-end tests asserting query() returns expected Arrow schema/values, limit enforcement, duplicate column-name preservation, dry_run success/failure, numeric/string round-trip regressions, and semicolon-wrapping behavior.
Test automation recipe
core/wren/justfile
Added test-canner recipe to run the connector tests with pytest marker canner.

Sequence Diagram(s)

sequenceDiagram
  participant Caller
  participant CannerConnector
  participant psycopg
  participant PyArrow

  Caller->>CannerConnector: query(sql, limit)
  CannerConnector->>CannerConnector: wrap SQL with LIMIT if provided
  CannerConnector->>psycopg: cursor.execute(wrapped_sql)
  psycopg-->>CannerConnector: cursor.description + rows
  CannerConnector->>PyArrow: _build_arrow_table(cursor)
  PyArrow-->>CannerConnector: pyarrow.Table
  CannerConnector-->>Caller: return Table or raise WrenError (with metadata[DIALECT_SQL])
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I hopped from ibis to psycopg land,
Rows became Arrows placed by hand,
Decimals trimmed, JSON twined as strings,
Containers hummed and the test bell rings,
A closed connection, and a recipe to stand.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 8.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically summarizes the main change: refactoring the Canner connector to use psycopg native driver instead of ibis, which is the primary objective across all modified files.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch refactor/canner-native-driver

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions Bot added python Pull requests that update Python code core labels May 14, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
core/wren/tests/connectors/test_canner.py (1)

208-212: ⚡ Quick win

Assert the SQL payload in WrenError for invalid dry-run

Line 211-212 verifies only the exception type. Please also assert the failing SQL is attached to the raised WrenError so the connector error-contract is locked by tests.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@core/wren/tests/connectors/test_canner.py` around lines 208 - 212, The test
only checks the exception type; update
test_canner_connector_dry_run_raises_for_invalid_sql to capture the raised
WrenError via pytest.raises as excinfo when calling
canner_connector.dry_run("SELECT * FROM no_such_table") and assert the failing
SQL is attached to the error—e.g., assert getattr(excinfo.value, "sql", None) ==
"SELECT * FROM no_such_table" and/or assert "SELECT * FROM no_such_table" in
str(excinfo.value) so the connector's error contract (WrenError carrying the SQL
payload) is enforced.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@core/wren/src/wren/connector/canner.py`:
- Line 210: The current return uses dict(zip([field.name for field in schema],
arrays)) which will drop duplicate column names; instead pass the arrays and
schema directly to PyArrow so duplicate field names are preserved—replace the
dict-based construction with a call like pa.table(arrays, schema=schema) or
pa.Table.from_arrays(arrays, schema=schema), keeping the existing schema and
arrays variables and the pa.table / pa.Table API to avoid name-based
overwriting.
- Around line 124-133: The code currently maps SQL NULLs for JSON/JSONB
(pg_type_oid in {114, 3802}) to the string "null" inside the arrow_type ==
pa.string() handling, which changes NULL semantics; instead, stop coercing None
to "null" — treat None like any other NULL by appending None to processed.
Locate the block handling arrow_type == pa.string(), remove the special-case
that appends "null" for pg_type_oid 114/3802, and ensure the loop over values
appends None for SQL NULLs (value is None) so downstream logic receives actual
nulls rather than the literal string; keep existing behavior for non-None values
unchanged.

---

Nitpick comments:
In `@core/wren/tests/connectors/test_canner.py`:
- Around line 208-212: The test only checks the exception type; update
test_canner_connector_dry_run_raises_for_invalid_sql to capture the raised
WrenError via pytest.raises as excinfo when calling
canner_connector.dry_run("SELECT * FROM no_such_table") and assert the failing
SQL is attached to the error—e.g., assert getattr(excinfo.value, "sql", None) ==
"SELECT * FROM no_such_table" and/or assert "SELECT * FROM no_such_table" in
str(excinfo.value) so the connector's error contract (WrenError carrying the SQL
payload) is enforced.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: b5353458-9c2d-4703-843a-8f6eb0320e40

📥 Commits

Reviewing files that changed from the base of the PR and between 544ecab and a372ba4.

📒 Files selected for processing (5)
  • core/wren/justfile
  • core/wren/src/wren/connector/canner.py
  • core/wren/src/wren/model/data_source.py
  • core/wren/tests/conftest.py
  • core/wren/tests/connectors/test_canner.py

Comment thread core/wren/src/wren/connector/canner.py Outdated
Comment thread core/wren/src/wren/connector/canner.py Outdated
…bles

SQL NULL in json/jsonb columns was being coerced into the string "null",
breaking NULL semantics downstream. Drop the oid-114/3802 special case so
None passes through unchanged.

Arrow tables were built via dict(zip(names, arrays)), which silently
drops duplicate column names (e.g. self-joins projecting two `id`
columns). Switch to pa.Table.from_arrays(..., schema=schema) so
positional construction keeps duplicates.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@core/wren/src/wren/connector/canner.py`:
- Around line 91-100: The current _decimal_type function defaults unconstrained
NUMERIC columns to scale=9 which causes silent rounding later where values are
quantized; change the logic so that when column.scale is None you do NOT assign
a fixed scale (remove scale=9 default) and instead either (a) infer
precision/scale from observed row values before constructing pa.decimal128, or
(b) preserve values as strings or decimal.Decimal until an explicit
precision/scale is decided; update any downstream quantization code that
currently forces values to the _decimal_type scale (the quantization block
referenced around lines 114–116 and the type-resolution paths handling NUMERIC
and NUMERIC[] arrays) to respect the "unspecified scale" case and avoid rounding
unless an explicit target scale is chosen. Ensure references to _decimal_type,
the NUMERIC handling branch, and the array NUMERIC[] branch are updated so
unconstrained NUMERIC stays lossless until an explicit precision/scale is
determined.
- Line 221: The SQL string is embedded into subqueries at the points where sql
is wrapped (e.g., in the dry_run() path and the limited query wrapper that
builds "SELECT * FROM ({sql}) AS _t LIMIT {limit}"), and trailing semicolons
cause PostgreSQL to reject the subselect; before wrapping the user-supplied sql
variable, strip any trailing semicolons and whitespace (e.g., sql =
sql.rstrip().rstrip(';')) so the constructed subquery is valid—apply this
normalization wherever the code builds subqueries from the sql variable
(referencing the sql variable and the dry_run/limited-query construction sites).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 2b93b718-223d-4ae1-884a-9809f3cb7bff

📥 Commits

Reviewing files that changed from the base of the PR and between a372ba4 and 793fdda.

📒 Files selected for processing (2)
  • core/wren/src/wren/connector/canner.py
  • core/wren/tests/connectors/test_canner.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • core/wren/tests/connectors/test_canner.py

Comment thread core/wren/src/wren/connector/canner.py Outdated
Comment thread core/wren/src/wren/connector/canner.py Outdated
goldmedal and others added 2 commits May 14, 2026 11:58
Round-2 review found _decimal_type defaulted scale=9 when the column
typmod was missing, so Decimal.quantize silently rounded high-precision
values (e.g. 18-significant-figure NUMERIC). Fall back to pa.string() for
unconstrained NUMERIC and NUMERIC[] columns so the exact textual value
round-trips. Same approach Trino's connector takes for dynamic-decimal
casts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Round-2 review found that query() and dry_run() wrap user SQL as
"SELECT * FROM ({sql}) AS _t LIMIT N", which Postgres/Canner reject
when the inner SQL ends in a semicolon. Add a _strip_trailing_semicolon
helper that only strips the terminating run of semicolons and
whitespace (so semicolons inside string literals are preserved) and
apply it on both call sites.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
core/wren/tests/connectors/test_canner.py (1)

273-277: ⚡ Quick win

Assert the WrenError metadata contract in this failure-path test.

Line 276 currently verifies only exception type. The connector contract also requires dialect SQL to be attached; assert it here so regressions are caught.

Proposed test tightening
 def test_canner_connector_dry_run_raises_for_invalid_sql(canner_connector) -> None:
-    from wren.model.error import WrenError  # noqa: PLC0415
+    from wren.model.error import DIALECT_SQL, WrenError  # noqa: PLC0415

-    with pytest.raises(WrenError):
-        canner_connector.dry_run("SELECT * FROM no_such_table")
+    sql = "SELECT * FROM no_such_table"
+    with pytest.raises(WrenError) as exc:
+        canner_connector.dry_run(sql)
+    assert exc.value.metadata[DIALECT_SQL] == sql
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@core/wren/tests/connectors/test_canner.py` around lines 273 - 277, Update the
test_canner_connector_dry_run_raises_for_invalid_sql to capture the raised
WrenError (use pytest.raises as excinfo) and assert the connector metadata
contract: verify excinfo.value.metadata contains the "dialect" key (e.g., equals
"sql") and that the "sql" (or "query") metadata includes the attempted SQL
string (e.g., contains "SELECT * FROM no_such_table"); locate this change around
the test_canner_connector_dry_run_raises_for_invalid_sql function and the
canner_connector.dry_run call.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@core/wren/tests/connectors/test_canner.py`:
- Around line 273-277: Update the
test_canner_connector_dry_run_raises_for_invalid_sql to capture the raised
WrenError (use pytest.raises as excinfo) and assert the connector metadata
contract: verify excinfo.value.metadata contains the "dialect" key (e.g., equals
"sql") and that the "sql" (or "query") metadata includes the attempted SQL
string (e.g., contains "SELECT * FROM no_such_table"); locate this change around
the test_canner_connector_dry_run_raises_for_invalid_sql function and the
canner_connector.dry_run call.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 0860f7a5-42d4-46b2-a4c2-59ae128f2e10

📥 Commits

Reviewing files that changed from the base of the PR and between 793fdda and bb18b0f.

📒 Files selected for processing (2)
  • core/wren/src/wren/connector/canner.py
  • core/wren/tests/connectors/test_canner.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • core/wren/src/wren/connector/canner.py

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@goldmedal
Copy link
Copy Markdown
Collaborator Author

Re: CodeRabbit nitpicks on test_canner.py:208-212 and test_canner.py:273-277 (tighten WrenError contract on dry-run failure paths) — skipping. The existing pytest.raises(WrenError) checks are sufficient for the error-type contract; the DIALECT_SQL metadata is already covered by other tests in the file, and we keep the assertion style consistent across the suite.

@goldmedal
Copy link
Copy Markdown
Collaborator Author

Superseded by #2313 — these seven native-driver refactors were consolidated into a single feature branch to resolve shared-file conflicts (data_source.py, pyproject.toml, uv.lock, factory.py, etc.) once instead of seven times.

@goldmedal goldmedal closed this May 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core python Pull requests that update Python code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant