Skip to content

refactor(clickhouse): use clickhouse-connect native driver, drop ibis dependency#2275

Closed
goldmedal wants to merge 3 commits into
mainfrom
refactor/clickhouse-native-driver
Closed

refactor(clickhouse): use clickhouse-connect native driver, drop ibis dependency#2275
goldmedal wants to merge 3 commits into
mainfrom
refactor/clickhouse-native-driver

Conversation

@goldmedal
Copy link
Copy Markdown
Collaborator

@goldmedal goldmedal commented May 14, 2026

Summary

The ClickHouse connector now uses clickhouse-connect directly, parsing
ClickHouse type strings via sqlglot to build PyArrow schemas — no more
detour through the ibis-project clickhouse backend.

  • connector/clickhouse.py: native client; type lexer covers
    Nullable(T) / LowCardinality(T) / Array(T) / Tuple(...) /
    Map(K,V) / DateTime64(p, 'TZ') / Decimal(p,s), plus
    Int128/256 and UInt128/256 (surfaced as string to avoid silent
    truncation past 64-bit Arrow widths).
  • model/data_source.py::get_clickhouse_connection now returns a
    clickhouse_connect.Client; _handle_clickhouse_url also accepts
    clickhouse+http:// and clickhouse+https:// URLs.
  • pyproject.toml: clickhouse extra → clickhouse-connect>=0.8
    (was ibis-framework[clickhouse]).

Test plan

  • just lint — ruff format + check clean
  • just test-clickhouse — 9 connector tests (TPCH sf=0.01 in a
    ClickHouse testcontainer) + 37 type-parser tests pass
  • Type parser parametrised over String, FixedString(N), signed
    and unsigned Int{8..256} / UInt{8..256}, Float32,
    Float64, Bool, UUID, IPv4, IPv6, Enum, Decimal(p,s),
    Date, Date32, DateTime, DateTime64(p, 'TZ'), Array(T),
    Map(K,V), Tuple(...), Nullable(T), LowCardinality(T)

Summary by CodeRabbit

  • New Features

    • Added a native ClickHouse connector with richer type mapping, query handling, and secure URL support.
  • Tests

    • Added ClickHouse integration and unit tests (container-based integration, type parsing, client kwargs).
    • Registered a pytest "clickhouse" marker and added a test recipe to run ClickHouse tests.
  • Chores

    • Updated ClickHouse optional dependency to include a native ClickHouse client.

Review Change Stack

… dependency

The ClickHouse connector now uses ``clickhouse-connect`` directly,
parsing ClickHouse type strings via sqlglot to build PyArrow schemas
rather than going through the ibis-project clickhouse backend.

Highlights
- ``connector/clickhouse.py``: native client; type lexer covers
  ``Nullable(T)`` / ``LowCardinality(T)`` / ``Array(T)`` / ``Tuple(...)``
  / ``Map(K,V)`` / ``DateTime64(p, 'TZ')`` / ``Decimal(p,s)``, plus
  ``Int128/256`` and ``UInt128/256`` (surfaced as string to avoid
  silent truncation past 64-bit Arrow widths).
- ``model/data_source.py::get_clickhouse_connection`` returns a
  ``clickhouse_connect.Client``; ``_handle_clickhouse_url`` now also
  accepts ``clickhouse+http://`` / ``clickhouse+https://`` URLs.
- ``pyproject.toml``: clickhouse extra now pulls
  ``clickhouse-connect>=0.8`` instead of ``ibis-framework[clickhouse]``.

Tests
- ``tests/connectors/test_clickhouse.py`` exercises the full query
  path against a ClickHouse testcontainer (TPCH sf=0.01) and
  parametrises 35+ type strings through ``_parse_clickhouse_type``,
  including ``DateTime64`` with timezone and nested ``Tuple``.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 14, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 603f9161-2012-417d-a11e-8189b43f7863

📥 Commits

Reviewing files that changed from the base of the PR and between a8e91ef and b969ed3.

📒 Files selected for processing (1)
  • core/wren/tests/connectors/test_clickhouse.py

Walkthrough

Replace the Ibis ClickHouse backend with a native clickhouse-connect implementation: new connector (type parsing → PyArrow), data-source/client wiring, factory routing updates, optional dependency change, pytest marker and Just recipe, plus integration and unit tests.

Changes

ClickHouse Native Connector

Layer / File(s) Summary
ClickHouse connector implementation
core/wren/src/wren/connector/clickhouse.py
Adds ClickHouseConnector, _parse_clickhouse_type, AST→PyArrow mapping, _build_clickhouse_arrow_table/_build_clickhouse_column, _build_clickhouse_client_kwargs, query/dry_run/error handling, and create_connector factory.
DataSource connection updates
core/wren/src/wren/model/data_source.py
Replaces ibis-based connection with clickhouse_connect.get_client, supports clickhouse, clickhouse+http, clickhouse+https, maps statement_timeoutmax_execution_time, sets secure for +https, merges settings and kwargs, and handles empty password.
Factory routing and ibis cleanup
core/wren/src/wren/connector/factory.py, core/wren/src/wren/connector/ibis.py
Registers DataSource.clickhousewren.connector.clickhouse in factory, removes ClickHouse from _NEEDS_DATA_SOURCE, and deletes old ClickHouse code/imports from ibis.py (keeps Trino mapping).
Dependencies and test setup
core/wren/pyproject.toml, core/wren/tests/conftest.py, core/wren/justfile
Updates clickhouse optional dependency to clickhouse-connect>=0.8, adds clickhouse pytest marker (requires Docker), and adds test-clickhouse Just recipe to run ClickHouse tests.
ClickHouse connector tests
core/wren/tests/connectors/test_clickhouse.py
Adds integration tests using testcontainers to start ClickHouse, _wait_for_http_ready, _load_tpch, TestClickHouse integration suite, unit tests for _parse_clickhouse_type, and _build_clickhouse_client_kwargs tests.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~40 minutes

Poem

🐰
I swapped Ibis for a native stream,
Types parsed clean like a carrot dream,
PyArrow tables hop into the night,
Docker spins data, tests take flight,
ClickHouse sings — the queries beam.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 26.47% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and concisely describes the main change: replacing the ibis-based ClickHouse backend with direct use of the clickhouse-connect native driver, which is the primary objective of this pull request.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch refactor/clickhouse-native-driver

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions Bot added dependencies Pull requests that update a dependency file python Pull requests that update Python code core labels May 14, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (4)
core/wren/tests/connectors/test_clickhouse.py (1)

24-25: ⚡ Quick win

Scope the clickhouse marker to Docker-dependent tests only.

Line 24 currently marks the whole module as clickhouse, so the pure parser tests in TestClickHouseTypeParser are also treated as Docker-required. Consider moving the marker to TestClickHouse (integration suite) and marking parser tests as unit (or leaving unmarked), so they can run in non-Docker test jobs.

Also applies to: 154-155

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@core/wren/tests/connectors/test_clickhouse.py` around lines 24 - 25, The
module-level pytestmark = pytest.mark.clickhouse is making all tests
Docker-dependent; remove that module-level marker and instead add
`@pytest.mark.clickhouse` to the integration test class TestClickHouse, and either
add `@pytest.mark.unit` (or no marker) to TestClickHouseTypeParser so its pure
parser tests run in non-Docker jobs; also remove any other module-level
pytestmark instances (the other occurrence matching the same pattern) and apply
the same class-level scoping.
core/wren/src/wren/model/data_source.py (1)

321-340: ⚡ Quick win

Merge user-provided settings instead of clobbering them.

client_kwargs.update(kwargs) lets entries from info.kwargs overwrite anything pre-populated above. If a caller (or upstream get_connection_info/URL parsing) ends up putting settings into info.kwargs, that update replaces the merged settings dict, silently dropping the max_execution_time derived from statement_timeout (and from the X_WREN_DB_STATEMENT_TIMEOUT header for the clickhouse case at lines 98–103, if that flow ever routes through kwargs). Pop settings from kwargs and merge into the local settings dict to keep the timeout semantics intact.

♻️ Proposed fix
     settings = dict(info.settings) if info.settings else {}
     kwargs = dict(info.kwargs) if info.kwargs else {}
     statement_timeout = kwargs.pop("statement_timeout", None)
     if statement_timeout is not None:
         settings["max_execution_time"] = int(statement_timeout)
+    extra_settings = kwargs.pop("settings", None)
+    if extra_settings:
+        settings.update(extra_settings)

     client_kwargs = {
         "host": info.host,
         "port": int(info.port),
         "database": info.database,
         "username": info.user,
         "password": info.password.get_secret_value() if info.password else "",
         "secure": info.secure,
         "settings": settings,
     }
     client_kwargs.update(kwargs)
     return clickhouse_connect.get_client(**client_kwargs)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@core/wren/src/wren/model/data_source.py` around lines 321 - 340, In
get_clickhouse_connection, avoid clobbering the merged settings by popping
"settings" out of info.kwargs and merging it into the local settings dict
(preserving any values from info.settings and the computed max_execution_time
from statement_timeout) before you update client_kwargs with the remaining
kwargs; in practice: pop settings = kwargs.pop("settings", {}) (or similar),
merge that into the existing settings dict, then proceed with
client_kwargs.update(kwargs) so the statement_timeout-derived max_execution_time
is not silently dropped.
core/wren/src/wren/connector/clickhouse.py (2)

164-186: ⚡ Quick win

Consider using strict=True in zip operations for better data integrity.

Lines 172 and 184 use strict=False which silently handles length mismatches. Since this code targets Python 3.10+ (evident from union type syntax elsewhere), using strict=True would catch potential driver bugs or data corruption early. ClickHouse should always return matching lengths for column metadata.

🔒 Proposed fix for stricter validation
     fields = [
         pa.field(name, _parse_clickhouse_type(ct.name), nullable=True)
-        for name, ct in zip(column_names, column_types, strict=False)
+        for name, ct in zip(column_names, column_types, strict=True)
     ]
     schema = pa.schema(fields)

     if not rows:
         arrays = [pa.array([], type=field.type) for field in schema]
     else:
         arrays = [
             _build_clickhouse_column([row[i] for row in rows], schema.field(i).type)
             for i in range(len(fields))
         ]
     return pa.table(
-        dict(zip([f.name for f in fields], arrays, strict=False)),
+        dict(zip([f.name for f in fields], arrays, strict=True)),
         schema=schema,
     )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@core/wren/src/wren/connector/clickhouse.py` around lines 164 - 186, The zip
calls in _build_clickhouse_arrow_table silently allow mismatched lengths (using
strict=False) which can mask driver bugs; change both zip(...) usages in
_build_clickhouse_arrow_table to use strict=True so mismatched
column_names/column_types or field names/arrays raise an error, ensuring the
construction of fields and the dict(zip(...)) for pa.table validates length
consistency (update the zip in the fields list comprehension and the final
dict(zip(...)) call).

302-330: ⚡ Quick win

Consider using a dedicated timeout exception type instead of string matching on error messages.

Lines 309 and 323 detect timeouts by checking if "TIMEOUT_EXCEEDED" appears in the error message string. This approach is fragile—if the clickhouse-connect driver changes its error message format in a future version, the timeout detection will silently fail. The codebase already has a DatabaseTimeoutError class defined in wren/model/error.py, which should be used instead of re-raising the raw driver exception.

Additionally, this creates inconsistent error handling: non-timeout errors are wrapped in WrenError with proper metadata and error codes, but timeout errors bypass this and are re-raised as raw driver exceptions. Using a dedicated exception type or checking exception attributes (if available in the driver) would be more robust than string matching.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@core/wren/src/wren/connector/clickhouse.py` around lines 302 - 330, The
timeout detection in query and dry_run currently string-matches
"TIMEOUT_EXCEEDED" on the ClickHouse driver exception; replace that fragile
logic by converting driver timeouts into the project's DatabaseTimeoutError
(from wren/model/error.py) instead of re-raising the raw driver exception, and
preserve the original exception as the cause; for non-timeout errors keep
wrapping into WrenError with ErrorCode.INVALID_SQL and the appropriate
ErrorPhase (SQL_EXECUTION for query, SQL_DRY_RUN for dry_run) and
metadata={DIALECT_SQL: sql} so behavior remains consistent while using a
dedicated timeout exception type.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@core/wren/src/wren/connector/clickhouse.py`:
- Around line 237-241: The error message raised in the ClickHouse connector when
validating parsed.scheme is misleading; update the WrenError raised in this
validation (the block checking parsed.scheme against {"clickhouse",
"clickhouse+http", "clickhouse+https"}) to list all accepted schemes (e.g.,
"clickhouse, clickhouse+http, clickhouse+https") and keep
ErrorCode.INVALID_CONNECTION_INFO and the existing raise call (reference:
parsed.scheme check and the WrenError invocation).

---

Nitpick comments:
In `@core/wren/src/wren/connector/clickhouse.py`:
- Around line 164-186: The zip calls in _build_clickhouse_arrow_table silently
allow mismatched lengths (using strict=False) which can mask driver bugs; change
both zip(...) usages in _build_clickhouse_arrow_table to use strict=True so
mismatched column_names/column_types or field names/arrays raise an error,
ensuring the construction of fields and the dict(zip(...)) for pa.table
validates length consistency (update the zip in the fields list comprehension
and the final dict(zip(...)) call).
- Around line 302-330: The timeout detection in query and dry_run currently
string-matches "TIMEOUT_EXCEEDED" on the ClickHouse driver exception; replace
that fragile logic by converting driver timeouts into the project's
DatabaseTimeoutError (from wren/model/error.py) instead of re-raising the raw
driver exception, and preserve the original exception as the cause; for
non-timeout errors keep wrapping into WrenError with ErrorCode.INVALID_SQL and
the appropriate ErrorPhase (SQL_EXECUTION for query, SQL_DRY_RUN for dry_run)
and metadata={DIALECT_SQL: sql} so behavior remains consistent while using a
dedicated timeout exception type.

In `@core/wren/src/wren/model/data_source.py`:
- Around line 321-340: In get_clickhouse_connection, avoid clobbering the merged
settings by popping "settings" out of info.kwargs and merging it into the local
settings dict (preserving any values from info.settings and the computed
max_execution_time from statement_timeout) before you update client_kwargs with
the remaining kwargs; in practice: pop settings = kwargs.pop("settings", {}) (or
similar), merge that into the existing settings dict, then proceed with
client_kwargs.update(kwargs) so the statement_timeout-derived max_execution_time
is not silently dropped.

In `@core/wren/tests/connectors/test_clickhouse.py`:
- Around line 24-25: The module-level pytestmark = pytest.mark.clickhouse is
making all tests Docker-dependent; remove that module-level marker and instead
add `@pytest.mark.clickhouse` to the integration test class TestClickHouse, and
either add `@pytest.mark.unit` (or no marker) to TestClickHouseTypeParser so its
pure parser tests run in non-Docker jobs; also remove any other module-level
pytestmark instances (the other occurrence matching the same pattern) and apply
the same class-level scoping.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: b129df92-f94b-477c-b440-8adb04185c40

📥 Commits

Reviewing files that changed from the base of the PR and between 544ecab and 3828bf4.

⛔ Files ignored due to path filters (1)
  • core/wren/uv.lock is excluded by !**/*.lock
📒 Files selected for processing (8)
  • core/wren/justfile
  • core/wren/pyproject.toml
  • core/wren/src/wren/connector/clickhouse.py
  • core/wren/src/wren/connector/factory.py
  • core/wren/src/wren/connector/ibis.py
  • core/wren/src/wren/model/data_source.py
  • core/wren/tests/conftest.py
  • core/wren/tests/connectors/test_clickhouse.py
💤 Files with no reviewable changes (1)
  • core/wren/src/wren/connector/ibis.py

Comment thread core/wren/src/wren/connector/clickhouse.py
…ettings

In the ClickHouse client-kwargs assembly, ``out.update(kwargs)`` /
``client_kwargs.update(kwargs)`` would clobber the merged ``settings`` dict
(carrying ``max_execution_time`` from ``statement_timeout``) whenever the
caller also passed their own ``settings`` via ``kwargs``. Pop ``settings``
from incoming ``kwargs`` first and merge it into the local dict so the
timeout survives. Also wrap driver ``TIMEOUT_EXCEEDED`` errors as the
existing ``DatabaseTimeoutError`` instead of re-raising the raw driver
exception, for consistency with the typed error model.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@goldmedal
Copy link
Copy Markdown
Collaborator Author

Addressing the CodeRabbit nitpicks from the review body:

  • data_source.py settings clobbering — fixed in a8e91ef. Also fixed the same pattern in two parallel spots in clickhouse.py::_build_clickhouse_client_kwargs (URL-based and ClickHouseConnectionInfo-based branches), plus added a regression test (TestClickHouseClientKwargs::test_statement_timeout_survives_user_settings).
  • DatabaseTimeoutError instead of string-match raise — applied in a8e91ef. The class exists in wren.model.error and was previously unused; both query() and dry_run() now wrap TIMEOUT_EXCEEDED driver errors as DatabaseTimeoutError with the original exception preserved as __cause__.
  • zip(..., strict=True) in _build_clickhouse_arrow_table — skipping. strict=False is intentional: clickhouse-connect occasionally returns a slightly different column count for empty result sets, and failing hard would surface as a confusing user-facing crash rather than a recoverable warning.
  • Move pytest.mark.clickhouse from module level to integration class — skipping. Keeping the module-level marker for consistency with the sibling connector test files (postgres/mysql/...). The parser tests are quick enough that the Docker prereq doesn't materially affect CI scheduling.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@core/wren/tests/connectors/test_clickhouse.py`:
- Around line 68-76: The readiness retry loop creates a ClickHouse client via
clickhouse_connect.get_client and currently only calls client.close() on the
success path; ensure the client is always closed by moving the client.close()
call into a finally block (or otherwise invoking it when an exception is raised)
surrounding the client.query("SELECT 1") call so the local client variable is
cleaned up on both success and failure and avoids leaking connections across
retries.
- Around line 84-86: Replace the dynamic TPCH extension install and dbgen
invocation that causes network access with loading pre-provisioned TPCH seed
files: remove the "INSTALL tpch; LOAD tpch; CALL dbgen(...)" sequence inside the
duckdb.connect() block and instead read the checked-in CSV/Parquet fixtures (or
a prebuilt DuckDB dump) into the in-memory connection before executing the query
that populates orders_rows; update the test to use duck.execute("CREATE TABLE
... AS SELECT * FROM read_csv_auto('path/to/seed/orders.csv')") or the
equivalent fixture-loading helper so orders_rows is derived from local test data
rather than downloading the tpch extension.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 3b069458-e64e-41e0-82ac-6422cf32fdc4

📥 Commits

Reviewing files that changed from the base of the PR and between 3828bf4 and a8e91ef.

📒 Files selected for processing (3)
  • core/wren/src/wren/connector/clickhouse.py
  • core/wren/src/wren/model/data_source.py
  • core/wren/tests/connectors/test_clickhouse.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • core/wren/src/wren/model/data_source.py
  • core/wren/src/wren/connector/clickhouse.py

Comment thread core/wren/tests/connectors/test_clickhouse.py
Comment thread core/wren/tests/connectors/test_clickhouse.py Outdated
Wrap the readiness-loop client in try/finally so we close on failed
attempts, not only on success. Replace the DuckDB TPCH extension
fixture (which pulled the extension over the network on every run)
with inline-fabricated rows so the test stays hermetic.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@goldmedal
Copy link
Copy Markdown
Collaborator Author

Superseded by #2313 — these seven native-driver refactors were consolidated into a single feature branch to resolve shared-file conflicts (data_source.py, pyproject.toml, uv.lock, factory.py, etc.) once instead of seven times.

@goldmedal goldmedal closed this May 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core dependencies Pull requests that update a dependency file python Pull requests that update Python code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant