refactor(clickhouse): use clickhouse-connect native driver, drop ibis dependency#2275
refactor(clickhouse): use clickhouse-connect native driver, drop ibis dependency#2275goldmedal wants to merge 3 commits into
Conversation
… dependency The ClickHouse connector now uses ``clickhouse-connect`` directly, parsing ClickHouse type strings via sqlglot to build PyArrow schemas rather than going through the ibis-project clickhouse backend. Highlights - ``connector/clickhouse.py``: native client; type lexer covers ``Nullable(T)`` / ``LowCardinality(T)`` / ``Array(T)`` / ``Tuple(...)`` / ``Map(K,V)`` / ``DateTime64(p, 'TZ')`` / ``Decimal(p,s)``, plus ``Int128/256`` and ``UInt128/256`` (surfaced as string to avoid silent truncation past 64-bit Arrow widths). - ``model/data_source.py::get_clickhouse_connection`` returns a ``clickhouse_connect.Client``; ``_handle_clickhouse_url`` now also accepts ``clickhouse+http://`` / ``clickhouse+https://`` URLs. - ``pyproject.toml``: clickhouse extra now pulls ``clickhouse-connect>=0.8`` instead of ``ibis-framework[clickhouse]``. Tests - ``tests/connectors/test_clickhouse.py`` exercises the full query path against a ClickHouse testcontainer (TPCH sf=0.01) and parametrises 35+ type strings through ``_parse_clickhouse_type``, including ``DateTime64`` with timezone and nested ``Tuple``. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
WalkthroughReplace the Ibis ClickHouse backend with a native clickhouse-connect implementation: new connector (type parsing → PyArrow), data-source/client wiring, factory routing updates, optional dependency change, pytest marker and Just recipe, plus integration and unit tests. ChangesClickHouse Native Connector
Estimated code review effort🎯 4 (Complex) | ⏱️ ~40 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (4)
core/wren/tests/connectors/test_clickhouse.py (1)
24-25: ⚡ Quick winScope the
clickhousemarker to Docker-dependent tests only.Line 24 currently marks the whole module as
clickhouse, so the pure parser tests inTestClickHouseTypeParserare also treated as Docker-required. Consider moving the marker toTestClickHouse(integration suite) and marking parser tests asunit(or leaving unmarked), so they can run in non-Docker test jobs.Also applies to: 154-155
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@core/wren/tests/connectors/test_clickhouse.py` around lines 24 - 25, The module-level pytestmark = pytest.mark.clickhouse is making all tests Docker-dependent; remove that module-level marker and instead add `@pytest.mark.clickhouse` to the integration test class TestClickHouse, and either add `@pytest.mark.unit` (or no marker) to TestClickHouseTypeParser so its pure parser tests run in non-Docker jobs; also remove any other module-level pytestmark instances (the other occurrence matching the same pattern) and apply the same class-level scoping.core/wren/src/wren/model/data_source.py (1)
321-340: ⚡ Quick winMerge user-provided
settingsinstead of clobbering them.
client_kwargs.update(kwargs)lets entries frominfo.kwargsoverwrite anything pre-populated above. If a caller (or upstreamget_connection_info/URL parsing) ends up puttingsettingsintoinfo.kwargs, that update replaces the mergedsettingsdict, silently dropping themax_execution_timederived fromstatement_timeout(and from theX_WREN_DB_STATEMENT_TIMEOUTheader for theclickhousecase at lines 98–103, if that flow ever routes through kwargs). Popsettingsfromkwargsand merge into the localsettingsdict to keep the timeout semantics intact.♻️ Proposed fix
settings = dict(info.settings) if info.settings else {} kwargs = dict(info.kwargs) if info.kwargs else {} statement_timeout = kwargs.pop("statement_timeout", None) if statement_timeout is not None: settings["max_execution_time"] = int(statement_timeout) + extra_settings = kwargs.pop("settings", None) + if extra_settings: + settings.update(extra_settings) client_kwargs = { "host": info.host, "port": int(info.port), "database": info.database, "username": info.user, "password": info.password.get_secret_value() if info.password else "", "secure": info.secure, "settings": settings, } client_kwargs.update(kwargs) return clickhouse_connect.get_client(**client_kwargs)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@core/wren/src/wren/model/data_source.py` around lines 321 - 340, In get_clickhouse_connection, avoid clobbering the merged settings by popping "settings" out of info.kwargs and merging it into the local settings dict (preserving any values from info.settings and the computed max_execution_time from statement_timeout) before you update client_kwargs with the remaining kwargs; in practice: pop settings = kwargs.pop("settings", {}) (or similar), merge that into the existing settings dict, then proceed with client_kwargs.update(kwargs) so the statement_timeout-derived max_execution_time is not silently dropped.core/wren/src/wren/connector/clickhouse.py (2)
164-186: ⚡ Quick winConsider using
strict=Truein zip operations for better data integrity.Lines 172 and 184 use
strict=Falsewhich silently handles length mismatches. Since this code targets Python 3.10+ (evident from union type syntax elsewhere), usingstrict=Truewould catch potential driver bugs or data corruption early. ClickHouse should always return matching lengths for column metadata.🔒 Proposed fix for stricter validation
fields = [ pa.field(name, _parse_clickhouse_type(ct.name), nullable=True) - for name, ct in zip(column_names, column_types, strict=False) + for name, ct in zip(column_names, column_types, strict=True) ] schema = pa.schema(fields) if not rows: arrays = [pa.array([], type=field.type) for field in schema] else: arrays = [ _build_clickhouse_column([row[i] for row in rows], schema.field(i).type) for i in range(len(fields)) ] return pa.table( - dict(zip([f.name for f in fields], arrays, strict=False)), + dict(zip([f.name for f in fields], arrays, strict=True)), schema=schema, )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@core/wren/src/wren/connector/clickhouse.py` around lines 164 - 186, The zip calls in _build_clickhouse_arrow_table silently allow mismatched lengths (using strict=False) which can mask driver bugs; change both zip(...) usages in _build_clickhouse_arrow_table to use strict=True so mismatched column_names/column_types or field names/arrays raise an error, ensuring the construction of fields and the dict(zip(...)) for pa.table validates length consistency (update the zip in the fields list comprehension and the final dict(zip(...)) call).
302-330: ⚡ Quick winConsider using a dedicated timeout exception type instead of string matching on error messages.
Lines 309 and 323 detect timeouts by checking if
"TIMEOUT_EXCEEDED"appears in the error message string. This approach is fragile—if the clickhouse-connect driver changes its error message format in a future version, the timeout detection will silently fail. The codebase already has aDatabaseTimeoutErrorclass defined inwren/model/error.py, which should be used instead of re-raising the raw driver exception.Additionally, this creates inconsistent error handling: non-timeout errors are wrapped in
WrenErrorwith proper metadata and error codes, but timeout errors bypass this and are re-raised as raw driver exceptions. Using a dedicated exception type or checking exception attributes (if available in the driver) would be more robust than string matching.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@core/wren/src/wren/connector/clickhouse.py` around lines 302 - 330, The timeout detection in query and dry_run currently string-matches "TIMEOUT_EXCEEDED" on the ClickHouse driver exception; replace that fragile logic by converting driver timeouts into the project's DatabaseTimeoutError (from wren/model/error.py) instead of re-raising the raw driver exception, and preserve the original exception as the cause; for non-timeout errors keep wrapping into WrenError with ErrorCode.INVALID_SQL and the appropriate ErrorPhase (SQL_EXECUTION for query, SQL_DRY_RUN for dry_run) and metadata={DIALECT_SQL: sql} so behavior remains consistent while using a dedicated timeout exception type.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@core/wren/src/wren/connector/clickhouse.py`:
- Around line 237-241: The error message raised in the ClickHouse connector when
validating parsed.scheme is misleading; update the WrenError raised in this
validation (the block checking parsed.scheme against {"clickhouse",
"clickhouse+http", "clickhouse+https"}) to list all accepted schemes (e.g.,
"clickhouse, clickhouse+http, clickhouse+https") and keep
ErrorCode.INVALID_CONNECTION_INFO and the existing raise call (reference:
parsed.scheme check and the WrenError invocation).
---
Nitpick comments:
In `@core/wren/src/wren/connector/clickhouse.py`:
- Around line 164-186: The zip calls in _build_clickhouse_arrow_table silently
allow mismatched lengths (using strict=False) which can mask driver bugs; change
both zip(...) usages in _build_clickhouse_arrow_table to use strict=True so
mismatched column_names/column_types or field names/arrays raise an error,
ensuring the construction of fields and the dict(zip(...)) for pa.table
validates length consistency (update the zip in the fields list comprehension
and the final dict(zip(...)) call).
- Around line 302-330: The timeout detection in query and dry_run currently
string-matches "TIMEOUT_EXCEEDED" on the ClickHouse driver exception; replace
that fragile logic by converting driver timeouts into the project's
DatabaseTimeoutError (from wren/model/error.py) instead of re-raising the raw
driver exception, and preserve the original exception as the cause; for
non-timeout errors keep wrapping into WrenError with ErrorCode.INVALID_SQL and
the appropriate ErrorPhase (SQL_EXECUTION for query, SQL_DRY_RUN for dry_run)
and metadata={DIALECT_SQL: sql} so behavior remains consistent while using a
dedicated timeout exception type.
In `@core/wren/src/wren/model/data_source.py`:
- Around line 321-340: In get_clickhouse_connection, avoid clobbering the merged
settings by popping "settings" out of info.kwargs and merging it into the local
settings dict (preserving any values from info.settings and the computed
max_execution_time from statement_timeout) before you update client_kwargs with
the remaining kwargs; in practice: pop settings = kwargs.pop("settings", {}) (or
similar), merge that into the existing settings dict, then proceed with
client_kwargs.update(kwargs) so the statement_timeout-derived max_execution_time
is not silently dropped.
In `@core/wren/tests/connectors/test_clickhouse.py`:
- Around line 24-25: The module-level pytestmark = pytest.mark.clickhouse is
making all tests Docker-dependent; remove that module-level marker and instead
add `@pytest.mark.clickhouse` to the integration test class TestClickHouse, and
either add `@pytest.mark.unit` (or no marker) to TestClickHouseTypeParser so its
pure parser tests run in non-Docker jobs; also remove any other module-level
pytestmark instances (the other occurrence matching the same pattern) and apply
the same class-level scoping.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: b129df92-f94b-477c-b440-8adb04185c40
⛔ Files ignored due to path filters (1)
core/wren/uv.lockis excluded by!**/*.lock
📒 Files selected for processing (8)
core/wren/justfilecore/wren/pyproject.tomlcore/wren/src/wren/connector/clickhouse.pycore/wren/src/wren/connector/factory.pycore/wren/src/wren/connector/ibis.pycore/wren/src/wren/model/data_source.pycore/wren/tests/conftest.pycore/wren/tests/connectors/test_clickhouse.py
💤 Files with no reviewable changes (1)
- core/wren/src/wren/connector/ibis.py
…ettings In the ClickHouse client-kwargs assembly, ``out.update(kwargs)`` / ``client_kwargs.update(kwargs)`` would clobber the merged ``settings`` dict (carrying ``max_execution_time`` from ``statement_timeout``) whenever the caller also passed their own ``settings`` via ``kwargs``. Pop ``settings`` from incoming ``kwargs`` first and merge it into the local dict so the timeout survives. Also wrap driver ``TIMEOUT_EXCEEDED`` errors as the existing ``DatabaseTimeoutError`` instead of re-raising the raw driver exception, for consistency with the typed error model. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Addressing the CodeRabbit nitpicks from the review body:
|
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@core/wren/tests/connectors/test_clickhouse.py`:
- Around line 68-76: The readiness retry loop creates a ClickHouse client via
clickhouse_connect.get_client and currently only calls client.close() on the
success path; ensure the client is always closed by moving the client.close()
call into a finally block (or otherwise invoking it when an exception is raised)
surrounding the client.query("SELECT 1") call so the local client variable is
cleaned up on both success and failure and avoids leaking connections across
retries.
- Around line 84-86: Replace the dynamic TPCH extension install and dbgen
invocation that causes network access with loading pre-provisioned TPCH seed
files: remove the "INSTALL tpch; LOAD tpch; CALL dbgen(...)" sequence inside the
duckdb.connect() block and instead read the checked-in CSV/Parquet fixtures (or
a prebuilt DuckDB dump) into the in-memory connection before executing the query
that populates orders_rows; update the test to use duck.execute("CREATE TABLE
... AS SELECT * FROM read_csv_auto('path/to/seed/orders.csv')") or the
equivalent fixture-loading helper so orders_rows is derived from local test data
rather than downloading the tpch extension.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 3b069458-e64e-41e0-82ac-6422cf32fdc4
📒 Files selected for processing (3)
core/wren/src/wren/connector/clickhouse.pycore/wren/src/wren/model/data_source.pycore/wren/tests/connectors/test_clickhouse.py
🚧 Files skipped from review as they are similar to previous changes (2)
- core/wren/src/wren/model/data_source.py
- core/wren/src/wren/connector/clickhouse.py
Wrap the readiness-loop client in try/finally so we close on failed attempts, not only on success. Replace the DuckDB TPCH extension fixture (which pulled the extension over the network on every run) with inline-fabricated rows so the test stays hermetic. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Superseded by #2313 — these seven native-driver refactors were consolidated into a single feature branch to resolve shared-file conflicts (data_source.py, pyproject.toml, uv.lock, factory.py, etc.) once instead of seven times. |
Summary
The ClickHouse connector now uses
clickhouse-connectdirectly, parsingClickHouse type strings via sqlglot to build PyArrow schemas — no more
detour through the ibis-project clickhouse backend.
connector/clickhouse.py: native client; type lexer coversNullable(T)/LowCardinality(T)/Array(T)/Tuple(...)/Map(K,V)/DateTime64(p, 'TZ')/Decimal(p,s), plusInt128/256andUInt128/256(surfaced asstringto avoid silenttruncation past 64-bit Arrow widths).
model/data_source.py::get_clickhouse_connectionnow returns aclickhouse_connect.Client;_handle_clickhouse_urlalso acceptsclickhouse+http://andclickhouse+https://URLs.pyproject.toml: clickhouse extra →clickhouse-connect>=0.8(was
ibis-framework[clickhouse]).Test plan
just lint— ruff format + check cleanjust test-clickhouse— 9 connector tests (TPCH sf=0.01 in aClickHouse testcontainer) + 37 type-parser tests pass
String,FixedString(N), signedand unsigned
Int{8..256}/UInt{8..256},Float32,Float64,Bool,UUID,IPv4,IPv6,Enum,Decimal(p,s),Date,Date32,DateTime,DateTime64(p, 'TZ'),Array(T),Map(K,V),Tuple(...),Nullable(T),LowCardinality(T)Summary by CodeRabbit
New Features
Tests
Chores