feat(wren): add YTsaurus (CHYT) connector#2258
Conversation
Adds a new data source that talks to a YTsaurus cluster through its CHYT (ClickHouse-over-YT) clique. CHYT exposes a ClickHouse HTTP protocol on the YT proxy, so the connector reuses the existing ClickHouse / Ibis path with YT-flavored auth and routing. Wiring: - DataSource.ytsaurus enum + factory entry - YTsaurusConnectionInfo (proxy, clique, token, secure, port, query_path, settings, kwargs); token falls back to YT_TOKEN env var - pip extra `wren-engine[ytsaurus]` (pulls ibis-framework[clickhouse]) - sqlglot dialect map: ytsaurus -> clickhouse - docs: docs/connectors/ytsaurus.md, README install row, docs/connections.md JSON example CHYT integration details (the reason this is not a 5-line ClickHouse alias): 1. clickhouse_connect.get_client() does not accept http_headers as a kwarg. Headers are injected post-construction with a belt-and-braces fallback for clients that re-snapshot state. 2. CHYT mounts at URL path /query, not /. New `query_path` field on YTsaurusConnectionInfo (default "/query") is mapped to clickhouse_connect's proxy_path. 3. YT routes by chyt.clique_alias=<*alias> URL param, not by the `database=` field clickhouse_connect would send as a header. The connector patches HttpClient.params at construction so the alias is present before the startup `SELECT version()` runs. 4. YT proxy accepts only `Authorization: OAuth <token>` (rejects Basic and Bearer). The connector subclasses HttpClient with a one-line override of _init_common_settings that injects the OAuth header before the parent's startup query. 5. ibis backend.sql() introspects via CREATE VIEW, which CHYT rejects (std::out_of_range). YTsaurusConnector.query / dry_run bypass ibis and call clickhouse_connect directly (native protocol -> pyarrow.Table; EXPLAIN AST for dry-runs). Physical-SQL rewrite (engine.py): WrenEngine applies a YT-specific post-transpile sqlglot rewrite that replaces <schema>.<table> references with backticked YT paths sourced from each model's `properties.ytPath` in the manifest. Only runs when `data_source == ytsaurus`; path map is cached on first build, so other connectors are unaffected. Verified end-to-end against a live YT cluster via a CHYT clique: COUNT(*), JOIN-and-aggregate, and dialect-specific functions (toUnixTimestamp, startsWith, INTERVAL arithmetic, COUNT DISTINCT) all return matching results vs. the same SQL run against the cluster directly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds a first-class YTsaurus data source (CHYT): Pydantic connection model, connection factory with token/patching, ClickHouse-compatible connector, engine SQL rewriting for YT paths, registry wiring, optional pyproject extra, and documentation. ChangesYTsaurus connector integration
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@core/wren/src/wren/connector/ytsaurus.py`:
- Around line 60-63: The code assumes result.column_names and
result.result_columns are same length before building a PyArrow table (variables
result, columns, data in the block after self._ch_client.query(wrapped)), which
can silently drop columns due to zip(); add an explicit check comparing
len(columns) and len(data) immediately after those lists are created, and if
they differ either raise a descriptive exception (e.g., ValueError including
both lengths and samples) or log an error and fail fast so callers won't get a
truncated/malformed table; keep the pa.table construction only when lengths
match.
- Around line 53-55: The code interpolates limit directly into the SQL (wrapped,
sql, limit) which allows injection; validate and sanitize limit before using it:
if limit is not None, ensure it's an integer (and not a bool) and optionally
non-negative (e.g., if not isinstance(limit, int) or isinstance(limit, bool) or
limit < 0: raise TypeError/ValueError), then safely build wrapped = f"SELECT *
FROM (\n{sql}\n) LIMIT {limit}"; this prevents non-integer strings like "100;
DROP ..." from being injected.
In `@core/wren/src/wren/engine.py`:
- Around line 172-189: The _rewrite function can accidentally match an
empty-string table name because _yt_path_map generates an empty key when a
model's tableReference.table is missing; update _rewrite (inside the
transformation that handles exp.Table) to early-return the original node if
tbl_name is empty or falsy (i.e., add a guard after computing tbl_name) so it
does not lookup path_map with an empty key; reference exp.Table, the local
variables db/name/tbl_name, and _yt_path_map to locate related logic and ensure
no empty-key matches occur.
- Around line 209-221: The manifest parsing can still produce mappings for
models with no valid table name; after computing table = (tr.get("table") or
m.get("name") or "").strip() in the loop over manifest.get("models", []), add an
explicit guard (if not table: continue) so models without a non-empty table/name
are skipped entirely and neither the schema.table nor the table key is written
into out; update the loop that builds out and the use of self._yt_path_map_cache
accordingly.
In `@core/wren/src/wren/model/data_source.py`:
- Around line 372-423: The code mutates global class state
(_BaseHttpClient.params, _ch_http.HttpClient, _ch_driver.HttpClient, and
_CHYTHttpClient._wren_yt_token) around the ibis.clickhouse.connect() call, which
can leak settings/tokens under concurrency; fix by introducing a module-level
lock (e.g., a threading.Lock named something like _ytsaurus_client_lock) and
wrap the whole patching/connection/finally block with that lock (using a context
manager) so only one thread can perform the patch, connect, and restore sequence
at a time; ensure the lock is declared at module scope and used to guard the
region that sets _CHYTHttpClient._wren_yt_token, mutates _BaseHttpClient.params
and replaces _ch_http.HttpClient/_ch_driver.HttpClient, and that the finally
block still restores originals while inside the locked section.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 4b259c5a-4ba6-464f-8296-069715ef549e
📒 Files selected for processing (10)
core/wren/README.mdcore/wren/docs/connections.mdcore/wren/docs/connectors/ytsaurus.mdcore/wren/pyproject.tomlcore/wren/src/wren/connector/factory.pycore/wren/src/wren/connector/ytsaurus.pycore/wren/src/wren/engine.pycore/wren/src/wren/mdl/cte_rewriter.pycore/wren/src/wren/model/__init__.pycore/wren/src/wren/model/data_source.py
- ytsaurus.py: validate `limit` is a non-negative int before f-string interpolation (defense in depth against SQL injection if a caller bypasses the type hint). - ytsaurus.py: fail fast with WrenError when CHYT returns mismatched column metadata (column_names vs result_columns length), instead of silently truncating via zip(). - engine.py: in the YT path-rewrite, early-return when the parsed table name is empty so we never look up an empty key in the path map. - engine.py: skip manifest models whose resolved `table` is empty, so the path map can't grow an empty-string entry. - data_source.py: introduce a module-level threading.Lock and hold it around the HttpClient monkey-patching block in `get_ytsaurus_connection`. Concurrent connections were racing on `HttpClient.params` and the per-class `_wren_yt_token`, which could leak one caller's token into another caller's connection. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Addressed all five CodeRabbit findings in b234b45:
The lock is at module scope rather than on |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
core/wren/src/wren/model/data_source.py (1)
109-114:⚠️ Potential issue | 🟠 Major | ⚡ Quick winReject or normalize
connectionUrlfor YTsaurus.Adding
DataSource.ytsaurusto the structured-connection flow here is still incompatible with the existingconnectionUrlfast path. If a caller sendsconnectionUrl,_build_connection_info()returns a genericConnectionUrl, so this branch can blow up oninfo.settings, and the laterhasattr(info, "connection_url")path will bypassget_ytsaurus_connection()entirely. That skips the CHYT OAuth/clique setup and makes the new datasource behave incorrectly for a valid-looking input shape.Either explicitly reject
connectionUrlfor YTsaurus up front, or parse it into a realYTsaurusConnectionInfobefore this branch runs.Possible minimal fix
def _build_connection_info(self, data: dict) -> ConnectionInfo: if "connectionUrl" in data or "connection_url" in data: + if self == DataSource.ytsaurus: + raise WrenError( + ErrorCode.INVALID_CONNECTION_INFO, + "YTsaurus does not support connectionUrl; use proxy/clique/token fields.", + ) if self == DataSource.clickhouse: return self._handle_clickhouse_url( urllib.parse.urlparse( data.get("connectionUrl", data.get("connection_url")) ) ) return ConnectionUrl.model_validate(data)Also applies to: 189-190
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@core/wren/src/wren/model/data_source.py` around lines 109 - 114, The YTsaurus branch in _build_connection_info is incompatible with callers that pass a generic ConnectionUrl: before touching info.settings or branching on DataSource.ytsaurus, detect if info is a ConnectionUrl and either raise/reject connectionUrl for YTsaurus or convert/normalize it into a YTsaurusConnectionInfo (populate required fields including settings and OAuth/clique fields) so subsequent code (including get_ytsaurus_connection) runs correctly; also guard any access to info.settings (create if None) and apply the same normalization/guard to the analogous block around the later DataSource.ytsaurus handling (the code referenced at 189-190).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@core/wren/src/wren/model/data_source.py`:
- Around line 109-114: The YTsaurus branch in _build_connection_info is
incompatible with callers that pass a generic ConnectionUrl: before touching
info.settings or branching on DataSource.ytsaurus, detect if info is a
ConnectionUrl and either raise/reject connectionUrl for YTsaurus or
convert/normalize it into a YTsaurusConnectionInfo (populate required fields
including settings and OAuth/clique fields) so subsequent code (including
get_ytsaurus_connection) runs correctly; also guard any access to info.settings
(create if None) and apply the same normalization/guard to the analogous block
around the later DataSource.ytsaurus handling (the code referenced at 189-190).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 9132143d-dad2-4006-ac60-3525c6d359e7
📒 Files selected for processing (3)
core/wren/src/wren/connector/ytsaurus.pycore/wren/src/wren/engine.pycore/wren/src/wren/model/data_source.py
🚧 Files skipped from review as they are similar to previous changes (2)
- core/wren/src/wren/engine.py
- core/wren/src/wren/connector/ytsaurus.py
Adding YTsaurus to the connectionUrl fast path returned a generic ConnectionUrl that lacks .settings (used in the clickhouse/ytsaurus session-timeout branch) and bypassed get_ytsaurus_connection() via the hasattr(info, "connection_url") path, skipping CHYT OAuth/clique setup. Reject connectionUrl upfront with INVALID_CONNECTION_INFO. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@core/wren/src/wren/model/data_source.py`:
- Around line 409-437: get_clickhouse_connection currently calls
ibis.clickhouse.connect without holding _YTSAURUS_PATCH_LOCK, allowing a race
that can leak _CHYTHttpClient._wren_yt_token and modified _BaseHttpClient.params
into plain ClickHouse clients; to fix, ensure the ibis.clickhouse.connect
invocation in get_clickhouse_connection is executed while holding
_YTSAURUS_PATCH_LOCK (same lock used by get_ytsaurus_connection) or refactor
both into a shared helper that acquires _YTSAURUS_PATCH_LOCK before temporarily
patching _BaseHttpClient and swapping _ch_http.HttpClient/_ch_driver.HttpClient
to _CHYTHttpClient and restores them in a finally block, making sure
_CHYTHttpClient._wren_yt_token and original_class_params handling mirrors the
existing pattern.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 425b1d47-a9bc-47ce-9d8e-9fe41b0d978d
📒 Files selected for processing (1)
core/wren/src/wren/model/data_source.py
get_ytsaurus_connection temporarily monkey-patches clickhouse_connect's HttpClient class and `params` dict to inject CHYT auth. Without holding the same lock in get_clickhouse_connection, a concurrent plain ClickHouse connect could snapshot the patched state and leak the YT OAuth token and chyt.clique_alias into a regular ClickHouse client. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CodeRabbit docstring-coverage check flagged the new YTsaurus code as under the 80% threshold. Add one-line docstrings to YTsaurusConnector methods, the create_connector factory, get_ytsaurus_connection (and the serialization comment for get_clickhouse_connection), the inner _CHYTHttpClient hook, and the inner _rewrite transformer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CodeRabbit docstring coverage was still under the 80% threshold because the modified DataSource / DataSourceExtension classes and their public dispatch methods (get_connection, get_connection_info, _build_connection_info) lacked docstrings. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@core/wren/src/wren/engine.py`:
- Around line 223-224: The unqualified-table mapping (the out dict entry created
via out.setdefault(table, yt_path)) can silently collide; change the
construction of the path_map/out so you detect when table is already present
with a different yt_path and handle it explicitly: either log a warning (use the
module logger) and keep both qualified mappings while removing or avoiding the
unqualified key, or omit creating the unqualified mapping when a collision is
detected; update the code that currently does out[f"{schema}.{table}"] = yt_path
and out.setdefault(table, yt_path) to check existing out.get(table) and compare
values before setting, and ensure this behavior is documented where path_map is
used by the rewrite logic that calls path_map.get(f"{db_name}.{tbl_name}") or
path_map.get(tbl_name).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 8a0d1862-4543-4605-858c-a5e5495c0ce5
📒 Files selected for processing (3)
core/wren/src/wren/connector/ytsaurus.pycore/wren/src/wren/engine.pycore/wren/src/wren/model/data_source.py
🚧 Files skipped from review as they are similar to previous changes (2)
- core/wren/src/wren/connector/ytsaurus.py
- core/wren/src/wren/model/data_source.py
When two models in different schemas share the same table name, out.setdefault(table, ...) silently kept the first-seen yt_path. A query referencing the bare table name could then rewrite to the wrong YT path. Detect the conflict, drop the bare-name mapping, and log a warning telling the user to qualify with a schema. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@core/wren/src/wren/engine.py`:
- Around line 117-118: dry_plan currently returns SQL with synthetic
schema.table identifiers that differ from what YTsaurus actually executes
because _apply_physical_overrides is only applied later in query() and
dry_run(); update dry_plan (or the code path used by DataSource.ytsaurus) to
call _apply_physical_overrides(dialect_sql) before returning so the public
dry_plan API returns the same physically-overridden SQL as executed by
query()/dry_run(). Ensure you modify the dry_plan implementation (and any helper
used by DataSource.ytsaurus) so the returned dialect_sql has had
_apply_physical_overrides applied, keeping behavior consistent with query() and
dry_run().
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: fdb5f585-0325-47c3-b780-99144b24cb9d
📒 Files selected for processing (1)
core/wren/src/wren/engine.py
dry_plan() returned the raw dialect SQL without the YT path substitution that query() and dry_run() applied later. Callers that used dry_plan as the public planning API for YTsaurus saw synthetic schema.table identifiers that don't match what CHYT actually executes. Move the override into dry_plan so all three APIs agree, and drop the now- redundant call sites in query()/dry_run(). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
|
Thanks @nar3k, there are some CI failures. Could you fix them before I review this PR? |
| return ibis.clickhouse.connect( | ||
| host=info.host, | ||
| port=int(info.port), | ||
| database=info.database, | ||
| user=info.user, | ||
| password=(info.password and info.password.get_secret_value()), | ||
| settings=info.settings if info.settings else {}, | ||
| **info.kwargs if info.kwargs else {}, | ||
| ) |
There was a problem hiding this comment.
I'm not familer with ytsaurus. Is it fully compatible with ClickHouse? Because we're planning to remove the ibis-framework dependency and use the native clickhouse_connect driver instead.
I think you can connect with ytsaurus through clickhouse_connect directly in this PR.
There was a problem hiding this comment.
Hi, i will check it out and get back!
CI surfaced two failures on the YTsaurus PR: - ui tests: test_all_datasources_covered failed because the new DataSource.ytsaurus enum value was missing from field_registry.DATASOURCE_MODELS. Added the entry plus UI label overrides for proxy/clique/token/query_path. - lint: ruff format wanted ytsaurus.py reflowed (one ValueError onto a single line) and ruff check (PLC0415) flagged the deferred clickhouse_connect imports inside get_ytsaurus_connection. The imports are intentionally deferred because clickhouse_connect is an optional extra, so noqa them and document the reason. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@nar3k, could you add some tests or document how to test YTSaurus? It will be helpful to protect the follow-up change so that it won't break the YTsaurus connector. |
Closes #2257.
Summary
Adds a first-class
ytsaurusdata source that talks to a YTsaurus cluster through its CHYT (ClickHouse-over-YT) clique. CHYT exposes a ClickHouse-compatible HTTP protocol on the YT proxy, so the bulk of Wren's existing ClickHouse / Ibis / sqlglot path is reused — only the auth and routing shim is new.What's in scope
DataSource.ytsaurusenum + factory entry + sqlglot dialect map (ytsaurus -> clickhouse)YTsaurusConnectionInfoPydantic model (proxy,clique,tokenw/YT_TOKENenv fallback,secure,port,query_path,settings,kwargs)wren.connector.ytsaurus.YTsaurusConnector— subclass of the Ibis ClickHouse connectorwren-engine[ytsaurus]pip extra (pullsibis-framework[clickhouse])engine.py: whendata_source == ytsaurus,<schema>.<table>references are replaced with backticked YT paths sourced from each model'sproperties.ytPath. Path map is cached; gated entirely ondata_source == ytsaurus, so other connectors are unaffectedcore/wren/docs/connectors/ytsaurus.md,READMEinstall row,docs/connections.mdJSON exampleCHYT-specific details (why this is more than a ClickHouse alias)
clickhouse_connect.get_client()does not accepthttp_headersas a kwarg → headers are injected post-construction, with a belt-and-braces fallback for clients that re-snapshot state./query, not/→ newquery_pathfield (default/query) maps toclickhouse_connect'sproxy_path.chyt.clique_alias=<*alias>URL param, not bydatabase=→ the connector patchesHttpClient.paramsat construction so the alias is present before the startupSELECT version()runs.Authorization: OAuth <token>(rejects Basic / Bearer) → the connector subclassesHttpClientwith a one-line override of_init_common_settingsthat injects the OAuth header before the parent's startup query.backend.sql()introspects viaCREATE VIEW, which CHYT rejects withstd::out_of_range→YTsaurusConnector.query/dry_runbypass ibis and callclickhouse_connectdirectly (native protocol →pyarrow.Table;EXPLAIN ASTfor dry-runs).Test plan
wren context buildandwren --sql ...against a live YT cluster with a CHYT cliqueCOUNT(*), JOIN-and-aggregate, and dialect-specific functions (toUnixTimestamp,startsWith,INTERVALarithmetic,COUNT(DISTINCT ...)) all return matching results vs. the same SQL run against the cluster directlyconnector.dry_run(sql)returns the CHYTEXPLAIN ASTwithout executingcargo/ruff/ existingwrentest suite untouched — no other connector paths alteredclickhouse_connect-mock-based unit test if usefulOut of scope
🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Documentation
Chores