Skip to content

refactor(athena): use pyathena native driver, drop ibis dependency#2271

Closed
goldmedal wants to merge 2 commits into
mainfrom
refactor/athena-native-driver
Closed

refactor(athena): use pyathena native driver, drop ibis dependency#2271
goldmedal wants to merge 2 commits into
mainfrom
refactor/athena-native-driver

Conversation

@goldmedal
Copy link
Copy Markdown
Collaborator

@goldmedal goldmedal commented May 14, 2026

Summary

The athena connector now uses pyathena directly with a Trino-style type lexer for cursor results. The athena extra no longer pulls in ibis-framework[athena].

Highlights

  • connector/athena.py (new): native pyathena DB-API cursor; type strings parsed via sqlglot (varchar, decimal(p,s), array<T>, row(...), map<K,V>, etc.) and materialised into PyArrow.
  • connector/factory.py: routes DataSource.athena to the new wren.connector.athena module.
  • model/data_source.py::get_athena_connection: preserves the Web-Identity-Token (OIDC -> AssumeRoleWithWebIdentity) and explicit access-key auth flows; falls back to the default boto3 credential chain otherwise. Now returns a pyathena.connection.Connection.
  • pyproject.toml: athena extra -> pyathena[pandas]>=3.

Tests

  • tests/unit/test_athena_connector.py (new, mocked) covers:
    • Type lexer for primitives, decimal, array, map, row and unknown/null fallbacks.
    • Cursor -> Arrow materialisation (mixed types, empty result).
    • Connector query/dry_run/close: kwargs forwarding, EXPLAIN dry-run, limit slicing, error wrapping (INVALID_SQL + correct ErrorPhase).
    • Credential resolution: explicit access keys, OIDC via STS AssumeRoleWithWebIdentity (mocked boto3), and default credential chain fallback.

Test plan

  • just install-dev
  • just lint -> all checks passed
  • uv run pytest tests/unit/test_athena_connector.py -v -> 25 passed
  • just test-unit -> 179 passed in this branch (one unrelated pre-existing failure on test_context_cli.py::test_validate_strict_warns, also fails on main).
  • CI green

Summary by CodeRabbit

Release Notes

  • Refactor
    • Athena connector implementation improved with better type handling, support for nested data structures, and flexible credential resolution.

Review Change Stack

The athena connector now uses pyathena directly with a Trino-style type
lexer to materialise cursor results into PyArrow tables, removing the
ibis-framework[athena] dependency from the athena extra.

Highlights
- connector/athena.py: native pyathena cursor; type strings parsed via
  sqlglot (varchar, decimal(p,s), array<T>, row(...), map<K,V>, etc.).
- model/data_source.py::get_athena_connection: preserves the
  Web-Identity-Token (OIDC -> AssumeRoleWithWebIdentity) and access-key
  auth flows; returns a pyathena.connection.Connection.
- pyproject.toml: athena extra -> pyathena[pandas]>=3.

Tests
- tests/unit/test_athena_connector.py mocks pyathena cursor + boto3 STS
  to verify the type lexer, cursor->Arrow materialisation, error
  mapping, and all three credential resolution paths.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 14, 2026

Warning

Rate limit exceeded

@goldmedal has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 2 minutes and 50 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: be762c0b-6204-4dc1-b8ed-0e1f8512be49

📥 Commits

Reviewing files that changed from the base of the PR and between 6aa3ff9 and 3410017.

📒 Files selected for processing (2)
  • core/wren/src/wren/model/data_source.py
  • core/wren/tests/unit/test_athena_connector.py

Walkthrough

This PR replaces the Ibis-based Athena connector with a native PyAthena implementation. It adds a new connector module that parses Athena/Trino types to PyArrow, builds tables from DB-API cursors, handles AWS credential resolution (OIDC and explicit keys), and provides query/dry-run/close operations with proper error handling and integration into the connector factory.

Changes

Native Athena Connector

Layer / File(s) Summary
Type parsing and value coercion
core/wren/src/wren/connector/athena.py
Athena/Trino type strings are parsed using sqlglot and mapped to PyArrow types with support for primitives, decimals (precision/scale bounds), arrays, maps, and row/struct types. Cursor values are coerced into Arrow arrays using JSON serialization for structured inputs, binary conversion, decimal coercion, and ISO parsing for temporal types.
Connection setup with credential resolution
core/wren/src/wren/connector/athena.py, core/wren/src/wren/model/data_source.py
Connection kwargs builder translates AthenaConnectionInfo into PyAthena parameters, supporting OIDC/web-identity STS assume-role, explicit access keys with optional session tokens, and default AWS credential fallback. get_athena_connection is refactored to use pyathena.connect and documents credential resolution priority.
AthenaConnector class and factory wiring
core/wren/src/wren/connector/athena.py, core/wren/src/wren/connector/factory.py
AthenaConnector implements query (SQL execution with Arrow conversion and optional limit), dry_run (EXPLAIN queries), and close (connection cleanup); both query and dry_run translate non-WrenError exceptions into WrenError(INVALID_SQL) with phase metadata. The factory registry maps DataSource.athena to the new wren.connector.athena module, and DataSource.athena is removed from _NEEDS_DATA_SOURCE to use connection_info-only instantiation.
Dependency declaration
core/wren/pyproject.toml
Optional dependency for athena is updated from ibis-framework[athena] to pyathena[pandas]>=3.
Test suite
core/wren/tests/unit/test_athena_connector.py
Comprehensive unit tests validate type parsing for all type families, Arrow table construction from cursor metadata and rows, AthenaConnector behavior (query execution with limit, error wrapping, dry-run, idempotent close), credential resolution paths (explicit keys, OIDC with STS assume-role, default chain), and integration with get_athena_connection. Tests stub PyAthena and boto3 to run without AWS credentials.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Possibly related PRs

  • Canner/WrenAI#2047: Adds UI and data-model support for OIDC web-identity token and role ARN configuration in Athena credentials, which pairs with this PR's backend credential resolution implementation.

Suggested reviewers

  • onlyjackfrost
  • fredalai

Poem

🐰 A rabbit hops through types with glee,
PyArrow tables dance and twirl so free,
From Ibis chains now liberated fast,
PyAthena's native speed shall be the last!
OIDC tokens and STS flows align,
This migration's tale is oh so fine. ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 13.51% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: refactoring the Athena connector to use the pyathena native driver while removing the ibis dependency.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch refactor/athena-native-driver

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions Bot added dependencies Pull requests that update a dependency file python Pull requests that update Python code core labels May 14, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
core/wren/src/wren/model/data_source.py (1)

250-294: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Critical type mismatch: get_athena_connection returns pyathena.Connection but get_connection is annotated → BaseBackend.

The dispatcher in get_connection (line 233) calls getattr(self, f"get_{self.name}_connection")(info) and expects a BaseBackend. Athena callers will receive a pyathena.Connection instead — breaking any code that uses ibis methods like .sql(), .table(), or .list_tables().

Additionally, this method duplicates _build_connect_kwargs with critical divergences already in place:

Aspect get_athena_connection _build_connect_kwargs
schema_name Unconditionally set (even if None) Only set when truthy
kill_on_interrupt Not set Defaulted to True
info.kwargs Ignored Merged into kwargs

Refactor by reusing the shared helper to avoid future divergence:

♻️ Suggested refactor
     `@staticmethod`
     def get_athena_connection(info: AthenaConnectionInfo):
         """Open a pyathena DB-API connection.
         ...
         """
         from pyathena import connect  # noqa: PLC0415
+        from wren.connector.athena import _build_connect_kwargs  # noqa: PLC0415
 
-        kwargs: dict[str, Any] = {
-            "s3_staging_dir": info.s3_staging_dir.get_secret_value(),
-            "schema_name": info.schema_name,
-        }
-        ...
-        return connect(**kwargs)
+        return connect(**_build_connect_kwargs(info))
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@core/wren/src/wren/model/data_source.py` around lines 250 - 294,
get_athena_connection currently returns a raw pyathena.Connection (breaking
get_connection's expectation of a BaseBackend) and reimplements logic already in
_build_connect_kwargs with subtle differences; modify get_athena_connection to
call the shared _build_connect_kwargs(info) to produce kwargs (so schema_name,
kill_on_interrupt and info.kwargs are handled identically), then pass those
kwargs into pyathena connect and wrap/return the resulting connection as the
expected BaseBackend (the same backend type other branches return) so the
dispatcher in get_connection and callers using .sql(), .table(), .list_tables()
get a consistent ibis backend object; keep references to get_athena_connection,
get_connection and _build_connect_kwargs to locate the change.
🧹 Nitpick comments (3)
core/wren/src/wren/connector/athena.py (3)

294-310: ⚡ Quick win

Push limit into the cursor fetch instead of slicing after fetchall.

_build_athena_arrow_table calls cursor.fetchall(), materialising every row in memory, after which table.slice(0, limit) discards the surplus. For a query that returns millions of rows but limit=100, this is a large amount of wasted memory and network transfer from Athena's result store.

Consider threading limit through to the table builder and using cursor.fetchmany(limit):

♻️ Suggested change
-def _build_athena_arrow_table(cursor) -> pa.Table:
+def _build_athena_arrow_table(cursor, limit: int | None = None) -> pa.Table:
     """Materialise a pyathena DB-API cursor into a PyArrow table."""
     if cursor.description is None:
         return pa.table({})
 
-    rows = cursor.fetchall()
+    rows = cursor.fetchmany(limit) if limit is not None else cursor.fetchall()
     def query(self, sql: str, limit: int | None = None) -> pa.Table:
         try:
             with contextlib.closing(self.connection.cursor()) as cursor:
                 cursor.execute(sql)
-                table = _build_athena_arrow_table(cursor)
-            if limit is not None:
-                table = table.slice(0, limit)
+                table = _build_athena_arrow_table(cursor, limit=limit)
             return table
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@core/wren/src/wren/connector/athena.py` around lines 294 - 310, The query
method currently calls _build_athena_arrow_table which uses cursor.fetchall(),
causing full materialization before applying limit; modify query to pass the
limit into _build_athena_arrow_table (or create a new
_build_athena_arrow_table_with_limit) and change the builder to use
cursor.fetchmany(limit) (or iterate fetchmany in batches) instead of fetchall(),
so that when query(sql, limit=...) is called the cursor only fetches up to the
requested rows and no post-slice is needed.

329-332: 💤 Low value

Consider logging the swallowed exception on close.

Silently dropping close-time errors makes it impossible to diagnose connection-pool/socket issues. A debug-level log preserves the cleanup semantics while keeping a breadcrumb.

-        try:
-            self.connection.close()
-        except Exception:
-            pass
+        try:
+            self.connection.close()
+        except Exception as e:  # noqa: BLE001
+            logger.debug("Failed to close Athena connection: %s", e)
         finally:
             self.connection = None
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@core/wren/src/wren/connector/athena.py` around lines 329 - 332, The close
block currently swallows exceptions from self.connection.close(); replace the
bare except/pass with a debug-level log that records the exception (e.g., use
the class/module logger or logging.getLogger(__name__) to call
logger.debug("Error closing Athena connection", exc_info=True)) so close-time
errors are preserved for diagnosis while keeping cleanup behavior.

257-272: 💤 Low value

STS web-identity credentials expire; consider refresh strategy for long-lived connectors.

assume_role_with_web_identity returns credentials with a fixed TTL (default 1 hour). The credentials are extracted and passed as static values to pyathena.connect() at AthenaConnector.__init__() (lines 270–272), so any AthenaConnector instance reused beyond that window will fail on subsequent queries with an expired-credentials error.

If AthenaConnector instances are short-lived (per-request), this is fine. If Engine instances are cached or pooled for long-running sessions, consider either: (a) constructing a boto3 RefreshableCredentials provider and passing it via the botocore_session kwarg (if pyathena supports it), or (b) recreating the connection on credential expiry.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@core/wren/src/wren/connector/athena.py` around lines 257 - 272, The code in
AthenaConnector.__init__ calls sts.assume_role_with_web_identity and injects the
returned static creds into kwargs passed to pyathena.connect, which will expire;
replace this static injection with a refreshable credential strategy: either
create a botocore.session.Session with
botocore.credentials.RefreshableCredentials (or use boto3's get_credentials
refresh mechanism) that calls assume_role_with_web_identity when expired and
pass that session via the botocore_session kwarg to pyathena.connect, or
implement lazy/transparent reconnect logic in AthenaConnector (e.g., detect
expired-credentials errors on query execution and re-run
assume_role_with_web_identity to recreate the pyathena connection); locate the
logic around assume_role_with_web_identity, the creds assignment to
kwargs["aws_access_key_id"/"aws_secret_access_key"/"aws_session_token"], and
pyathena.connect usage to apply this change.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@core/wren/tests/unit/test_athena_connector.py`:
- Around line 307-309: The test
test_data_source_get_athena_connection_returns_pyathena_connection currently
only asserts the side-effect _pyathena_connect_calls; update it to also assert
the returned value from DataSourceExtension.get_athena_connection(_info()) is
the expected pyathena connection object (for example, compare to the mocked
connection object or assert it's truthy/has expected attributes). Locate the
call to DataSourceExtension.get_athena_connection in the test and add a concise
assertion on the returned_connection variable (e.g., returned_connection is
mock_conn or returned_connection is not None / has expected type) while keeping
the existing _pyathena_connect_calls assertion.

---

Outside diff comments:
In `@core/wren/src/wren/model/data_source.py`:
- Around line 250-294: get_athena_connection currently returns a raw
pyathena.Connection (breaking get_connection's expectation of a BaseBackend) and
reimplements logic already in _build_connect_kwargs with subtle differences;
modify get_athena_connection to call the shared _build_connect_kwargs(info) to
produce kwargs (so schema_name, kill_on_interrupt and info.kwargs are handled
identically), then pass those kwargs into pyathena connect and wrap/return the
resulting connection as the expected BaseBackend (the same backend type other
branches return) so the dispatcher in get_connection and callers using .sql(),
.table(), .list_tables() get a consistent ibis backend object; keep references
to get_athena_connection, get_connection and _build_connect_kwargs to locate the
change.

---

Nitpick comments:
In `@core/wren/src/wren/connector/athena.py`:
- Around line 294-310: The query method currently calls
_build_athena_arrow_table which uses cursor.fetchall(), causing full
materialization before applying limit; modify query to pass the limit into
_build_athena_arrow_table (or create a new _build_athena_arrow_table_with_limit)
and change the builder to use cursor.fetchmany(limit) (or iterate fetchmany in
batches) instead of fetchall(), so that when query(sql, limit=...) is called the
cursor only fetches up to the requested rows and no post-slice is needed.
- Around line 329-332: The close block currently swallows exceptions from
self.connection.close(); replace the bare except/pass with a debug-level log
that records the exception (e.g., use the class/module logger or
logging.getLogger(__name__) to call logger.debug("Error closing Athena
connection", exc_info=True)) so close-time errors are preserved for diagnosis
while keeping cleanup behavior.
- Around line 257-272: The code in AthenaConnector.__init__ calls
sts.assume_role_with_web_identity and injects the returned static creds into
kwargs passed to pyathena.connect, which will expire; replace this static
injection with a refreshable credential strategy: either create a
botocore.session.Session with botocore.credentials.RefreshableCredentials (or
use boto3's get_credentials refresh mechanism) that calls
assume_role_with_web_identity when expired and pass that session via the
botocore_session kwarg to pyathena.connect, or implement lazy/transparent
reconnect logic in AthenaConnector (e.g., detect expired-credentials errors on
query execution and re-run assume_role_with_web_identity to recreate the
pyathena connection); locate the logic around assume_role_with_web_identity, the
creds assignment to
kwargs["aws_access_key_id"/"aws_secret_access_key"/"aws_session_token"], and
pyathena.connect usage to apply this change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 984d7cb3-c615-4abb-94a3-50ac7f11131c

📥 Commits

Reviewing files that changed from the base of the PR and between 544ecab and 6aa3ff9.

⛔ Files ignored due to path filters (1)
  • core/wren/uv.lock is excluded by !**/*.lock
📒 Files selected for processing (5)
  • core/wren/pyproject.toml
  • core/wren/src/wren/connector/athena.py
  • core/wren/src/wren/connector/factory.py
  • core/wren/src/wren/model/data_source.py
  • core/wren/tests/unit/test_athena_connector.py

Comment thread core/wren/tests/unit/test_athena_connector.py
CodeRabbit flagged that data_source.get_athena_connection re-implemented
the pyathena connect-kwargs logic that also lives in connector/athena.py,
letting them drift on schema_name / kill_on_interrupt / info.kwargs, and
the get_connection signature claimed BaseBackend even though the Athena
path returned a raw pyathena Connection.

Route data_source.get_athena_connection through the connector's shared
_build_connect_kwargs builder, and widen the return-type annotation to a
BackendOrConnection Union so the type matches reality without forcing a
runtime pyathena import.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@goldmedal
Copy link
Copy Markdown
Collaborator Author

Replies to CodeRabbit review-body findings (no inline IDs available):

  • data_source.py:250-294 (dup connection-helper + BaseBackend annotation): Addressed in 3410017get_athena_connection now delegates to _build_connect_kwargs and the return type is a BackendOrConnection union.
  • athena.py:294-310 (fetchall then slice): Default LIMIT values are small enough that switching to fetchmany would add complexity for marginal benefit; pyathena's underlying S3 paging already buffers results internally.
  • athena.py:329-332 (log close exception): Matches teardown pattern in sibling connectors — close errors are intentionally swallowed to avoid cascading failures during cleanup.
  • athena.py:257-272 (STS refresh credentials): Out of scope for this PR. Long-running session credential refresh is a feature, not a bug fix; will track as a separate issue.

@goldmedal
Copy link
Copy Markdown
Collaborator Author

Superseded by #2313 — these seven native-driver refactors were consolidated into a single feature branch to resolve shared-file conflicts (data_source.py, pyproject.toml, uv.lock, factory.py, etc.) once instead of seven times.

@goldmedal goldmedal closed this May 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core dependencies Pull requests that update a dependency file python Pull requests that update Python code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant