Skip to content

chore: added async mysql db driver dependency#243

Open
rohansen856 wants to merge 13 commits intoopenml:mainfrom
rohansen856:async_db
Open

chore: added async mysql db driver dependency#243
rohansen856 wants to merge 13 commits intoopenml:mainfrom
rohansen856:async_db

Conversation

@rohansen856
Copy link

Solves: #229
Replacing the current implementation of mysqlclient with aiomysql for async db connection support. would need to replace the db drivers under database/ and routes/ with async implementation and convert the functions to async def for async support

Signed-off-by: rohansen856 <rohansen856@gmail.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 18, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

The PR converts the codebase to async DB usage: adds aiomysql and pytest-asyncio/pytest-xdist to project config, switches engine creation to AsyncEngine using mysql+aiomysql, replaces sqlalchemy Connection with AsyncConnection across database modules and routers, converts numerous functions and FastAPI route handlers to async/await, updates dependency helpers and tests to async fixtures, and adds an async application lifespan that closes databases on shutdown.

Possibly related PRs

  • openml/server-api PR 218 — touches the same router modules (notably src/routers/openml/flows.py and src/routers/openml/tasks.py), affecting flow/subflow resolution and task template handling.
🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 12.04% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The PR title accurately summarizes the main change: adding async mysql db driver dependency (aiomysql). It clearly identifies the primary modification from the developer's perspective.
Description check ✅ Passed The description is directly related to the changeset, explaining the motivation (issue #229) and the scope of work required (replacing drivers, converting to async def). It provides sufficient context about the changes.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@rohansen856 rohansen856 marked this pull request as ready for review February 26, 2026 11:02
Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 2 issues, and left some high level feedback:

  • Several dependencies are annotated as AsyncConnection | None and then immediately asserted non-None (e.g. in datasets, study, tasktype, mldcat_ap routers); consider making the dependency providers return a non-optional AsyncConnection and updating the type hints so the assert statements are unnecessary.
  • In places where user.get_groups() is awaited multiple times within a single request (e.g. datasets.update_dataset_status, study._get_study_raise_otherwise, study.attach_to_study), consider awaiting once and reusing the result to avoid repeated coroutine calls and keep the code clearer.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- Several dependencies are annotated as `AsyncConnection | None` and then immediately asserted non-`None` (e.g. in `datasets`, `study`, `tasktype`, `mldcat_ap` routers); consider making the dependency providers return a non-optional `AsyncConnection` and updating the type hints so the `assert` statements are unnecessary.
- In places where `user.get_groups()` is awaited multiple times within a single request (e.g. `datasets.update_dataset_status`, `study._get_study_raise_otherwise`, `study.attach_to_study`), consider awaiting once and reusing the result to avoid repeated coroutine calls and keep the code clearer.

## Individual Comments

### Comment 1
<location path="src/database/users.py" line_range="34-38" />
<code_context>

-def get_user_groups_for(*, user_id: int, connection: Connection) -> list[UserGroup]:
-    row = connection.execute(
+async def get_user_groups_for(*, user_id: int, connection: AsyncConnection) -> list[UserGroup]:
+    row = await connection.execute(
         text(
</code_context>
<issue_to_address>
**suggestion:** Avoid re-wrapping UserGroup values and keep return type consistently primitive or enum.

`get_user_groups_for` currently builds `[UserGroup(group) for (group,) in rows]`, and `User.get_groups` then wraps those again via `[UserGroup(group_id) for group_id in groups]`. This double conversion is redundant and hides the actual return type of `get_user_groups_for`. Either return `list[int]` (raw IDs) and convert only in `User.get_groups`, or return `list[UserGroup]` here and remove the extra wrapping in `User.get_groups`.

Suggested implementation:

```python
async def get_user_groups_for(*, user_id: int, connection: AsyncConnection) -> list[int]:

```

```python
    rows = result.fetchall()
    return [group for (group,) in rows]

```

Changing `get_user_groups_for` to return `list[int]` means any call sites (other than `User.get_groups`, which already wraps raw group IDs) that expect `list[UserGroup]` will need to be updated to either:
1. Adjust their type hints to `list[int]`, or
2. Wrap the returned IDs via `UserGroup(group_id)` where enum instances are required.

Please search for all usages of `get_user_groups_for` and update their expectations and type hints accordingly.
</issue_to_address>

### Comment 2
<location path="tests/routers/openml/users_test.py" line_range="21" />
<code_context>
-def test_fetch_user(api_key: str, user: User, user_test: Connection) -> None:
-    db_user = fetch_user(api_key, user_data=user_test)
+async def test_fetch_user(api_key: str, user: User, user_test: AsyncConnection) -> None:
+    db_user = await fetch_user(api_key, user_data=user_test)
     assert db_user is not None
     assert user.user_id == db_user.user_id
-    assert user.groups == db_user.groups
+    assert user._groups == db_user._groups  # noqa: SLF001
</code_context>
<issue_to_address>
**suggestion (testing):** Prefer exercising the public async `get_groups()` API instead of comparing the private `_groups` attribute.

Because the test now inspects `_groups` directly, it no longer validates the async group-fetching behavior and is more brittle to internal changes. Instead, await `get_groups()` on `db_user` and compare that result against the expected groups, e.g. by also using `get_groups()` on your reference user or on the parametrized `user`. This keeps the test aligned with the public interface and the new async flow.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

db_user = await fetch_user(api_key, user_data=user_test)
assert db_user is not None
assert user.user_id == db_user.user_id
assert user.groups == db_user.groups
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): Prefer exercising the public async get_groups() API instead of comparing the private _groups attribute.

Because the test now inspects _groups directly, it no longer validates the async group-fetching behavior and is more brittle to internal changes. Instead, await get_groups() on db_user and compare that result against the expected groups, e.g. by also using get_groups() on your reference user or on the parametrized user. This keeps the test aligned with the public interface and the new async flow.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
tests/routers/openml/datasets_test.py (1)

79-91: ⚠️ Potential issue | 🟠 Major

Test may fail with AssertionError instead of expected HTTPException.

The test passes user_db=None, but looking at get_dataset in src/routers/openml/datasets.py (from relevant snippets), it has assert user_db is not None at line 397. This assertion will fire before any access check occurs, causing an AssertionError rather than the expected HTTPException with status FORBIDDEN.

If the intent is to test the "no access" path, user_db should be provided (possibly via the user_test fixture), so the access check logic is actually exercised.

Proposed fix
 async def test_private_dataset_no_access(
     user: User | None,
     expdb_test: AsyncConnection,
+    user_test: AsyncConnection,
 ) -> None:
     with pytest.raises(HTTPException) as e:
         await get_dataset(
             dataset_id=130,
             user=user,
-            user_db=None,
+            user_db=user_test,
             expdb_db=expdb_test,
         )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/routers/openml/datasets_test.py` around lines 79 - 91, The test
test_private_dataset_no_access is passing user_db=None which triggers the assert
user_db is not None in get_dataset before access logic runs; update the test to
supply a valid user_db (e.g., use the user_test fixture or the existing user
fixture) so that get_dataset's access checks execute and the HTTPException
FORBIDDEN is raised—locate test_private_dataset_no_access and change the call to
get_dataset to pass a populated user_db instead of None.
tests/routers/openml/flows_test.py (1)

27-33: ⚠️ Potential issue | 🔴 Critical

All three test functions in this file require AsyncMock to properly mock the async get_by_name function.

database.flows.get_by_name is async, but all three tests mock it with mocker.patch() which creates a regular MagicMock. When flow_exists awaits the mocked function, it will fail with TypeError: object ... can't be used in 'await' expression.

Apply this fix to all three test functions:

Proposed fixes
-    mocked_db = mocker.patch("database.flows.get_by_name")
+    mocked_db = mocker.patch("database.flows.get_by_name", new_callable=mocker.AsyncMock)
     await flow_exists(name, external_version, expdb_test)
     fake_flow = mocker.MagicMock(id=flow_id)
-    mocker.patch(
+    mocked_db = mocker.patch(
         "database.flows.get_by_name",
+        new_callable=mocker.AsyncMock,
         return_value=fake_flow,
     )
-    mocker.patch("database.flows.get_by_name", return_value=None)
+    mocker.patch("database.flows.get_by_name", new_callable=mocker.AsyncMock, return_value=None)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/routers/openml/flows_test.py` around lines 27 - 33, The tests call the
async function flow_exists which awaits database.flows.get_by_name, but the
tests currently patch get_by_name with a regular MagicMock; change the patches
in all three test functions to use an AsyncMock so the awaited call works:
replace mocker.patch("database.flows.get_by_name") with a patch that supplies an
AsyncMock (e.g., mocker.patch("database.flows.get_by_name", new=AsyncMock()) or
mocker.patch("database.flows.get_by_name", new_callable=AsyncMock)) and keep the
same assertions against the mocked_db returned by the patch in each test that
references mocked_db, flow_exists, and database.flows.get_by_name.
src/routers/mldcat_ap/dataset.py (1)

170-183: ⚠️ Potential issue | 🟠 Major

Validate feature_no bounds before indexing.

Line 182 can raise IndexError, and negative indices currently allow unintended reverse indexing. Return a controlled client error for out-of-range values.

Proposed fix
 async def get_distribution_feature(
@@
-    feature = features[feature_no]
+    if feature_no < 0 or feature_no >= len(features):
+        raise HTTPException(status_code=404, detail="Feature not found.")
+    feature = features[feature_no]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/routers/mldcat_ap/dataset.py` around lines 170 - 183, In
get_distribution_feature, validate feature_no against the features list before
indexing: check that feature_no is an integer >= 0 and < len(features) (use the
local features variable returned by get_dataset_features) and if it is out of
range raise a controlled client error (e.g., raise fastapi.HTTPException with a
400/404 and a clear message) so negative values cannot trigger reverse indexing
and IndexError is avoided; perform this bounds check immediately before using
features[feature_no] in get_distribution_feature.
🧹 Nitpick comments (7)
pyproject.toml (1)

20-21: Remove the mysqlclient dependency from pyproject.toml.

The codebase has been fully migrated to async operations using sqlalchemy.ext.asyncio and aiomysql. The sync MySQL driver mysqlclient is not imported or used anywhere. It can be safely removed to reduce unnecessary dependencies.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pyproject.toml` around lines 20 - 21, Remove the unused synchronous MySQL
driver by deleting the "mysqlclient" entry from the dependencies list in
pyproject.toml; since the code uses async DB access via aiomysql and
sqlalchemy.ext.asyncio, ensure only "aiomysql" remains, then update the project
lock (e.g., poetry lock / poetry update) to reflect the removed dependency.
tests/routers/openml/study_test.py (2)

547-576: Same redundant pattern in other tests.

Lines 551 and 567 also have redundant UPDATE statements since _attach_tasks_to_study already performs this update. Consider removing these duplicate calls for consistency.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/routers/openml/study_test.py` around lines 547 - 576, Remove the
redundant explicit DB update calls before calling _attach_tasks_to_study in both
test_attach_task_to_study_already_linked_raises and
test_attach_task_to_study_but_task_not_exist_raises: since
_attach_tasks_to_study already performs the "UPDATE study SET status =
'in_preparation' WHERE id = 1" step, delete the await expdb_test.execute(...)
lines in those two tests so the setup only happens inside _attach_tasks_to_study
and avoid duplicate state changes.

533-544: Redundant database update.

Line 536 executes the same UPDATE statement that _attach_tasks_to_study executes on line 514. The helper already sets the study status to 'in_preparation', so this pre-update is unnecessary.

♻️ Suggested fix
 async def test_attach_task_to_study_needs_owner(
     py_api: TestClient, expdb_test: AsyncConnection
 ) -> None:
-    await expdb_test.execute(text("UPDATE study SET status = 'in_preparation' WHERE id = 1"))
     response = await _attach_tasks_to_study(
         study_id=1,
         task_ids=[2, 3, 4],
         api_key="00000000000000000000000000000000",
         py_api=py_api,
         expdb_test=expdb_test,
     )
     assert response.status_code == HTTPStatus.FORBIDDEN
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/routers/openml/study_test.py` around lines 533 - 544, The test
test_attach_task_to_study_needs_owner contains a redundant manual UPDATE via
expdb_test.execute that duplicates the state setup already performed by the
helper _attach_tasks_to_study; remove the explicit
expdb_test.execute(text("UPDATE study SET status = 'in_preparation' WHERE id =
1")) call so the test relies on the helper to set the study status to
'in_preparation' before asserting the forbidden response, leaving the rest of
the call to _attach_tasks_to_study and the HTTPStatus.FORBIDDEN assertion
unchanged.
src/routers/openml/tasktype.py (1)

29-40: Consider making expdb non-optional to match get_task_type.

The expdb parameter is typed as AsyncConnection | None with a runtime assertion, while get_task_type (line 46) uses non-optional AsyncConnection. Since FastAPI's dependency injection will always provide a connection, the | None and assertion pattern adds unnecessary complexity.

This is consistent with other endpoints in this PR, so it may be an intentional pattern, but worth noting for consistency within this file.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/routers/openml/tasktype.py` around lines 29 - 40, The list_task_types
endpoint declares expdb as Annotated[AsyncConnection | None,
Depends(expdb_connection)] and then asserts it's not None; change the parameter
to Annotated[AsyncConnection, Depends(expdb_connection)] (make expdb
non-optional to match get_task_type and get_task_types) and remove the runtime
assert; confirm _normalize_task_type and await get_task_types(expdb) still use
the non-optional AsyncConnection signature and update any related type hints if
needed.
src/database/users.py (1)

61-65: Avoid redundant UserGroup re-wrapping in User.get_groups.

Line 64 re-casts values that are already returned as UserGroup by get_user_groups_for, which is unnecessary.

Proposed simplification
     async def get_groups(self) -> list[UserGroup]:
         if self._groups is None:
-            groups = await get_user_groups_for(user_id=self.user_id, connection=self._database)
-            self._groups = [UserGroup(group_id) for group_id in groups]
+            self._groups = await get_user_groups_for(
+                user_id=self.user_id,
+                connection=self._database,
+            )
         return self._groups
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/database/users.py` around lines 61 - 65, The get_groups method is
re-wrapping values returned by get_user_groups_for into new UserGroup instances
unnecessarily; update User.get_groups to assign the returned list directly to
self._groups when get_user_groups_for already returns UserGroup objects (or
otherwise only wrap when items are plain IDs). Locate the method get_groups and
replace the unconditional list comprehension [UserGroup(group_id) for group_id
in groups] with logic that detects/assumes the returned items' type (e.g., if
isinstance(items[0], UserGroup) then self._groups = groups else map to
UserGroup), preserving the existing caching behavior using self._groups,
self.user_id, and self._database.
src/routers/mldcat_ap/dataset.py (1)

43-47: Make DB dependencies required instead of Optional + assert.

Lines 46-47, Line 149, and Line 176 use assert for non-None checks. assert can be stripped in optimized runs and is brittle for request-path guarantees; prefer required dependency types directly.

Proposed pattern (apply similarly to all affected handlers)
 async def get_mldcat_ap_distribution(
     distribution_id: int,
     user: Annotated[User | None, Depends(fetch_user)] = None,
-    user_db: Annotated[AsyncConnection | None, Depends(userdb_connection)] = None,
-    expdb: Annotated[AsyncConnection | None, Depends(expdb_connection)] = None,
+    user_db: Annotated[AsyncConnection, Depends(userdb_connection)],
+    expdb: Annotated[AsyncConnection, Depends(expdb_connection)],
 ) -> JsonLDGraph:
-    assert user_db is not None  # noqa: S101
-    assert expdb is not None  # noqa: S101

Also applies to: 147-150, 174-177

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/routers/mldcat_ap/dataset.py` around lines 43 - 47, The handlers
currently declare dependencies as optional (Annotated[AsyncConnection | None,
Depends(...)] = None) and then use assert to check non-None; change the
dependency parameters to be required by removing the "| None" and the "= None"
default (e.g. use Annotated[AsyncConnection, Depends(userdb_connection)] and
Annotated[AsyncConnection, Depends(expdb_connection)]), and remove the
corresponding assert statements (e.g. the assert user_db is not None / assert
expdb is not None checks). Apply the same change to the other affected handlers
referenced (the ones using userdb_connection and expdb_connection at the other
locations) so all DB deps are required and no asserts are used for request
validation.
src/routers/openml/datasets.py (1)

37-40: Prefer required DB dependency types over Optional + assert guards.

Line 39, Line 107, Line 298, Line 399, and Line 400 all rely on assert to narrow optional dependencies. For request handlers, required dependency annotations are safer and clearer.

Proposed pattern (apply to all affected handlers)
 async def tag_dataset(
@@
-    expdb_db: Annotated[AsyncConnection | None, Depends(expdb_connection)] = None,
+    expdb_db: Annotated[AsyncConnection, Depends(expdb_connection)],
 ) -> dict[str, dict[str, Any]]:
-    assert expdb_db is not None  # noqa: S101

Also applies to: 105-108, 296-299, 396-401

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/routers/openml/datasets.py` around lines 37 - 40, The handlers currently
annotate the DB dependency as optional (e.g., expdb_db:
Annotated[AsyncConnection | None, Depends(expdb_connection)] = None) and then
use assert to narrow it; change those parameters to required dependency types by
removing the None union and default (e.g., use Annotated[AsyncConnection,
Depends(expdb_connection)]) and remove the subsequent assert checks; apply this
to all affected handlers that reference expdb_db (and similar params) so callers
receive a guaranteed AsyncConnection and the code can call database methods like
database.datasets.get_tags_for without runtime asserts.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/database/setup.py`:
- Around line 10-22: _create_engine mutates the shared configuration by calling
database_configuration[database_name].pop("echo", False) and setting
database_configuration[database_name]["drivername"], which can break subsequent
calls; fix by copying the per-database config (e.g., cfg =
dict(load_database_configuration()[database_name])) then read echo using
cfg.pop("echo", False) or cfg.get("echo") on the copy, set cfg["drivername"] =
"mysql+aiomysql" on the copy, and pass that copy into URL.create(...) so
load_database_configuration and its returned dict are never mutated.

In `@src/database/studies.py`:
- Around line 171-177: attach_runs currently raises NotImplementedError so
calling database.studies.attach_runs(study_id, run_ids, user, connection) will
crash; implement the function to actually attach runs (or change the caller, but
prefer implementing). In the attach_runs(study_id: int, run_ids: list[int],
user: User, connection: AsyncConnection) implementation: validate the study is
RUN-type and user has permission, begin a transaction on the provided
connection, upsert entries into the study_runs linking table (or equivalent) for
each run_id while avoiding duplicates, update study metadata (e.g., modified_at)
as needed, and commit/rollback on error; remove the NotImplementedError so the
router call for RUN studies succeeds.

In `@src/database/users.py`:
- Around line 56-59: User.fetch currently treats falsy user_id values (e.g., 0)
as missing by using a truthiness check; change the guard to an explicit None
check: call get_user_id_for(api_key=api_key, connection=user_db), assign to
user_id, and return cls(user_id, _database=user_db) only if user_id is not None
(otherwise return None). This affects the async classmethod User.fetch and
references get_user_id_for, APIKey, AsyncConnection and the cls constructor.

In `@src/routers/dependencies.py`:
- Around line 12-23: The current expdb_connection and userdb_connection
generators always call await connection.commit() after yield, which will commit
even if the downstream endpoint raised an exception; change each generator to
use SQLAlchemy's transactional context (e.g., use engine.begin() / async with
engine.begin() as connection) so that the transaction is committed only on
successful exit and automatically rolled back on exceptions, or alternatively
wrap the yield in try/except/finally and call await connection.rollback() when
an exception is caught; update expdb_connection and userdb_connection
accordingly to ensure rollback on errors instead of unconditional commit.

In `@src/routers/mldcat_ap/dataset.py`:
- Around line 143-152: get_distribution_quality currently calls quality = next(q
for q in qualities if q.name == quality_name) which will raise StopIteration for
unknown quality_name and produce a 500; change this to use a default (e.g.
next((q for q in qualities if q.name == quality_name), None)) and if the result
is None raise a controlled API error (e.g. raise HTTPException(status_code=404,
detail=f"Quality '{quality_name}' not found for distribution
{distribution_id}")) so callers get a proper 4xx response; update references to
get_qualities, quality_name, and the quality variable accordingly.

In `@src/routers/openml/flows.py`:
- Around line 61-67: The list comprehension for subflows uses an await which is
invalid; replace it with an explicit async loop that iterates over subflow_rows,
awaits get_flow for each subflow.child_id, constructs a
Subflow(identifier=subflow.identifier, flow=flow) and appends it to subflows (or
builds the list incrementally) so all awaits occur inside the async loop; refer
to the symbols subflows, Subflow, get_flow, and subflow_rows to locate and
update the code.

In `@src/routers/openml/tasks.py`:
- Around line 142-144: The code calling next(rows) on result.mappings() can
raise StopIteration when the query returns no rows (e.g., task_inputs[table] ID
missing); update the block around rows = result.mappings() in the function
handling task inputs to safely handle empty results by checking for a row before
using it (for example use next(rows, None) or iterate and test for None/empty)
and only set fetched_data[f"{table}.{column}"] when a row exists; ensure you
keep the variable names rows, fetched_data, task_inputs and table intact so the
change is localized and clear.

In `@tests/conftest.py`:
- Around line 51-64: The py_api fixture should yield a context-managed
TestClient so the FastAPI lifespan handler runs deterministically: instead of
returning TestClient(app) directly, create the app with create_api(), set
app.dependency_overrides for expdb_connection and userdb_connection using the
async override_expdb and override_userdb functions, then instantiate TestClient
inside a with block (with TestClient(app) as client:) and yield client so
startup/shutdown (including create_api()'s lifespan that calls
close_databases()) is executed; update the py_api fixture to use this
context-managed pattern around TestClient.

In `@tests/routers/openml/flows_test.py`:
- Around line 45-51: The test patches database.flows.get_by_name with a regular
MagicMock which doesn't support awaiting; update the patch in the test for
flow_exists to use an async-compatible mock (e.g.,
mocker.AsyncMock(return_value=fake_flow)) or otherwise ensure the mocked
get_by_name is awaitable so awaiting inside flow_exists works; keep the
fake_flow = mocker.MagicMock(id=flow_id) and replace the mocker.patch call that
targets "database.flows.get_by_name" with an AsyncMock variant.
- Around line 54-61: The test is patching database.flows.get_by_name
synchronously but flow_exists awaits it; change the patch to provide an async
mock (e.g., use new_callable=AsyncMock or new=AsyncMock(return_value=None)) so
get_by_name is awaitable and returns None; update the mocker.patch call that
targets "database.flows.get_by_name" in test_flow_exists_handles_flow_not_found
to use AsyncMock, then run the test to assert HTTPException with
HTTPStatus.NOT_FOUND and detail "Flow not found." as before.

In `@tests/routers/openml/users_test.py`:
- Around line 17-21: The assertion in test_fetch_user is a false positive
because user._groups and db_user._groups can both be None; update the test to
await the async get_groups() on both user and db_user (or compare the awaited
results) before asserting equality so group membership is actually populated and
checked—modify test_fetch_user to call await user.get_groups() and await
db_user.get_groups() (or compare await user.get_groups() to await
db_user.get_groups()) prior to asserting on _groups.

---

Outside diff comments:
In `@src/routers/mldcat_ap/dataset.py`:
- Around line 170-183: In get_distribution_feature, validate feature_no against
the features list before indexing: check that feature_no is an integer >= 0 and
< len(features) (use the local features variable returned by
get_dataset_features) and if it is out of range raise a controlled client error
(e.g., raise fastapi.HTTPException with a 400/404 and a clear message) so
negative values cannot trigger reverse indexing and IndexError is avoided;
perform this bounds check immediately before using features[feature_no] in
get_distribution_feature.

In `@tests/routers/openml/datasets_test.py`:
- Around line 79-91: The test test_private_dataset_no_access is passing
user_db=None which triggers the assert user_db is not None in get_dataset before
access logic runs; update the test to supply a valid user_db (e.g., use the
user_test fixture or the existing user fixture) so that get_dataset's access
checks execute and the HTTPException FORBIDDEN is raised—locate
test_private_dataset_no_access and change the call to get_dataset to pass a
populated user_db instead of None.

In `@tests/routers/openml/flows_test.py`:
- Around line 27-33: The tests call the async function flow_exists which awaits
database.flows.get_by_name, but the tests currently patch get_by_name with a
regular MagicMock; change the patches in all three test functions to use an
AsyncMock so the awaited call works: replace
mocker.patch("database.flows.get_by_name") with a patch that supplies an
AsyncMock (e.g., mocker.patch("database.flows.get_by_name", new=AsyncMock()) or
mocker.patch("database.flows.get_by_name", new_callable=AsyncMock)) and keep the
same assertions against the mocked_db returned by the patch in each test that
references mocked_db, flow_exists, and database.flows.get_by_name.

---

Nitpick comments:
In `@pyproject.toml`:
- Around line 20-21: Remove the unused synchronous MySQL driver by deleting the
"mysqlclient" entry from the dependencies list in pyproject.toml; since the code
uses async DB access via aiomysql and sqlalchemy.ext.asyncio, ensure only
"aiomysql" remains, then update the project lock (e.g., poetry lock / poetry
update) to reflect the removed dependency.

In `@src/database/users.py`:
- Around line 61-65: The get_groups method is re-wrapping values returned by
get_user_groups_for into new UserGroup instances unnecessarily; update
User.get_groups to assign the returned list directly to self._groups when
get_user_groups_for already returns UserGroup objects (or otherwise only wrap
when items are plain IDs). Locate the method get_groups and replace the
unconditional list comprehension [UserGroup(group_id) for group_id in groups]
with logic that detects/assumes the returned items' type (e.g., if
isinstance(items[0], UserGroup) then self._groups = groups else map to
UserGroup), preserving the existing caching behavior using self._groups,
self.user_id, and self._database.

In `@src/routers/mldcat_ap/dataset.py`:
- Around line 43-47: The handlers currently declare dependencies as optional
(Annotated[AsyncConnection | None, Depends(...)] = None) and then use assert to
check non-None; change the dependency parameters to be required by removing the
"| None" and the "= None" default (e.g. use Annotated[AsyncConnection,
Depends(userdb_connection)] and Annotated[AsyncConnection,
Depends(expdb_connection)]), and remove the corresponding assert statements
(e.g. the assert user_db is not None / assert expdb is not None checks). Apply
the same change to the other affected handlers referenced (the ones using
userdb_connection and expdb_connection at the other locations) so all DB deps
are required and no asserts are used for request validation.

In `@src/routers/openml/datasets.py`:
- Around line 37-40: The handlers currently annotate the DB dependency as
optional (e.g., expdb_db: Annotated[AsyncConnection | None,
Depends(expdb_connection)] = None) and then use assert to narrow it; change
those parameters to required dependency types by removing the None union and
default (e.g., use Annotated[AsyncConnection, Depends(expdb_connection)]) and
remove the subsequent assert checks; apply this to all affected handlers that
reference expdb_db (and similar params) so callers receive a guaranteed
AsyncConnection and the code can call database methods like
database.datasets.get_tags_for without runtime asserts.

In `@src/routers/openml/tasktype.py`:
- Around line 29-40: The list_task_types endpoint declares expdb as
Annotated[AsyncConnection | None, Depends(expdb_connection)] and then asserts
it's not None; change the parameter to Annotated[AsyncConnection,
Depends(expdb_connection)] (make expdb non-optional to match get_task_type and
get_task_types) and remove the runtime assert; confirm _normalize_task_type and
await get_task_types(expdb) still use the non-optional AsyncConnection signature
and update any related type hints if needed.

In `@tests/routers/openml/study_test.py`:
- Around line 547-576: Remove the redundant explicit DB update calls before
calling _attach_tasks_to_study in both
test_attach_task_to_study_already_linked_raises and
test_attach_task_to_study_but_task_not_exist_raises: since
_attach_tasks_to_study already performs the "UPDATE study SET status =
'in_preparation' WHERE id = 1" step, delete the await expdb_test.execute(...)
lines in those two tests so the setup only happens inside _attach_tasks_to_study
and avoid duplicate state changes.
- Around line 533-544: The test test_attach_task_to_study_needs_owner contains a
redundant manual UPDATE via expdb_test.execute that duplicates the state setup
already performed by the helper _attach_tasks_to_study; remove the explicit
expdb_test.execute(text("UPDATE study SET status = 'in_preparation' WHERE id =
1")) call so the test relies on the helper to set the study status to
'in_preparation' before asserting the forbidden response, leaving the rest of
the call to _attach_tasks_to_study and the HTTPStatus.FORBIDDEN assertion
unchanged.

ℹ️ Review info

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 97557d2 and f0c6580.

📒 Files selected for processing (29)
  • pyproject.toml
  • src/core/access.py
  • src/database/datasets.py
  • src/database/evaluations.py
  • src/database/flows.py
  • src/database/qualities.py
  • src/database/setup.py
  • src/database/studies.py
  • src/database/tasks.py
  • src/database/users.py
  • src/main.py
  • src/routers/dependencies.py
  • src/routers/mldcat_ap/dataset.py
  • src/routers/openml/datasets.py
  • src/routers/openml/estimation_procedure.py
  • src/routers/openml/evaluations.py
  • src/routers/openml/flows.py
  • src/routers/openml/qualities.py
  • src/routers/openml/study.py
  • src/routers/openml/tasks.py
  • src/routers/openml/tasktype.py
  • tests/conftest.py
  • tests/database/flows_test.py
  • tests/routers/openml/dataset_tag_test.py
  • tests/routers/openml/datasets_test.py
  • tests/routers/openml/flows_test.py
  • tests/routers/openml/qualities_test.py
  • tests/routers/openml/study_test.py
  • tests/routers/openml/users_test.py

Signed-off-by: rohansen856 <rohansen856@gmail.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/routers/mldcat_ap/dataset.py (1)

182-187: ⚠️ Potential issue | 🟡 Minor

Handle out-of-bounds feature_no without raising IndexError.

Line 187 directly indexes features[feature_no] which will raise IndexError and return a 500 if the index is invalid. This should return a controlled 404 response, similar to how get_distribution_quality handles missing qualities.

Proposed fix
     features = await get_dataset_features(
         dataset_id=distribution_id,
         user=user,
         expdb=expdb,
     )
-    feature = features[feature_no]
+    if feature_no < 0 or feature_no >= len(features):
+        raise HTTPException(
+            status_code=404,
+            detail=f"Feature {feature_no} not found for distribution {distribution_id}.",
+        )
+    feature = features[feature_no]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/routers/mldcat_ap/dataset.py` around lines 182 - 187, The code directly
indexes features[feature_no] which can raise IndexError; update the handler
after calling get_dataset_features to check that 0 <= feature_no < len(features)
and if not raise an HTTPException with status_code=404 (or return the same
controlled 404 response pattern used by get_distribution_quality) including a
clear detail message, otherwise proceed to assign feature =
features[feature_no]; reference get_dataset_features, features, feature_no, and
mirror the error/response shape used in get_distribution_quality.
src/database/studies.py (1)

157-159: ⚠️ Potential issue | 🟠 Major

Harden DB error parsing; single-value unpack of e.args is brittle.

Lines 158 and 189 assume exactly one arg via (msg,) = e.args. With SQLAlchemy-wrapped DB errors from aiomysql, this assumption breaks: SQLAlchemy's IntegrityError.args is not standardized and may not be a 1-tuple, causing a secondary ValueError that masks your intended error handling for duplicate entries and foreign key violations.

Catch IntegrityError specifically and extract the message from the wrapped driver exception:

Proposed fix
+from sqlalchemy.exc import IntegrityError
@@
-    except Exception as e:
-        (msg,) = e.args
+    except IntegrityError as e:
+        msg = str(getattr(e, "orig", e))
@@
-    except Exception as e:
-        (msg,) = e.args
+    except IntegrityError as e:
+        msg = str(getattr(e, "orig", e))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/database/studies.py` around lines 157 - 159, The except block that
currently does "except Exception as e" and unpacks "(msg,) = e.args" is brittle;
change to catch SQLAlchemy's IntegrityError (from sqlalchemy.exc import
IntegrityError), then extract the DB message robustly (e.g., inspect e.orig or
fallback to str(e)) instead of assuming a single-element e.args; use that
extracted msg when applying the re.search(r"Duplicate entry '(\d+)-(\d+)' for
key 'task_study.PRIMARY'") check and for the foreign-key violation handling so
the duplicate-entry and FK branches hit reliably without causing a secondary
ValueError.
🧹 Nitpick comments (2)
src/routers/openml/study.py (1)

57-59: Make expdb dependency non-null in the signature and drop assert guards.

Line 57, Line 99, and Line 142 already type expdb as AsyncConnection, but defaults to None and then asserts non-null (Line 59, Line 101, Line 144). This is clearer and safer if the dependency is required directly by the function signature.

♻️ Suggested cleanup
 async def attach_to_study(
@@
-    expdb: Annotated[AsyncConnection, Depends(expdb_connection)] = None,
+    expdb: Annotated[AsyncConnection, Depends(expdb_connection)],
 ) -> AttachDetachResponse:
-    assert expdb is not None  # noqa: S101

 async def create_study(
@@
-    expdb: Annotated[AsyncConnection, Depends(expdb_connection)] = None,
+    expdb: Annotated[AsyncConnection, Depends(expdb_connection)],
 ) -> dict[Literal["study_id"], int]:
-    assert expdb is not None  # noqa: S101

 async def get_study(
@@
-    expdb: Annotated[AsyncConnection, Depends(expdb_connection)] = None,
+    expdb: Annotated[AsyncConnection, Depends(expdb_connection)],
 ) -> Study:
-    assert expdb is not None  # noqa: S101

Also applies to: 99-101, 142-144

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/routers/openml/study.py` around lines 57 - 59, The expdb dependency
parameter should be made required in the function signatures instead of
defaulting to None and asserting non-null; change the parameter declaration from
"expdb: Annotated[AsyncConnection, Depends(expdb_connection)] = None" to "expdb:
Annotated[AsyncConnection, Depends(expdb_connection)]" in each affected function
and remove the subsequent "assert expdb is not None" guard; apply this change
for the three occurrences that declare the expdb parameter and assert it (the
parameter name expdb, the type Annotated[AsyncConnection,
Depends(expdb_connection)], and the assert lines).
src/routers/openml/datasets.py (1)

37-40: Remove non-optional dependency defaults and runtime asserts.

Lines 37–40, 105–108, 296–299, and 396–400 use = None defaults with assert ... is not None checks for AsyncConnection dependencies. Since assert statements are stripped in optimized Python runs, this pattern should be replaced by making the dependency non-optional in the signature (remove = None), eliminating the runtime assert.

The pattern is already correctly applied elsewhere (e.g., line 336), demonstrating the preferred approach.

Pattern to apply
 async def tag_dataset(
     data_id: Annotated[int, Body()],
     tag: Annotated[str, SystemString64],
     user: Annotated[User | None, Depends(fetch_user)] = None,
-    expdb_db: Annotated[AsyncConnection, Depends(expdb_connection)] = None,
+    expdb_db: Annotated[AsyncConnection, Depends(expdb_connection)],
 ) -> dict[str, dict[str, Any]]:
-    assert expdb_db is not None  # noqa: S101
     tags = await database.datasets.get_tags_for(data_id, expdb_db)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/routers/openml/datasets.py` around lines 37 - 40, The code uses
Optional-style dependency defaults (expdb_db: Annotated[AsyncConnection,
Depends(expdb_connection)] = None) paired with runtime asserts (assert expdb_db
is not None); remove the "= None" default from the AsyncConnection parameters
(e.g., expdb_db) in each affected function signature and delete the
corresponding "assert ... is not None" checks so the dependency is required at
call time (match the non-optional pattern already used elsewhere in this file);
ensure calls like database.datasets.get_tags_for(data_id, expdb_db) continue to
use the parameter directly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/routers/openml/study.py`:
- Around line 117-123: The pre-check using study.alias and
database.studies.get_by_alias is racy; remove reliance on it and make create()
enforce uniqueness atomically at the persistence layer, then catch
duplicate-alias errors from database.studies.create and raise
HTTPException(status_code=HTTPStatus.CONFLICT, detail="Study alias already
exists.") when the DB reports a unique-key/duplicate constraint violation; you
can still keep an optional pre-check for fast-fail, but ensure
database.studies.create is the authoritative enforcement and map its
unique-constraint error to the 409 response.

---

Outside diff comments:
In `@src/database/studies.py`:
- Around line 157-159: The except block that currently does "except Exception as
e" and unpacks "(msg,) = e.args" is brittle; change to catch SQLAlchemy's
IntegrityError (from sqlalchemy.exc import IntegrityError), then extract the DB
message robustly (e.g., inspect e.orig or fallback to str(e)) instead of
assuming a single-element e.args; use that extracted msg when applying the
re.search(r"Duplicate entry '(\d+)-(\d+)' for key 'task_study.PRIMARY'") check
and for the foreign-key violation handling so the duplicate-entry and FK
branches hit reliably without causing a secondary ValueError.

In `@src/routers/mldcat_ap/dataset.py`:
- Around line 182-187: The code directly indexes features[feature_no] which can
raise IndexError; update the handler after calling get_dataset_features to check
that 0 <= feature_no < len(features) and if not raise an HTTPException with
status_code=404 (or return the same controlled 404 response pattern used by
get_distribution_quality) including a clear detail message, otherwise proceed to
assign feature = features[feature_no]; reference get_dataset_features, features,
feature_no, and mirror the error/response shape used in
get_distribution_quality.

---

Nitpick comments:
In `@src/routers/openml/datasets.py`:
- Around line 37-40: The code uses Optional-style dependency defaults (expdb_db:
Annotated[AsyncConnection, Depends(expdb_connection)] = None) paired with
runtime asserts (assert expdb_db is not None); remove the "= None" default from
the AsyncConnection parameters (e.g., expdb_db) in each affected function
signature and delete the corresponding "assert ... is not None" checks so the
dependency is required at call time (match the non-optional pattern already used
elsewhere in this file); ensure calls like
database.datasets.get_tags_for(data_id, expdb_db) continue to use the parameter
directly.

In `@src/routers/openml/study.py`:
- Around line 57-59: The expdb dependency parameter should be made required in
the function signatures instead of defaulting to None and asserting non-null;
change the parameter declaration from "expdb: Annotated[AsyncConnection,
Depends(expdb_connection)] = None" to "expdb: Annotated[AsyncConnection,
Depends(expdb_connection)]" in each affected function and remove the subsequent
"assert expdb is not None" guard; apply this change for the three occurrences
that declare the expdb parameter and assert it (the parameter name expdb, the
type Annotated[AsyncConnection, Depends(expdb_connection)], and the assert
lines).

ℹ️ Review info

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f0c6580 and cf5fc3f.

📒 Files selected for processing (13)
  • src/database/setup.py
  • src/database/studies.py
  • src/database/users.py
  • src/routers/dependencies.py
  • src/routers/mldcat_ap/dataset.py
  • src/routers/openml/datasets.py
  • src/routers/openml/flows.py
  • src/routers/openml/study.py
  • src/routers/openml/tasks.py
  • src/routers/openml/tasktype.py
  • tests/conftest.py
  • tests/routers/openml/flows_test.py
  • tests/routers/openml/users_test.py

Comment on lines +117 to 123
if study.alias and await database.studies.get_by_alias(study.alias, expdb):
raise HTTPException(
status_code=HTTPStatus.CONFLICT,
detail="Study alias already exists.",
)
study_id = database.studies.create(study, user, expdb)
study_id = await database.studies.create(study, user, expdb)
if study.main_entity_type == StudyType.TASK:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Alias uniqueness check is vulnerable to a race condition.

At Line 117 you do a pre-check, then create at Line 122. Two concurrent requests can pass the pre-check and race. Enforce uniqueness atomically at create time and map duplicate-alias failures to 409 CONFLICT.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/routers/openml/study.py` around lines 117 - 123, The pre-check using
study.alias and database.studies.get_by_alias is racy; remove reliance on it and
make create() enforce uniqueness atomically at the persistence layer, then catch
duplicate-alias errors from database.studies.create and raise
HTTPException(status_code=HTTPStatus.CONFLICT, detail="Study alias already
exists.") when the DB reports a unique-key/duplicate constraint violation; you
can still keep an optional pre-check for fast-fail, but ensure
database.studies.create is the authoritative enforcement and map its
unique-constraint error to the 409 response.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/routers/openml/datasets_test.py (1)

79-89: ⚠️ Potential issue | 🟠 Major

test_private_dataset_no_access passes user_db=None, which violates get_dataset preconditions.

get_dataset asserts user_db is not None (src/routers/openml/datasets.py, Line [396]), so this test can fail with AssertionError before the expected HTTPException.

💡 Proposed fix
 async def test_private_dataset_no_access(
     user: User | None,
     expdb_test: AsyncConnection,
+    user_test: AsyncConnection,
 ) -> None:
     with pytest.raises(HTTPException) as e:
         await get_dataset(
             dataset_id=130,
             user=user,
-            user_db=None,
+            user_db=user_test,
             expdb_db=expdb_test,
         )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/routers/openml/datasets_test.py` around lines 79 - 89, The test
test_private_dataset_no_access currently passes user_db=None which violates the
precondition in get_dataset (it asserts user_db is not None), so update the test
to supply a valid/mocked AsyncConnection for user_db (or a fixture) instead of
None so the function reaches its HTTPException path; locate the call to
get_dataset in the test and replace user_db=None with a proper user_db fixture
or mock AsyncConnection that simulates no access.
🧹 Nitpick comments (2)
src/database/users.py (1)

30-42: Consider renaming row to result for clarity.

The variable row holds a CursorResult from execute(), not an actual row. This naming pattern is repeated in get_user_groups_for as well. Using result would better communicate the type.

♻️ Proposed refactor for clarity
 async def get_user_id_for(*, api_key: APIKey, connection: AsyncConnection) -> int | None:
-    row = await connection.execute(
+    result = await connection.execute(
         text(
             """
     SELECT *
     FROM users
     WHERE session_hash = :api_key
     """,
         ),
         parameters={"api_key": api_key},
     )
-    user = row.one_or_none()
+    user = result.one_or_none()
     return user.id if user else None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/database/users.py` around lines 30 - 42, The variable named `row` in
get_user_id_for (and the same pattern in get_user_groups_for) is misleading
because it holds a CursorResult from connection.execute(); rename `row` to
`result` in both functions, update any subsequent uses (e.g., change
`row.one_or_none()` to `result.one_or_none()` and `row.fetchall()` to
`result.fetchall()` as appropriate) so the name reflects the type and avoids
confusion when reading or modifying get_user_id_for and get_user_groups_for.
tests/routers/openml/study_test.py (1)

534-569: Remove duplicated study-status setup in attach tests.

Lines [537], [552], and [568] repeat the same DB update already done in _attach_tasks_to_study (Line [515]). Keep this setup in one place to reduce noise.

♻️ Proposed simplification
 async def test_attach_task_to_study_needs_owner(
     py_api: TestClient, expdb_test: AsyncConnection
 ) -> None:
-    await expdb_test.execute(text("UPDATE study SET status = 'in_preparation' WHERE id = 1"))
     response = await _attach_tasks_to_study(
         study_id=1,
         task_ids=[2, 3, 4],
         api_key=ApiKey.OWNER_USER,
         py_api=py_api,
         expdb_test=expdb_test,
@@
 async def test_attach_task_to_study_already_linked_raises(
     py_api: TestClient,
     expdb_test: AsyncConnection,
 ) -> None:
-    await expdb_test.execute(text("UPDATE study SET status = 'in_preparation' WHERE id = 1"))
     response = await _attach_tasks_to_study(
         study_id=1,
         task_ids=[1, 3, 4],
         api_key=ApiKey.ADMIN,
         py_api=py_api,
         expdb_test=expdb_test,
@@
 async def test_attach_task_to_study_but_task_not_exist_raises(
     py_api: TestClient,
     expdb_test: AsyncConnection,
 ) -> None:
-    await expdb_test.execute(text("UPDATE study SET status = 'in_preparation' WHERE id = 1"))
     response = await _attach_tasks_to_study(
         study_id=1,
         task_ids=[80123, 78914],
         api_key=ApiKey.ADMIN,
         py_api=py_api,
         expdb_test=expdb_test,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/routers/openml/study_test.py` around lines 534 - 569, The tests
test_attach_task_to_study_needs_owner,
test_attach_task_to_study_already_linked_raises, and
test_attach_task_to_study_but_task_not_exist_raises each repeat the UPDATE study
SET status = 'in_preparation' setup that is already performed inside the helper
_attach_tasks_to_study; remove those duplicate expdb_test.execute(...) calls
from those test functions so the DB setup remains only in
_attach_tasks_to_study, leaving the tests to call _attach_tasks_to_study with
the same parameters and assertions unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/routers/openml/study_test.py`:
- Around line 506-519: The helper _attach_tasks_to_study currently hardcodes the
DB update to WHERE id = 1; change the UPDATE to use the study_id argument by
executing a parameterized query (e.g. text("UPDATE study SET status =
'in_preparation' WHERE id = :study_id")) and pass {"study_id": study_id} to
expdb_test.execute so the setup updates the correct study id; update the call
site in _attach_tasks_to_study that invokes expdb_test.execute (and keep using
the existing study_id parameter) to avoid SQL injection and ensure the helper
works for any input.

---

Outside diff comments:
In `@tests/routers/openml/datasets_test.py`:
- Around line 79-89: The test test_private_dataset_no_access currently passes
user_db=None which violates the precondition in get_dataset (it asserts user_db
is not None), so update the test to supply a valid/mocked AsyncConnection for
user_db (or a fixture) instead of None so the function reaches its HTTPException
path; locate the call to get_dataset in the test and replace user_db=None with a
proper user_db fixture or mock AsyncConnection that simulates no access.

---

Nitpick comments:
In `@src/database/users.py`:
- Around line 30-42: The variable named `row` in get_user_id_for (and the same
pattern in get_user_groups_for) is misleading because it holds a CursorResult
from connection.execute(); rename `row` to `result` in both functions, update
any subsequent uses (e.g., change `row.one_or_none()` to `result.one_or_none()`
and `row.fetchall()` to `result.fetchall()` as appropriate) so the name reflects
the type and avoids confusion when reading or modifying get_user_id_for and
get_user_groups_for.

In `@tests/routers/openml/study_test.py`:
- Around line 534-569: The tests test_attach_task_to_study_needs_owner,
test_attach_task_to_study_already_linked_raises, and
test_attach_task_to_study_but_task_not_exist_raises each repeat the UPDATE study
SET status = 'in_preparation' setup that is already performed inside the helper
_attach_tasks_to_study; remove those duplicate expdb_test.execute(...) calls
from those test functions so the DB setup remains only in
_attach_tasks_to_study, leaving the tests to call _attach_tasks_to_study with
the same parameters and assertions unchanged.

ℹ️ Review info

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cf5fc3f and c5dd79a.

📒 Files selected for processing (3)
  • src/database/users.py
  • tests/routers/openml/datasets_test.py
  • tests/routers/openml/study_test.py

Comment on lines +506 to 519
async def _attach_tasks_to_study(
study_id: int,
task_ids: list[int],
api_key: str,
py_api: TestClient,
expdb_test: Connection,
expdb_test: AsyncConnection,
) -> httpx.Response:
# Adding requires the study to be in preparation,
# but the current snapshot has no in-preparation studies.
expdb_test.execute(text("UPDATE study SET status = 'in_preparation' WHERE id = 1"))
await expdb_test.execute(text("UPDATE study SET status = 'in_preparation' WHERE id = 1"))
return py_api.post(
f"/studies/attach?api_key={api_key}",
json={"study_id": study_id, "entity_ids": task_ids},
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Use study_id in the setup update instead of hardcoded id = 1.

Line [515] ignores the helper argument and always updates study 1, which makes the helper incorrect for any other input.

💡 Proposed fix
 async def _attach_tasks_to_study(
     study_id: int,
     task_ids: list[int],
     api_key: str,
     py_api: TestClient,
     expdb_test: AsyncConnection,
 ) -> httpx.Response:
     # Adding requires the study to be in preparation,
     # but the current snapshot has no in-preparation studies.
-    await expdb_test.execute(text("UPDATE study SET status = 'in_preparation' WHERE id = 1"))
+    await expdb_test.execute(
+        text("UPDATE study SET status = 'in_preparation' WHERE id = :study_id"),
+        {"study_id": study_id},
+    )
     return py_api.post(
         f"/studies/attach?api_key={api_key}",
         json={"study_id": study_id, "entity_ids": task_ids},
     )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/routers/openml/study_test.py` around lines 506 - 519, The helper
_attach_tasks_to_study currently hardcodes the DB update to WHERE id = 1; change
the UPDATE to use the study_id argument by executing a parameterized query (e.g.
text("UPDATE study SET status = 'in_preparation' WHERE id = :study_id")) and
pass {"study_id": study_id} to expdb_test.execute so the setup updates the
correct study id; update the call site in _attach_tasks_to_study that invokes
expdb_test.execute (and keep using the existing study_id parameter) to avoid SQL
injection and ensure the helper works for any input.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/routers/openml/datasets_test.py`:
- Around line 5-6: The test references the undefined type Connection and is
synchronous while calling async fixtures: replace the type annotation Connection
with the imported AsyncConnection, change the test function signature to async
def (making it an async test), and await any asynchronous fixture calls inside
the test (e.g., calls to async client/DB fixtures or methods) so all async
operations are properly awaited; update any function/method annotations and
usages that mention Connection to AsyncConnection to avoid the NameError.

ℹ️ Review info

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c5dd79a and 04b4b22.

📒 Files selected for processing (3)
  • src/database/datasets.py
  • src/main.py
  • tests/routers/openml/datasets_test.py

Signed-off-by: rohansen856 <rohansen856@gmail.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/routers/openml/datasets_test.py (1)

278-293: ⚠️ Potential issue | 🟡 Minor

Add explicit commit before the py_api request to ensure inserted rows are visible.

Both py_api and expdb_test use the same AsyncConnection object (the py_api fixture depends on expdb_test and yields it via dependency_overrides). However, the connection is wrapped in automatic_rollback, which starts a transaction but does not commit. While uncommitted changes within a single transaction are visible to subsequent queries on the same connection, the test should explicitly call await expdb_test.commit() after the inserts and before py_api.get(...) to match the pattern used elsewhere in the codebase (e.g., persisted_flow fixture) and ensure the regression fix is robustly tested:

Suggested fix
async def test_dataset_no_500_with_multiple_processing_entries(
    py_api: TestClient,
    expdb_test: AsyncConnection,
) -> None:
    """Regression test for issue `#145`: multiple processing entries caused 500."""
    await expdb_test.execute(
        text("INSERT INTO evaluation_engine(id, name, description) VALUES (99, 'test_engine', '')"),
    )
    await expdb_test.execute(
        text(
            "INSERT INTO data_processed(did, evaluation_engine_id, user_id, processing_date) "
            "VALUES (1, 99, 2, '2020-01-01 00:00:00')",
        ),
    )
    await expdb_test.commit()
    response = py_api.get("/datasets/1")
    assert response.status_code == HTTPStatus.OK
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/routers/openml/datasets_test.py` around lines 278 - 293, The test
function test_dataset_no_500_with_multiple_processing_entries inserts rows using
the AsyncConnection fixture expdb_test but does not commit them before calling
py_api.get("/datasets/1"); add an explicit await expdb_test.commit() after the
two await expdb_test.execute(...) calls and before the py_api.get(...) call
(mirroring the persisted_flow pattern) so the inserted rows are durably visible
to the request.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@tests/routers/openml/datasets_test.py`:
- Around line 278-293: The test function
test_dataset_no_500_with_multiple_processing_entries inserts rows using the
AsyncConnection fixture expdb_test but does not commit them before calling
py_api.get("/datasets/1"); add an explicit await expdb_test.commit() after the
two await expdb_test.execute(...) calls and before the py_api.get(...) call
(mirroring the persisted_flow pattern) so the inserted rows are durably visible
to the request.

ℹ️ Review info

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 04b4b22 and d4f33a6.

📒 Files selected for processing (1)
  • tests/routers/openml/datasets_test.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant