openml · Gokul-social · Feb 24, 2026 · Feb 25, 2026
diff --git a/docs/contributing/tests.md b/docs/contributing/tests.md
@@ -1,98 +1,124 @@
 # Writing Tests
 
-tl;dr:
- - Setting up the `py_api` fixture to test directly against a REST API endpoint is really slow, only use it for migration/integration tests.
- - Getting a database fixture and doing a database call is slow, consider mocking if appropriate.
-
-## Overhead from Fixtures
-Sometimes, you want to interact with the REST API through the `py_api` fixture,
-or want access to a database with `user_test` or `expdb_test` fixtures.
-Be warned that these come with considerable relative overhead, which adds up when running thousands of tests.
-
-```python
-@pytest.mark.parametrize('execution_number', range(5000))
-def test_private_dataset_owner_access(
-        execution_number,
-        expdb_test: Connection,
-        user_test: Connection,
-        py_api: TestClient,
-) -> None:
-    fetch_user(ApiKey.REGULAR_USER, user_test)  # accesses only the user db
-    get_estimation_procedures(expdb_test)  # accesses only the experiment db
-    py_api.get("/does/not/exist")  # only queries the api
-    pass
+This page documents the current testing strategy in this repository.
+It is intentionally descriptive: it explains which test layers exist today and when each layer is used.
+
+## Quick summary
+
+- Use the lightest test layer that verifies the behavior you are changing.
+- `py_api` (`fastapi.testclient.TestClient`) is intentionally used for integration and migration checks.
+- Direct database tests verify SQL/database behavior.
+- Direct function tests verify application logic with minimal fixture overhead.
+- Mocking is used selectively to keep tests fast, while still validating real database behavior in dedicated tests.
+
+## Test infrastructure in this repository
+
+The core fixtures are defined in `tests/conftest.py`:
+
+- `expdb_test` and `user_test` provide transactional database connections.
+- `py_api` creates a FastAPI `TestClient` and overrides dependencies to use those transactional connections.
+- `php_api` provides an HTTP client to the legacy PHP API for migration comparisons.
+
+The transactional fixtures use rollback semantics, so most tests can mutate data without persisting changes.
+
+## Test categories
+
+### 1) Migration tests
+
+Migration tests compare Python API responses against the legacy PHP API for equivalent endpoints.
+These tests live under `tests/routers/openml/migration/`.
+
+Characteristics:
+
+- Use both `py_api` and `php_api` fixtures.
+- Compare response status and response body (with explicit normalization where old/new formats differ).
+- Focus on compatibility guarantees during migration.
+
+Typical examples include dataset, flow, task, study, and evaluation migration checks.
+
+### 2) Integration tests (FastAPI TestClient)
+
+Integration tests call Python API endpoints through `py_api` and assert end-to-end behavior from routing to serialization.
+Most endpoint-focused tests under `tests/routers/openml/` use this style.
+
+Characteristics:
+
+- Exercise request/response handling via HTTP calls to the in-process FastAPI app.
+- Use real dependency wiring (with test database connections injected via fixture overrides).
+- Validate returned status codes and payloads as clients see them.
+
+This layer is broader than direct function/database tests, but also has higher execution cost.
+
+### 3) Direct database tests
+
+Direct database tests call functions in `src/database/*` with `expdb_test`/`user_test` connections.
+Examples are in `tests/database/`.
+
+Characteristics:
+
+- Focus on query behavior and returned records.
+- Avoid HTTP/TestClient overhead.
+- Validate persistence-layer behavior directly against the test database.
+
+Use this layer when the change is primarily in SQL access or data retrieval logic.
+
+### 4) Direct function tests
+
+Direct function tests call router or dependency functions directly (without HTTP requests), often with lightweight fixtures and selective mocks.
+Examples include tests that call functions such as `flow_exists(...)` or `get_dataset(...)` directly.
+
+Characteristics:
+
+- Validate function-level control flow and error handling.
+- Can mock lower-level calls where appropriate.
+- Keep runtime low compared with full TestClient tests.
+
+These tests are useful for fast feedback on logic that does not require full HTTP-level verification.
+
+## Performance tradeoffs
+
+Fixture setup has measurable cost.
+In existing measurements, creating `py_api` is significantly more expensive than direct function/database-level testing, and database fixtures also add overhead.
+
+Practical implications:
+
+- Prefer direct function or direct database tests when they can validate the behavior sufficiently.
+- Reserve `py_api` usage for cases where endpoint-level integration behavior is the target.
+- Keep migration tests focused, because they combine multiple expensive dependencies.
+
+This keeps local feedback cycles fast while preserving endpoint and compatibility coverage where required.
+
+## Design philosophy: limited mocking
+
+Mocking is used to reduce runtime and isolate logic when full database interaction is not required.
+At the same time, this repository keeps mocking limited by pairing it with real database coverage for the same entities/paths.
+
+Why this balance is used:
+
+- Mock-based tests are fast and targeted.
+- Database-backed tests verify actual query/schema behavior.
+- Together they reduce risk that mocked behavior diverges from real database behavior.
+
+In short: mock for speed and focus, but keep real database tests for behavioral truth.
+
+## Running tests
+
+Run all tests (from the Python API container):
+
+```bash
+python -m pytest tests
 ```
 
-When individually adding/removing components, we measure (for 5000 repeats, n=1):
-
-| expdb | user | api | exp call | user call | api get |  time (s) |
-|-------|------|-----|----------|-----------|---------|----------:|
-|  ❌   |  ❌  | ❌  | ❌      | ❌        | ❌      |      1.78 |
-|  ✅   |  ❌  | ❌  | ❌      | ❌        | ❌      |      3.45 |
-|  ❌   |  ✅  | ❌  | ❌      | ❌        | ❌      |      3.22 |
-|  ❌   |  ❌  | ✅  | ❌      | ❌        | ❌      |    298.48 |
-|  ✅   |  ✅  | ❌  | ❌      | ❌        | ❌      |      4.44 |
-|  ✅   |  ✅  | ✅  | ❌      | ❌        | ❌      |    285.69 |
-|  ✅   |  ❌  | ❌  | ✅      | ❌        | ❌      |      4.91 |
-|  ❌   |  ✅  | ❌  | ❌      | ✅        | ❌      |      5.81 |
-|  ✅   |  ✅  | ✅  | ✅      | ✅        | ✅      |    307.91 |
-
-Adding a fixture that just returns some value adds only minimal overhead (1.91s),
-so the burden comes from establishing the database connection itself.
-
-We make the following observations:
-
-- Adding a database fixture adds the same overhead as instantiating an entirely new test.
-- Overhead of adding multiple database fixtures is not free
-- The `py_api` fixture adds two orders of magnitude more overhead
-
-We want our tests to be fast, so we want to avoid using these fixtures when we reasonably can.
-We restrict usage of `py_api` fixtures to integration/migration tests, since it is very slow.
-These only run on CI before merges.
-For database fixtures
-
-We will write some fixtures that can be used to e.g., get a `User` without accessing the database.
-The validity of these users will be tested against the database in only a single test.
-
-### Mocking
-Mocking can help us reduce the reliance on database connections in tests.
-A mocked function can prevent accessing the database, and instead return a predefined value instead.
-
-It has a few upsides:
- - It's faster than using a database fixture (see below).
- - The test is not dependent on the database: you can run the test without a database.
-
-But it also has downsides:
- - Behavior changes in the database, such as schema changes, are not automatically reflected in the tests.
- - The database layer (e.g., queries) are not actually tested.
-
-Basically, the mocked behavior may not match real behavior when executed on a database.
-For this reason, for each mocked entity, we should add a test that verifies that if the database layer
-is invoked with the database, it returns the expected output that matches the mock.
-This is additional overhead in development, but hopefully it pays back in more granular test feedback and faster tests.
-
-On the speed of mocks, consider these two tests:
-
-```diff
-@pytest.mark.parametrize('execution_number', range(5000))
-def test_private_dataset_owner_access(
-        execution_number,
-        admin,
-+        mocker,
--        expdb_test: Connection,
-) -> None:
-+    mock = mocker.patch('database.datasets.get')
-+    class Dataset(NamedTuple):
-+        uploader: int
-+        visibility: Visibility
-+    mock.return_value = Dataset(uploader=1, visibility=Visibility.PRIVATE)
-
-    _get_dataset_raise_otherwise(
-        dataset_id=1,
-        user=admin,
--        expdb=expdb_test,
-+        expdb=None,
-    )
+Run a focused test module:
+
+```bash
+python -m pytest tests/routers/openml/datasets_test.py
 ```
-There is only a single database call in the test. It fetches a record on an indexed field and does not require any joins.
-Despite the database call being very light, the database-included test is ~50% slower than the mocked version (3.50s vs 5.04s).
+
+Run by marker expression (example):
+
+```bash
+python -m pytest -m "not slow"
+```
+
+See `pyproject.toml` for current marker definitions (including `slow` and `mut`).
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -13,14 +13,16 @@ theme:
     - content.code.copy
 
 nav:
-  - Intro: index.md
+  - OpenML Server: index.md
   - Getting Started: installation.md
+
   - Contributing:
       - contributing/index.md
       - Development: contributing/contributing.md
       - Tests: contributing/tests.md
       - Documentation: contributing/documentation.md
       - Project Overview: contributing/project_overview.md
+
   - Changes: migration.md
 
 markdown_extensions: