Skip to content

Release v0.2.0 — API Infrastructure & Suggestion Sessions#1

Merged
JohnRDOrazio merged 118 commits intomainfrom
release/v0.2.0
Mar 9, 2026
Merged

Release v0.2.0 — API Infrastructure & Suggestion Sessions#1
JohnRDOrazio merged 118 commits intomainfrom
release/v0.2.0

Conversation

@damienriehl
Copy link
Copy Markdown
Contributor

@damienriehl damienriehl commented Mar 4, 2026

Summary

  • API infrastructure — lifespan handlers, JWT role extraction, full-text search, rate limiting, security headers, custom exceptions, input validation
  • Suggestion sessions — full backend for suggester workflow: session CRUD, beacon auth, auto-submit worker, Alembic migration
  • Bug fixes — HttpUrl crash fix in class tree sort, default role changed to suggester
  • 85 passing tests — auth, config, search, linter, ontology service, project routes, end-to-end workflow

Stats

Metric Value
Commits 5
Files changed 31
Insertions +3,070
Deletions −86
Tests added 85

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Project analytics, semantic search/similarity, embedding management, suggestion sessions (create/save/submit/discard/beacon), and quality tools (cross-references, consistency checks, duplicate detection).
  • Improvements

    • Safer SPARQL handling and search, request IDs, security headers, rate limiting, role support with new "suggester" role, improved member ordering, and auto-embed on saves.
  • Infrastructure

    • pgvector support and DB migrations for analytics/embeddings/suggestions.
  • Version

    • Release bumped to 0.3.0
  • Tests

    • Extensive new unit and integration tests added.

damienriehl and others added 9 commits February 22, 2026 12:42
…leware, exceptions, input validation, and comprehensive test suite (85 tests)

- Implement startup/shutdown lifespan: DB verification, Redis pool, MinIO bucket init
- Extract Zitadel project roles from JWT claims with 1-hour JWKS cache TTL
- Implement full-text search via PostgreSQL tsvector/tsquery with pagination
- Implement SPARQL execution via RDFLib (SELECT/ASK/CONSTRUCT)
- Add slowapi rate limiting (100 req/min default)
- Add request ID, access logging, and security headers middleware
- Add custom exception types (NotFoundError, ValidationError, ConflictError, ForbiddenError)
  with global exception handlers returning consistent error schema
- Add IRI format validation and string length limits on schema fields
- Add 85 passing tests covering auth, config, search, linter, ontology service, routes,
  and integration workflows

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When the search query is "*", skip the substring match filter and return
all entities of the requested type(s). This enables the frontend Properties
and Individuals tabs to list all entities without a dedicated listing endpoint.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…h, and auto-submit

Implements Phase C of the suggestion workflow — the FastAPI backend that
powers the frontend suggestion UI. Suggesters get a dedicated branch,
save edits as commits, and submit as PRs. Stale sessions auto-submit
via a cron job every 10 minutes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Pydantic v2 HttpUrl is not a plain string, so calling .lower() on
cls.iri fails with AttributeError. Wrap in str() for both
get_root_classes and get_class_children sort_key functions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Join request approval now assigns suggester role instead of editor.
MemberBase schema default changed from viewer to suggester.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Release v0.2.0 — lifespan handlers, JWT roles, search service, rate
limiting, security middleware, suggestion sessions, beacon auth,
auto-submit worker, HttpUrl fix, default suggester role, and 85 tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Phase A (Quality): Cross-reference service, 12-rule consistency checker,
duplicate detection via label similarity. 4 new endpoints with Redis caching.

Phase B (Analytics): EntityChangeEvent model + migration, graph-diff change
tracking hooked into save_source_content, entity history, project activity,
hot entities, and contributor stats. 4 new endpoints.

Phase C (Vector): pgvector-backed embedding tables + migration, pluggable
embedding providers (local/OpenAI/Voyage), semantic search, similar entities,
rank suggestions, auto-embed on save. 8 new endpoints, 2 ARQ worker tasks.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add ZITADEL_EXTERNALPORT: 8080 so issuer matches backend config
- Add port 53000 to CORS_ORIGINS for frontend dev server
- Use pgvector/pgvector:pg17 image instead of postgres:17-alpine
- Create pgvector extension in init-db.sh as superuser

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@JohnRDOrazio
Copy link
Copy Markdown
Member

@CodeRabbit review

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 8, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 8, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds analytics, quality, embeddings, semantic search, and suggestion-session features including DB migrations and ORM models, embedding providers/services, new API routes and middleware, background worker tasks, auth + beacon token changes, extensive tests, and bumps version to 0.3.0.

Changes

Cohort / File(s) Summary
Database Migrations
alembic/versions/l0g1h2i3j4k5_add_suggestion_sessions_table.py, alembic/versions/m1n2o3p4q5r6_add_entity_change_events.py, alembic/versions/n2o3p4q5r6s7_add_embedding_tables.py
New migrations: create suggestion_sessions, entity_change_events, embedding tables; enable pgvector; add indexes, constraints (including partial unique active session index); include downgrade steps.
ORM Models & Exports
ontokit/models/change_event.py, ontokit/models/embedding.py, ontokit/models/suggestion_session.py, ontokit/models/project.py, ontokit/models/__init__.py
Add EntityChangeEvent, embedding models, SuggestionSession + status enum; wire Project.suggestion_sessions relationship; update public exports.
API Routes
ontokit/api/routes/__init__.py, analytics.py, embeddings.py, quality.py, semantic_search.py, suggestions.py, projects.py, search.py, lint.py, normalization.py
Introduce routers/endpoints for analytics, embeddings (config/status/generate), quality checks, semantic search/similarity/ranking, suggestions (session lifecycle + beacon), and wire DB session into search/projects; projects.save now records change events and may enqueue embedding jobs; lint/normalization use centralized ARQ pool.
Services (core & new)
ontokit/services/...
change_event_service.py, embedding_service.py, embedding_text_builder.py, suggestion_service.py, consistency_service.py, cross_reference_service.py, duplicate_detection_service.py, search.py, rdf_utils.py
Add ChangeEventService, EmbeddingService (generation, semantic search, ranking), SuggestionService (session lifecycle), consistency/cross-reference/duplicate detection services, SearchService (FTS + SPARQL), and RDF utilities.
Embedding Providers & Factory
ontokit/services/embedding_providers/...
base.py, local_provider.py, openai_provider.py, voyage_provider.py, __init__.py
Add EmbeddingProvider interface and implementations: local (sentence-transformers), OpenAI, Voyage; provide get_embedding_provider factory.
Schemas
ontokit/schemas/...
analytics.py, embeddings.py, quality.py, suggestion.py, ontology.py, project.py, __init__.py
New analytics/quality/suggestion schemas; embeddings schemas; tighten IRI and length validators across ontology/project schemas; add ProjectCreate and "suggester" role.
Core & Auth
ontokit/core/...
auth.py, beacon_token.py, exceptions.py, middleware.py
Extract Zitadel roles into TokenPayload, add JWKS TTL cache + refresh, add HMAC beacon token creation/verification, define OntoKitError hierarchy, and add RequestID/AccessLog/SecurityHeaders middleware.
App, Worker & Ops
ontokit/main.py, ontokit/worker.py, compose.yaml, scripts/init-db.sh, pyproject.toml, ontokit/version.py
Wire Redis, rate limiter, lifespan startup checks, error handlers, middlewares; add ARQ worker tasks (embedding, auto-submit stale suggestions) and cron; update compose to pgvector image and init-db to create pgvector; add deps and bump VERSION to 0.3.0.
API Utils (Redis)
ontokit/api/utils/redis.py
Add shared get_arq_pool() and close_arq_pool() to centralize ARQ Redis pool usage across routes.
Minor Service Adjustments
ontokit/services/...
join_request_service.py, ontology.py, project_service.py
Join approvals now grant "suggester"; sorting fixes and wildcard support in ontology search; member ordering updated to include "suggester".
Tests
tests/...
tests/conftest.py, many unit/integration tests
Extensive new/updated tests covering auth, beacon tokens, config, linter, ontology utils, search/SPARQL, quality, cross-references, duplicate detection, embedding flows, suggestion sessions, and integration routes.
Packaging & CI
pyproject.toml, .github/workflows/release.yml, scripts/init-db.sh
Add dependencies (slowapi, sentence-transformers, pgvector, numpy), adjust CI invocations, and ensure DB extension init for pgvector.

Sequence Diagram(s)

sequenceDiagram
    participant Client as Client
    participant API as FastAPI (suggestions router)
    participant Service as SuggestionService
    participant DB as Database
    participant Git as Git backend
    participant PR as PullRequest service

    Client->>API: POST /projects/{id}/suggestions/sessions
    API->>Service: create_session(project_id, user)
    Service->>DB: INSERT suggestion_session
    Service->>Git: create branch for session
    DB-->>Service: created session
    Service-->>API: 201 Created (session_id, branch, beacon_token)

    Client->>API: PUT /projects/{id}/suggestions/sessions/{sid}/save
    API->>Service: save(session_id, content, user)
    Service->>Git: commit to branch
    Service->>DB: UPDATE changes_count, entities_modified, last_activity
    Service-->>API: 200 OK (commit_hash, branch, changes_count)

    Client->>API: POST /projects/{id}/suggestions/sessions/{sid}/submit
    API->>Service: submit(session_id, summary, user)
    Service->>DB: verify session & changes
    Service->>Git: prepare PR branch/metadata
    Service->>PR: create pull request
    Service->>DB: UPDATE session status and pr reference
    Service-->>API: 200/201 (pr_number, pr_url, status)
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related issues

Poem

I nibble vectors, hop through code, 🐇
Branches sprout where sessions grow,
Analytics count each little hop,
Quality fences keep bugs from crop,
Suggestions bloom — a carrot glow! 🥕

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch release/v0.2.0

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Note

Due to the large number of review comments, Critical severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
ontokit/services/join_request_service.py (1)

205-205: ⚠️ Potential issue | 🟡 Minor

Docstring inconsistency.

The implementation correctly assigns the suggester role, but the docstring at Line 205 still references "editor":

"Approve a join request and add the user as an editor."

This should be updated to reflect the new role assignment.

📝 Proposed fix
     async def approve_request(
         self,
         project_id: UUID,
         request_id: UUID,
         action: JoinRequestAction,
         user: CurrentUser,
     ) -> JoinRequestResponse:
-        """Approve a join request and add the user as an editor."""
+        """Approve a join request and add the user as a suggester."""
         await self._check_admin_access(project_id, user)

Also applies to: 233-237

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ontokit/services/join_request_service.py` at line 205, Update the misleading
docstrings that say "add the user as an editor" to accurately state that the
user is added/assigned the "suggester" role: specifically edit the docstring on
the approve_join_request method (and the other docstring block around the same
function area) so they describe approving a join request and assigning the
suggester role (mentioning the method name approve_join_request and any helper
like assign_role or add_user_to_org if present) and ensure the wording matches
the actual implementation that assigns "suggester".
ontokit/core/auth.py (1)

130-169: ⚠️ Potential issue | 🔴 Critical

Add audience (aud/azp) validation before extracting and applying token roles.

The validate_token() function disables JWT audience verification with options={"verify_aud": False} but does not perform manual audience validation afterwards. This means a token from the same Zitadel issuer but issued for a different client will still be accepted and its roles will be extracted and used for authorization decisions. Add validation to ensure the token's aud claim contains or azp claim matches settings.zitadel_client_id before accepting the token.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ontokit/core/auth.py` around lines 130 - 169, validate_token currently
disables JWT audience verification and extracts roles without checking that the
token was issued for our client; after jwt.decode and before calling
_extract_roles and constructing TokenPayload, validate that the payload's aud
contains settings.zitadel_client_id (or, if aud is a string, equals it) or that
payload.get("azp") == settings.zitadel_client_id, and if not raise
HTTPException(status_code=401, detail="Invalid audience"); keep the existing
issuer/alg checks and only perform this manual audience/azp check prior to using
roles from _extract_roles.
🟠 Major comments (22)
ontokit/core/auth.py-229-239 (1)

229-239: ⚠️ Potential issue | 🟠 Major

Widen the userinfo fallback to cover missing roles too.

Right now both helpers only call fetch_userinfo() when name or email is missing. Tokens that already include profile fields but omit the Zitadel roles claim will therefore come back with roles=[] and fail every require_roles(...) check even if userinfo could supply the roles.

Minimal fix
-    if not name or not email:
+    if not name or not email or not roles:
         userinfo = await fetch_userinfo(credentials.credentials)
         if userinfo:
             name = name or userinfo.get("name") or userinfo.get("preferred_username")
             email = email or userinfo.get("email")
             username = username or userinfo.get("preferred_username")
-            # Also extract roles from userinfo if not already present in token
-            if not roles:
-                roles = _extract_roles(userinfo)
+            roles = roles or _extract_roles(userinfo)

Apply the same change in both helper paths.

Also applies to: 289-299

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ontokit/core/auth.py` around lines 229 - 239, The token parsing currently
only calls fetch_userinfo() when name or email is missing, so tokens with roles
absent in token_payload.roles remain [] and fail role checks; update both helper
paths (the block referencing token_payload.roles and the analogous block around
lines 289-299) to also call fetch_userinfo() when roles is empty/falsey, then if
userinfo is present set roles = roles or _extract_roles(userinfo); ensure you
reference token_payload.roles, fetch_userinfo, and _extract_roles in the fix so
roles are populated from userinfo when missing from the token.
ontokit/services/search.py-191-206 (1)

191-206: ⚠️ Potential issue | 🟠 Major

Parse SPARQL queries properly to determine query type instead of guessing from the first token.

When queries include PREFIX, BASE, or comment preambles, the string-based detection fails: queries with these preambles that should be classified as ASK, CONSTRUCT, or DESCRIBE incorrectly default to SELECT. This causes two problems:

  1. Valid queries with preambles return the wrong response shape (e.g., ASK returns SELECT structure).
  2. When parsing fails, the exception handler (lines 208–218) returns an error response with the wrong query_type, masking the underlying issue instead of exposing a clear error to clients.

Use RDFLib's parseQuery() to properly parse SPARQL and extract the correct query type, which handles all valid SPARQL preambles.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ontokit/services/search.py` around lines 191 - 206, Replace the fragile
string-based detection (query_text/query_upper/query_type) with RDFLib's parser:
call rdflib.plugins.sparql.parser.parseQuery(query.query) up front, extract the
actual SPARQL operation type (map the parsed result to
"SELECT"/"ASK"/"CONSTRUCT"/"DESCRIBE"), and set query_type from that parsed
value (instead of guessing from the first token); if parseQuery raises, do not
silently default to "SELECT"—let the parse exception propagate (or set
query_type to a clear error state like "PARSE_ERROR") so the existing error
handler can return an accurate error response rather than masking the parse
failure. Ensure you update the code paths that reference query_type (the
variables query_text, query_upper, and query_type) to use the parsed type.
ontokit/services/duplicate_detection_service.py-85-91 (1)

85-91: ⚠️ Potential issue | 🟠 Major

Potential logic issue in cluster membership check.

The condition if root in parent or iri in parent may incorrectly exclude entities that were merged into clusters. An entity that was the root of a union won't be in parent as a key if it was never the child in a union operation. This could result in some cluster members being missed.

🐛 Proposed fix for cluster membership logic
     # Build clusters
     clusters_map: dict[str, list[tuple[str, str, str]]] = defaultdict(list)
     entity_lookup = {iri: (label, etype) for iri, label, etype in entities}
+    # Track which entities are part of any cluster (appeared in at least one similar pair)
+    clustered_entities = set()
+    for (iri_a, iri_b) in pair_sim.keys():
+        clustered_entities.add(iri_a)
+        clustered_entities.add(iri_b)
+
     for iri, (label, etype) in entity_lookup.items():
-        root = find(iri)
-        if root in parent or iri in parent:
+        if iri in clustered_entities:
+            root = find(iri)
             clusters_map[find(iri)].append((iri, label, etype))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ontokit/services/duplicate_detection_service.py` around lines 85 - 91, The
current membership check using "if root in parent or iri in parent" can skip
entities whose representative root isn't a key in parent; instead, compute root
= find(iri) and always add the tuple to clusters_map[root] (i.e., remove the
conditional filter) so every entity is grouped under its representative; you can
then filter out singleton clusters later if needed—update the block that builds
clusters_map (referencing clusters_map, entity_lookup, find, parent, iri, root)
to always append to clusters_map[root].
ontokit/services/consistency_service.py-285-300 (1)

285-300: ⚠️ Potential issue | 🟠 Major

Don't treat external vocabulary terms as dangling references.

Valid references like rdfs:range xsd:string or rdfs:subClassOf skos:Concept will be flagged unless those IRIs also appear as local subjects. This rule should only report unresolved in-ontology references, not standard or imported vocabulary terms.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ontokit/services/consistency_service.py` around lines 285 - 300, The check in
consistency_service.py flags external vocabulary IRIs as dangling because
`known` is only _all_declared_uris(graph) plus local subjects; update the logic
so external/imported vocabularies are treated as resolved: extend `known` to
include terms from the graph's namespace manager and common vocabularies before
the predicate loop (use graph.namespace_manager.namespaces() to collect
namespace URI prefixes and add any URIRefs whose namespace matches those
prefixes, and also include well-known constants like RDFS, RDF, XSD, SKOS, OWL),
or alternatively skip reporting when the object's namespace is one of the graph
namespaces or a standard vocab namespace; modify the code around
`_all_declared_uris`, `all_subjects`, `known`, and the predicate loop to apply
this namespace-based filter so only true in-ontology unresolved references are
reported.
ontokit/services/consistency_service.py-75-98 (1)

75-98: ⚠️ Potential issue | 🟠 Major

Walk every subClassOf edge when detecting cycles.

This loop only follows parents[0]. In a multi-inheritance graph, a cycle reachable through any other parent is missed, so the checker returns false negatives.

💡 Suggested fix
 def _check_cycle_detect(graph: Graph) -> list[ConsistencyIssue]:
     """Detect cycles in rdfs:subClassOf chains."""
     issues = []
     reported: set[str] = set()
 
     for cls in graph.subjects(RDF.type, OWL.Class):
         if not isinstance(cls, URIRef):
             continue
-        visited: set[URIRef] = set()
-        current = cls
-        while True:
-            if current in visited:
+        stack: list[tuple[URIRef, set[URIRef]]] = [(cls, set())]
+        while stack:
+            current, path = stack.pop()
+            if current in path:
                 iri = str(cls)
                 if iri not in reported:
                     issues.append(ConsistencyIssue(
                         rule_id="cycle_detect",
                         severity="error",
@@
-                break
-            visited.add(current)
+                break
+
+            next_path = path | {current}
             parents = [p for p in graph.objects(current, RDFS.subClassOf) if isinstance(p, URIRef)]
-            if not parents:
-                break
-            current = parents[0]
+            stack.extend((parent, next_path) for parent in parents)
     return issues
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ontokit/services/consistency_service.py` around lines 75 - 98, The cycle
check only follows the first parent (parents[0]) and misses cycles via other
parents; update the traversal starting at each class (the loop over
graph.subjects(RDF.type, OWL.Class)) to explore all RDFS.subClassOf edges (e.g.,
use a stack or queue to perform DFS/BFS over parents) instead of assigning
current = parents[0]; keep using the per-start visited/path set (visited) to
detect a back-edge and report via ConsistencyIssue/reported when a cycle is
found, and ensure you only push/visit URIRef parents (from
graph.objects(current, RDFS.subClassOf)) so all inheritance branches are
checked.
ontokit/api/routes/embeddings.py-99-104 (1)

99-104: ⚠️ Potential issue | 🟠 Major

Don't hardcode "main" as the implicit branch.

Projects can use master or any custom default branch. Using "main" here makes generation/status hit the wrong revision unless the caller already knows to override it.

Also applies to: 120-125

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ontokit/api/routes/embeddings.py` around lines 99 - 104, The endpoints (e.g.,
generate_embeddings) currently hardcode branch=Query(default="main"); instead
make branch optional (no hardcoded "main") and resolve the effective branch
inside the handler using the project's configured default branch fetched via
project_id (e.g., query Project.default_branch or project settings) so requests
without branch use the project's actual default; apply the same change to the
other similar endpoint(s) referenced around lines 120-125 so all embedding
generation/status handlers derive the branch from project metadata when branch
is not provided.
ontokit/services/cross_reference_service.py-61-63 (1)

61-63: ⚠️ Potential issue | 🟠 Major

Don't drop blank-node references here.

Common OWL patterns like subClassOf [ owl:someValuesFrom target ] attach the target through a blank node. continueing on non-URI subjects means those references never get attributed back to the owning entity.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ontokit/services/cross_reference_service.py` around lines 61 - 63, The loop
currently ignores blank-node subjects (s) which drops important references;
instead of continuing on non-URI s, detect rdflib.term.BNode and resolve its
owning entity by finding subjects that point to that blank node (e.g., iterate
graph.subjects(None, s) or graph.triples((owner, None, s))) and attribute the
reference to those owning subjects (only treat owners that are URIRef); keep the
existing handling for direct URIRef subjects (s) and skip only when no URIRef
owner can be resolved.
ontokit/services/suggestion_service.py-353-359 (1)

353-359: ⚠️ Potential issue | 🟠 Major

max(pr_number) + 1 will race under concurrent submits.

Manual submit and auto-submit can both read the same max and allocate the same pr_number. This needs a DB-enforced allocator, or at least a retry path on unique-constraint failure.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ontokit/services/suggestion_service.py` around lines 353 - 359, The current
pattern (select max(PullRequest.pr_number) then +1) races under concurrent
submits; replace it with a DB-enforced allocator or a retry-on-unique-error
flow: either (preferred) create/use a DB sequence (e.g.,
nextval('pull_request_pr_number_seq')) and obtain pr_number from the sequence
via self.db.execute(nextval...) or by making pr_number default from that
sequence and performing an INSERT ... RETURNING pr_number, or (alternative) keep
the current logic but wrap the INSERT that uses pr_number in a retry loop that
catches the DB unique-constraint/IntegrityError and retries reading
max/allocating a new pr_number; update the code around max_number_result,
pr_number, and the insert path to use the chosen approach and ensure the unique
constraint on PullRequest.pr_number remains enforced.
ontokit/services/suggestion_service.py-219-227 (1)

219-227: ⚠️ Potential issue | 🟠 Major

Don't write personal emails into git authorship by default.

These values become part of commit history and can leak PII to repo viewers or external mirrors. Prefer a noreply/app alias unless the user explicitly opted into exposing their email.

Also applies to: 473-480

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ontokit/services/suggestion_service.py` around lines 219 - 227, The commit
author default currently uses session.user_email and session.user_name which can
leak PII; update the call to commit_to_branch (and the other similar call later)
to only use session.user_email/session.user_name when explicitly provided by the
user, otherwise provide a noreply/app alias (e.g., "noreply@ontokit.dev" or an
application-specific noreply) for author_email and a neutral author_name like
"OntoKit Bot" or "Suggester" to avoid embedding personal emails in git history;
ensure you change both the current commit_to_branch usage and the similar
invocation referenced in the comment so the fallback logic is centralized and
consistent.
ontokit/services/suggestion_service.py-497-521 (1)

497-521: ⚠️ Potential issue | 🟠 Major

Claim the stale session before creating its PR.

This loop reads active stale sessions without locking or an atomic status transition, so a manual submit—or another worker—can process the same session in parallel and create duplicate PRs.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ontokit/services/suggestion_service.py` around lines 497 - 521, Before
calling _create_pr_for_session, perform an atomic claim on each session to avoid
races: for each session from stale_sessions, run an UPDATE via self.db that sets
a temporary "claimed" or "in-progress" status (use SuggestionSession.status and
SuggestionSessionStatus) with a WHERE that includes SuggestionSession.id ==
session.id, SuggestionSession.status == SuggestionSessionStatus.ACTIVE.value,
SuggestionSession.changes_count > 0 and SuggestionSession.last_activity <
cutoff; execute and only proceed to call _create_pr_for_session when the update
affected exactly one row (otherwise skip because another worker claimed it).
Ensure you check the DB execute result (rows affected) and only change to the
final AUTO_SUBMITTED status inside _create_pr_for_session or after success.
ontokit/main.py-45-50 (1)

45-50: ⚠️ Potential issue | 🟠 Major

Fail fast on a broken database dependency.

Swallowing the DB probe exception leaves the app serving requests while its primary dependency is unavailable. That's especially misleading because the health endpoint still reports healthy.

Proposed fix
     try:
         async with engine.connect() as conn:
             await conn.execute(text("SELECT 1"))
         logger.info("Database connection verified")
-    except Exception:
-        logger.exception("Failed to connect to the database — continuing startup")
+    except Exception:
+        logger.exception("Failed to connect to the database")
+        raise
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ontokit/main.py` around lines 45 - 50, The DB probe currently swallows all
exceptions (in the try around engine.connect / text("SELECT 1")) and only logs
via logger.exception, allowing startup to continue; change this to fail fast by
either re-raising the exception or terminating startup after logging so the
process does not become healthy with a broken DB. Specifically, in the
try/except that wraps async with engine.connect() ... await
conn.execute(text("SELECT 1")), keep the logger.exception call for context but
then raise the caught exception (or call sys.exit(1)) so the exception
propagates and prevents continuation of startup instead of silently continuing.
ontokit/services/suggestion_service.py-507-530 (1)

507-530: ⚠️ Potential issue | 🟠 Major

Rollback before continuing after a failed auto-submit.

If _create_pr_for_session() fails during flush/commit, AsyncSession stays in a failed transaction state. Without a rollback, the rest of this loop can cascade into PendingRollbackError.

Proposed fix
         for session in stale_sessions:
             try:
                 # Create a mock user for the auto-submit
                 mock_user = CurrentUser(
                     id=session.user_id,
                     email=session.user_email,
                     name=session.user_name,
                 )
                 await self._create_pr_for_session(
                     session.project_id,
                     session,
                     mock_user,
                     summary="Auto-submitted: session inactive for 30+ minutes.",
                     new_status=SuggestionSessionStatus.AUTO_SUBMITTED.value,
                 )
                 count += 1
                 logger.info(
                     f"Auto-submitted suggestion session {session.session_id} "
                     f"for project {session.project_id}"
                 )
             except Exception as e:
+                await self.db.rollback()
                 logger.error(
                     f"Failed to auto-submit session {session.session_id}: {e}"
                 )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ontokit/services/suggestion_service.py` around lines 507 - 530, The loop that
calls _create_pr_for_session can leave the SQLAlchemy AsyncSession in a failed
transaction state if that call raises; update the except block to rollback the
AsyncSession before continuing—e.g., in the except Exception as e: await
self._db.rollback() (or await self._db_session.rollback() / whichever
AsyncSession instance the service uses) then log the error and continue—this
ensures the AsyncSession is reset and avoids PendingRollbackError cascading for
subsequent iterations.
ontokit/services/cross_reference_service.py-67-72 (1)

67-72: ⚠️ Potential issue | 🟠 Major

The fallback misclassifies every unknown URI predicate as annotation_value.

isinstance(p, URIRef) is true for ordinary structural predicates too, so unrecognized relationships get bucketed under the wrong context instead of being skipped or mapped correctly.

Proposed fix
         if context is None:
             # Check if it's an annotation property with a URI value
-            if isinstance(p, URIRef):
+            if isinstance(p, URIRef) and (p, RDF.type, OWL.AnnotationProperty) in graph:
                 context = "annotation_value"
             else:
                 continue
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ontokit/services/cross_reference_service.py` around lines 67 - 72, The
current fallback treats any unknown URI predicate p as an "annotation_value"
which misclassifies structural predicates; instead, in the block that sets
context when context is None, check whether the triple's object (o) is a Literal
or the predicate p is known to be an annotation property (e.g., by consulting
your annotation property set or mapper) before assigning "annotation_value";
otherwise skip/continue. Update the logic around variables context, p and o in
cross_reference_service.py so only literal-valued or explicitly-annotative
predicates become "annotation_value" and other unknown URI predicates are not
mis-bucketed.
ontokit/api/routes/embeddings.py-32-45 (1)

32-45: ⚠️ Potential issue | 🟠 Major

Use RedisSettings.from_dsn() to parse the full Redis URL instead of manual URL decomposition.

The current code drops credentials and TLS settings when initializing the ARQ pool. urlparse() only extracts host/port/database, but RedisSettings supports username, password, ssl, ssl_keyfile, ssl_certfile, and related TLS fields. URLs like redis://:password@host:6379 and rediss://host:6379 will fail to connect.

Replace the manual parsing with:

redis_settings = RedisSettings.from_dsn(redis_url)

The from_dsn() method parses the complete URL, automatically sets ssl=True for rediss:// schemes, and extracts authentication and database settings.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ontokit/api/routes/embeddings.py` around lines 32 - 45, The get_arq_pool
function currently reconstructs RedisSettings by manually parsing
settings.redis_url which drops credentials and TLS options; replace the manual
urlparse block with RedisSettings.from_dsn(str(settings.redis_url)) (assign to
redis_settings) so username/password and rediss:// SSL settings are preserved,
remove the urlparse import, and then pass that redis_settings into await
create_pool(...) to initialize _arq_pool as before.
ontokit/main.py-163-165 (1)

163-165: ⚠️ Potential issue | 🟠 Major

Use SlowAPI's _inject_headers() to properly preserve rate-limit headers.

The current implementation discards rate-limit headers (Retry-After, quota metadata). While exc.headers can carry headers, SlowAPI's default handler injects rate-limit headers via limiter._inject_headers(), which must be called to populate them. Simply copying exc.headers won't capture the dynamically-generated rate-limit metadata.

Proposed fix
 `@app.exception_handler`(RateLimitExceeded)
 async def rate_limit_handler(_request: Request, exc: RateLimitExceeded) -> JSONResponse:
-    return _error_response(429, "rate_limit_exceeded", str(exc.detail))
+    response = _error_response(429, "rate_limit_exceeded", str(exc.detail))
+    return _request.app.state.limiter._inject_headers(response, _request.state.view_rate_limit)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ontokit/main.py` around lines 163 - 165, The rate_limit_handler currently
returns _error_response(429, "rate_limit_exceeded", str(exc.detail)) but drops
SlowAPI rate-limit headers; update rate_limit_handler to create the JSONResponse
via _error_response and then call the Limiter instance's
_inject_headers(response, exc) (i.e., limiter._inject_headers) so the
Retry-After and quota metadata are populated from the RateLimitExceeded
exception before returning; locate the rate_limit_handler function and the
limiter object used for SlowAPI to perform this injection.
ontokit/services/embedding_providers/voyage_provider.py-41-57 (1)

41-57: ⚠️ Potential issue | 🟠 Major

Validate the batch response cardinality before extending all_embeddings.

The Voyage API docs require clients to validate that the response count matches the request count. Without this check, a partial or reordered upstream response silently breaks text→embedding alignment downstream. The API also provides an explicit index field in each response item for handling out-of-order results if needed.

Proposed fix
     async def embed_batch(self, texts: list[str]) -> list[list[float]]:
         async with httpx.AsyncClient(timeout=60.0) as client:
             all_embeddings: list[list[float]] = []
             for i in range(0, len(texts), 128):
                 batch = texts[i : i + 128]
                 resp = await client.post(
                     "https://api.voyageai.com/v1/embeddings",
                     headers={
                         "Authorization": f"Bearer {self._api_key}",
                         "Content-Type": "application/json",
                     },
                     json={"input": batch, "model": self._model_name},
                 )
                 resp.raise_for_status()
                 data = resp.json()
-                all_embeddings.extend([item["embedding"] for item in data["data"]])
+                items = data.get("data", [])
+                if len(items) != len(batch):
+                    raise ValueError(
+                        f"Expected {len(batch)} embeddings, got {len(items)}"
+                    )
+                all_embeddings.extend(item["embedding"] for item in items)
             return all_embeddings
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ontokit/services/embedding_providers/voyage_provider.py` around lines 41 -
57, In embed_batch, ensure the Voyage API response cardinality matches the
request and preserve ordering: after resp.json() (variable data) validate that
len(data["data"]) == len(batch) and raise or retry if not; then build a result
list of length len(batch) (e.g., placeholders) and populate it using each
item["index"] to place item["embedding"] in the correct slot before
extending/merging into all_embeddings, so all_embeddings alignment with input
texts is preserved (refer to embed_batch, batch, data, all_embeddings, and
item["index"]).
ontokit/services/embedding_service.py-88-96 (1)

88-96: ⚠️ Potential issue | 🟠 Major

Invalidate existing embeddings when the provider/model changes.

This mutation flips _get_provider() for future queries, but every stored EntityEmbedding row stays in the previous vector space. Until a full re-embed happens, semantic search can compare incompatible embeddings or fail on dimension mismatches. Clear or mark rows stale here and reset last_full_embed_at so callers cannot treat old vectors as current.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ontokit/services/embedding_service.py` around lines 88 - 96, When
config.provider or config.model_name is changed in the update block (where
update.provider/update.model_name are applied and get_embedding_provider is
called), also invalidate existing embeddings and reset the full-embed marker:
mark all EntityEmbedding rows for this config as stale (e.g., set a `stale` flag
or clear the stored vector) and persist that change, and set the config's
`last_full_embed_at` to None (or zero) before saving so callers cannot treat old
vectors as current; use the EntityEmbedding model and the same config identifier
used in embedding_service.py to locate and update rows immediately after
updating config.provider/config.model_name and config.dimensions.
ontokit/services/embedding_service.py-294-325 (1)

294-325: ⚠️ Potential issue | 🟠 Major

Full re-embed needs to prune removed entities.

This only upserts entities that still exist in the graph. If an ontology term was deleted, its old EntityEmbedding row survives and stays searchable forever, so a “full” embed never actually syncs the branch snapshot. Please delete rows whose entity_iri is no longer present in the current entity set as part of this path.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ontokit/services/embedding_service.py` around lines 294 - 325, When doing the
"full" upsert in this embedding path, you must also delete EntityEmbedding rows
for the given project_id and branch whose entity_iri is not in the current
entities set; compute the current IRIs from the entities list (e.g., [iri for
iri in entities] or a set), then run a delete using the EntityEmbedding model to
remove rows where EntityEmbedding.project_id == project_id and
EntityEmbedding.branch == branch and EntityEmbedding.entity_iri not in that IRI
set (use SQLAlchemy delete(...) with .where(...) and .notin_(...) or
~EntityEmbedding.entity_iri.in_(...)), and commit with the existing
self._db.commit() so stale embeddings are pruned when job.embedded_entities is
updated.
ontokit/services/embedding_service.py-97-101 (1)

97-101: ⚠️ Potential issue | 🟠 Major

Do not store provider secrets as base64.

Base64 is reversible encoding, not encryption. Anyone with database access can recover the raw API key from api_key_encrypted, so this materially weakens the security posture of the feature. Use real encryption/KMS or a dedicated secret store before release.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ontokit/services/embedding_service.py` around lines 97 - 101, The code
currently stores provider secrets by base64-encoding update.api_key into
config.api_key_encrypted (see the update.api_key branch and
config.api_key_encrypted assignment); replace this reversible encoding with real
secret storage: remove the base64.b64encode(...) usage and instead call a secure
storage/encryption helper (e.g., encrypt_with_kms(update.api_key) or
secret_manager.save_secret/update_secret) and persist the returned secret
identifier or ciphertext, not the raw key; ensure the helper performs
authenticated encryption or uses your cloud KMS/secret-store API and update any
callers of config.api_key_encrypted to expect the encrypted value or secret
reference.
ontokit/services/embedding_service.py-532-548 (1)

532-548: ⚠️ Potential issue | 🟠 Major

Add branch filtering to rank_suggestions queries.

The rank_suggestions method queries EntityEmbedding without filtering by branch, unlike semantic_search and find_similar. Since EntityEmbedding has a unique constraint on (project_id, branch, entity_iri), the same IRI can exist on multiple branches. This causes two problems:

  1. The context lookup on line 537 using scalar_one_or_none() will raise an exception if the same context IRI exists on multiple branches.
  2. The candidate embeddings (lines 542–548) can be mixed from different branches when ranking.

Add branch to RankSuggestionRequest (or the method signature and API endpoint) and filter both queries with it:

.where(
    EntityEmbedding.project_id == project_id,
    EntityEmbedding.branch == branch,
    EntityEmbedding.entity_iri == body.context_iri,
)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ontokit/services/embedding_service.py` around lines 532 - 548, The
rank_suggestions flow is missing branch filtering which can cause
scalar_one_or_none() on the context query in rank_suggestions and the candidate
query in candidates_q to return multiple rows from different branches; update
RankSuggestionRequest (or the rank_suggestions method signature/API) to accept a
branch and add EntityEmbedding.branch == branch to both the context lookup
(where EntityEmbedding.entity_iri == body.context_iri) and the candidates query
(where EntityEmbedding.entity_iri.in_(body.candidates)), ensuring
rank_suggestions, EntityEmbedding queries, and downstream logic mirror the
branch filtering used by semantic_search and find_similar.
ontokit/services/embedding_service.py-343-347 (1)

343-347: ⚠️ Potential issue | 🟠 Major

Rollback before recording the failed job.

If any DB write inside the try block raises, the AsyncSession enters a failed transaction state. In that case, this commit() will not persist status="failed", leaving the job stuck as running.

🩹 Proposed fix
         except Exception as e:
+            await self._db.rollback()
             job.status = "failed"
             job.error_message = str(e)
             job.completed_at = datetime.now(UTC)
             await self._db.commit()
             raise
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ontokit/services/embedding_service.py` around lines 343 - 347, The except
block handling failures must rollback the failed AsyncSession transaction before
updating the Job record and committing; modify the except handler in
embedding_service.py so it first calls await self._db.rollback() (or the session
rollback method used in your codebase) to clear the failed transaction state,
then set job.status = "failed", job.error_message = str(e), job.completed_at =
datetime.now(UTC), and finally await self._db.commit() to persist the changes;
reference the existing symbols job, self._db.rollback(), and self._db.commit()
when making the change.
ontokit/services/embedding_service.py-264-271 (1)

264-271: ⚠️ Potential issue | 🟠 Major

Deduplicate typed subjects to prevent embedding duplicates and inaccurate progress tracking.

graph.subjects(RDF.type, None) yields each subject once per matching rdf:type triple, not once per unique subject. Entities with multiple type assertions will be processed multiple times: duplicated in the entities list, inflating job.total_entities, making progress inaccurate, and wasting provider quota on larger ontologies.

♻️ Proposed fix
             entities: list[tuple[URIRef, str, str]] = []  # (uri, type, text)
+            seen: set[URIRef] = set()
             for s in graph.subjects(RDF.type, None):
-                if not isinstance(s, URIRef) or s == OWL.Thing:
+                if (
+                    not isinstance(s, URIRef)
+                    or s == OWL.Thing
+                    or s in seen
+                ):
                     continue
+                seen.add(s)

Alternatively, use graph.subjects(RDF.type, None, unique=True) directly.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ontokit/services/embedding_service.py` around lines 264 - 271, The loop over
graph.subjects(RDF.type, None) can yield the same subject multiple times if it
has multiple rdf:type triples, causing duplicates in entities and inflating
job.total_entities; change the iteration to deduplicate subjects (e.g., use
graph.subjects(RDF.type, None, unique=True) or collect into a set of subjects
first) before calling _get_entity_type and build_embedding_text and appending to
entities so each subject is processed only once and progress/provider usage is
accurate.
🟡 Minor comments (1)
ontokit/services/embedding_providers/openai_provider.py-47-52 (1)

47-52: ⚠️ Potential issue | 🟡 Minor

API key may leak in exception tracebacks.

If raise_for_status() throws, the httpx exception may include request headers containing the Bearer token. Consider wrapping with a try/except that redacts sensitive details or using httpx's event hooks.

🛡️ Proposed mitigation
+            try:
                 resp.raise_for_status()
+            except httpx.HTTPStatusError as e:
+                # Re-raise without exposing auth header
+                raise httpx.HTTPStatusError(
+                    message=f"OpenAI API error: {e.response.status_code}",
+                    request=e.request,
+                    response=e.response,
+                ) from None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ontokit/services/embedding_providers/openai_provider.py` around lines 47 -
52, The code calls client.post(...) and then resp.raise_for_status(), which can
cause httpx exceptions to include the Authorization header (Bearer token) in
tracebacks; wrap the POST and raise_for_status() in a try/except that catches
httpx.HTTPError (or broader Exception), and on error redact or remove the
Authorization value before re-raising or logging (e.g., build
headers_without_auth = {k:v for k,v in headers.items() if k.lower() !=
"authorization"} and use that for any log/exception message), or re-raise a
sanitized error with only non-sensitive request/response details; apply this
change around the client.post/resp.raise_for_status call in the embedding
provider method (referenced by client.post, resp, and resp.raise_for_status).

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0b1d4469-bea9-4ec3-a02e-89852fbc820e

📥 Commits

Reviewing files that changed from the base of the PR and between 339631b and f4f7a12.

📒 Files selected for processing (57)
  • alembic/versions/l0g1h2i3j4k5_add_suggestion_sessions_table.py
  • alembic/versions/m1n2o3p4q5r6_add_entity_change_events.py
  • alembic/versions/n2o3p4q5r6s7_add_embedding_tables.py
  • compose.yaml
  • ontokit/api/routes/__init__.py
  • ontokit/api/routes/analytics.py
  • ontokit/api/routes/embeddings.py
  • ontokit/api/routes/projects.py
  • ontokit/api/routes/quality.py
  • ontokit/api/routes/search.py
  • ontokit/api/routes/semantic_search.py
  • ontokit/api/routes/suggestions.py
  • ontokit/core/auth.py
  • ontokit/core/beacon_token.py
  • ontokit/core/exceptions.py
  • ontokit/core/middleware.py
  • ontokit/main.py
  • ontokit/models/__init__.py
  • ontokit/models/change_event.py
  • ontokit/models/embedding.py
  • ontokit/models/project.py
  • ontokit/models/suggestion_session.py
  • ontokit/schemas/__init__.py
  • ontokit/schemas/analytics.py
  • ontokit/schemas/embeddings.py
  • ontokit/schemas/ontology.py
  • ontokit/schemas/project.py
  • ontokit/schemas/quality.py
  • ontokit/schemas/suggestion.py
  • ontokit/services/change_event_service.py
  • ontokit/services/consistency_service.py
  • ontokit/services/cross_reference_service.py
  • ontokit/services/duplicate_detection_service.py
  • ontokit/services/embedding_providers/__init__.py
  • ontokit/services/embedding_providers/base.py
  • ontokit/services/embedding_providers/local_provider.py
  • ontokit/services/embedding_providers/openai_provider.py
  • ontokit/services/embedding_providers/voyage_provider.py
  • ontokit/services/embedding_service.py
  • ontokit/services/embedding_text_builder.py
  • ontokit/services/join_request_service.py
  • ontokit/services/ontology.py
  • ontokit/services/project_service.py
  • ontokit/services/search.py
  • ontokit/services/suggestion_service.py
  • ontokit/version.py
  • ontokit/worker.py
  • pyproject.toml
  • scripts/init-db.sh
  • tests/conftest.py
  • tests/integration/test_project_workflow.py
  • tests/unit/test_auth.py
  • tests/unit/test_config.py
  • tests/unit/test_linter.py
  • tests/unit/test_ontology_service.py
  • tests/unit/test_projects_routes.py
  • tests/unit/test_search.py

Comment on lines +188 to +189
if graph is None:
graph = Graph()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Don't silently turn a missing ontology graph into Graph().

ontokit/api/routes/search.py currently calls this method without a graph, so this fallback makes the shipped /sparql endpoint execute every request against a fresh empty graph and return empty data. This should either load the requested ontology first or fail fast until that wiring exists.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ontokit/services/search.py` around lines 188 - 189, The code currently does
"if graph is None: graph = Graph()", which silently substitutes an empty graph
and causes the SPARQL endpoint to return empty results; instead either require a
real graph and fail fast or explicitly load the requested ontology: remove the
silent Graph() fallback and (a) raise a clear exception (e.g. raise
ValueError("graph is required")) so callers must pass a populated graph, or (b)
call the project’s ontology-loading helper (e.g. load_ontology_graph(...) or
OntologyStore.get_graph(...)) to populate graph when an ontology id is
available; update the caller(s) that invoked this function without a graph to
either pass the loaded graph or provide the ontology id used for loading.

JohnRDOrazio and others added 17 commits March 8, 2026 23:01
The docstring on approve_request incorrectly stated "add the user as an
editor" when the implementation actually assigns the "suggester" role.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
verify_aud was disabled in jwt.decode with no follow-up check, allowing
tokens issued for other clients to be accepted. Now validates that aud
contains or azp matches our zitadel_client_id before extracting roles.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Both get_current_user and get_current_user_with_token only called
fetch_userinfo() when name or email was missing. If the token had both
but no roles claim, roles would stay [] and role checks would fail.
Now also triggers the userinfo fallback when roles is empty.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace fragile string-based detection (startswith on uppercased query)
with rdflib.plugins.sparql.parser.parseQuery. This correctly handles
queries with PREFIX declarations before the operation keyword. Parse
errors now raise ValueError instead of silently defaulting to SELECT.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The conditional `if root in parent or iri in parent` could skip entities
whose representative wasn't a key in the parent dict, causing incomplete
clusters. Now unconditionally groups every entity under its root.
Singleton clusters are already filtered out downstream (len < 2).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The dangling_ref rule flagged terms from standard vocabularies (RDFS,
OWL, XSD, SKOS, DCTERMS) and graph-registered namespaces as undeclared.
Now collects namespace prefixes from the graph's namespace manager and
well-known vocabs, and skips URIs matching those namespaces.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The previous loop only followed parents[0], missing cycles reachable
through other inheritance branches. Now uses DFS with a stack to visit
all rdfs:subClassOf edges from each class.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
generate_embeddings and get_embedding_status hardcoded branch="main".
Now branch is optional and resolved via git.get_default_branch(project_id)
when not provided, matching the pattern used in the projects routes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Blank-node subjects (e.g. OWL restrictions) were silently skipped,
dropping references like owl:someValuesFrom from restriction nodes.
Now traces blank nodes back to their owning named entities (URIRefs)
via _resolve_bnode_owners, which walks up through nested blank nodes
with cycle detection.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The SELECT MAX(pr_number) + 1 pattern can produce duplicate numbers
under concurrent inserts. Now retries up to 3 times on IntegrityError,
re-reading the max pr_number after each conflict.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…haining

Remove unused imports (update, ProjectMember, PullRequestService) and
add missing 'from e' to exception re-raises (B904).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Multiple workers running auto_submit_stale_sessions could process the
same session concurrently. Now uses an atomic UPDATE with WHERE
conditions to claim each session before creating a PR. Only proceeds
when exactly one row is affected; reverts the claim on failure so the
session can be retried.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The DB probe was swallowing exceptions and continuing startup, leaving
the API running with a broken database. Now re-raises after logging
so the process fails to start.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
If _create_pr_for_session fails with a DB error, the AsyncSession is
left in a failed transaction state. The subsequent commit to revert
the claim would raise PendingRollbackError. Now rollbacks first to
reset the session before reverting status to ACTIVE.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The fallback treated any unknown URI predicate as annotation_value,
misclassifying structural predicates like owl:unionOf. Now checks
that the predicate is declared as owl:AnnotationProperty in the graph
before assigning annotation_value context; other unknown predicates
are skipped.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
JohnRDOrazio and others added 3 commits March 9, 2026 02:56
…and cross-references

49 new tests covering five low-coverage modules using in-memory RDFLib
graphs and monkeypatch — no infrastructure dependencies required.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Enables automatic linting, formatting, and type checking on commit.
Developers run `pre-commit install` after cloning to activate hooks.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tion

`make setup` installs dev dependencies and pre-commit hooks in one step.
Also adds lint, format, typecheck, and test convenience targets.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

♻️ Duplicate comments (4)
ontokit/services/suggestion_service.py (2)

255-269: ⚠️ Potential issue | 🟠 Major

Re-check session state inside the branch lock and use the same lock in discard().

save() and beacon_save() validate ACTIVE before acquiring _branch_locks, while discard() deletes the branch without that lock. A concurrent discard can slip between those steps and turn a valid save into a 500 or race a beacon write against branch deletion. Revalidate session.status after the lock is acquired, and take the same per-branch lock around discard.

Suggested fix
-        if session.status != SuggestionSessionStatus.ACTIVE.value:
-            raise HTTPException(
-                status_code=status.HTTP_400_BAD_REQUEST,
-                detail=f"Session is {session.status}, cannot save",
-            )
-
         project = await self._get_project(project_id)
         filename = self._get_git_ontology_path(project)
 
         # Serialize git writes per branch to prevent lost commits
         async with _branch_locks[session.branch]:
+            await self.db.refresh(session)
+            if session.status != SuggestionSessionStatus.ACTIVE.value:
+                raise HTTPException(
+                    status_code=status.HTTP_400_BAD_REQUEST,
+                    detail=f"Session is {session.status}, cannot save",
+                )
             # Commit to the suggestion branch
             commit_message = f"Update {data.entity_label}"
-        # Delete the git branch
-        try:
-            self.git_service.delete_branch(project_id, session.branch, force=True)
-        except Exception as e:
-            logger.warning(f"Failed to delete suggestion branch {session.branch}: {e}")
-
-        session.status = SuggestionSessionStatus.DISCARDED.value
-        session.last_activity = datetime.now(UTC)
-        await self.db.commit()
+        async with _branch_locks[session.branch]:
+            await self.db.refresh(session)
+            if session.status != SuggestionSessionStatus.ACTIVE.value:
+                raise HTTPException(
+                    status_code=status.HTTP_400_BAD_REQUEST,
+                    detail=f"Session is {session.status}, cannot discard",
+                )
+            try:
+                self.git_service.delete_branch(project_id, session.branch, force=True)
+            except Exception as e:
+                logger.warning(f"Failed to delete suggestion branch {session.branch}: {e}")
+
+            session.status = SuggestionSessionStatus.DISCARDED.value
+            session.last_activity = datetime.now(UTC)
+            await self.db.commit()

Apply the same post-lock refresh/status check in beacon_save() as well.

Also applies to: 520-540, 563-578

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ontokit/services/suggestion_service.py` around lines 255 - 269, The
save()/beacon_save() flow validates session.status before acquiring the
per-branch lock (_branch_locks[session.branch]) which allows a concurrent
discard() to delete the branch; after acquiring the branch lock in save() and
beacon_save() re-fetch or refresh the session (via _get_session(project_id,
session_id) or reload the session object) and re-check that session.status ==
SuggestionSessionStatus.ACTIVE.value and raise the same HTTPException if not;
additionally wrap discard() with the same per-branch lock (async with
_branch_locks[session.branch]) so branch deletion serializes with
saves/beacon_saves to prevent races.

93-100: ⚠️ Potential issue | 🟠 Major

Reject absolute ontology paths before stripping separators.

Line 94 turns an absolute path like /ontologies/core.ttl into ontologies/core.ttl, so invalid absolute paths are silently accepted and writes can land on the wrong repo-relative file. Check os.path.isabs(...) before normalizing/stripping.

Suggested fix
     if project.source_file_path:
-        path = os.path.normpath(project.source_file_path).lstrip("/\\")
+        if os.path.isabs(project.source_file_path):
+            raise HTTPException(
+                status_code=status.HTTP_400_BAD_REQUEST,
+                detail="Invalid ontology path",
+            )
+        path = os.path.normpath(project.source_file_path)
         if path.startswith(".."):
             raise HTTPException(
                 status_code=status.HTTP_400_BAD_REQUEST,
                 detail="Invalid ontology path",
             )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ontokit/services/suggestion_service.py` around lines 93 - 100, The code
currently normalizes and strips leading separators from project.source_file_path
which lets absolute paths like "/ontologies/core.ttl" become
"ontologies/core.ttl" and bypass validation; update the logic around
project.source_file_path (the block referencing project.source_file_path) to
first call os.path.isabs(project.source_file_path) and if true raise the same
HTTPException (status.HTTP_400_BAD_REQUEST, detail="Invalid ontology path")
before doing os.path.normpath() and lstrip, then keep the existing check for
path.startswith("..") and return the cleaned path.
ontokit/services/search.py (1)

81-89: ⚠️ Potential issue | 🟠 Major

Fail fast when _db was never injected.

Returning 200 with results=[] still makes a dependency-wiring bug indistinguishable from a legitimate “no matches” response. Keeping db=None for SPARQL-only use is fine, but search() should raise once it reaches the database path.

Suggested fix
         if self._db is None:
-            # No database session available -- return empty results.
-            elapsed = (time.perf_counter() - start_time) * 1000
-            return SearchResponse(
-                results=[],
-                total=0,
-                query=query.query,
-                took_ms=elapsed,
-            )
+            raise RuntimeError("SearchService.search requires a database session")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ontokit/services/search.py` around lines 81 - 89, The search() method
currently returns an empty SearchResponse when self._db is None, which hides
dependency wiring bugs; instead, in ontokit.services.search.py inside the
search(self, query, ...) flow, detect if self._db is None before entering the
database query path and raise a clear exception (e.g., RuntimeError or a custom
DependencyError) with a descriptive message that the DB was not injected, rather
than returning SearchResponse(results=[]). Update the branch that references
self._db to raise the error so SPARQL-only use can still construct the service
but any code that reaches the DB path in search() fails fast.
ontokit/services/embedding_service.py (1)

160-182: ⚠️ Potential issue | 🟠 Major

Coverage can still read as 100% when it isn't.

Line 172 looks for finished jobs with status "complete", but Line 370 writes "completed", so get_status() never picks up successful full embeds as the denominator source. When that query misses, Line 160 falls back to embedded_count, which still overreports coverage after incremental auto-embeds.

Also applies to: 370-372

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ontokit/services/embedding_service.py` around lines 160 - 182, The coverage
calculation falls back to the last completed job by querying EmbeddingJob.status
== "complete" but elsewhere the code writes status "completed", so successful
jobs are missed and coverage can be reported as 100%; update the status check to
use the canonical status string or a single shared constant/enum (e.g., replace
the literal "complete" with the same value used when writing status, or
reference a shared constant/enum) in the query in embedding_service.py (the
block using last_job_q and last_total) and make the identical fix where statuses
are written (the other place referenced around the later job update), ensuring
both use the same identifier (e.g., "completed" or a STATUS_COMPLETE constant)
so the denominator picks up finished jobs correctly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@ontokit/api/routes/embeddings.py`:
- Around line 132-139: The code enqueues an ARQ job after committing a pending
row so failures in get_arq_pool() or pool.enqueue_job(...) leave a stuck active
job; wrap the ARQ calls (get_arq_pool and pool.enqueue_job with args
str(project_id), resolved_branch, str(job_id)) in a try/except, and on any
exception delete or mark the pending job as failed (use the same job identifier
job_id or the repo/job model you used to create the pending row) before
re-raising or returning an error; ensure cleanup uses the same transactional
context or a safe retry so no pending row remains if enqueueing fails and keep
the successful return of EmbeddingGenerateResponse(job_id=str(job_id)) only
after enqueue succeeds.

In `@ontokit/api/routes/quality.py`:
- Around line 121-135: The endpoint runs run_consistency_check(...)
synchronously and currently returns a job_id even if caching to Redis fails,
which yields dead job IDs; change the flow so you only generate/return a job_id
after a successful cache write (or if you generate it earlier, treat Redis
failures as a hard failure). Specifically, call run_consistency_check(graph,
str(project_id), branch) as before, then attempt to get the Redis client via
_get_redis() and persist the result_json under cache_key and job_key; if any
await redis.set(...) call throws, propagate an error response (e.g., raise an
HTTPException / return a 500) instead of swallowing the exception and returning
ConsistencyCheckTriggerResponse(job_id=...). Keep references:
run_consistency_check, _get_redis, cache_key ("quality:{project_id}:{branch}"),
job_key ("quality_job:{project_id}:{job_id}"), and
ConsistencyCheckTriggerResponse.
- Around line 119-129: Resolve and reuse the actual branch value returned/used
by _load_graph instead of the original query param: call _load_graph(project_id,
branch, db), extract the resolved branch from the returned graph (e.g.,
graph.branch or graph.repo.default_branch depending on the graph shape), then
pass that resolved_branch into run_consistency_check(project_id) (i.e.,
run_consistency_check(graph, str(project_id), resolved_branch), use
resolved_branch when forming cache_key/job_key, and set result.branch =
resolved_branch before serializing with result.model_dump_json(); apply the same
fix to the other handlers mentioned (around the blocks that use
run_consistency_check, _get_redis, cache_key/job_key at the later locations).

In `@ontokit/api/routes/semantic_search.py`:
- Around line 45-47: The handlers semantic_search() and find_similar_entities()
are defaulting the branch Query param to "main" instead of resolving the
project's actual default branch like rank_suggestions() does; update both
handlers to, when branch is omitted/None, call the same repo/default-branch
resolution used by rank_suggestions() (e.g., the helper or method that returns
the project's default branch) and pass that resolved branch to EmbeddingService
instead of the literal "main"; apply the same change to the other parameterized
endpoints referenced around lines 65-67 so all calls into EmbeddingService use
the repo-resolved default branch.

In `@ontokit/core/auth.py`:
- Around line 25-36: The current code uses bare dict annotations causing mypy
failures; add from typing import Any and define a JsonObject = dict[str, Any]
(or use Mapping[str, Any] for read-only) near the top of the file, then change
_extract_roles(payload: dict) -> list[str] to _extract_roles(payload:
Mapping[str, Any]) -> list[str], annotate _jwks_cache as JsonObject | None,
change get_jwks return type to JsonObject, and update fetch_userinfo's dict |
None to JsonObject | None so all helpers use the new typed alias.

In `@ontokit/main.py`:
- Around line 42-43: Annotate the loose types: change the function parameter
`body` to the precise mapping type `dict[str, dict[str, str] | object]`; in
ontokit/version.py add an explicit type to `VERSION_BASE` (e.g. `VERSION_BASE:
str = VERSION.replace(...)`) so `__version__` is clearly a str; and make the
Redis typing consistent by matching the declared `redis_client` type to whatever
`aioredis.from_url(..., decode_responses=True)` actually returns (either change
the `redis_client` annotation to the concrete return type from
`aioredis.from_url` or use `typing.cast(...)` when assigning the result) so mypy
strict mode no longer reports a mismatch.

In `@ontokit/services/consistency_service.py`:
- Around line 18-25: The function _iter_classes currently lacks a return type
annotation causing mypy failures; add a return type of Iterator[URIRef] to its
signature (def _iter_classes(graph: Graph) -> Iterator[URIRef]:) and ensure
Iterator is imported from typing (or collections.abc) alongside existing
RDF/URIRef/Graph imports so the file type-checks under strict mypy.

In `@ontokit/services/embedding_service.py`:
- Around line 604-618: The queries use body.branch directly which can be None
and causes unintended IS NULL filters; resolve the branch to the project's
default before the embedding lookups by fetching the project's default branch
(e.g., via the Project record or an existing helper like
get_default_branch(project_id)) and assign it to a local resolved_branch (fall
back to body.branch if already set), then use resolved_branch in the
EntityEmbedding queries (both the context emb_q and candidates_q) instead of
body.branch so lookups for omitted-branch requests use the project's default
branch.

In `@ontokit/services/search.py`:
- Line 9: Replace the internal SPARQLResult import with RDFLib's public Result
type: change the import from rdflib.plugins.sparql.processor.SPARQLResult to
rdflib.query.Result and update any type annotations that reference SPARQLResult
(e.g., the variable or return type used around the Graph.query() handling at the
site flagged) to use Result instead so the annotation matches Graph.query()'s
public API.

In `@ontokit/services/suggestion_service.py`:
- Around line 201-238: The code currently deletes the git branch in the generic
except that also catches failures from await self.db.refresh(db_session) after a
successful commit; change flow so branch cleanup only happens when add/commit
fails: wrap self.db.add(...) and await self.db.commit() in their own try/except
(handling IntegrityError as before and running
self.git_service.delete_branch(...) on those failures), then perform await
self.db.refresh(db_session) in a separate try/except that does NOT call
self.git_service.delete_branch (instead log the refresh failure and raise an
appropriate HTTPException); keep the existing logic that returns an existing
SuggestionSessionResponse when IntegrityError finds a concurrent active session,
and continue to re-raise unexpected exceptions from the commit block after
cleanup.

---

Duplicate comments:
In `@ontokit/services/embedding_service.py`:
- Around line 160-182: The coverage calculation falls back to the last completed
job by querying EmbeddingJob.status == "complete" but elsewhere the code writes
status "completed", so successful jobs are missed and coverage can be reported
as 100%; update the status check to use the canonical status string or a single
shared constant/enum (e.g., replace the literal "complete" with the same value
used when writing status, or reference a shared constant/enum) in the query in
embedding_service.py (the block using last_job_q and last_total) and make the
identical fix where statuses are written (the other place referenced around the
later job update), ensuring both use the same identifier (e.g., "completed" or a
STATUS_COMPLETE constant) so the denominator picks up finished jobs correctly.

In `@ontokit/services/search.py`:
- Around line 81-89: The search() method currently returns an empty
SearchResponse when self._db is None, which hides dependency wiring bugs;
instead, in ontokit.services.search.py inside the search(self, query, ...) flow,
detect if self._db is None before entering the database query path and raise a
clear exception (e.g., RuntimeError or a custom DependencyError) with a
descriptive message that the DB was not injected, rather than returning
SearchResponse(results=[]). Update the branch that references self._db to raise
the error so SPARQL-only use can still construct the service but any code that
reaches the DB path in search() fails fast.

In `@ontokit/services/suggestion_service.py`:
- Around line 255-269: The save()/beacon_save() flow validates session.status
before acquiring the per-branch lock (_branch_locks[session.branch]) which
allows a concurrent discard() to delete the branch; after acquiring the branch
lock in save() and beacon_save() re-fetch or refresh the session (via
_get_session(project_id, session_id) or reload the session object) and re-check
that session.status == SuggestionSessionStatus.ACTIVE.value and raise the same
HTTPException if not; additionally wrap discard() with the same per-branch lock
(async with _branch_locks[session.branch]) so branch deletion serializes with
saves/beacon_saves to prevent races.
- Around line 93-100: The code currently normalizes and strips leading
separators from project.source_file_path which lets absolute paths like
"/ontologies/core.ttl" become "ontologies/core.ttl" and bypass validation;
update the logic around project.source_file_path (the block referencing
project.source_file_path) to first call os.path.isabs(project.source_file_path)
and if true raise the same HTTPException (status.HTTP_400_BAD_REQUEST,
detail="Invalid ontology path") before doing os.path.normpath() and lstrip, then
keep the existing check for path.startswith("..") and return the cleaned path.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 95f0f74c-68bd-4697-a95b-2af5e268b559

📥 Commits

Reviewing files that changed from the base of the PR and between 0642da3 and b2d5d4d.

📒 Files selected for processing (20)
  • alembic/versions/l0g1h2i3j4k5_add_suggestion_sessions_table.py
  • ontokit/api/routes/embeddings.py
  • ontokit/api/routes/projects.py
  • ontokit/api/routes/quality.py
  • ontokit/api/routes/semantic_search.py
  • ontokit/api/utils/redis.py
  • ontokit/core/auth.py
  • ontokit/core/beacon_token.py
  • ontokit/main.py
  • ontokit/schemas/embeddings.py
  • ontokit/services/consistency_service.py
  • ontokit/services/embedding_service.py
  • ontokit/services/rdf_utils.py
  • ontokit/services/search.py
  • ontokit/services/suggestion_service.py
  • tests/unit/test_beacon_token.py
  • tests/unit/test_consistency_service.py
  • tests/unit/test_cross_reference.py
  • tests/unit/test_duplicate_detection.py
  • tests/unit/test_rdf_utils.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • ontokit/api/utils/redis.py

Comment on lines +25 to +36
def _extract_roles(payload: dict) -> list[str]:
"""Extract role names from the Zitadel roles claim.

The Zitadel roles claim format is:
{"urn:zitadel:iam:org:project:roles": {"role_name": {"org_id": "org_name"}, ...}}

This function extracts just the role names (top-level keys of the roles dict).
"""
roles_claim = payload.get(ZITADEL_ROLES_CLAIM)
if not roles_claim or not isinstance(roles_claim, dict):
return []
return list(roles_claim.keys())
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n ontokit/core/auth.py | head -100

Repository: CatholicOS/ontokit-api

Length of output: 3775


🏁 Script executed:

cat -n ontokit/core/auth.py | tail -50

Repository: CatholicOS/ontokit-api

Length of output: 2114


🏁 Script executed:

sed -n '100,150p' ontokit/core/auth.py

Repository: CatholicOS/ontokit-api

Length of output: 1971


🏁 Script executed:

# Check for mypy configuration and CI files
find . -name "pyproject.toml" -o -name "setup.cfg" -o -name ".mypy.ini" -o -name "mypy.ini" | head -5
ls -la .github/workflows/ 2>/dev/null || echo "No .github/workflows found"
find . -name "*.yml" -o -name "*.yaml" | grep -E "(workflow|ci|github)" | head -10

Repository: CatholicOS/ontokit-api

Length of output: 286


🏁 Script executed:

cat pyproject.toml

Repository: CatholicOS/ontokit-api

Length of output: 2717


🏁 Script executed:

# Search for all bare dict usages in the auth.py file
rg '\bdict\b' ontokit/core/auth.py -n

Repository: CatholicOS/ontokit-api

Length of output: 428


🏁 Script executed:

# Check the imports section more carefully
sed -n '1,20p' ontokit/core/auth.py

Repository: CatholicOS/ontokit-api

Length of output: 581


🏁 Script executed:

# Check the fetch_userinfo function
sed -n '215,240p' ontokit/core/auth.py

Repository: CatholicOS/ontokit-api

Length of output: 899


🏁 Script executed:

# Check if there are any CI test runs or documentation about type checking
find . -type f -name "*.yml" -o -name "*.yaml" | xargs grep -l "mypy" 2>/dev/null || echo "No mypy in CI found"
cat .github/workflows/release.yml

Repository: CatholicOS/ontokit-api

Length of output: 2994


🏁 Script executed:

# Get complete view of all functions with bare dict annotations
sed -n '215,242p' ontokit/core/auth.py

Repository: CatholicOS/ontokit-api

Length of output: 916


🏁 Script executed:

# Double-check all function signatures that return dict or accept dict
rg 'def.*->.*dict|def.*\(.*:\s*dict' ontokit/core/auth.py -n

Repository: CatholicOS/ontokit-api

Length of output: 243


🏁 Script executed:

# Verify the full extent of bare dict usage - check lines 70-90 for _jwks_cache
sed -n '70,90p' ontokit/core/auth.py

Repository: CatholicOS/ontokit-api

Length of output: 821


🏁 Script executed:

# Verify mypy would actually fail on these bare dict annotations by checking the strict mode config
grep -A 5 "\[tool.mypy\]" pyproject.toml

Repository: CatholicOS/ontokit-api

Length of output: 166


Add type parameters to bare dict annotations in new auth helpers.

Lines 25, 71, and 76 use bare dict, which fails strict mypy checks enabled in CI. Introduce a JsonObject = dict[str, Any] type alias and apply it consistently:

  • Line 25: _extract_roles parameter should be Mapping[str, Any] (read-only interface)
  • Line 71: _jwks_cache should be JsonObject | None
  • Line 76: get_jwks return type should be JsonObject

Add Any to the imports from typing and define the alias near the top of the file.

Note: fetch_userinfo at line 215 has the same bare dict | None pattern and will also need the same fix for complete type coverage.

🧰 Tools
🪛 GitHub Actions: Distribution

[error] 25-25: Mypy: Missing type parameters for generic type "dict" [type-arg]

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ontokit/core/auth.py` around lines 25 - 36, The current code uses bare dict
annotations causing mypy failures; add from typing import Any and define a
JsonObject = dict[str, Any] (or use Mapping[str, Any] for read-only) near the
top of the file, then change _extract_roles(payload: dict) -> list[str] to
_extract_roles(payload: Mapping[str, Any]) -> list[str], annotate _jwks_cache as
JsonObject | None, change get_jwks return type to JsonObject, and update
fetch_userinfo's dict | None to JsonObject | None so all helpers use the new
typed alias.

Comment on lines +42 to +43
logger.info("Starting OntoKit API v%s (env=%s)", __version__, settings.app_env)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

fd -t f "main.py" | head -20

Repository: CatholicOS/ontokit-api

Length of output: 81


🏁 Script executed:

cat -n ontokit/main.py | head -60

Repository: CatholicOS/ontokit-api

Length of output: 2532


🏁 Script executed:

cat -n ontokit/main.py | sed -n '40,50p'

Repository: CatholicOS/ontokit-api

Length of output: 585


🏁 Script executed:

cat -n ontokit/main.py | sed -n '50,65p'

Repository: CatholicOS/ontokit-api

Length of output: 749


🏁 Script executed:

cat -n ontokit/main.py | sed -n '140,155p'

Repository: CatholicOS/ontokit-api

Length of output: 755


🏁 Script executed:

rg "__version__" -B 3 -A 3 ontokit/

Repository: CatholicOS/ontokit-api

Length of output: 1639


🏁 Script executed:

rg "aioredis" ontokit/main.py -B 2 -A 2

Repository: CatholicOS/ontokit-api

Length of output: 566


🏁 Script executed:

rg "_error_response" ontokit/main.py -B 5 -A 5

Repository: CatholicOS/ontokit-api

Length of output: 1709


🏁 Script executed:

find ontokit -name "version.py" -o -name "py.typed"

Repository: CatholicOS/ontokit-api

Length of output: 84


🏁 Script executed:

cat ontokit/__init__.py

Repository: CatholicOS/ontokit-api

Length of output: 202


🏁 Script executed:

cat ontokit/version.py 2>/dev/null || echo "File not found"

Repository: CatholicOS/ontokit-api

Length of output: 256


🏁 Script executed:

rg "VERSION_BASE|__version__" ontokit/

Repository: CatholicOS/ontokit-api

Length of output: 529


🏁 Script executed:

cat pyproject.toml | grep -A 20 "\[tool.mypy\]"

Repository: CatholicOS/ontokit-api

Length of output: 291


🏁 Script executed:

rg "dict\[" ontokit/main.py

Repository: CatholicOS/ontokit-api

Length of output: 145


🏁 Script executed:

ls -la ontokit/py.typed 2>/dev/null || echo "No py.typed marker found"

Repository: CatholicOS/ontokit-api

Length of output: 90


🏁 Script executed:

cat -n ontokit/main.py | sed -n '1,30p'

Repository: CatholicOS/ontokit-api

Length of output: 1282


🏁 Script executed:

rg "from typing import|from __future__ import" ontokit/main.py

Repository: CatholicOS/ontokit-api

Length of output: 48


🏁 Script executed:

rg "redis.asyncio" ontokit/ -B 2 -A 2

Repository: CatholicOS/ontokit-api

Length of output: 310


🏁 Script executed:

python3 << 'EOF'
# Check the structure of the imports and potential type issues
import sys

# Analyze the issues:
issues = {
    "line_42_version": {
        "code": "logger.info(\"Starting OntoKit API v%s (env=%s)\", __version__, settings.app_env)",
        "import": "from ontokit import __version__",
        "source": "ontokit/__init__.py: from ontokit.version import VERSION_BASE as __version__",
        "analysis": "__version__ is properly imported from version module, no obvious attr-defined issue"
    },
    "lines_55_58_aioredis": {
        "code": "redis_pool = aioredis.from_url(str(settings.redis_url), decode_responses=True)",
        "import": "import redis.asyncio as aioredis",
        "annotation": "redis_pool: aioredis.Redis | None = None",
        "issue": "from_url() may return untyped Redis object or type mismatch with decode_responses"
    },
    "line_146_dict": {
        "code": "body: dict = {\"error\": {...}}",
        "issue": "Bare dict without type arguments violates strict mypy - should be dict[str, Any] or dict[str, object]"
    }
}

for issue, details in issues.items():
    print(f"\n{issue}:")
    for key, val in details.items():
        print(f"  {key}: {val}")
EOF

Repository: CatholicOS/ontokit-api

Length of output: 860


🏁 Script executed:

head -30 ontokit/main.py

Repository: CatholicOS/ontokit-api

Length of output: 1072


🏁 Script executed:

cat ontokit/version.py

Repository: CatholicOS/ontokit-api

Length of output: 256


🏁 Script executed:

rg "from typing" ontokit/version.py

Repository: CatholicOS/ontokit-api

Length of output: 48


🏁 Script executed:

find . -name ".github" -type d | head -5

Repository: CatholicOS/ontokit-api

Length of output: 75


🏁 Script executed:

fd "\.yml$|\.yaml$" .github/workflows/ 2>/dev/null | head -10

Repository: CatholicOS/ontokit-api

Length of output: 95


🏁 Script executed:

cat .github/workflows/*.yml 2>/dev/null | grep -A 10 -B 2 "mypy"

Repository: CatholicOS/ontokit-api

Length of output: 399


🏁 Script executed:

find .github -name "*.yml" -o -name "*.yaml" | xargs cat

Repository: CatholicOS/ontokit-api

Length of output: 2936


🏁 Script executed:

cat -n ontokit/version.py

Repository: CatholicOS/ontokit-api

Length of output: 298


🏁 Script executed:

rg "mypy|type" pyproject.toml -B 2 -A 5

Repository: CatholicOS/ontokit-api

Length of output: 357


🏁 Script executed:

cd ontokit && python3 -c "from version import VERSION_BASE; print(type(VERSION_BASE).__name__, repr(VERSION_BASE))"

Repository: CatholicOS/ontokit-api

Length of output: 77


🏁 Script executed:

python3 << 'EOF'
import sys
try:
    import redis.asyncio as aioredis
    # Check the return type of from_url
    import inspect
    sig = inspect.signature(aioredis.from_url)
    print(f"from_url signature: {sig}")
    print(f"from_url return annotation: {sig.return_annotation}")
except Exception as e:
    print(f"Error: {e}")
EOF

Repository: CatholicOS/ontokit-api

Length of output: 96


🏁 Script executed:

cat -n ontokit/main.py | sed -n '140,160p'

Repository: CatholicOS/ontokit-api

Length of output: 1002


🏁 Script executed:

grep -n "from typing import\|from __future__ import" ontokit/main.py

Repository: CatholicOS/ontokit-api

Length of output: 48


🏁 Script executed:

find . -name "*.log" -o -name "*mypy*" 2>/dev/null | head -20

Repository: CatholicOS/ontokit-api

Length of output: 48


🏁 Script executed:

git log --oneline -20 2>/dev/null | head -20

Repository: CatholicOS/ontokit-api

Length of output: 147


🏁 Script executed:

cat ontokit/__init__.py

Repository: CatholicOS/ontokit-api

Length of output: 202


🏁 Script executed:

rg "Any|Dict" ontokit/main.py

Repository: CatholicOS/ontokit-api

Length of output: 48


🏁 Script executed:

python3 << 'EOF'
# Check what type annotations are needed
import sys

# Issue 1: body: dict without type arguments
issue_1 = """
def _error_response(...) -> JSONResponse:
    body: dict = {"error": {"code": code, "message": message}}
    # In strict mypy mode, this needs: dict[str, dict[str, ...] | ...]
"""

# Issue 2: Check if redis.asyncio has proper typing
try:
    from redis.asyncio import from_url
    print("redis.asyncio.from_url is available")
except ImportError as e:
    print(f"redis.asyncio import error: {e}")

# Issue 3: __version__ from version module
# VERSION_BASE is untyped, so it may be seen as Any
print("\nIssue Analysis:")
print("1. body: dict - STRICT MYPY VIOLATION (needs type args)")
print("2. __version__ - Depends on whether VERSION_BASE has type hints (currently NO)")
print("3. aioredis.from_url() - Should be typed, but may have return type issues")
EOF

Repository: CatholicOS/ontokit-api

Length of output: 343


Fix mypy strict mode violations before release.

Three type annotation issues are blocking the distribution pipeline:

  1. Line 146body: dict lacks type arguments. Use: body: dict[str, dict[str, str] | object]

  2. Line 42__version__ requires a type annotation on VERSION_BASE in ontokit/version.py. Add: VERSION_BASE: str = VERSION.replace(...)

  3. Lines 55–58aioredis.from_url() return type must match the aioredis.Redis | None annotation on line 31. Verify the return type of from_url() with decode_responses=True is compatible with the declared type.

These must be resolved before release.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ontokit/main.py` around lines 42 - 43, Annotate the loose types: change the
function parameter `body` to the precise mapping type `dict[str, dict[str, str]
| object]`; in ontokit/version.py add an explicit type to `VERSION_BASE` (e.g.
`VERSION_BASE: str = VERSION.replace(...)`) so `__version__` is clearly a str;
and make the Redis typing consistent by matching the declared `redis_client`
type to whatever `aioredis.from_url(..., decode_responses=True)` actually
returns (either change the `redis_client` annotation to the concrete return type
from `aioredis.from_url` or use `typing.cast(...)` when assigning the result) so
mypy strict mode no longer reports a mismatch.

from ontokit.schemas.search import SearchQuery, SearchResponse, SPARQLQuery, SPARQLResponse
from rdflib import BNode, Graph, Literal, URIRef
from rdflib.plugins.sparql.parser import parseQuery
from rdflib.plugins.sparql.processor import SPARQLResult
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== Repository dependency pins =="
fd 'pyproject.toml|poetry.lock|uv.lock|pdm.lock|requirements*.txt|setup.py' -x sh -c '
  echo "--- {} ---"
  rg -n "rdflib" "{}" || true
'

echo
echo "== Sandbox RDFLib signature (if installed) =="
python - <<'PY'
try:
    import inspect
    from typing import get_type_hints
    from rdflib import Graph

    print("Graph.query signature:", inspect.signature(Graph.query))
    try:
        print("Graph.query type hints:", get_type_hints(Graph.query))
    except Exception as exc:
        print("Could not resolve type hints:", exc)
except Exception as exc:
    print("rdflib not importable in sandbox:", exc)
PY

Repository: CatholicOS/ontokit-api

Length of output: 253


🏁 Script executed:

cat -n ontokit/services/search.py | head -30

Repository: CatholicOS/ontokit-api

Length of output: 1167


🏁 Script executed:

cat -n ontokit/services/search.py | sed -n '220,235p'

Repository: CatholicOS/ontokit-api

Length of output: 841


🏁 Script executed:

rg -n "from rdflib" ontokit/services/search.py

Repository: CatholicOS/ontokit-api

Length of output: 229


🏁 Script executed:

fd -e '.mypy.ini' -o -name 'pyproject.toml' -o -name 'setup.cfg' | xargs rg -l "mypy|type" 2>/dev/null | head -5

Repository: CatholicOS/ontokit-api

Length of output: 420


🏁 Script executed:

rg -n "mypy" pyproject.toml

Repository: CatholicOS/ontokit-api

Length of output: 104


🏁 Script executed:

web_search
RDFLib Graph.query() return type rdflib.query.Result vs SPARQLResult version 7.1

Repository: CatholicOS/ontokit-api

Length of output: 282


🌐 Web query:

RDFLib 7.1.0 Graph.query return type rdflib.query.Result

💡 Result:

In RDFLib 7.1.0, rdflib.graph.Graph.query(...) is annotated to return rdflib.query.Result (i.e., query.Result). [1]

That Result object behaves differently depending on SPARQL query type: SELECT → iterates ResultRows, ASK → yields a single bool / bool(result), CONSTRUCT/DESCRIBE → iterates triples (and may also expose a .graph). [2][3]


🌐 Web query:

RDFLib SPARQLResult vs Result typing

💡 Result:

In RDFLib, Result is the public, generic query-result type, while SPARQLResult is the SPARQL plugin’s concrete implementation of that type.

  • rdflib.query.Result: common container for all SPARQL query forms (SELECT, ASK, CONSTRUCT, DESCRIBE). It “shape-shifts” at runtime:

    • SELECT: iterates ResultRow rows
    • ASK: bool(result) (and iteration) yields a single boolean
    • CONSTRUCT/DESCRIBE: iterates triples
      It also exposes .type, .vars, .bindings, .graph, .askAnswer. [1]
  • rdflib.plugins.sparql.processor.SPARQLResult: subclass of Result created by the SPARQL processor. Its constructor simply takes the processor’s internal mapping (res) and populates the base Result fields (notably calling Result.__init__(res["type_"]) and assigning .vars, .bindings, .askAnswer, .graph). [2]

Typing guidance (practical):

  • For user code, annotate results as rdflib.query.Result (that’s the stable API surface). You can branch on result.type ("SELECT", "ASK", …) to narrow behavior. [1]
  • Only type as SPARQLResult if you explicitly depend on the SPARQL plugin implementation (usually unnecessary, since the fields you typically use are already on Result). [1][2]
  • For value types inside bindings/rows, RDFLib defines type aliases like QueryResultValueType = Union[IdentifiedNode, Literal] (useful for static typing when you process term values). [3]
  • Parsing/serializing SPARQL result formats (XML/JSON/CSV/TSV/TXT) is standardized around the Result type via the query-results plugins. [4]

Sources: [1] (rdflib.readthedocs.io), [2] (rdflib.readthedocs.io), [3] (rdflib.readthedocs.io), [4] (rdflib.readthedocs.io)

Citations:


Use RDFLib's public query result type instead of the SPARQL plugin's internal implementation.

In RDFLib 7.1.0+, Graph.query() is typed to return rdflib.query.Result, not rdflib.plugins.sparql.processor.SPARQLResult. Although SPARQLResult is a subclass of Result, the stable public API for user code is Result. This mismatch will trigger mypy errors at line 227.

Change the import from rdflib.plugins.sparql.processor.SPARQLResult to rdflib.query.Result and update the type annotation accordingly.

Suggested fix
-from rdflib.plugins.sparql.processor import SPARQLResult
+from rdflib.query import Result
@@
-            result: SPARQLResult = graph.query(query_text)
+            result: Result = graph.query(query_text)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ontokit/services/search.py` at line 9, Replace the internal SPARQLResult
import with RDFLib's public Result type: change the import from
rdflib.plugins.sparql.processor.SPARQLResult to rdflib.query.Result and update
any type annotations that reference SPARQLResult (e.g., the variable or return
type used around the Graph.query() handling at the site flagged) to use Result
instead so the annotation matches Graph.query()'s public API.

JohnRDOrazio and others added 6 commits March 9, 2026 03:46
Add explicit generic type parameters, cast() at DB-schema boundaries
for Literal types, missing function annotations, pygit2/GitPython stub
workarounds, and pydantic mypy plugin. Add types-python-jose to dev deps.
Update pre-commit mypy hook with missing additional_dependencies and
remove types-redis (conflicts with redis 5.x inline stubs).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Wrap the ARQ enqueue_job() call in try/except so that if the pool
or enqueue fails, the pending EmbeddingJob row is deleted rather
than leaving a stuck active job that blocks future requests with 409.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
_load_graph() now returns (graph, resolved_branch) so callers use the
actual branch for Redis cache keys and ConsistencyCheckResult metadata
instead of the raw (possibly None) query parameter.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use the project's actual default branch instead of hardcoding "main"
when the branch query parameter is omitted, matching the behavior
already used in rank_suggestions().

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Guard against None branch in rank_suggestions() by resolving to the
project's default branch, preventing IS NULL filters that would
silently return no results.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move db.refresh() outside the try/except that triggers branch deletion
so that a refresh failure after a successful commit does not incorrectly
delete the git branch, leaving an active session pointing at nothing.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@JohnRDOrazio
Copy link
Copy Markdown
Member

CodeRabbit fixes — batch 4

mypy strict-mode compliance

  1. Fix all 178 mypy strict-mode errors across 39 files (464ba3b) — Add explicit generic type parameters (dict[str, Any] instead of bare dict), cast() at DB→schema boundaries for Literal types, missing function return/param annotations, pygit2/GitPython stub workarounds (cast() and targeted type: ignore), pydantic mypy plugin config. Add types-python-jose to dev deps. Update pre-commit mypy hook with missing additional_dependencies and remove types-redis (conflicts with redis 5.x inline stubs).

Architectural fixes from CodeRabbit review (batch 5)

  1. Clean up pending embedding job row when ARQ enqueueing fails (embeddings.py) — Wrap pool.enqueue_job() in try/except; on failure, delete the pending EmbeddingJob row so it doesn't block future requests with 409.

  2. Resolve branch before cache keys and result metadata in quality routes (quality.py) — _load_graph() now returns (graph, resolved_branch) so callers use the actual branch for Redis cache keys and ConsistencyCheckResult metadata instead of the raw (possibly None) query parameter.

  3. Resolve project default branch in semantic search routes (semantic_search.py) — semantic_search() and find_similar_entities() now resolve the project's actual default branch instead of hardcoding "main" when the branch query parameter is omitted, matching the behavior already used in rank_suggestions().

  4. Resolve nullable branch in embedding service rank_suggestions (embedding_service.py) — Defensively resolve body.branch to the project's default branch when None, preventing IS NULL filters that would silently return no results.

  5. Scope branch cleanup to commit failures in suggestion service (suggestion_service.py) — Move db.refresh() outside the try/except that triggers branch deletion, so a refresh failure after a successful commit does not incorrectly delete the git branch (leaving an active session pointing at nothing).

JohnRDOrazio and others added 11 commits March 9, 2026 03:58
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use WEB_PORT env var (default 3000) instead of hardcoded port in
OIDC redirect URIs, so the setup script works with non-standard ports.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add descriptions for all 16 endpoint groups and a rich markdown API
description covering key capabilities, authentication, and rate
limiting for the Scalar-powered API Reference.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Introduces a per-user notification model, service, and API endpoints
(list, mark-read, mark-all-read) for project events. Integrates
notification dispatch into suggestion, pull request, and join request
services.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… and worker task

Implements the full backend for tracking upstream GitHub repository files,
matching the existing frontend API client. Includes config CRUD, background
check job via ARQ, job status polling, and sync event history.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…anges, resubmit

Implements the missing review/admin API for suggestion sessions that the
frontend already expects. Adds new session statuses (merged, rejected,
changes-requested), review fields on the model, and five new routes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Eliminates noisy browser console errors on every page load for projects
without upstream sync configured.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… (PEP 735)

Dev tools are development-only and should not be exposed as installable
extras in the published package.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add api_base_url config setting for webhook callback URLs
- Add github_hook_id column to track webhook state on GitHub
- Add scope helpers and webhook management methods to GitHub service
- Add POST webhook-setup endpoint for auto-detecting/creating webhooks
- Guard against localhost URLs when auto-creating webhooks
- Add webhook frequency option to upstream sync types
- Rearrange GitHub integration info boxes in project settings
- Enhance WebhookConfigPanel with auto-setup status and fallback to manual
- Add admin:repo_hook scope guidance in user settings page

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When enabling webhooks on a GitHub integration, automatically create
an UpstreamSyncConfig with frequency="webhook" using the integration's
repo data. When disabling webhooks, downgrade to frequency="manual".

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@JohnRDOrazio
Copy link
Copy Markdown
Member

@damienriehl I added CodeRabbit which I have been using on a few other projects, it does a wonderful job of reviewing and nit-picking. I then feed CodeRabbit's review issues back into Claude Code. I find that it is best to address each issue with a commit but specifically instruct Claude not to push until all CodeRabbit issues have been addressed, because CodeRabbit triggers a new review every time you push but it also implements rate-limiting, so with many pushes it'll just quit. So best to make all the commits first, then push them all at once when they're ready. I either copy the "Prompt for AI Agents" section for each issue, one at a time; or, at the risk of perhaps consuming a few more tokens but speeding up the process, I tell Claude to directly read all the issues raised by CodeRabbit's review on the PR (I have the GitHub CLI installed, which Claude uses to read the PR), but again instructing specifically to address each issue with one commit per fix but not push until all have been addressed.

So now, linting and tests are passing!

@JohnRDOrazio
Copy link
Copy Markdown
Member

I added a webhooks UI to the GitHub integration, which should be a valid alternative to "Upstream Source Tracking". The idea is that as soon as webhooks are enabled and installed, "Upstream Source Tracking" will be readonly; if webhooks are not enabled, "Upstream Source Tracking" can be set but will default to the same GitHub repo when GitHub integration is active.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants