Release v0.2.0 — API Infrastructure & Suggestion Sessions#1
Release v0.2.0 — API Infrastructure & Suggestion Sessions#1JohnRDOrazio merged 118 commits intomainfrom
Conversation
…leware, exceptions, input validation, and comprehensive test suite (85 tests) - Implement startup/shutdown lifespan: DB verification, Redis pool, MinIO bucket init - Extract Zitadel project roles from JWT claims with 1-hour JWKS cache TTL - Implement full-text search via PostgreSQL tsvector/tsquery with pagination - Implement SPARQL execution via RDFLib (SELECT/ASK/CONSTRUCT) - Add slowapi rate limiting (100 req/min default) - Add request ID, access logging, and security headers middleware - Add custom exception types (NotFoundError, ValidationError, ConflictError, ForbiddenError) with global exception handlers returning consistent error schema - Add IRI format validation and string length limits on schema fields - Add 85 passing tests covering auth, config, search, linter, ontology service, routes, and integration workflows Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When the search query is "*", skip the substring match filter and return all entities of the requested type(s). This enables the frontend Properties and Individuals tabs to list all entities without a dedicated listing endpoint. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…h, and auto-submit Implements Phase C of the suggestion workflow — the FastAPI backend that powers the frontend suggestion UI. Suggesters get a dedicated branch, save edits as commits, and submit as PRs. Stale sessions auto-submit via a cron job every 10 minutes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Pydantic v2 HttpUrl is not a plain string, so calling .lower() on cls.iri fails with AttributeError. Wrap in str() for both get_root_classes and get_class_children sort_key functions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Join request approval now assigns suggester role instead of editor. MemberBase schema default changed from viewer to suggester. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Release v0.2.0 — lifespan handlers, JWT roles, search service, rate limiting, security middleware, suggestion sessions, beacon auth, auto-submit worker, HttpUrl fix, default suggester role, and 85 tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Phase A (Quality): Cross-reference service, 12-rule consistency checker, duplicate detection via label similarity. 4 new endpoints with Redis caching. Phase B (Analytics): EntityChangeEvent model + migration, graph-diff change tracking hooked into save_source_content, entity history, project activity, hot entities, and contributor stats. 4 new endpoints. Phase C (Vector): pgvector-backed embedding tables + migration, pluggable embedding providers (local/OpenAI/Voyage), semantic search, similar entities, rank suggestions, auto-embed on save. 8 new endpoints, 2 ARQ worker tasks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add ZITADEL_EXTERNALPORT: 8080 so issuer matches backend config - Add port 53000 to CORS_ORIGINS for frontend dev server - Use pgvector/pgvector:pg17 image instead of postgres:17-alpine - Create pgvector extension in init-db.sh as superuser Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
@CodeRabbit review |
✅ Actions performedReview triggered.
|
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds analytics, quality, embeddings, semantic search, and suggestion-session features including DB migrations and ORM models, embedding providers/services, new API routes and middleware, background worker tasks, auth + beacon token changes, extensive tests, and bumps version to 0.3.0. Changes
Sequence Diagram(s)sequenceDiagram
participant Client as Client
participant API as FastAPI (suggestions router)
participant Service as SuggestionService
participant DB as Database
participant Git as Git backend
participant PR as PullRequest service
Client->>API: POST /projects/{id}/suggestions/sessions
API->>Service: create_session(project_id, user)
Service->>DB: INSERT suggestion_session
Service->>Git: create branch for session
DB-->>Service: created session
Service-->>API: 201 Created (session_id, branch, beacon_token)
Client->>API: PUT /projects/{id}/suggestions/sessions/{sid}/save
API->>Service: save(session_id, content, user)
Service->>Git: commit to branch
Service->>DB: UPDATE changes_count, entities_modified, last_activity
Service-->>API: 200 OK (commit_hash, branch, changes_count)
Client->>API: POST /projects/{id}/suggestions/sessions/{sid}/submit
API->>Service: submit(session_id, summary, user)
Service->>DB: verify session & changes
Service->>Git: prepare PR branch/metadata
Service->>PR: create pull request
Service->>DB: UPDATE session status and pr reference
Service-->>API: 200/201 (pr_number, pr_url, status)
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related issues
Poem
✨ Finishing Touches🧪 Generate unit tests (beta)
|
There was a problem hiding this comment.
Actionable comments posted: 3
Note
Due to the large number of review comments, Critical severity comments were prioritized as inline comments.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
ontokit/services/join_request_service.py (1)
205-205:⚠️ Potential issue | 🟡 MinorDocstring inconsistency.
The implementation correctly assigns the
suggesterrole, but the docstring at Line 205 still references "editor":"Approve a join request and add the user as an editor."
This should be updated to reflect the new role assignment.
📝 Proposed fix
async def approve_request( self, project_id: UUID, request_id: UUID, action: JoinRequestAction, user: CurrentUser, ) -> JoinRequestResponse: - """Approve a join request and add the user as an editor.""" + """Approve a join request and add the user as a suggester.""" await self._check_admin_access(project_id, user)Also applies to: 233-237
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ontokit/services/join_request_service.py` at line 205, Update the misleading docstrings that say "add the user as an editor" to accurately state that the user is added/assigned the "suggester" role: specifically edit the docstring on the approve_join_request method (and the other docstring block around the same function area) so they describe approving a join request and assigning the suggester role (mentioning the method name approve_join_request and any helper like assign_role or add_user_to_org if present) and ensure the wording matches the actual implementation that assigns "suggester".ontokit/core/auth.py (1)
130-169:⚠️ Potential issue | 🔴 CriticalAdd audience (
aud/azp) validation before extracting and applying token roles.The
validate_token()function disables JWT audience verification withoptions={"verify_aud": False}but does not perform manual audience validation afterwards. This means a token from the same Zitadel issuer but issued for a different client will still be accepted and its roles will be extracted and used for authorization decisions. Add validation to ensure the token'saudclaim contains orazpclaim matchessettings.zitadel_client_idbefore accepting the token.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ontokit/core/auth.py` around lines 130 - 169, validate_token currently disables JWT audience verification and extracts roles without checking that the token was issued for our client; after jwt.decode and before calling _extract_roles and constructing TokenPayload, validate that the payload's aud contains settings.zitadel_client_id (or, if aud is a string, equals it) or that payload.get("azp") == settings.zitadel_client_id, and if not raise HTTPException(status_code=401, detail="Invalid audience"); keep the existing issuer/alg checks and only perform this manual audience/azp check prior to using roles from _extract_roles.
🟠 Major comments (22)
ontokit/core/auth.py-229-239 (1)
229-239:⚠️ Potential issue | 🟠 MajorWiden the
userinfofallback to cover missing roles too.Right now both helpers only call
fetch_userinfo()whennameorroles=[]and fail everyrequire_roles(...)check even ifuserinfocould supply the roles.Minimal fix
- if not name or not email: + if not name or not email or not roles: userinfo = await fetch_userinfo(credentials.credentials) if userinfo: name = name or userinfo.get("name") or userinfo.get("preferred_username") email = email or userinfo.get("email") username = username or userinfo.get("preferred_username") - # Also extract roles from userinfo if not already present in token - if not roles: - roles = _extract_roles(userinfo) + roles = roles or _extract_roles(userinfo)Apply the same change in both helper paths.
Also applies to: 289-299
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ontokit/core/auth.py` around lines 229 - 239, The token parsing currently only calls fetch_userinfo() when name or email is missing, so tokens with roles absent in token_payload.roles remain [] and fail role checks; update both helper paths (the block referencing token_payload.roles and the analogous block around lines 289-299) to also call fetch_userinfo() when roles is empty/falsey, then if userinfo is present set roles = roles or _extract_roles(userinfo); ensure you reference token_payload.roles, fetch_userinfo, and _extract_roles in the fix so roles are populated from userinfo when missing from the token.ontokit/services/search.py-191-206 (1)
191-206:⚠️ Potential issue | 🟠 MajorParse SPARQL queries properly to determine query type instead of guessing from the first token.
When queries include
PREFIX,BASE, or comment preambles, the string-based detection fails: queries with these preambles that should be classified asASK,CONSTRUCT, orDESCRIBEincorrectly default toSELECT. This causes two problems:
- Valid queries with preambles return the wrong response shape (e.g., ASK returns SELECT structure).
- When parsing fails, the exception handler (lines 208–218) returns an error response with the wrong
query_type, masking the underlying issue instead of exposing a clear error to clients.Use RDFLib's
parseQuery()to properly parse SPARQL and extract the correct query type, which handles all valid SPARQL preambles.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ontokit/services/search.py` around lines 191 - 206, Replace the fragile string-based detection (query_text/query_upper/query_type) with RDFLib's parser: call rdflib.plugins.sparql.parser.parseQuery(query.query) up front, extract the actual SPARQL operation type (map the parsed result to "SELECT"/"ASK"/"CONSTRUCT"/"DESCRIBE"), and set query_type from that parsed value (instead of guessing from the first token); if parseQuery raises, do not silently default to "SELECT"—let the parse exception propagate (or set query_type to a clear error state like "PARSE_ERROR") so the existing error handler can return an accurate error response rather than masking the parse failure. Ensure you update the code paths that reference query_type (the variables query_text, query_upper, and query_type) to use the parsed type.ontokit/services/duplicate_detection_service.py-85-91 (1)
85-91:⚠️ Potential issue | 🟠 MajorPotential logic issue in cluster membership check.
The condition
if root in parent or iri in parentmay incorrectly exclude entities that were merged into clusters. An entity that was the root of a union won't be inparentas a key if it was never the child in a union operation. This could result in some cluster members being missed.🐛 Proposed fix for cluster membership logic
# Build clusters clusters_map: dict[str, list[tuple[str, str, str]]] = defaultdict(list) entity_lookup = {iri: (label, etype) for iri, label, etype in entities} + # Track which entities are part of any cluster (appeared in at least one similar pair) + clustered_entities = set() + for (iri_a, iri_b) in pair_sim.keys(): + clustered_entities.add(iri_a) + clustered_entities.add(iri_b) + for iri, (label, etype) in entity_lookup.items(): - root = find(iri) - if root in parent or iri in parent: + if iri in clustered_entities: + root = find(iri) clusters_map[find(iri)].append((iri, label, etype))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ontokit/services/duplicate_detection_service.py` around lines 85 - 91, The current membership check using "if root in parent or iri in parent" can skip entities whose representative root isn't a key in parent; instead, compute root = find(iri) and always add the tuple to clusters_map[root] (i.e., remove the conditional filter) so every entity is grouped under its representative; you can then filter out singleton clusters later if needed—update the block that builds clusters_map (referencing clusters_map, entity_lookup, find, parent, iri, root) to always append to clusters_map[root].ontokit/services/consistency_service.py-285-300 (1)
285-300:⚠️ Potential issue | 🟠 MajorDon't treat external vocabulary terms as dangling references.
Valid references like
rdfs:range xsd:stringorrdfs:subClassOf skos:Conceptwill be flagged unless those IRIs also appear as local subjects. This rule should only report unresolved in-ontology references, not standard or imported vocabulary terms.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ontokit/services/consistency_service.py` around lines 285 - 300, The check in consistency_service.py flags external vocabulary IRIs as dangling because `known` is only _all_declared_uris(graph) plus local subjects; update the logic so external/imported vocabularies are treated as resolved: extend `known` to include terms from the graph's namespace manager and common vocabularies before the predicate loop (use graph.namespace_manager.namespaces() to collect namespace URI prefixes and add any URIRefs whose namespace matches those prefixes, and also include well-known constants like RDFS, RDF, XSD, SKOS, OWL), or alternatively skip reporting when the object's namespace is one of the graph namespaces or a standard vocab namespace; modify the code around `_all_declared_uris`, `all_subjects`, `known`, and the predicate loop to apply this namespace-based filter so only true in-ontology unresolved references are reported.ontokit/services/consistency_service.py-75-98 (1)
75-98:⚠️ Potential issue | 🟠 MajorWalk every
subClassOfedge when detecting cycles.This loop only follows
parents[0]. In a multi-inheritance graph, a cycle reachable through any other parent is missed, so the checker returns false negatives.💡 Suggested fix
def _check_cycle_detect(graph: Graph) -> list[ConsistencyIssue]: """Detect cycles in rdfs:subClassOf chains.""" issues = [] reported: set[str] = set() for cls in graph.subjects(RDF.type, OWL.Class): if not isinstance(cls, URIRef): continue - visited: set[URIRef] = set() - current = cls - while True: - if current in visited: + stack: list[tuple[URIRef, set[URIRef]]] = [(cls, set())] + while stack: + current, path = stack.pop() + if current in path: iri = str(cls) if iri not in reported: issues.append(ConsistencyIssue( rule_id="cycle_detect", severity="error", @@ - break - visited.add(current) + break + + next_path = path | {current} parents = [p for p in graph.objects(current, RDFS.subClassOf) if isinstance(p, URIRef)] - if not parents: - break - current = parents[0] + stack.extend((parent, next_path) for parent in parents) return issues🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ontokit/services/consistency_service.py` around lines 75 - 98, The cycle check only follows the first parent (parents[0]) and misses cycles via other parents; update the traversal starting at each class (the loop over graph.subjects(RDF.type, OWL.Class)) to explore all RDFS.subClassOf edges (e.g., use a stack or queue to perform DFS/BFS over parents) instead of assigning current = parents[0]; keep using the per-start visited/path set (visited) to detect a back-edge and report via ConsistencyIssue/reported when a cycle is found, and ensure you only push/visit URIRef parents (from graph.objects(current, RDFS.subClassOf)) so all inheritance branches are checked.ontokit/api/routes/embeddings.py-99-104 (1)
99-104:⚠️ Potential issue | 🟠 MajorDon't hardcode
"main"as the implicit branch.Projects can use
masteror any custom default branch. Using"main"here makes generation/status hit the wrong revision unless the caller already knows to override it.Also applies to: 120-125
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ontokit/api/routes/embeddings.py` around lines 99 - 104, The endpoints (e.g., generate_embeddings) currently hardcode branch=Query(default="main"); instead make branch optional (no hardcoded "main") and resolve the effective branch inside the handler using the project's configured default branch fetched via project_id (e.g., query Project.default_branch or project settings) so requests without branch use the project's actual default; apply the same change to the other similar endpoint(s) referenced around lines 120-125 so all embedding generation/status handlers derive the branch from project metadata when branch is not provided.ontokit/services/cross_reference_service.py-61-63 (1)
61-63:⚠️ Potential issue | 🟠 MajorDon't drop blank-node references here.
Common OWL patterns like
subClassOf [ owl:someValuesFrom target ]attach the target through a blank node.continueing on non-URI subjects means those references never get attributed back to the owning entity.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ontokit/services/cross_reference_service.py` around lines 61 - 63, The loop currently ignores blank-node subjects (s) which drops important references; instead of continuing on non-URI s, detect rdflib.term.BNode and resolve its owning entity by finding subjects that point to that blank node (e.g., iterate graph.subjects(None, s) or graph.triples((owner, None, s))) and attribute the reference to those owning subjects (only treat owners that are URIRef); keep the existing handling for direct URIRef subjects (s) and skip only when no URIRef owner can be resolved.ontokit/services/suggestion_service.py-353-359 (1)
353-359:⚠️ Potential issue | 🟠 Major
max(pr_number) + 1will race under concurrent submits.Manual submit and auto-submit can both read the same max and allocate the same
pr_number. This needs a DB-enforced allocator, or at least a retry path on unique-constraint failure.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ontokit/services/suggestion_service.py` around lines 353 - 359, The current pattern (select max(PullRequest.pr_number) then +1) races under concurrent submits; replace it with a DB-enforced allocator or a retry-on-unique-error flow: either (preferred) create/use a DB sequence (e.g., nextval('pull_request_pr_number_seq')) and obtain pr_number from the sequence via self.db.execute(nextval...) or by making pr_number default from that sequence and performing an INSERT ... RETURNING pr_number, or (alternative) keep the current logic but wrap the INSERT that uses pr_number in a retry loop that catches the DB unique-constraint/IntegrityError and retries reading max/allocating a new pr_number; update the code around max_number_result, pr_number, and the insert path to use the chosen approach and ensure the unique constraint on PullRequest.pr_number remains enforced.ontokit/services/suggestion_service.py-219-227 (1)
219-227:⚠️ Potential issue | 🟠 MajorDon't write personal emails into git authorship by default.
These values become part of commit history and can leak PII to repo viewers or external mirrors. Prefer a noreply/app alias unless the user explicitly opted into exposing their email.
Also applies to: 473-480
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ontokit/services/suggestion_service.py` around lines 219 - 227, The commit author default currently uses session.user_email and session.user_name which can leak PII; update the call to commit_to_branch (and the other similar call later) to only use session.user_email/session.user_name when explicitly provided by the user, otherwise provide a noreply/app alias (e.g., "noreply@ontokit.dev" or an application-specific noreply) for author_email and a neutral author_name like "OntoKit Bot" or "Suggester" to avoid embedding personal emails in git history; ensure you change both the current commit_to_branch usage and the similar invocation referenced in the comment so the fallback logic is centralized and consistent.ontokit/services/suggestion_service.py-497-521 (1)
497-521:⚠️ Potential issue | 🟠 MajorClaim the stale session before creating its PR.
This loop reads active stale sessions without locking or an atomic status transition, so a manual submit—or another worker—can process the same session in parallel and create duplicate PRs.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ontokit/services/suggestion_service.py` around lines 497 - 521, Before calling _create_pr_for_session, perform an atomic claim on each session to avoid races: for each session from stale_sessions, run an UPDATE via self.db that sets a temporary "claimed" or "in-progress" status (use SuggestionSession.status and SuggestionSessionStatus) with a WHERE that includes SuggestionSession.id == session.id, SuggestionSession.status == SuggestionSessionStatus.ACTIVE.value, SuggestionSession.changes_count > 0 and SuggestionSession.last_activity < cutoff; execute and only proceed to call _create_pr_for_session when the update affected exactly one row (otherwise skip because another worker claimed it). Ensure you check the DB execute result (rows affected) and only change to the final AUTO_SUBMITTED status inside _create_pr_for_session or after success.ontokit/main.py-45-50 (1)
45-50:⚠️ Potential issue | 🟠 MajorFail fast on a broken database dependency.
Swallowing the DB probe exception leaves the app serving requests while its primary dependency is unavailable. That's especially misleading because the health endpoint still reports healthy.
Proposed fix
try: async with engine.connect() as conn: await conn.execute(text("SELECT 1")) logger.info("Database connection verified") - except Exception: - logger.exception("Failed to connect to the database — continuing startup") + except Exception: + logger.exception("Failed to connect to the database") + raise🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ontokit/main.py` around lines 45 - 50, The DB probe currently swallows all exceptions (in the try around engine.connect / text("SELECT 1")) and only logs via logger.exception, allowing startup to continue; change this to fail fast by either re-raising the exception or terminating startup after logging so the process does not become healthy with a broken DB. Specifically, in the try/except that wraps async with engine.connect() ... await conn.execute(text("SELECT 1")), keep the logger.exception call for context but then raise the caught exception (or call sys.exit(1)) so the exception propagates and prevents continuation of startup instead of silently continuing.ontokit/services/suggestion_service.py-507-530 (1)
507-530:⚠️ Potential issue | 🟠 MajorRollback before continuing after a failed auto-submit.
If
_create_pr_for_session()fails during flush/commit,AsyncSessionstays in a failed transaction state. Without a rollback, the rest of this loop can cascade intoPendingRollbackError.Proposed fix
for session in stale_sessions: try: # Create a mock user for the auto-submit mock_user = CurrentUser( id=session.user_id, email=session.user_email, name=session.user_name, ) await self._create_pr_for_session( session.project_id, session, mock_user, summary="Auto-submitted: session inactive for 30+ minutes.", new_status=SuggestionSessionStatus.AUTO_SUBMITTED.value, ) count += 1 logger.info( f"Auto-submitted suggestion session {session.session_id} " f"for project {session.project_id}" ) except Exception as e: + await self.db.rollback() logger.error( f"Failed to auto-submit session {session.session_id}: {e}" )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ontokit/services/suggestion_service.py` around lines 507 - 530, The loop that calls _create_pr_for_session can leave the SQLAlchemy AsyncSession in a failed transaction state if that call raises; update the except block to rollback the AsyncSession before continuing—e.g., in the except Exception as e: await self._db.rollback() (or await self._db_session.rollback() / whichever AsyncSession instance the service uses) then log the error and continue—this ensures the AsyncSession is reset and avoids PendingRollbackError cascading for subsequent iterations.ontokit/services/cross_reference_service.py-67-72 (1)
67-72:⚠️ Potential issue | 🟠 MajorThe fallback misclassifies every unknown URI predicate as
annotation_value.
isinstance(p, URIRef)is true for ordinary structural predicates too, so unrecognized relationships get bucketed under the wrong context instead of being skipped or mapped correctly.Proposed fix
if context is None: # Check if it's an annotation property with a URI value - if isinstance(p, URIRef): + if isinstance(p, URIRef) and (p, RDF.type, OWL.AnnotationProperty) in graph: context = "annotation_value" else: continue🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ontokit/services/cross_reference_service.py` around lines 67 - 72, The current fallback treats any unknown URI predicate p as an "annotation_value" which misclassifies structural predicates; instead, in the block that sets context when context is None, check whether the triple's object (o) is a Literal or the predicate p is known to be an annotation property (e.g., by consulting your annotation property set or mapper) before assigning "annotation_value"; otherwise skip/continue. Update the logic around variables context, p and o in cross_reference_service.py so only literal-valued or explicitly-annotative predicates become "annotation_value" and other unknown URI predicates are not mis-bucketed.ontokit/api/routes/embeddings.py-32-45 (1)
32-45:⚠️ Potential issue | 🟠 MajorUse
RedisSettings.from_dsn()to parse the full Redis URL instead of manual URL decomposition.The current code drops credentials and TLS settings when initializing the ARQ pool.
urlparse()only extracts host/port/database, butRedisSettingssupportsusername,password,ssl,ssl_keyfile,ssl_certfile, and related TLS fields. URLs likeredis://:password@host:6379andrediss://host:6379will fail to connect.Replace the manual parsing with:
redis_settings = RedisSettings.from_dsn(redis_url)The
from_dsn()method parses the complete URL, automatically setsssl=Trueforrediss://schemes, and extracts authentication and database settings.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ontokit/api/routes/embeddings.py` around lines 32 - 45, The get_arq_pool function currently reconstructs RedisSettings by manually parsing settings.redis_url which drops credentials and TLS options; replace the manual urlparse block with RedisSettings.from_dsn(str(settings.redis_url)) (assign to redis_settings) so username/password and rediss:// SSL settings are preserved, remove the urlparse import, and then pass that redis_settings into await create_pool(...) to initialize _arq_pool as before.ontokit/main.py-163-165 (1)
163-165:⚠️ Potential issue | 🟠 MajorUse SlowAPI's
_inject_headers()to properly preserve rate-limit headers.The current implementation discards rate-limit headers (
Retry-After, quota metadata). Whileexc.headerscan carry headers, SlowAPI's default handler injects rate-limit headers vialimiter._inject_headers(), which must be called to populate them. Simply copyingexc.headerswon't capture the dynamically-generated rate-limit metadata.Proposed fix
`@app.exception_handler`(RateLimitExceeded) async def rate_limit_handler(_request: Request, exc: RateLimitExceeded) -> JSONResponse: - return _error_response(429, "rate_limit_exceeded", str(exc.detail)) + response = _error_response(429, "rate_limit_exceeded", str(exc.detail)) + return _request.app.state.limiter._inject_headers(response, _request.state.view_rate_limit)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ontokit/main.py` around lines 163 - 165, The rate_limit_handler currently returns _error_response(429, "rate_limit_exceeded", str(exc.detail)) but drops SlowAPI rate-limit headers; update rate_limit_handler to create the JSONResponse via _error_response and then call the Limiter instance's _inject_headers(response, exc) (i.e., limiter._inject_headers) so the Retry-After and quota metadata are populated from the RateLimitExceeded exception before returning; locate the rate_limit_handler function and the limiter object used for SlowAPI to perform this injection.ontokit/services/embedding_providers/voyage_provider.py-41-57 (1)
41-57:⚠️ Potential issue | 🟠 MajorValidate the batch response cardinality before extending
all_embeddings.The Voyage API docs require clients to validate that the response count matches the request count. Without this check, a partial or reordered upstream response silently breaks text→embedding alignment downstream. The API also provides an explicit
indexfield in each response item for handling out-of-order results if needed.Proposed fix
async def embed_batch(self, texts: list[str]) -> list[list[float]]: async with httpx.AsyncClient(timeout=60.0) as client: all_embeddings: list[list[float]] = [] for i in range(0, len(texts), 128): batch = texts[i : i + 128] resp = await client.post( "https://api.voyageai.com/v1/embeddings", headers={ "Authorization": f"Bearer {self._api_key}", "Content-Type": "application/json", }, json={"input": batch, "model": self._model_name}, ) resp.raise_for_status() data = resp.json() - all_embeddings.extend([item["embedding"] for item in data["data"]]) + items = data.get("data", []) + if len(items) != len(batch): + raise ValueError( + f"Expected {len(batch)} embeddings, got {len(items)}" + ) + all_embeddings.extend(item["embedding"] for item in items) return all_embeddings🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ontokit/services/embedding_providers/voyage_provider.py` around lines 41 - 57, In embed_batch, ensure the Voyage API response cardinality matches the request and preserve ordering: after resp.json() (variable data) validate that len(data["data"]) == len(batch) and raise or retry if not; then build a result list of length len(batch) (e.g., placeholders) and populate it using each item["index"] to place item["embedding"] in the correct slot before extending/merging into all_embeddings, so all_embeddings alignment with input texts is preserved (refer to embed_batch, batch, data, all_embeddings, and item["index"]).ontokit/services/embedding_service.py-88-96 (1)
88-96:⚠️ Potential issue | 🟠 MajorInvalidate existing embeddings when the provider/model changes.
This mutation flips
_get_provider()for future queries, but every storedEntityEmbeddingrow stays in the previous vector space. Until a full re-embed happens, semantic search can compare incompatible embeddings or fail on dimension mismatches. Clear or mark rows stale here and resetlast_full_embed_atso callers cannot treat old vectors as current.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ontokit/services/embedding_service.py` around lines 88 - 96, When config.provider or config.model_name is changed in the update block (where update.provider/update.model_name are applied and get_embedding_provider is called), also invalidate existing embeddings and reset the full-embed marker: mark all EntityEmbedding rows for this config as stale (e.g., set a `stale` flag or clear the stored vector) and persist that change, and set the config's `last_full_embed_at` to None (or zero) before saving so callers cannot treat old vectors as current; use the EntityEmbedding model and the same config identifier used in embedding_service.py to locate and update rows immediately after updating config.provider/config.model_name and config.dimensions.ontokit/services/embedding_service.py-294-325 (1)
294-325:⚠️ Potential issue | 🟠 MajorFull re-embed needs to prune removed entities.
This only upserts entities that still exist in the graph. If an ontology term was deleted, its old
EntityEmbeddingrow survives and stays searchable forever, so a “full” embed never actually syncs the branch snapshot. Please delete rows whoseentity_iriis no longer present in the current entity set as part of this path.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ontokit/services/embedding_service.py` around lines 294 - 325, When doing the "full" upsert in this embedding path, you must also delete EntityEmbedding rows for the given project_id and branch whose entity_iri is not in the current entities set; compute the current IRIs from the entities list (e.g., [iri for iri in entities] or a set), then run a delete using the EntityEmbedding model to remove rows where EntityEmbedding.project_id == project_id and EntityEmbedding.branch == branch and EntityEmbedding.entity_iri not in that IRI set (use SQLAlchemy delete(...) with .where(...) and .notin_(...) or ~EntityEmbedding.entity_iri.in_(...)), and commit with the existing self._db.commit() so stale embeddings are pruned when job.embedded_entities is updated.ontokit/services/embedding_service.py-97-101 (1)
97-101:⚠️ Potential issue | 🟠 MajorDo not store provider secrets as base64.
Base64 is reversible encoding, not encryption. Anyone with database access can recover the raw API key from
api_key_encrypted, so this materially weakens the security posture of the feature. Use real encryption/KMS or a dedicated secret store before release.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ontokit/services/embedding_service.py` around lines 97 - 101, The code currently stores provider secrets by base64-encoding update.api_key into config.api_key_encrypted (see the update.api_key branch and config.api_key_encrypted assignment); replace this reversible encoding with real secret storage: remove the base64.b64encode(...) usage and instead call a secure storage/encryption helper (e.g., encrypt_with_kms(update.api_key) or secret_manager.save_secret/update_secret) and persist the returned secret identifier or ciphertext, not the raw key; ensure the helper performs authenticated encryption or uses your cloud KMS/secret-store API and update any callers of config.api_key_encrypted to expect the encrypted value or secret reference.ontokit/services/embedding_service.py-532-548 (1)
532-548:⚠️ Potential issue | 🟠 MajorAdd branch filtering to rank_suggestions queries.
The
rank_suggestionsmethod queriesEntityEmbeddingwithout filtering by branch, unlikesemantic_searchandfind_similar. SinceEntityEmbeddinghas a unique constraint on(project_id, branch, entity_iri), the same IRI can exist on multiple branches. This causes two problems:
- The context lookup on line 537 using
scalar_one_or_none()will raise an exception if the same context IRI exists on multiple branches.- The candidate embeddings (lines 542–548) can be mixed from different branches when ranking.
Add
branchtoRankSuggestionRequest(or the method signature and API endpoint) and filter both queries with it:.where( EntityEmbedding.project_id == project_id, EntityEmbedding.branch == branch, EntityEmbedding.entity_iri == body.context_iri, )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ontokit/services/embedding_service.py` around lines 532 - 548, The rank_suggestions flow is missing branch filtering which can cause scalar_one_or_none() on the context query in rank_suggestions and the candidate query in candidates_q to return multiple rows from different branches; update RankSuggestionRequest (or the rank_suggestions method signature/API) to accept a branch and add EntityEmbedding.branch == branch to both the context lookup (where EntityEmbedding.entity_iri == body.context_iri) and the candidates query (where EntityEmbedding.entity_iri.in_(body.candidates)), ensuring rank_suggestions, EntityEmbedding queries, and downstream logic mirror the branch filtering used by semantic_search and find_similar.ontokit/services/embedding_service.py-343-347 (1)
343-347:⚠️ Potential issue | 🟠 MajorRollback before recording the failed job.
If any DB write inside the
tryblock raises, theAsyncSessionenters a failed transaction state. In that case, thiscommit()will not persiststatus="failed", leaving the job stuck as running.🩹 Proposed fix
except Exception as e: + await self._db.rollback() job.status = "failed" job.error_message = str(e) job.completed_at = datetime.now(UTC) await self._db.commit() raise🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ontokit/services/embedding_service.py` around lines 343 - 347, The except block handling failures must rollback the failed AsyncSession transaction before updating the Job record and committing; modify the except handler in embedding_service.py so it first calls await self._db.rollback() (or the session rollback method used in your codebase) to clear the failed transaction state, then set job.status = "failed", job.error_message = str(e), job.completed_at = datetime.now(UTC), and finally await self._db.commit() to persist the changes; reference the existing symbols job, self._db.rollback(), and self._db.commit() when making the change.ontokit/services/embedding_service.py-264-271 (1)
264-271:⚠️ Potential issue | 🟠 MajorDeduplicate typed subjects to prevent embedding duplicates and inaccurate progress tracking.
graph.subjects(RDF.type, None)yields each subject once per matchingrdf:typetriple, not once per unique subject. Entities with multiple type assertions will be processed multiple times: duplicated in theentitieslist, inflatingjob.total_entities, making progress inaccurate, and wasting provider quota on larger ontologies.♻️ Proposed fix
entities: list[tuple[URIRef, str, str]] = [] # (uri, type, text) + seen: set[URIRef] = set() for s in graph.subjects(RDF.type, None): - if not isinstance(s, URIRef) or s == OWL.Thing: + if ( + not isinstance(s, URIRef) + or s == OWL.Thing + or s in seen + ): continue + seen.add(s)Alternatively, use
graph.subjects(RDF.type, None, unique=True)directly.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ontokit/services/embedding_service.py` around lines 264 - 271, The loop over graph.subjects(RDF.type, None) can yield the same subject multiple times if it has multiple rdf:type triples, causing duplicates in entities and inflating job.total_entities; change the iteration to deduplicate subjects (e.g., use graph.subjects(RDF.type, None, unique=True) or collect into a set of subjects first) before calling _get_entity_type and build_embedding_text and appending to entities so each subject is processed only once and progress/provider usage is accurate.
🟡 Minor comments (1)
ontokit/services/embedding_providers/openai_provider.py-47-52 (1)
47-52:⚠️ Potential issue | 🟡 MinorAPI key may leak in exception tracebacks.
If
raise_for_status()throws, the httpx exception may include request headers containing the Bearer token. Consider wrapping with a try/except that redacts sensitive details or using httpx's event hooks.🛡️ Proposed mitigation
+ try: resp.raise_for_status() + except httpx.HTTPStatusError as e: + # Re-raise without exposing auth header + raise httpx.HTTPStatusError( + message=f"OpenAI API error: {e.response.status_code}", + request=e.request, + response=e.response, + ) from None🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ontokit/services/embedding_providers/openai_provider.py` around lines 47 - 52, The code calls client.post(...) and then resp.raise_for_status(), which can cause httpx exceptions to include the Authorization header (Bearer token) in tracebacks; wrap the POST and raise_for_status() in a try/except that catches httpx.HTTPError (or broader Exception), and on error redact or remove the Authorization value before re-raising or logging (e.g., build headers_without_auth = {k:v for k,v in headers.items() if k.lower() != "authorization"} and use that for any log/exception message), or re-raise a sanitized error with only non-sensitive request/response details; apply this change around the client.post/resp.raise_for_status call in the embedding provider method (referenced by client.post, resp, and resp.raise_for_status).
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 0b1d4469-bea9-4ec3-a02e-89852fbc820e
📒 Files selected for processing (57)
alembic/versions/l0g1h2i3j4k5_add_suggestion_sessions_table.pyalembic/versions/m1n2o3p4q5r6_add_entity_change_events.pyalembic/versions/n2o3p4q5r6s7_add_embedding_tables.pycompose.yamlontokit/api/routes/__init__.pyontokit/api/routes/analytics.pyontokit/api/routes/embeddings.pyontokit/api/routes/projects.pyontokit/api/routes/quality.pyontokit/api/routes/search.pyontokit/api/routes/semantic_search.pyontokit/api/routes/suggestions.pyontokit/core/auth.pyontokit/core/beacon_token.pyontokit/core/exceptions.pyontokit/core/middleware.pyontokit/main.pyontokit/models/__init__.pyontokit/models/change_event.pyontokit/models/embedding.pyontokit/models/project.pyontokit/models/suggestion_session.pyontokit/schemas/__init__.pyontokit/schemas/analytics.pyontokit/schemas/embeddings.pyontokit/schemas/ontology.pyontokit/schemas/project.pyontokit/schemas/quality.pyontokit/schemas/suggestion.pyontokit/services/change_event_service.pyontokit/services/consistency_service.pyontokit/services/cross_reference_service.pyontokit/services/duplicate_detection_service.pyontokit/services/embedding_providers/__init__.pyontokit/services/embedding_providers/base.pyontokit/services/embedding_providers/local_provider.pyontokit/services/embedding_providers/openai_provider.pyontokit/services/embedding_providers/voyage_provider.pyontokit/services/embedding_service.pyontokit/services/embedding_text_builder.pyontokit/services/join_request_service.pyontokit/services/ontology.pyontokit/services/project_service.pyontokit/services/search.pyontokit/services/suggestion_service.pyontokit/version.pyontokit/worker.pypyproject.tomlscripts/init-db.shtests/conftest.pytests/integration/test_project_workflow.pytests/unit/test_auth.pytests/unit/test_config.pytests/unit/test_linter.pytests/unit/test_ontology_service.pytests/unit/test_projects_routes.pytests/unit/test_search.py
| if graph is None: | ||
| graph = Graph() |
There was a problem hiding this comment.
Don't silently turn a missing ontology graph into Graph().
ontokit/api/routes/search.py currently calls this method without a graph, so this fallback makes the shipped /sparql endpoint execute every request against a fresh empty graph and return empty data. This should either load the requested ontology first or fail fast until that wiring exists.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@ontokit/services/search.py` around lines 188 - 189, The code currently does
"if graph is None: graph = Graph()", which silently substitutes an empty graph
and causes the SPARQL endpoint to return empty results; instead either require a
real graph and fail fast or explicitly load the requested ontology: remove the
silent Graph() fallback and (a) raise a clear exception (e.g. raise
ValueError("graph is required")) so callers must pass a populated graph, or (b)
call the project’s ontology-loading helper (e.g. load_ontology_graph(...) or
OntologyStore.get_graph(...)) to populate graph when an ontology id is
available; update the caller(s) that invoked this function without a graph to
either pass the loaded graph or provide the ontology id used for loading.
The docstring on approve_request incorrectly stated "add the user as an editor" when the implementation actually assigns the "suggester" role. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
verify_aud was disabled in jwt.decode with no follow-up check, allowing tokens issued for other clients to be accepted. Now validates that aud contains or azp matches our zitadel_client_id before extracting roles. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Both get_current_user and get_current_user_with_token only called fetch_userinfo() when name or email was missing. If the token had both but no roles claim, roles would stay [] and role checks would fail. Now also triggers the userinfo fallback when roles is empty. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace fragile string-based detection (startswith on uppercased query) with rdflib.plugins.sparql.parser.parseQuery. This correctly handles queries with PREFIX declarations before the operation keyword. Parse errors now raise ValueError instead of silently defaulting to SELECT. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The conditional `if root in parent or iri in parent` could skip entities whose representative wasn't a key in the parent dict, causing incomplete clusters. Now unconditionally groups every entity under its root. Singleton clusters are already filtered out downstream (len < 2). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The dangling_ref rule flagged terms from standard vocabularies (RDFS, OWL, XSD, SKOS, DCTERMS) and graph-registered namespaces as undeclared. Now collects namespace prefixes from the graph's namespace manager and well-known vocabs, and skips URIs matching those namespaces. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The previous loop only followed parents[0], missing cycles reachable through other inheritance branches. Now uses DFS with a stack to visit all rdfs:subClassOf edges from each class. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
generate_embeddings and get_embedding_status hardcoded branch="main". Now branch is optional and resolved via git.get_default_branch(project_id) when not provided, matching the pattern used in the projects routes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Blank-node subjects (e.g. OWL restrictions) were silently skipped, dropping references like owl:someValuesFrom from restriction nodes. Now traces blank nodes back to their owning named entities (URIRefs) via _resolve_bnode_owners, which walks up through nested blank nodes with cycle detection. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The SELECT MAX(pr_number) + 1 pattern can produce duplicate numbers under concurrent inserts. Now retries up to 3 times on IntegrityError, re-reading the max pr_number after each conflict. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…haining Remove unused imports (update, ProjectMember, PullRequestService) and add missing 'from e' to exception re-raises (B904). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Multiple workers running auto_submit_stale_sessions could process the same session concurrently. Now uses an atomic UPDATE with WHERE conditions to claim each session before creating a PR. Only proceeds when exactly one row is affected; reverts the claim on failure so the session can be retried. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The DB probe was swallowing exceptions and continuing startup, leaving the API running with a broken database. Now re-raises after logging so the process fails to start. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
If _create_pr_for_session fails with a DB error, the AsyncSession is left in a failed transaction state. The subsequent commit to revert the claim would raise PendingRollbackError. Now rollbacks first to reset the session before reverting status to ACTIVE. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The fallback treated any unknown URI predicate as annotation_value, misclassifying structural predicates like owl:unionOf. Now checks that the predicate is declared as owl:AnnotationProperty in the graph before assigning annotation_value context; other unknown predicates are skipped. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…and cross-references 49 new tests covering five low-coverage modules using in-memory RDFLib graphs and monkeypatch — no infrastructure dependencies required. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Enables automatic linting, formatting, and type checking on commit. Developers run `pre-commit install` after cloning to activate hooks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tion `make setup` installs dev dependencies and pre-commit hooks in one step. Also adds lint, format, typecheck, and test convenience targets. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 10
♻️ Duplicate comments (4)
ontokit/services/suggestion_service.py (2)
255-269:⚠️ Potential issue | 🟠 MajorRe-check session state inside the branch lock and use the same lock in
discard().
save()andbeacon_save()validateACTIVEbefore acquiring_branch_locks, whilediscard()deletes the branch without that lock. A concurrent discard can slip between those steps and turn a valid save into a 500 or race a beacon write against branch deletion. Revalidatesession.statusafter the lock is acquired, and take the same per-branch lock around discard.Suggested fix
- if session.status != SuggestionSessionStatus.ACTIVE.value: - raise HTTPException( - status_code=status.HTTP_400_BAD_REQUEST, - detail=f"Session is {session.status}, cannot save", - ) - project = await self._get_project(project_id) filename = self._get_git_ontology_path(project) # Serialize git writes per branch to prevent lost commits async with _branch_locks[session.branch]: + await self.db.refresh(session) + if session.status != SuggestionSessionStatus.ACTIVE.value: + raise HTTPException( + status_code=status.HTTP_400_BAD_REQUEST, + detail=f"Session is {session.status}, cannot save", + ) # Commit to the suggestion branch commit_message = f"Update {data.entity_label}"- # Delete the git branch - try: - self.git_service.delete_branch(project_id, session.branch, force=True) - except Exception as e: - logger.warning(f"Failed to delete suggestion branch {session.branch}: {e}") - - session.status = SuggestionSessionStatus.DISCARDED.value - session.last_activity = datetime.now(UTC) - await self.db.commit() + async with _branch_locks[session.branch]: + await self.db.refresh(session) + if session.status != SuggestionSessionStatus.ACTIVE.value: + raise HTTPException( + status_code=status.HTTP_400_BAD_REQUEST, + detail=f"Session is {session.status}, cannot discard", + ) + try: + self.git_service.delete_branch(project_id, session.branch, force=True) + except Exception as e: + logger.warning(f"Failed to delete suggestion branch {session.branch}: {e}") + + session.status = SuggestionSessionStatus.DISCARDED.value + session.last_activity = datetime.now(UTC) + await self.db.commit()Apply the same post-lock refresh/status check in
beacon_save()as well.Also applies to: 520-540, 563-578
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ontokit/services/suggestion_service.py` around lines 255 - 269, The save()/beacon_save() flow validates session.status before acquiring the per-branch lock (_branch_locks[session.branch]) which allows a concurrent discard() to delete the branch; after acquiring the branch lock in save() and beacon_save() re-fetch or refresh the session (via _get_session(project_id, session_id) or reload the session object) and re-check that session.status == SuggestionSessionStatus.ACTIVE.value and raise the same HTTPException if not; additionally wrap discard() with the same per-branch lock (async with _branch_locks[session.branch]) so branch deletion serializes with saves/beacon_saves to prevent races.
93-100:⚠️ Potential issue | 🟠 MajorReject absolute ontology paths before stripping separators.
Line 94 turns an absolute path like
/ontologies/core.ttlintoontologies/core.ttl, so invalid absolute paths are silently accepted and writes can land on the wrong repo-relative file. Checkos.path.isabs(...)before normalizing/stripping.Suggested fix
if project.source_file_path: - path = os.path.normpath(project.source_file_path).lstrip("/\\") + if os.path.isabs(project.source_file_path): + raise HTTPException( + status_code=status.HTTP_400_BAD_REQUEST, + detail="Invalid ontology path", + ) + path = os.path.normpath(project.source_file_path) if path.startswith(".."): raise HTTPException( status_code=status.HTTP_400_BAD_REQUEST, detail="Invalid ontology path", )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ontokit/services/suggestion_service.py` around lines 93 - 100, The code currently normalizes and strips leading separators from project.source_file_path which lets absolute paths like "/ontologies/core.ttl" become "ontologies/core.ttl" and bypass validation; update the logic around project.source_file_path (the block referencing project.source_file_path) to first call os.path.isabs(project.source_file_path) and if true raise the same HTTPException (status.HTTP_400_BAD_REQUEST, detail="Invalid ontology path") before doing os.path.normpath() and lstrip, then keep the existing check for path.startswith("..") and return the cleaned path.ontokit/services/search.py (1)
81-89:⚠️ Potential issue | 🟠 MajorFail fast when
_dbwas never injected.Returning
200withresults=[]still makes a dependency-wiring bug indistinguishable from a legitimate “no matches” response. Keepingdb=Nonefor SPARQL-only use is fine, butsearch()should raise once it reaches the database path.Suggested fix
if self._db is None: - # No database session available -- return empty results. - elapsed = (time.perf_counter() - start_time) * 1000 - return SearchResponse( - results=[], - total=0, - query=query.query, - took_ms=elapsed, - ) + raise RuntimeError("SearchService.search requires a database session")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ontokit/services/search.py` around lines 81 - 89, The search() method currently returns an empty SearchResponse when self._db is None, which hides dependency wiring bugs; instead, in ontokit.services.search.py inside the search(self, query, ...) flow, detect if self._db is None before entering the database query path and raise a clear exception (e.g., RuntimeError or a custom DependencyError) with a descriptive message that the DB was not injected, rather than returning SearchResponse(results=[]). Update the branch that references self._db to raise the error so SPARQL-only use can still construct the service but any code that reaches the DB path in search() fails fast.ontokit/services/embedding_service.py (1)
160-182:⚠️ Potential issue | 🟠 MajorCoverage can still read as 100% when it isn't.
Line 172 looks for finished jobs with status
"complete", but Line 370 writes"completed", soget_status()never picks up successful full embeds as the denominator source. When that query misses, Line 160 falls back toembedded_count, which still overreports coverage after incremental auto-embeds.Also applies to: 370-372
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ontokit/services/embedding_service.py` around lines 160 - 182, The coverage calculation falls back to the last completed job by querying EmbeddingJob.status == "complete" but elsewhere the code writes status "completed", so successful jobs are missed and coverage can be reported as 100%; update the status check to use the canonical status string or a single shared constant/enum (e.g., replace the literal "complete" with the same value used when writing status, or reference a shared constant/enum) in the query in embedding_service.py (the block using last_job_q and last_total) and make the identical fix where statuses are written (the other place referenced around the later job update), ensuring both use the same identifier (e.g., "completed" or a STATUS_COMPLETE constant) so the denominator picks up finished jobs correctly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@ontokit/api/routes/embeddings.py`:
- Around line 132-139: The code enqueues an ARQ job after committing a pending
row so failures in get_arq_pool() or pool.enqueue_job(...) leave a stuck active
job; wrap the ARQ calls (get_arq_pool and pool.enqueue_job with args
str(project_id), resolved_branch, str(job_id)) in a try/except, and on any
exception delete or mark the pending job as failed (use the same job identifier
job_id or the repo/job model you used to create the pending row) before
re-raising or returning an error; ensure cleanup uses the same transactional
context or a safe retry so no pending row remains if enqueueing fails and keep
the successful return of EmbeddingGenerateResponse(job_id=str(job_id)) only
after enqueue succeeds.
In `@ontokit/api/routes/quality.py`:
- Around line 121-135: The endpoint runs run_consistency_check(...)
synchronously and currently returns a job_id even if caching to Redis fails,
which yields dead job IDs; change the flow so you only generate/return a job_id
after a successful cache write (or if you generate it earlier, treat Redis
failures as a hard failure). Specifically, call run_consistency_check(graph,
str(project_id), branch) as before, then attempt to get the Redis client via
_get_redis() and persist the result_json under cache_key and job_key; if any
await redis.set(...) call throws, propagate an error response (e.g., raise an
HTTPException / return a 500) instead of swallowing the exception and returning
ConsistencyCheckTriggerResponse(job_id=...). Keep references:
run_consistency_check, _get_redis, cache_key ("quality:{project_id}:{branch}"),
job_key ("quality_job:{project_id}:{job_id}"), and
ConsistencyCheckTriggerResponse.
- Around line 119-129: Resolve and reuse the actual branch value returned/used
by _load_graph instead of the original query param: call _load_graph(project_id,
branch, db), extract the resolved branch from the returned graph (e.g.,
graph.branch or graph.repo.default_branch depending on the graph shape), then
pass that resolved_branch into run_consistency_check(project_id) (i.e.,
run_consistency_check(graph, str(project_id), resolved_branch), use
resolved_branch when forming cache_key/job_key, and set result.branch =
resolved_branch before serializing with result.model_dump_json(); apply the same
fix to the other handlers mentioned (around the blocks that use
run_consistency_check, _get_redis, cache_key/job_key at the later locations).
In `@ontokit/api/routes/semantic_search.py`:
- Around line 45-47: The handlers semantic_search() and find_similar_entities()
are defaulting the branch Query param to "main" instead of resolving the
project's actual default branch like rank_suggestions() does; update both
handlers to, when branch is omitted/None, call the same repo/default-branch
resolution used by rank_suggestions() (e.g., the helper or method that returns
the project's default branch) and pass that resolved branch to EmbeddingService
instead of the literal "main"; apply the same change to the other parameterized
endpoints referenced around lines 65-67 so all calls into EmbeddingService use
the repo-resolved default branch.
In `@ontokit/core/auth.py`:
- Around line 25-36: The current code uses bare dict annotations causing mypy
failures; add from typing import Any and define a JsonObject = dict[str, Any]
(or use Mapping[str, Any] for read-only) near the top of the file, then change
_extract_roles(payload: dict) -> list[str] to _extract_roles(payload:
Mapping[str, Any]) -> list[str], annotate _jwks_cache as JsonObject | None,
change get_jwks return type to JsonObject, and update fetch_userinfo's dict |
None to JsonObject | None so all helpers use the new typed alias.
In `@ontokit/main.py`:
- Around line 42-43: Annotate the loose types: change the function parameter
`body` to the precise mapping type `dict[str, dict[str, str] | object]`; in
ontokit/version.py add an explicit type to `VERSION_BASE` (e.g. `VERSION_BASE:
str = VERSION.replace(...)`) so `__version__` is clearly a str; and make the
Redis typing consistent by matching the declared `redis_client` type to whatever
`aioredis.from_url(..., decode_responses=True)` actually returns (either change
the `redis_client` annotation to the concrete return type from
`aioredis.from_url` or use `typing.cast(...)` when assigning the result) so mypy
strict mode no longer reports a mismatch.
In `@ontokit/services/consistency_service.py`:
- Around line 18-25: The function _iter_classes currently lacks a return type
annotation causing mypy failures; add a return type of Iterator[URIRef] to its
signature (def _iter_classes(graph: Graph) -> Iterator[URIRef]:) and ensure
Iterator is imported from typing (or collections.abc) alongside existing
RDF/URIRef/Graph imports so the file type-checks under strict mypy.
In `@ontokit/services/embedding_service.py`:
- Around line 604-618: The queries use body.branch directly which can be None
and causes unintended IS NULL filters; resolve the branch to the project's
default before the embedding lookups by fetching the project's default branch
(e.g., via the Project record or an existing helper like
get_default_branch(project_id)) and assign it to a local resolved_branch (fall
back to body.branch if already set), then use resolved_branch in the
EntityEmbedding queries (both the context emb_q and candidates_q) instead of
body.branch so lookups for omitted-branch requests use the project's default
branch.
In `@ontokit/services/search.py`:
- Line 9: Replace the internal SPARQLResult import with RDFLib's public Result
type: change the import from rdflib.plugins.sparql.processor.SPARQLResult to
rdflib.query.Result and update any type annotations that reference SPARQLResult
(e.g., the variable or return type used around the Graph.query() handling at the
site flagged) to use Result instead so the annotation matches Graph.query()'s
public API.
In `@ontokit/services/suggestion_service.py`:
- Around line 201-238: The code currently deletes the git branch in the generic
except that also catches failures from await self.db.refresh(db_session) after a
successful commit; change flow so branch cleanup only happens when add/commit
fails: wrap self.db.add(...) and await self.db.commit() in their own try/except
(handling IntegrityError as before and running
self.git_service.delete_branch(...) on those failures), then perform await
self.db.refresh(db_session) in a separate try/except that does NOT call
self.git_service.delete_branch (instead log the refresh failure and raise an
appropriate HTTPException); keep the existing logic that returns an existing
SuggestionSessionResponse when IntegrityError finds a concurrent active session,
and continue to re-raise unexpected exceptions from the commit block after
cleanup.
---
Duplicate comments:
In `@ontokit/services/embedding_service.py`:
- Around line 160-182: The coverage calculation falls back to the last completed
job by querying EmbeddingJob.status == "complete" but elsewhere the code writes
status "completed", so successful jobs are missed and coverage can be reported
as 100%; update the status check to use the canonical status string or a single
shared constant/enum (e.g., replace the literal "complete" with the same value
used when writing status, or reference a shared constant/enum) in the query in
embedding_service.py (the block using last_job_q and last_total) and make the
identical fix where statuses are written (the other place referenced around the
later job update), ensuring both use the same identifier (e.g., "completed" or a
STATUS_COMPLETE constant) so the denominator picks up finished jobs correctly.
In `@ontokit/services/search.py`:
- Around line 81-89: The search() method currently returns an empty
SearchResponse when self._db is None, which hides dependency wiring bugs;
instead, in ontokit.services.search.py inside the search(self, query, ...) flow,
detect if self._db is None before entering the database query path and raise a
clear exception (e.g., RuntimeError or a custom DependencyError) with a
descriptive message that the DB was not injected, rather than returning
SearchResponse(results=[]). Update the branch that references self._db to raise
the error so SPARQL-only use can still construct the service but any code that
reaches the DB path in search() fails fast.
In `@ontokit/services/suggestion_service.py`:
- Around line 255-269: The save()/beacon_save() flow validates session.status
before acquiring the per-branch lock (_branch_locks[session.branch]) which
allows a concurrent discard() to delete the branch; after acquiring the branch
lock in save() and beacon_save() re-fetch or refresh the session (via
_get_session(project_id, session_id) or reload the session object) and re-check
that session.status == SuggestionSessionStatus.ACTIVE.value and raise the same
HTTPException if not; additionally wrap discard() with the same per-branch lock
(async with _branch_locks[session.branch]) so branch deletion serializes with
saves/beacon_saves to prevent races.
- Around line 93-100: The code currently normalizes and strips leading
separators from project.source_file_path which lets absolute paths like
"/ontologies/core.ttl" become "ontologies/core.ttl" and bypass validation;
update the logic around project.source_file_path (the block referencing
project.source_file_path) to first call os.path.isabs(project.source_file_path)
and if true raise the same HTTPException (status.HTTP_400_BAD_REQUEST,
detail="Invalid ontology path") before doing os.path.normpath() and lstrip, then
keep the existing check for path.startswith("..") and return the cleaned path.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 95f0f74c-68bd-4697-a95b-2af5e268b559
📒 Files selected for processing (20)
alembic/versions/l0g1h2i3j4k5_add_suggestion_sessions_table.pyontokit/api/routes/embeddings.pyontokit/api/routes/projects.pyontokit/api/routes/quality.pyontokit/api/routes/semantic_search.pyontokit/api/utils/redis.pyontokit/core/auth.pyontokit/core/beacon_token.pyontokit/main.pyontokit/schemas/embeddings.pyontokit/services/consistency_service.pyontokit/services/embedding_service.pyontokit/services/rdf_utils.pyontokit/services/search.pyontokit/services/suggestion_service.pytests/unit/test_beacon_token.pytests/unit/test_consistency_service.pytests/unit/test_cross_reference.pytests/unit/test_duplicate_detection.pytests/unit/test_rdf_utils.py
🚧 Files skipped from review as they are similar to previous changes (1)
- ontokit/api/utils/redis.py
ontokit/core/auth.py
Outdated
| def _extract_roles(payload: dict) -> list[str]: | ||
| """Extract role names from the Zitadel roles claim. | ||
|
|
||
| The Zitadel roles claim format is: | ||
| {"urn:zitadel:iam:org:project:roles": {"role_name": {"org_id": "org_name"}, ...}} | ||
|
|
||
| This function extracts just the role names (top-level keys of the roles dict). | ||
| """ | ||
| roles_claim = payload.get(ZITADEL_ROLES_CLAIM) | ||
| if not roles_claim or not isinstance(roles_claim, dict): | ||
| return [] | ||
| return list(roles_claim.keys()) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n ontokit/core/auth.py | head -100Repository: CatholicOS/ontokit-api
Length of output: 3775
🏁 Script executed:
cat -n ontokit/core/auth.py | tail -50Repository: CatholicOS/ontokit-api
Length of output: 2114
🏁 Script executed:
sed -n '100,150p' ontokit/core/auth.pyRepository: CatholicOS/ontokit-api
Length of output: 1971
🏁 Script executed:
# Check for mypy configuration and CI files
find . -name "pyproject.toml" -o -name "setup.cfg" -o -name ".mypy.ini" -o -name "mypy.ini" | head -5
ls -la .github/workflows/ 2>/dev/null || echo "No .github/workflows found"
find . -name "*.yml" -o -name "*.yaml" | grep -E "(workflow|ci|github)" | head -10Repository: CatholicOS/ontokit-api
Length of output: 286
🏁 Script executed:
cat pyproject.tomlRepository: CatholicOS/ontokit-api
Length of output: 2717
🏁 Script executed:
# Search for all bare dict usages in the auth.py file
rg '\bdict\b' ontokit/core/auth.py -nRepository: CatholicOS/ontokit-api
Length of output: 428
🏁 Script executed:
# Check the imports section more carefully
sed -n '1,20p' ontokit/core/auth.pyRepository: CatholicOS/ontokit-api
Length of output: 581
🏁 Script executed:
# Check the fetch_userinfo function
sed -n '215,240p' ontokit/core/auth.pyRepository: CatholicOS/ontokit-api
Length of output: 899
🏁 Script executed:
# Check if there are any CI test runs or documentation about type checking
find . -type f -name "*.yml" -o -name "*.yaml" | xargs grep -l "mypy" 2>/dev/null || echo "No mypy in CI found"
cat .github/workflows/release.ymlRepository: CatholicOS/ontokit-api
Length of output: 2994
🏁 Script executed:
# Get complete view of all functions with bare dict annotations
sed -n '215,242p' ontokit/core/auth.pyRepository: CatholicOS/ontokit-api
Length of output: 916
🏁 Script executed:
# Double-check all function signatures that return dict or accept dict
rg 'def.*->.*dict|def.*\(.*:\s*dict' ontokit/core/auth.py -nRepository: CatholicOS/ontokit-api
Length of output: 243
🏁 Script executed:
# Verify the full extent of bare dict usage - check lines 70-90 for _jwks_cache
sed -n '70,90p' ontokit/core/auth.pyRepository: CatholicOS/ontokit-api
Length of output: 821
🏁 Script executed:
# Verify mypy would actually fail on these bare dict annotations by checking the strict mode config
grep -A 5 "\[tool.mypy\]" pyproject.tomlRepository: CatholicOS/ontokit-api
Length of output: 166
Add type parameters to bare dict annotations in new auth helpers.
Lines 25, 71, and 76 use bare dict, which fails strict mypy checks enabled in CI. Introduce a JsonObject = dict[str, Any] type alias and apply it consistently:
- Line 25:
_extract_rolesparameter should beMapping[str, Any](read-only interface) - Line 71:
_jwks_cacheshould beJsonObject | None - Line 76:
get_jwksreturn type should beJsonObject
Add Any to the imports from typing and define the alias near the top of the file.
Note: fetch_userinfo at line 215 has the same bare dict | None pattern and will also need the same fix for complete type coverage.
🧰 Tools
🪛 GitHub Actions: Distribution
[error] 25-25: Mypy: Missing type parameters for generic type "dict" [type-arg]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@ontokit/core/auth.py` around lines 25 - 36, The current code uses bare dict
annotations causing mypy failures; add from typing import Any and define a
JsonObject = dict[str, Any] (or use Mapping[str, Any] for read-only) near the
top of the file, then change _extract_roles(payload: dict) -> list[str] to
_extract_roles(payload: Mapping[str, Any]) -> list[str], annotate _jwks_cache as
JsonObject | None, change get_jwks return type to JsonObject, and update
fetch_userinfo's dict | None to JsonObject | None so all helpers use the new
typed alias.
| logger.info("Starting OntoKit API v%s (env=%s)", __version__, settings.app_env) | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
fd -t f "main.py" | head -20Repository: CatholicOS/ontokit-api
Length of output: 81
🏁 Script executed:
cat -n ontokit/main.py | head -60Repository: CatholicOS/ontokit-api
Length of output: 2532
🏁 Script executed:
cat -n ontokit/main.py | sed -n '40,50p'Repository: CatholicOS/ontokit-api
Length of output: 585
🏁 Script executed:
cat -n ontokit/main.py | sed -n '50,65p'Repository: CatholicOS/ontokit-api
Length of output: 749
🏁 Script executed:
cat -n ontokit/main.py | sed -n '140,155p'Repository: CatholicOS/ontokit-api
Length of output: 755
🏁 Script executed:
rg "__version__" -B 3 -A 3 ontokit/Repository: CatholicOS/ontokit-api
Length of output: 1639
🏁 Script executed:
rg "aioredis" ontokit/main.py -B 2 -A 2Repository: CatholicOS/ontokit-api
Length of output: 566
🏁 Script executed:
rg "_error_response" ontokit/main.py -B 5 -A 5Repository: CatholicOS/ontokit-api
Length of output: 1709
🏁 Script executed:
find ontokit -name "version.py" -o -name "py.typed"Repository: CatholicOS/ontokit-api
Length of output: 84
🏁 Script executed:
cat ontokit/__init__.pyRepository: CatholicOS/ontokit-api
Length of output: 202
🏁 Script executed:
cat ontokit/version.py 2>/dev/null || echo "File not found"Repository: CatholicOS/ontokit-api
Length of output: 256
🏁 Script executed:
rg "VERSION_BASE|__version__" ontokit/Repository: CatholicOS/ontokit-api
Length of output: 529
🏁 Script executed:
cat pyproject.toml | grep -A 20 "\[tool.mypy\]"Repository: CatholicOS/ontokit-api
Length of output: 291
🏁 Script executed:
rg "dict\[" ontokit/main.pyRepository: CatholicOS/ontokit-api
Length of output: 145
🏁 Script executed:
ls -la ontokit/py.typed 2>/dev/null || echo "No py.typed marker found"Repository: CatholicOS/ontokit-api
Length of output: 90
🏁 Script executed:
cat -n ontokit/main.py | sed -n '1,30p'Repository: CatholicOS/ontokit-api
Length of output: 1282
🏁 Script executed:
rg "from typing import|from __future__ import" ontokit/main.pyRepository: CatholicOS/ontokit-api
Length of output: 48
🏁 Script executed:
rg "redis.asyncio" ontokit/ -B 2 -A 2Repository: CatholicOS/ontokit-api
Length of output: 310
🏁 Script executed:
python3 << 'EOF'
# Check the structure of the imports and potential type issues
import sys
# Analyze the issues:
issues = {
"line_42_version": {
"code": "logger.info(\"Starting OntoKit API v%s (env=%s)\", __version__, settings.app_env)",
"import": "from ontokit import __version__",
"source": "ontokit/__init__.py: from ontokit.version import VERSION_BASE as __version__",
"analysis": "__version__ is properly imported from version module, no obvious attr-defined issue"
},
"lines_55_58_aioredis": {
"code": "redis_pool = aioredis.from_url(str(settings.redis_url), decode_responses=True)",
"import": "import redis.asyncio as aioredis",
"annotation": "redis_pool: aioredis.Redis | None = None",
"issue": "from_url() may return untyped Redis object or type mismatch with decode_responses"
},
"line_146_dict": {
"code": "body: dict = {\"error\": {...}}",
"issue": "Bare dict without type arguments violates strict mypy - should be dict[str, Any] or dict[str, object]"
}
}
for issue, details in issues.items():
print(f"\n{issue}:")
for key, val in details.items():
print(f" {key}: {val}")
EOFRepository: CatholicOS/ontokit-api
Length of output: 860
🏁 Script executed:
head -30 ontokit/main.pyRepository: CatholicOS/ontokit-api
Length of output: 1072
🏁 Script executed:
cat ontokit/version.pyRepository: CatholicOS/ontokit-api
Length of output: 256
🏁 Script executed:
rg "from typing" ontokit/version.pyRepository: CatholicOS/ontokit-api
Length of output: 48
🏁 Script executed:
find . -name ".github" -type d | head -5Repository: CatholicOS/ontokit-api
Length of output: 75
🏁 Script executed:
fd "\.yml$|\.yaml$" .github/workflows/ 2>/dev/null | head -10Repository: CatholicOS/ontokit-api
Length of output: 95
🏁 Script executed:
cat .github/workflows/*.yml 2>/dev/null | grep -A 10 -B 2 "mypy"Repository: CatholicOS/ontokit-api
Length of output: 399
🏁 Script executed:
find .github -name "*.yml" -o -name "*.yaml" | xargs catRepository: CatholicOS/ontokit-api
Length of output: 2936
🏁 Script executed:
cat -n ontokit/version.pyRepository: CatholicOS/ontokit-api
Length of output: 298
🏁 Script executed:
rg "mypy|type" pyproject.toml -B 2 -A 5Repository: CatholicOS/ontokit-api
Length of output: 357
🏁 Script executed:
cd ontokit && python3 -c "from version import VERSION_BASE; print(type(VERSION_BASE).__name__, repr(VERSION_BASE))"Repository: CatholicOS/ontokit-api
Length of output: 77
🏁 Script executed:
python3 << 'EOF'
import sys
try:
import redis.asyncio as aioredis
# Check the return type of from_url
import inspect
sig = inspect.signature(aioredis.from_url)
print(f"from_url signature: {sig}")
print(f"from_url return annotation: {sig.return_annotation}")
except Exception as e:
print(f"Error: {e}")
EOFRepository: CatholicOS/ontokit-api
Length of output: 96
🏁 Script executed:
cat -n ontokit/main.py | sed -n '140,160p'Repository: CatholicOS/ontokit-api
Length of output: 1002
🏁 Script executed:
grep -n "from typing import\|from __future__ import" ontokit/main.pyRepository: CatholicOS/ontokit-api
Length of output: 48
🏁 Script executed:
find . -name "*.log" -o -name "*mypy*" 2>/dev/null | head -20Repository: CatholicOS/ontokit-api
Length of output: 48
🏁 Script executed:
git log --oneline -20 2>/dev/null | head -20Repository: CatholicOS/ontokit-api
Length of output: 147
🏁 Script executed:
cat ontokit/__init__.pyRepository: CatholicOS/ontokit-api
Length of output: 202
🏁 Script executed:
rg "Any|Dict" ontokit/main.pyRepository: CatholicOS/ontokit-api
Length of output: 48
🏁 Script executed:
python3 << 'EOF'
# Check what type annotations are needed
import sys
# Issue 1: body: dict without type arguments
issue_1 = """
def _error_response(...) -> JSONResponse:
body: dict = {"error": {"code": code, "message": message}}
# In strict mypy mode, this needs: dict[str, dict[str, ...] | ...]
"""
# Issue 2: Check if redis.asyncio has proper typing
try:
from redis.asyncio import from_url
print("redis.asyncio.from_url is available")
except ImportError as e:
print(f"redis.asyncio import error: {e}")
# Issue 3: __version__ from version module
# VERSION_BASE is untyped, so it may be seen as Any
print("\nIssue Analysis:")
print("1. body: dict - STRICT MYPY VIOLATION (needs type args)")
print("2. __version__ - Depends on whether VERSION_BASE has type hints (currently NO)")
print("3. aioredis.from_url() - Should be typed, but may have return type issues")
EOFRepository: CatholicOS/ontokit-api
Length of output: 343
Fix mypy strict mode violations before release.
Three type annotation issues are blocking the distribution pipeline:
-
Line 146 —
body: dictlacks type arguments. Use:body: dict[str, dict[str, str] | object] -
Line 42 —
__version__requires a type annotation onVERSION_BASEinontokit/version.py. Add:VERSION_BASE: str = VERSION.replace(...) -
Lines 55–58 —
aioredis.from_url()return type must match theaioredis.Redis | Noneannotation on line 31. Verify the return type offrom_url()withdecode_responses=Trueis compatible with the declared type.
These must be resolved before release.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@ontokit/main.py` around lines 42 - 43, Annotate the loose types: change the
function parameter `body` to the precise mapping type `dict[str, dict[str, str]
| object]`; in ontokit/version.py add an explicit type to `VERSION_BASE` (e.g.
`VERSION_BASE: str = VERSION.replace(...)`) so `__version__` is clearly a str;
and make the Redis typing consistent by matching the declared `redis_client`
type to whatever `aioredis.from_url(..., decode_responses=True)` actually
returns (either change the `redis_client` annotation to the concrete return type
from `aioredis.from_url` or use `typing.cast(...)` when assigning the result) so
mypy strict mode no longer reports a mismatch.
ontokit/services/search.py
Outdated
| from ontokit.schemas.search import SearchQuery, SearchResponse, SPARQLQuery, SPARQLResponse | ||
| from rdflib import BNode, Graph, Literal, URIRef | ||
| from rdflib.plugins.sparql.parser import parseQuery | ||
| from rdflib.plugins.sparql.processor import SPARQLResult |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== Repository dependency pins =="
fd 'pyproject.toml|poetry.lock|uv.lock|pdm.lock|requirements*.txt|setup.py' -x sh -c '
echo "--- {} ---"
rg -n "rdflib" "{}" || true
'
echo
echo "== Sandbox RDFLib signature (if installed) =="
python - <<'PY'
try:
import inspect
from typing import get_type_hints
from rdflib import Graph
print("Graph.query signature:", inspect.signature(Graph.query))
try:
print("Graph.query type hints:", get_type_hints(Graph.query))
except Exception as exc:
print("Could not resolve type hints:", exc)
except Exception as exc:
print("rdflib not importable in sandbox:", exc)
PYRepository: CatholicOS/ontokit-api
Length of output: 253
🏁 Script executed:
cat -n ontokit/services/search.py | head -30Repository: CatholicOS/ontokit-api
Length of output: 1167
🏁 Script executed:
cat -n ontokit/services/search.py | sed -n '220,235p'Repository: CatholicOS/ontokit-api
Length of output: 841
🏁 Script executed:
rg -n "from rdflib" ontokit/services/search.pyRepository: CatholicOS/ontokit-api
Length of output: 229
🏁 Script executed:
fd -e '.mypy.ini' -o -name 'pyproject.toml' -o -name 'setup.cfg' | xargs rg -l "mypy|type" 2>/dev/null | head -5Repository: CatholicOS/ontokit-api
Length of output: 420
🏁 Script executed:
rg -n "mypy" pyproject.tomlRepository: CatholicOS/ontokit-api
Length of output: 104
🏁 Script executed:
web_search
RDFLib Graph.query() return type rdflib.query.Result vs SPARQLResult version 7.1Repository: CatholicOS/ontokit-api
Length of output: 282
🌐 Web query:
RDFLib 7.1.0 Graph.query return type rdflib.query.Result
💡 Result:
In RDFLib 7.1.0, rdflib.graph.Graph.query(...) is annotated to return rdflib.query.Result (i.e., query.Result). [1]
That Result object behaves differently depending on SPARQL query type: SELECT → iterates ResultRows, ASK → yields a single bool / bool(result), CONSTRUCT/DESCRIBE → iterates triples (and may also expose a .graph). [2][3]
🌐 Web query:
RDFLib SPARQLResult vs Result typing
💡 Result:
In RDFLib, Result is the public, generic query-result type, while SPARQLResult is the SPARQL plugin’s concrete implementation of that type.
-
rdflib.query.Result: common container for all SPARQL query forms (SELECT,ASK,CONSTRUCT,DESCRIBE). It “shape-shifts” at runtime:SELECT: iteratesResultRowrowsASK:bool(result)(and iteration) yields a single booleanCONSTRUCT/DESCRIBE: iterates triples
It also exposes.type,.vars,.bindings,.graph,.askAnswer. [1]
-
rdflib.plugins.sparql.processor.SPARQLResult: subclass ofResultcreated by the SPARQL processor. Its constructor simply takes the processor’s internal mapping (res) and populates the baseResultfields (notably callingResult.__init__(res["type_"])and assigning.vars,.bindings,.askAnswer,.graph). [2]
Typing guidance (practical):
- For user code, annotate results as
rdflib.query.Result(that’s the stable API surface). You can branch onresult.type("SELECT","ASK", …) to narrow behavior. [1] - Only type as
SPARQLResultif you explicitly depend on the SPARQL plugin implementation (usually unnecessary, since the fields you typically use are already onResult). [1][2] - For value types inside bindings/rows, RDFLib defines type aliases like
QueryResultValueType = Union[IdentifiedNode, Literal](useful for static typing when you process term values). [3] - Parsing/serializing SPARQL result formats (XML/JSON/CSV/TSV/TXT) is standardized around the
Resulttype via the query-results plugins. [4]
Sources: [1] (rdflib.readthedocs.io), [2] (rdflib.readthedocs.io), [3] (rdflib.readthedocs.io), [4] (rdflib.readthedocs.io)
Citations:
- 1: https://rdflib.readthedocs.io/en/stable/apidocs/rdflib.query/
- 2: https://rdflib.readthedocs.io/en/latest/apidocs/rdflib.plugins.sparql.processor/?utm_source=openai
- 3: https://rdflib.readthedocs.io/en/latest/apidocs/rdflib.query/
- 4: https://rdflib.readthedocs.io/en/stable/plugins/
Use RDFLib's public query result type instead of the SPARQL plugin's internal implementation.
In RDFLib 7.1.0+, Graph.query() is typed to return rdflib.query.Result, not rdflib.plugins.sparql.processor.SPARQLResult. Although SPARQLResult is a subclass of Result, the stable public API for user code is Result. This mismatch will trigger mypy errors at line 227.
Change the import from rdflib.plugins.sparql.processor.SPARQLResult to rdflib.query.Result and update the type annotation accordingly.
Suggested fix
-from rdflib.plugins.sparql.processor import SPARQLResult
+from rdflib.query import Result
@@
- result: SPARQLResult = graph.query(query_text)
+ result: Result = graph.query(query_text)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@ontokit/services/search.py` at line 9, Replace the internal SPARQLResult
import with RDFLib's public Result type: change the import from
rdflib.plugins.sparql.processor.SPARQLResult to rdflib.query.Result and update
any type annotations that reference SPARQLResult (e.g., the variable or return
type used around the Graph.query() handling at the site flagged) to use Result
instead so the annotation matches Graph.query()'s public API.
Add explicit generic type parameters, cast() at DB-schema boundaries for Literal types, missing function annotations, pygit2/GitPython stub workarounds, and pydantic mypy plugin. Add types-python-jose to dev deps. Update pre-commit mypy hook with missing additional_dependencies and remove types-redis (conflicts with redis 5.x inline stubs). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Wrap the ARQ enqueue_job() call in try/except so that if the pool or enqueue fails, the pending EmbeddingJob row is deleted rather than leaving a stuck active job that blocks future requests with 409. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
_load_graph() now returns (graph, resolved_branch) so callers use the actual branch for Redis cache keys and ConsistencyCheckResult metadata instead of the raw (possibly None) query parameter. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use the project's actual default branch instead of hardcoding "main" when the branch query parameter is omitted, matching the behavior already used in rank_suggestions(). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Guard against None branch in rank_suggestions() by resolving to the project's default branch, preventing IS NULL filters that would silently return no results. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move db.refresh() outside the try/except that triggers branch deletion so that a refresh failure after a successful commit does not incorrectly delete the git branch, leaving an active session pointing at nothing. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
CodeRabbit fixes — batch 4mypy strict-mode compliance
Architectural fixes from CodeRabbit review (batch 5)
|
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use WEB_PORT env var (default 3000) instead of hardcoded port in OIDC redirect URIs, so the setup script works with non-standard ports. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add descriptions for all 16 endpoint groups and a rich markdown API description covering key capabilities, authentication, and rate limiting for the Scalar-powered API Reference. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Introduces a per-user notification model, service, and API endpoints (list, mark-read, mark-all-read) for project events. Integrates notification dispatch into suggestion, pull request, and join request services. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… and worker task Implements the full backend for tracking upstream GitHub repository files, matching the existing frontend API client. Includes config CRUD, background check job via ARQ, job status polling, and sync event history. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…anges, resubmit Implements the missing review/admin API for suggestion sessions that the frontend already expects. Adds new session statuses (merged, rejected, changes-requested), review fields on the model, and five new routes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Eliminates noisy browser console errors on every page load for projects without upstream sync configured. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… (PEP 735) Dev tools are development-only and should not be exposed as installable extras in the published package. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add api_base_url config setting for webhook callback URLs - Add github_hook_id column to track webhook state on GitHub - Add scope helpers and webhook management methods to GitHub service - Add POST webhook-setup endpoint for auto-detecting/creating webhooks - Guard against localhost URLs when auto-creating webhooks - Add webhook frequency option to upstream sync types - Rearrange GitHub integration info boxes in project settings - Enhance WebhookConfigPanel with auto-setup status and fallback to manual - Add admin:repo_hook scope guidance in user settings page Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When enabling webhooks on a GitHub integration, automatically create an UpstreamSyncConfig with frequency="webhook" using the integration's repo data. When disabling webhooks, downgrade to frequency="manual". Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
@damienriehl I added CodeRabbit which I have been using on a few other projects, it does a wonderful job of reviewing and nit-picking. I then feed CodeRabbit's review issues back into Claude Code. I find that it is best to address each issue with a commit but specifically instruct Claude not to push until all CodeRabbit issues have been addressed, because CodeRabbit triggers a new review every time you push but it also implements rate-limiting, so with many pushes it'll just quit. So best to make all the commits first, then push them all at once when they're ready. I either copy the "Prompt for AI Agents" section for each issue, one at a time; or, at the risk of perhaps consuming a few more tokens but speeding up the process, I tell Claude to directly read all the issues raised by CodeRabbit's review on the PR (I have the GitHub CLI installed, which Claude uses to read the PR), but again instructing specifically to address each issue with one commit per fix but not push until all have been addressed. So now, linting and tests are passing! |
|
I added a webhooks UI to the GitHub integration, which should be a valid alternative to "Upstream Source Tracking". The idea is that as soon as webhooks are enabled and installed, "Upstream Source Tracking" will be readonly; if webhooks are not enabled, "Upstream Source Tracking" can be set but will default to the same GitHub repo when GitHub integration is active. |
Summary
Stats
🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Improvements
Infrastructure
Version
Tests