diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md index b362714..0e9f72d 100644 --- a/ARCHITECTURE.md +++ b/ARCHITECTURE.md @@ -2,16 +2,16 @@ ## Current Implemented Slice -AliceBot now implements the accepted repo slice through Sprint 5F. The shipped backend includes: +AliceBot now implements the accepted repo slice through Sprint 5G. The shipped backend includes: - foundation continuity storage over `users`, `threads`, `sessions`, and append-only `events` - deterministic tracing and context compilation over durable continuity, memory, entity, and entity-edge records - governed memory admission, explicit-preference extraction, memory review labels, review queue reads, evaluation summary reads, explicit embedding config and memory-embedding storage, direct semantic retrieval, and deterministic hybrid compile-path memory merge - deterministic prompt assembly and one no-tools response path that persists assistant replies as immutable continuity events - user-scoped consents, policies, policy evaluation, tool registry, allowlist evaluation, tool routing, approval request persistence, approval resolution, approved-only proxy execution through the in-process `proxy.echo` handler, durable execution review, and execution-budget lifecycle plus enforcement -- durable `tasks`, `task_steps`, `task_workspaces`, `task_artifacts`, and `task_artifact_chunks`, deterministic task-step sequencing, explicit task-step transitions, explicit manual continuation with lineage through `parent_step_id`, `source_approval_id`, and `source_execution_id`, explicit `tool_executions.task_step_id` linkage for execution synchronization, deterministic rooted local task-workspace provisioning, explicit rooted local artifact registration, deterministic local text-artifact ingestion into durable chunk rows, deterministic lexical artifact-chunk retrieval over durable chunk rows, and optional compile-path artifact chunk inclusion as a separate context section +- durable `tasks`, `task_steps`, `task_workspaces`, `task_artifacts`, `task_artifact_chunks`, and `task_artifact_chunk_embeddings`, deterministic task-step sequencing, explicit task-step transitions, explicit manual continuation with lineage through `parent_step_id`, `source_approval_id`, and `source_execution_id`, explicit `tool_executions.task_step_id` linkage for execution synchronization, deterministic rooted local task-workspace provisioning, explicit rooted local artifact registration, deterministic local text-artifact ingestion into durable chunk rows, deterministic lexical artifact-chunk retrieval over durable chunk rows, optional compile-path artifact chunk inclusion as a separate context section, and explicit user-scoped artifact-chunk embedding persistence tied to existing embedding configs -The current multi-step boundary is narrow and explicit. Manual continuation is implemented and review-passed. Approval resolution and proxy execution now both use explicit task-step linkage rather than first-step inference. Task workspaces are now implemented only as deterministic rooted local boundaries, and task artifacts are now implemented only as explicit rooted local-file registrations, narrow deterministic text ingestion under those workspaces, lexical retrieval over persisted chunk rows, and optional compile-path inclusion of retrieved artifact chunks in a separate response section. Broader runner-style orchestration, automatic multi-step progression, artifact chunk embeddings and semantic retrieval, rich-document parsing, connectors, and new side-effect surfaces are still planned later and must not be described as live behavior. +The current multi-step boundary is narrow and explicit. Manual continuation is implemented and review-passed. Approval resolution and proxy execution now both use explicit task-step linkage rather than first-step inference. Task workspaces are now implemented only as deterministic rooted local boundaries, and task artifacts are now implemented only as explicit rooted local-file registrations, narrow deterministic text ingestion under those workspaces, lexical retrieval over persisted chunk rows, optional compile-path inclusion of retrieved artifact chunks in a separate response section, and explicit artifact-chunk embedding storage tied to existing embedding configs. Broader runner-style orchestration, automatic multi-step progression, artifact-chunk semantic retrieval, rich-document parsing, connectors, and new side-effect surfaces are still planned later and must not be described as live behavior. ## Implemented Now @@ -24,7 +24,7 @@ The current multi-step boundary is narrow and explicit. Manual continuation is i - memory and retrieval: `POST /v0/memories/admit`, `POST /v0/memories/extract-explicit-preferences`, `GET /v0/memories`, `GET /v0/memories/review-queue`, `GET /v0/memories/evaluation-summary`, `POST /v0/memories/semantic-retrieval`, `GET /v0/memories/{memory_id}`, `GET /v0/memories/{memory_id}/revisions`, `POST /v0/memories/{memory_id}/labels`, `GET /v0/memories/{memory_id}/labels` - embeddings and graph seams: `POST /v0/embedding-configs`, `GET /v0/embedding-configs`, `POST /v0/memory-embeddings`, `GET /v0/memories/{memory_id}/embeddings`, `GET /v0/memory-embeddings/{memory_embedding_id}`, `POST /v0/entities`, `GET /v0/entities`, `GET /v0/entities/{entity_id}`, `POST /v0/entity-edges`, `GET /v0/entities/{entity_id}/edges` - governance: `POST /v0/consents`, `GET /v0/consents`, `POST /v0/policies`, `GET /v0/policies`, `GET /v0/policies/{policy_id}`, `POST /v0/policies/evaluate`, `POST /v0/tools`, `GET /v0/tools`, `GET /v0/tools/{tool_id}`, `POST /v0/tools/allowlist/evaluate`, `POST /v0/tools/route`, `POST /v0/approvals/requests`, `GET /v0/approvals`, `GET /v0/approvals/{approval_id}`, `POST /v0/approvals/{approval_id}/approve`, `POST /v0/approvals/{approval_id}/reject`, `POST /v0/approvals/{approval_id}/execute` -- task and execution review: `GET /v0/tasks`, `GET /v0/tasks/{task_id}`, `POST /v0/tasks/{task_id}/workspace`, `GET /v0/task-workspaces`, `GET /v0/task-workspaces/{task_workspace_id}`, `POST /v0/task-workspaces/{task_workspace_id}/artifacts`, `GET /v0/task-artifacts`, `GET /v0/task-artifacts/{task_artifact_id}`, `POST /v0/task-artifacts/{task_artifact_id}/ingest`, `GET /v0/task-artifacts/{task_artifact_id}/chunks`, `POST /v0/tasks/{task_id}/artifact-chunks/retrieve`, `POST /v0/task-artifacts/{task_artifact_id}/chunks/retrieve`, `GET /v0/tasks/{task_id}/steps`, `GET /v0/task-steps/{task_step_id}`, `POST /v0/tasks/{task_id}/steps`, `POST /v0/task-steps/{task_step_id}/transition`, `POST /v0/execution-budgets`, `GET /v0/execution-budgets`, `GET /v0/execution-budgets/{execution_budget_id}`, `POST /v0/execution-budgets/{execution_budget_id}/deactivate`, `POST /v0/execution-budgets/{execution_budget_id}/supersede`, `GET /v0/tool-executions`, `GET /v0/tool-executions/{execution_id}` +- task and execution review: `GET /v0/tasks`, `GET /v0/tasks/{task_id}`, `POST /v0/tasks/{task_id}/workspace`, `GET /v0/task-workspaces`, `GET /v0/task-workspaces/{task_workspace_id}`, `POST /v0/task-workspaces/{task_workspace_id}/artifacts`, `GET /v0/task-artifacts`, `GET /v0/task-artifacts/{task_artifact_id}`, `POST /v0/task-artifacts/{task_artifact_id}/ingest`, `GET /v0/task-artifacts/{task_artifact_id}/chunks`, `POST /v0/tasks/{task_id}/artifact-chunks/retrieve`, `POST /v0/task-artifacts/{task_artifact_id}/chunks/retrieve`, `POST /v0/task-artifact-chunk-embeddings`, `GET /v0/task-artifacts/{task_artifact_id}/chunk-embeddings`, `GET /v0/task-artifact-chunks/{task_artifact_chunk_id}/embeddings`, `GET /v0/task-artifact-chunk-embeddings/{task_artifact_chunk_embedding_id}`, `GET /v0/tasks/{task_id}/steps`, `GET /v0/task-steps/{task_step_id}`, `POST /v0/tasks/{task_id}/steps`, `POST /v0/task-steps/{task_step_id}/transition`, `POST /v0/execution-budgets`, `GET /v0/execution-budgets`, `GET /v0/execution-budgets/{execution_budget_id}`, `POST /v0/execution-budgets/{execution_budget_id}/deactivate`, `POST /v0/execution-budgets/{execution_budget_id}/supersede`, `GET /v0/tool-executions`, `GET /v0/tool-executions/{execution_id}` - `apps/web` and `workers` remain starter shells only. ### Data Foundation @@ -37,7 +37,7 @@ The current multi-step boundary is narrow and explicit. Manual continuation is i - memory and retrieval tables: `memories`, `memory_revisions`, `memory_review_labels`, `embedding_configs`, `memory_embeddings` - graph tables: `entities`, `entity_edges` - governance tables: `consents`, `policies`, `tools`, `approvals`, `tool_executions`, `execution_budgets` - - task lifecycle tables: `tasks`, `task_steps`, `task_workspaces`, `task_artifacts`, `task_artifact_chunks` + - task lifecycle tables: `tasks`, `task_steps`, `task_workspaces`, `task_artifacts`, `task_artifact_chunks`, `task_artifact_chunk_embeddings` - `events`, `trace_events`, and `memory_revisions` are append-only by application contract and database enforcement. - `memory_review_labels` are append-only by database enforcement. - `tasks` are explicit user-scoped lifecycle records keyed to one thread and one tool, with durable request/tool snapshots, status in `pending_approval | approved | executed | denied | blocked`, and latest approval/execution pointers for the current narrow lifecycle seam. @@ -51,17 +51,18 @@ The current multi-step boundary is narrow and explicit. Manual continuation is i - `task_workspaces` persist one active workspace record per visible task and user, store a deterministic `local_path`, and enforce that active uniqueness through a partial unique index on `(user_id, task_id)`. - `task_artifacts` persist explicit user-scoped artifact rows linked to both `tasks` and `task_workspaces`, store `status = registered`, `ingestion_status in ('pending', 'ingested')`, store only a workspace-relative `relative_path` plus optional `media_type_hint`, and enforce deterministic duplicate rejection through a unique index on `(user_id, task_workspace_id, relative_path)`. - `task_artifact_chunks` persist explicit user-scoped durable chunk rows linked to one artifact, store ordered `sequence_no`, zero-based `char_start`, exclusive `char_end_exclusive`, and chunk `text`, and enforce deterministic uniqueness through a unique index on `(user_id, task_artifact_id, sequence_no)`. +- `task_artifact_chunk_embeddings` persist explicit user-scoped durable embedding rows linked to one visible chunk and one visible embedding config, store validated `dimensions` and `vector`, and enforce deterministic uniqueness through a unique index on `(user_id, task_artifact_chunk_id, embedding_config_id)`. - `execution_budgets` enforce at most one active budget per `(user_id, tool_key, domain_hint)` selector scope through a partial unique index. - Per-request user context is set in the database through `app.current_user_id()`. - `TASK_WORKSPACE_ROOT` defines the only allowed base directory for workspace provisioning, and the live path rule is `resolved_root / user_id / task_id`. ### Repo Boundaries In This Slice -- `apps/api`: implemented API, store, contracts, service logic, and migrations for continuity, tracing, memory, embeddings, entities, policies, tools, approvals, proxy execution, execution budgets, tasks, task steps, task workspaces, task artifacts, deterministic lexical artifact chunk retrieval, and narrow compile-path artifact chunk inclusion. +- `apps/api`: implemented API, store, contracts, service logic, and migrations for continuity, tracing, memory, embeddings, entities, policies, tools, approvals, proxy execution, execution budgets, tasks, task steps, task workspaces, task artifacts, artifact-chunk embeddings, deterministic lexical artifact chunk retrieval, and narrow compile-path artifact chunk inclusion. - `apps/web`: minimal shell only; no shipped workflow UI. - `workers`: scaffold only; no background jobs or runner logic are implemented. - `infra`: local development bootstrap assets only. -- `tests`: unit and Postgres-backed integration coverage for the shipped seams above, including Sprint 4O task-step lineage/manual continuation, Sprint 4S step-linked execution synchronization, Sprint 5A task-workspace provisioning, Sprint 5C task-artifact registration, Sprint 5D local artifact ingestion plus chunk reads, Sprint 5E lexical artifact-chunk retrieval, and Sprint 5F compile-path artifact chunk integration. +- `tests`: unit and Postgres-backed integration coverage for the shipped seams above, including Sprint 4O task-step lineage/manual continuation, Sprint 4S step-linked execution synchronization, Sprint 5A task-workspace provisioning, Sprint 5C task-artifact registration, Sprint 5D local artifact ingestion plus chunk reads, Sprint 5E lexical artifact-chunk retrieval, Sprint 5F compile-path artifact chunk integration, and Sprint 5G artifact-chunk embedding persistence and reads. ## Core Flows Implemented Now @@ -202,12 +203,23 @@ The current multi-step boundary is narrow and explicit. Manual continuation is i 6. Order matches deterministically by matched query term count desc, first match offset asc, relative path asc, sequence no asc, and id asc. 7. Return stable summary metadata describing query terms, scope, searched artifact count, and ordering. +### Artifact Chunk Embedding Storage + +1. Accept a user-scoped `POST /v0/task-artifact-chunk-embeddings` request. +2. Require `task_artifact_chunk_id` to reference one visible persisted chunk. +3. Require `embedding_config_id` to reference one visible persisted embedding config. +4. Normalize the submitted vector as finite numeric values only. +5. Reject writes unless the vector length matches `embedding_config.dimensions`. +6. Persist or update exactly one embedding per visible `(task_artifact_chunk_id, embedding_config_id)` pair. +7. Expose deterministic reads by artifact scope, chunk scope, and embedding id. +8. Order list reads by chunk sequence first, then `created_at ASC`, then `id ASC`. + ## Security Model Implemented Now -- User-owned continuity, trace, memory, embedding, entity, governance, task, task-step, task-workspace, task-artifact, and task-artifact-chunk tables enforce row-level security. +- User-owned continuity, trace, memory, embedding, entity, governance, task, task-step, task-workspace, task-artifact, task-artifact-chunk, and task-artifact-chunk-embedding tables enforce row-level security. - The runtime role is limited to the narrow `SELECT` / `INSERT` / `UPDATE` permissions required by the shipped seams; there is no broad DDL or unrestricted table access at runtime. - Cross-user references are constrained through composite foreign keys on `(id, user_id)` where the schema needs ownership-linked joins. -- Approval, execution, memory, entity, task/task-step, task-workspace, task-artifact, and task-artifact-chunk reads all operate only inside the current user scope. +- Approval, execution, memory, entity, task/task-step, task-workspace, task-artifact, task-artifact-chunk, and task-artifact-chunk-embedding reads all operate only inside the current user scope. - Task-step manual continuation adds both schema-level and service-level lineage protection: - schema-level: user-scoped foreign keys and parent-not-self check - service-level: same-task, latest-step, visible-approval, visible-execution, and parent-outcome-match validation diff --git a/BUILD_REPORT.md b/BUILD_REPORT.md index ad24605..fdcf29b 100644 --- a/BUILD_REPORT.md +++ b/BUILD_REPORT.md @@ -2,283 +2,166 @@ ## sprint objective -Implement Sprint 5F: Artifact Chunk Compile Integration V0 by extending `POST /v0/context/compile` so it can optionally retrieve durable artifact chunks through the existing lexical retrieval seam, return them in a separate context-pack section, and trace artifact include/exclude decisions without adding embeddings, semantic retrieval, connectors, runners, or UI work. +Implement Sprint 5G: Artifact Chunk Embedding Substrate by adding durable, user-scoped `task_artifact_chunk_embeddings` records tied to existing `embedding_configs`, with strict vector validation, deterministic reads, and no semantic retrieval, compile-path semantic use, connector, runner, or UI changes. ## completed work -- Added optional compile-request artifact retrieval input with one explicit scope per request: - - `artifact_retrieval.kind = "task"` with `task_id`, `query`, `limit` - - `artifact_retrieval.kind = "artifact"` with `task_artifact_id`, `query`, `limit` -- Added internal typed compile contracts for artifact retrieval: - - `CompileContextTaskScopedArtifactRetrievalInput` - - `CompileContextArtifactScopedArtifactRetrievalInput` - - `CompileContextArtifactRetrievalInput` -- Added compile response contracts for a separate artifact section: - - `context_pack.artifact_chunks` - - `context_pack.artifact_chunk_summary` - - `ArtifactRetrievalDecisionTracePayload` -- Kept the artifact section response-shape stable even when retrieval is not requested: - - `artifact_chunks` returns `[]` - - `artifact_chunk_summary.requested` is `false` -- Integrated compile-time artifact retrieval into `compile_and_persist_trace()` using only durable `task_artifact_chunks` rows and the shipped lexical retrieval seam. -- Preserved existing continuity, memory, entity, and entity-edge behavior unchanged. -- Recorded compile trace decisions for: - - included artifact chunks - - artifact chunks excluded by the compile limit - - artifacts excluded because `ingestion_status != "ingested"` -- Added summary trace fields for artifact retrieval counts and scope kind. +- Added Alembic revision `20260314_0025_task_artifact_chunk_embeddings` for a new `task_artifact_chunk_embeddings` table. +- Added schema for `task_artifact_chunk_embeddings`: + - columns: `id`, `user_id`, `task_artifact_chunk_id`, `embedding_config_id`, `dimensions`, `vector`, `created_at`, `updated_at` + - uniqueness: + - `UNIQUE (id, user_id)` + - `UNIQUE (user_id, task_artifact_chunk_id, embedding_config_id)` + - foreign keys: + - `(task_artifact_chunk_id, user_id) -> task_artifact_chunks(id, user_id)` + - `(embedding_config_id, user_id) -> embedding_configs(id, user_id)` + - checks: + - `dimensions > 0` + - `vector` is a JSON array + - `vector` is non-empty + - `jsonb_array_length(vector) = dimensions` + - index: + - `task_artifact_chunk_embeddings_user_chunk_created_idx (user_id, task_artifact_chunk_id, created_at, id)` + - security/runtime: + - owner-only RLS + - `GRANT SELECT, INSERT, UPDATE ON task_artifact_chunk_embeddings TO alicebot_app` +- Added stable contracts: + - `TaskArtifactChunkEmbeddingUpsertInput` + - `TaskArtifactChunkEmbeddingRecord` + - `TaskArtifactChunkEmbeddingWriteResponse` + - `TaskArtifactChunkEmbeddingDetailResponse` + - `TaskArtifactChunkEmbeddingListScope` + - `TaskArtifactChunkEmbeddingListSummary` + - `TaskArtifactChunkEmbeddingListResponse` + - `TASK_ARTIFACT_CHUNK_EMBEDDING_LIST_ORDER = ["task_artifact_chunk_sequence_no_asc", "created_at_asc", "id_asc"]` +- Implemented artifact-chunk embedding service behavior: + - validates `task_artifact_chunk_id` against visible `task_artifact_chunks` + - validates `embedding_config_id` against visible `embedding_configs` + - reuses the existing versioned `embedding_configs` seam without a second config/version model + - validates every vector element as finite numeric input + - enforces `len(vector) == embedding_config.dimensions` + - upserts one embedding per `(task_artifact_chunk_id, embedding_config_id)` pair + - exposes deterministic reads by: + - artifact scope + - chunk scope + - embedding id +- Added minimal API surface: + - `POST /v0/task-artifact-chunk-embeddings` + - `GET /v0/task-artifacts/{task_artifact_id}/chunk-embeddings` + - `GET /v0/task-artifact-chunks/{task_artifact_chunk_id}/embeddings` + - `GET /v0/task-artifact-chunk-embeddings/{task_artifact_chunk_embedding_id}` - Added unit and integration coverage for: - - artifact compile request routing and validation - - deterministic artifact chunk ordering - - non-ingested artifact exclusion - - included and excluded artifact trace events - - per-user isolation through compile path - - stable compile response shape with the new section + - persistence + - deterministic ordering + - dimension validation + - invalid config and invalid chunk references + - cross-user isolation + - stable response shape + - migration presence, RLS, grants, and downgrade behavior + +## embedding-config reuse rule and dimension-validation rule used + +- Reuse rule: + - every artifact-chunk embedding must reference an existing visible `embedding_config` + - no new embedding versioning model was introduced +- Dimension-validation rule: + - vector normalization accepts only finite numeric values + - write requests fail unless `len(vector) == embedding_config.dimensions` + - database and service validation both enforce the dimensions rule ## incomplete work -- None within Sprint 5F scope. +- None within Sprint 5G scope. ## files changed +- `apps/api/alembic/versions/20260314_0025_task_artifact_chunk_embeddings.py` - `apps/api/src/alicebot_api/contracts.py` -- `apps/api/src/alicebot_api/compiler.py` +- `apps/api/src/alicebot_api/embedding.py` - `apps/api/src/alicebot_api/main.py` -- `tests/unit/test_compiler.py` +- `apps/api/src/alicebot_api/store.py` +- `tests/integration/test_migrations.py` +- `tests/integration/test_task_artifact_chunk_embeddings_api.py` +- `tests/unit/test_20260314_0025_task_artifact_chunk_embeddings.py` +- `tests/unit/test_task_artifact_chunk_embedding.py` +- `tests/unit/test_task_artifact_chunk_embedding_store.py` - `tests/unit/test_main.py` -- `tests/integration/test_context_compile.py` - `BUILD_REPORT.md` -## exact compile contract changes introduced - -- `CompileContextRequest` now accepts optional `artifact_retrieval`. -- `artifact_retrieval` is a discriminated union: - - task scope: `{ "kind": "task", "task_id": "", "query": "", "limit": }` - - artifact scope: `{ "kind": "artifact", "task_artifact_id": "", "query": "", "limit": }` -- Artifact retrieval limits: - - default: `5` - - max: `50` -- `CompiledContextPack` now includes: - - `artifact_chunks: list[ContextPackArtifactChunk]` - - `artifact_chunk_summary: ContextPackArtifactChunkSummary` -- `artifact_chunk_summary` fields: - - `requested` - - `scope` - - `query` - - `query_terms` - - `matching_rule` - - `limit` - - `searched_artifact_count` - - `candidate_count` - - `included_count` - - `excluded_uningested_artifact_count` - - `excluded_limit_count` - - `order` - -## artifact retrieval matching and ordering rule used - -- Matching rule id: `casefolded_unicode_word_overlap_unique_query_terms_v1` -- Query normalization: - - casefold query text - - extract unique `\w+` terms in first-occurrence order - - reject queries that contain no lexical terms -- Matching source: - - only persisted `task_artifact_chunks` rows attached to visible artifacts - - no raw file reads in compile path -- Exclusion rule: - - artifacts with `ingestion_status != "ingested"` are excluded from artifact chunk results -- Ordering: - - `matched_query_term_count_desc` - - `first_match_char_start_asc` - - `relative_path_asc` - - `sequence_no_asc` - - `id_asc` -- Compile limit behavior: - - ordering is applied first - - the first `limit` chunk matches are included - - remaining matches are traced as `artifact_chunk_limit_exceeded` - -## example compile request +## example artifact-chunk embedding write response ```json { - "user_id": "11111111-1111-1111-1111-111111111111", - "thread_id": "22222222-2222-2222-2222-222222222222", - "artifact_retrieval": { - "kind": "task", - "task_id": "33333333-3333-3333-3333-333333333333", - "query": "Alpha beta", - "limit": 2 - } + "embedding": { + "id": "4d5d0a3b-6a8a-4bf4-bb7c-d1df3d6d84c8", + "task_artifact_id": "6dc8f07d-19f6-4667-b9f3-4573b9cf2b66", + "task_artifact_chunk_id": "fd3dc999-a4d3-4bb0-a287-4f4950dfd7e0", + "task_artifact_chunk_sequence_no": 2, + "embedding_config_id": "42dbab76-1e02-4b5f-a18b-f59c1b19d1d4", + "dimensions": 3, + "vector": [0.9, 0.8, 0.7], + "created_at": "2026-03-14T12:00:00+00:00", + "updated_at": "2026-03-14T12:10:00+00:00" + }, + "write_mode": "updated" } ``` -## example compile response with artifact section +## example artifact-chunk embedding list response ```json { - "trace_id": "44444444-4444-4444-4444-444444444444", - "trace_event_count": 18, - "context_pack": { - "compiler_version": "continuity_v0", - "artifact_chunks": [ - { - "id": "55555555-5555-5555-5555-555555555555", - "task_id": "33333333-3333-3333-3333-333333333333", - "task_artifact_id": "66666666-6666-6666-6666-666666666666", - "relative_path": "docs/a.txt", - "media_type": "text/plain", - "sequence_no": 1, - "char_start": 0, - "char_end_exclusive": 14, - "text": "beta alpha doc", - "match": { - "matched_query_terms": ["alpha", "beta"], - "matched_query_term_count": 2, - "first_match_char_start": 0 - } - }, - { - "id": "77777777-7777-7777-7777-777777777777", - "task_id": "33333333-3333-3333-3333-333333333333", - "task_artifact_id": "88888888-8888-8888-8888-888888888888", - "relative_path": "notes/b.md", - "media_type": "text/markdown", - "sequence_no": 1, - "char_start": 0, - "char_end_exclusive": 15, - "text": "alpha beta note", - "match": { - "matched_query_terms": ["alpha", "beta"], - "matched_query_term_count": 2, - "first_match_char_start": 0 - } - } - ], - "artifact_chunk_summary": { - "requested": true, - "scope": { - "kind": "task", - "task_id": "33333333-3333-3333-3333-333333333333" - }, - "query": "Alpha beta", - "query_terms": ["alpha", "beta"], - "matching_rule": "casefolded_unicode_word_overlap_unique_query_terms_v1", - "limit": 2, - "searched_artifact_count": 3, - "candidate_count": 3, - "included_count": 2, - "excluded_uningested_artifact_count": 1, - "excluded_limit_count": 1, - "order": [ - "matched_query_term_count_desc", - "first_match_char_start_asc", - "relative_path_asc", - "sequence_no_asc", - "id_asc" - ] + "items": [ + { + "id": "4d5d0a3b-6a8a-4bf4-bb7c-d1df3d6d84c8", + "task_artifact_id": "6dc8f07d-19f6-4667-b9f3-4573b9cf2b66", + "task_artifact_chunk_id": "fd3dc999-a4d3-4bb0-a287-4f4950dfd7e0", + "task_artifact_chunk_sequence_no": 2, + "embedding_config_id": "42dbab76-1e02-4b5f-a18b-f59c1b19d1d4", + "dimensions": 3, + "vector": [0.9, 0.8, 0.7], + "created_at": "2026-03-14T12:00:00+00:00", + "updated_at": "2026-03-14T12:10:00+00:00" } - } -} -``` - -## example artifact-retrieval trace events inside one compile run - -```json -[ - { - "kind": "context.included", - "payload": { - "entity_type": "artifact_chunk", - "entity_id": "55555555-5555-5555-5555-555555555555", - "reason": "within_artifact_chunk_limit", - "position": 1, - "scope_kind": "task", - "task_id": "33333333-3333-3333-3333-333333333333", - "task_artifact_id": "66666666-6666-6666-6666-666666666666", - "relative_path": "docs/a.txt", - "media_type": "text/plain", - "ingestion_status": "ingested", - "limit": 2, - "matched_query_terms": ["alpha", "beta"], - "matched_query_term_count": 2, - "first_match_char_start": 0, - "sequence_no": 1, - "char_start": 0, - "char_end_exclusive": 14 - } - }, - { - "kind": "context.excluded", - "payload": { - "entity_type": "artifact_chunk", - "entity_id": "99999999-9999-9999-9999-999999999999", - "reason": "artifact_chunk_limit_exceeded", - "position": 3, - "scope_kind": "task", - "task_id": "33333333-3333-3333-3333-333333333333", - "task_artifact_id": "aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa", - "relative_path": "notes/c.txt", - "media_type": "text/plain", - "ingestion_status": "ingested", - "limit": 2, - "matched_query_terms": ["beta"], - "matched_query_term_count": 1, - "first_match_char_start": 0, - "sequence_no": 1, - "char_start": 0, - "char_end_exclusive": 9 - } - }, - { - "kind": "context.excluded", - "payload": { - "entity_type": "task_artifact", - "entity_id": "bbbbbbbb-bbbb-bbbb-bbbb-bbbbbbbbbbbb", - "reason": "artifact_not_ingested", - "position": 3, - "scope_kind": "task", - "task_id": "33333333-3333-3333-3333-333333333333", - "task_artifact_id": "bbbbbbbb-bbbb-bbbb-bbbb-bbbbbbbbbbbb", - "relative_path": "notes/hidden.txt", - "media_type": "text/plain", - "ingestion_status": "pending", - "limit": 2 + ], + "summary": { + "total_count": 1, + "order": ["task_artifact_chunk_sequence_no_asc", "created_at_asc", "id_asc"], + "scope": { + "kind": "chunk", + "task_artifact_id": "6dc8f07d-19f6-4667-b9f3-4573b9cf2b66", + "task_artifact_chunk_id": "fd3dc999-a4d3-4bb0-a287-4f4950dfd7e0" } } -] +} ``` ## tests run -- `./.venv/bin/python -m pytest tests/unit/test_compiler.py` - - result: `4 passed in 0.21s` -- `./.venv/bin/python -m pytest tests/unit/test_main.py` - - result: `38 passed in 0.39s` -- `./.venv/bin/python -m pytest tests/integration/test_context_compile.py` - - sandboxed attempt failed to reach local Postgres on `localhost:5432` with `Operation not permitted` -- `./.venv/bin/python -m pytest tests/integration/test_context_compile.py` - - rerun with local database access: `7 passed in 2.17s` +- `./.venv/bin/python -m pytest tests/unit/test_20260314_0025_task_artifact_chunk_embeddings.py tests/unit/test_task_artifact_chunk_embedding.py tests/unit/test_task_artifact_chunk_embedding_store.py tests/unit/test_main.py` + - result: `48 passed in 0.41s` +- `./.venv/bin/python -m pytest tests/integration/test_task_artifact_chunk_embeddings_api.py tests/integration/test_migrations.py` + - first sandboxed attempt failed to reach local Postgres on `localhost:5432` due sandbox restrictions +- `./.venv/bin/python -m pytest tests/integration/test_migrations.py::test_migrations_upgrade_and_downgrade tests/integration/test_task_artifact_chunk_embeddings_api.py` + - result: `4 passed in 1.99s` - `./.venv/bin/python -m pytest tests/unit` - - result: `360 passed in 0.56s` + - result: `370 passed in 0.59s` - `./.venv/bin/python -m pytest tests/integration` - - rerun with local database access: `107 passed in 31.19s` + - result: `111 passed in 34.92s` ## blockers/issues -- No remaining implementation blockers. -- Postgres-backed integration verification required elevated local database access because sandboxed localhost connections were denied. +- No code blockers remained. +- Integration verification required elevated access to the local Postgres instance because sandboxed localhost connections were blocked. -## recommended next step +## what remains intentionally deferred to later milestones -Use this compile-path artifact section as the only seam for later retrieval upgrades, then add semantic retrieval or richer document handling in a separate sprint without changing this deterministic lexical contract. +- semantic retrieval over artifact chunks +- lexical plus semantic hybrid artifact retrieval +- compile-path semantic use of artifact embeddings +- embedding generation via model or external API calls +- connectors, runners, orchestration, and UI work -## what remains intentionally deferred to later milestones +## recommended next step -- Artifact chunk embeddings -- Semantic retrieval or reranking for artifact chunks -- Compile-path merging of artifact chunks into memory or entity sections -- PDF, DOCX, OCR, or richer document parsing -- Gmail or Calendar connectors -- Runner or orchestration work -- UI work +Use the new durable `task_artifact_chunk_embeddings` substrate to add a separate, narrowly scoped semantic artifact retrieval sprint that reads these stored vectors without changing compile-path behavior in the same step. diff --git a/REVIEW_REPORT.md b/REVIEW_REPORT.md index 5ee4f13..acff7a4 100644 --- a/REVIEW_REPORT.md +++ b/REVIEW_REPORT.md @@ -6,24 +6,16 @@ PASS ## criteria met -- `POST /v0/context/compile` optionally accepts `artifact_retrieval` input and returns a separate `context_pack.artifact_chunks` section plus `artifact_chunk_summary`. -- Compile-path artifact retrieval uses only durable `task_artifact_chunks` rows through the existing lexical retrieval seam in `apps/api/src/alicebot_api/artifacts.py`. -- Non-ingested artifacts are excluded from compile-path artifact results and produce explicit exclusion trace events. -- Artifact include/exclude decisions are persisted in `trace_events`, and compile summary events expose artifact retrieval counters and scope kind. -- Artifact chunk ordering is deterministic and matches the documented order: - - matched query term count desc - - first match start asc - - relative path asc - - sequence no asc - - id asc -- Current continuity, memory, entity, and entity-edge sections remain intact and separate from artifact chunks. -- Task-scoped and artifact-scoped compile retrieval paths are both covered, including artifact-scoped happy-path coverage in `tests/integration/test_context_compile.py`. -- The sprint stayed within scope: no embeddings, semantic retrieval for artifact chunks, connectors, runner work, UI work, or raw-file reads in compile. -- Verification in this review: - - `./.venv/bin/python -m pytest tests/unit` -> `360 passed in 0.59s` - - `./.venv/bin/python -m pytest tests/integration` -> `107 passed in 29.88s` - - `./.venv/bin/python -m pytest tests/integration/test_context_compile.py` -> `8 passed in 2.42s` - - `git diff --check` -> passed +- The sprint remains technically narrow and limited to the artifact-chunk embedding substrate: migration, contracts, store/service logic, and minimal embedding read/write routes are present in [20260314_0025_task_artifact_chunk_embeddings.py](/Users/samirusani/Desktop/Codex/AliceBot/apps/api/alembic/versions/20260314_0025_task_artifact_chunk_embeddings.py), [embedding.py](/Users/samirusani/Desktop/Codex/AliceBot/apps/api/src/alicebot_api/embedding.py#L315), [main.py](/Users/samirusani/Desktop/Codex/AliceBot/apps/api/src/alicebot_api/main.py#L1968), and [store.py](/Users/samirusani/Desktop/Codex/AliceBot/apps/api/src/alicebot_api/store.py). +- Writes attach one validated vector to one visible `task_artifact_chunk` under one visible `embedding_config`, reject missing refs and dimension mismatches, and preserve user isolation through existing ownership seams. See [embedding.py](/Users/samirusani/Desktop/Codex/AliceBot/apps/api/src/alicebot_api/embedding.py#L323). +- Reads are deterministic and user-scoped. The migration enforces composite ownership-linked foreign keys and RLS, and list ordering is explicit in both contracts and queries. See [contracts.py](/Users/samirusani/Desktop/Codex/AliceBot/apps/api/src/alicebot_api/contracts.py), [store.py](/Users/samirusani/Desktop/Codex/AliceBot/apps/api/src/alicebot_api/store.py), and [20260314_0025_task_artifact_chunk_embeddings.py](/Users/samirusani/Desktop/Codex/AliceBot/apps/api/alembic/versions/20260314_0025_task_artifact_chunk_embeddings.py#L15). +- Coverage remains adequate for the sprint packet: persistence, ordering, invalid refs, dimension validation, isolation, route shape, and migration upgrade/downgrade are test-backed. +- Prior runtime verification for this review cycle remains valid: + - `./.venv/bin/python -m pytest tests/unit` -> `370 passed` + - `./.venv/bin/python -m pytest tests/integration` -> `111 passed` +- The follow-up addressed the remaining review findings: + - [ARCHITECTURE.md](/Users/samirusani/Desktop/Codex/AliceBot/ARCHITECTURE.md) now reflects Sprint 5G as implemented, includes the new embedding routes and table, and no longer describes artifact-chunk embeddings as deferred. + - [RULES.md](/Users/samirusani/Desktop/Codex/AliceBot/RULES.md#L6) now makes [.ai/active/SPRINT_PACKET.md](/Users/samirusani/Desktop/Codex/AliceBot/.ai/active/SPRINT_PACKET.md) an immutable control/input artifact during implementation unless Control Tower changes the sprint. ## criteria missed @@ -31,25 +23,26 @@ PASS ## quality issues -- No blocking implementation or coverage issues found after the follow-up fixes. +- No blocking implementation or documentation quality issues remain. ## regression risks -- Low. The change remains additive, narrowly scoped to compile-path artifact chunk inclusion, and is covered by unit plus Postgres-backed integration tests for ordering, exclusion, tracing, validation, and isolation. +- Low. The only follow-up changes in this pass were documentation and rules updates, and they do not affect runtime behavior. ## docs issues -- `BUILD_REPORT.md` is aligned with the implementation and verification. -- `ARCHITECTURE.md` now reflects the shipped boundary through Sprint 5F and no longer misstates artifact retrieval as unimplemented. +- None blocking. +- Provenance note: [.ai/active/SPRINT_PACKET.md](/Users/samirusani/Desktop/Codex/AliceBot/.ai/active/SPRINT_PACKET.md) is still modified in the worktree relative to the repo base, but the current contents match the Sprint 5G assignment being reviewed, and [RULES.md](/Users/samirusani/Desktop/Codex/AliceBot/RULES.md#L6) now codifies immutability for future implementation turns. ## should anything be added to RULES.md? -- No. +- No. The needed control-artifact rule has been added. ## should anything update ARCHITECTURE.md? -- No further update is required for sprint acceptance. +- No. The needed Sprint 5G updates are present. ## recommended next action -- Mark Sprint 5F accepted and proceed to the next milestone in a separate sprint. +1. Treat the sprint as review-passed. +2. If desired, keep the new `SPRINT_PACKET.md` immutability rule as the standing process guard for future sprints. diff --git a/RULES.md b/RULES.md index 6c02e05..ad493d8 100644 --- a/RULES.md +++ b/RULES.md @@ -3,6 +3,7 @@ ## Truth And Scope - The active sprint packet is the top scope boundary for implementation work. +- Treat `.ai/active/SPRINT_PACKET.md` as an input/control artifact: do not edit it during implementation unless Control Tower explicitly changes the sprint. - Never describe planned behavior as already implemented. - Keep canonical truth files concise, current, and durable. - Archive stale planning or history material instead of deleting it when traceability still matters. diff --git a/apps/api/alembic/versions/20260314_0025_task_artifact_chunk_embeddings.py b/apps/api/alembic/versions/20260314_0025_task_artifact_chunk_embeddings.py new file mode 100644 index 0000000..23cba0e --- /dev/null +++ b/apps/api/alembic/versions/20260314_0025_task_artifact_chunk_embeddings.py @@ -0,0 +1,81 @@ +"""Add user-scoped task artifact chunk embedding records.""" + +from __future__ import annotations + +from alembic import op + + +revision = "20260314_0025" +down_revision = "20260314_0024" +branch_labels = None +depends_on = None + +_RLS_TABLES = ("task_artifact_chunk_embeddings",) + +_UPGRADE_SCHEMA_STATEMENT = """ + CREATE TABLE task_artifact_chunk_embeddings ( + id uuid PRIMARY KEY DEFAULT gen_random_uuid(), + user_id uuid NOT NULL REFERENCES users(id) ON DELETE CASCADE, + task_artifact_chunk_id uuid NOT NULL, + embedding_config_id uuid NOT NULL, + dimensions integer NOT NULL, + vector jsonb NOT NULL, + created_at timestamptz NOT NULL DEFAULT now(), + updated_at timestamptz NOT NULL DEFAULT now(), + UNIQUE (id, user_id), + UNIQUE (user_id, task_artifact_chunk_id, embedding_config_id), + CONSTRAINT task_artifact_chunk_embeddings_chunk_fkey + FOREIGN KEY (task_artifact_chunk_id, user_id) + REFERENCES task_artifact_chunks(id, user_id) ON DELETE CASCADE, + CONSTRAINT task_artifact_chunk_embeddings_embedding_config_fkey + FOREIGN KEY (embedding_config_id, user_id) + REFERENCES embedding_configs(id, user_id) ON DELETE CASCADE, + CONSTRAINT task_artifact_chunk_embeddings_dimensions_check + CHECK (dimensions > 0), + CONSTRAINT task_artifact_chunk_embeddings_vector_array_check + CHECK (jsonb_typeof(vector) = 'array'), + CONSTRAINT task_artifact_chunk_embeddings_vector_nonempty_check + CHECK (jsonb_array_length(vector) > 0), + CONSTRAINT task_artifact_chunk_embeddings_vector_dimensions_match_check + CHECK (jsonb_array_length(vector) = dimensions) + ); + + CREATE INDEX task_artifact_chunk_embeddings_user_chunk_created_idx + ON task_artifact_chunk_embeddings (user_id, task_artifact_chunk_id, created_at, id); + """ + +_UPGRADE_GRANT_STATEMENTS = ( + "GRANT SELECT, INSERT, UPDATE ON task_artifact_chunk_embeddings TO alicebot_app", +) + +_UPGRADE_POLICY_STATEMENT = """ + CREATE POLICY task_artifact_chunk_embeddings_is_owner ON task_artifact_chunk_embeddings + USING (user_id = app.current_user_id()) + WITH CHECK (user_id = app.current_user_id()); + """ + +_DOWNGRADE_STATEMENTS = ( + "DROP TABLE IF EXISTS task_artifact_chunk_embeddings", +) + + +def _execute_statements(statements: tuple[str, ...]) -> None: + for statement in statements: + op.execute(statement) + + +def _enable_row_level_security() -> None: + for table_name in _RLS_TABLES: + op.execute(f"ALTER TABLE {table_name} ENABLE ROW LEVEL SECURITY") + op.execute(f"ALTER TABLE {table_name} FORCE ROW LEVEL SECURITY") + + +def upgrade() -> None: + op.execute(_UPGRADE_SCHEMA_STATEMENT) + _execute_statements(_UPGRADE_GRANT_STATEMENTS) + _enable_row_level_security() + op.execute(_UPGRADE_POLICY_STATEMENT) + + +def downgrade() -> None: + _execute_statements(_DOWNGRADE_STATEMENTS) diff --git a/apps/api/src/alicebot_api/contracts.py b/apps/api/src/alicebot_api/contracts.py index 86e7934..c624578 100644 --- a/apps/api/src/alicebot_api/contracts.py +++ b/apps/api/src/alicebot_api/contracts.py @@ -23,6 +23,7 @@ TaskArtifactStatus = Literal["registered"] TaskArtifactIngestionStatus = Literal["pending", "ingested"] TaskArtifactChunkRetrievalScopeKind = Literal["task", "artifact"] +TaskArtifactChunkEmbeddingListScopeKind = Literal["artifact", "chunk"] TaskLifecycleSource = Literal[ "approval_request", "approval_resolution", @@ -136,6 +137,11 @@ TASK_WORKSPACE_LIST_ORDER = ["created_at_asc", "id_asc"] TASK_ARTIFACT_LIST_ORDER = ["created_at_asc", "id_asc"] TASK_ARTIFACT_CHUNK_LIST_ORDER = ["sequence_no_asc", "id_asc"] +TASK_ARTIFACT_CHUNK_EMBEDDING_LIST_ORDER = [ + "task_artifact_chunk_sequence_no_asc", + "created_at_asc", + "id_asc", +] TASK_ARTIFACT_CHUNK_RETRIEVAL_ORDER = [ "matched_query_term_count_desc", "first_match_char_start_asc", @@ -747,6 +753,20 @@ def as_payload(self) -> JsonObject: } +@dataclass(frozen=True, slots=True) +class TaskArtifactChunkEmbeddingUpsertInput: + task_artifact_chunk_id: UUID + embedding_config_id: UUID + vector: tuple[float, ...] + + def as_payload(self) -> JsonObject: + return { + "task_artifact_chunk_id": str(self.task_artifact_chunk_id), + "embedding_config_id": str(self.embedding_config_id), + "vector": [float(value) for value in self.vector], + } + + @dataclass(frozen=True, slots=True) class SemanticMemoryRetrievalRequestInput: embedding_config_id: UUID @@ -1770,6 +1790,44 @@ class TaskArtifactChunkListResponse(TypedDict): summary: TaskArtifactChunkListSummary +class TaskArtifactChunkEmbeddingRecord(TypedDict): + id: str + task_artifact_id: str + task_artifact_chunk_id: str + task_artifact_chunk_sequence_no: int + embedding_config_id: str + dimensions: int + vector: list[float] + created_at: str + updated_at: str + + +class TaskArtifactChunkEmbeddingWriteResponse(TypedDict): + embedding: TaskArtifactChunkEmbeddingRecord + write_mode: Literal["created", "updated"] + + +class TaskArtifactChunkEmbeddingDetailResponse(TypedDict): + embedding: TaskArtifactChunkEmbeddingRecord + + +class TaskArtifactChunkEmbeddingListScope(TypedDict): + kind: TaskArtifactChunkEmbeddingListScopeKind + task_artifact_id: str + task_artifact_chunk_id: NotRequired[str] + + +class TaskArtifactChunkEmbeddingListSummary(TypedDict): + total_count: int + order: list[str] + scope: TaskArtifactChunkEmbeddingListScope + + +class TaskArtifactChunkEmbeddingListResponse(TypedDict): + items: list[TaskArtifactChunkEmbeddingRecord] + summary: TaskArtifactChunkEmbeddingListSummary + + class TaskArtifactIngestionResponse(TypedDict): artifact: TaskArtifactRecord summary: TaskArtifactChunkListSummary diff --git a/apps/api/src/alicebot_api/embedding.py b/apps/api/src/alicebot_api/embedding.py index 5248197..320d5fb 100644 --- a/apps/api/src/alicebot_api/embedding.py +++ b/apps/api/src/alicebot_api/embedding.py @@ -5,9 +5,11 @@ import psycopg +from alicebot_api.artifacts import TaskArtifactNotFoundError from alicebot_api.contracts import ( EMBEDDING_CONFIG_LIST_ORDER, MEMORY_EMBEDDING_LIST_ORDER, + TASK_ARTIFACT_CHUNK_EMBEDDING_LIST_ORDER, EmbeddingConfigCreateInput, EmbeddingConfigCreateResponse, EmbeddingConfigListResponse, @@ -19,8 +21,21 @@ MemoryEmbeddingRecord, MemoryEmbeddingUpsertInput, MemoryEmbeddingUpsertResponse, + TaskArtifactChunkEmbeddingDetailResponse, + TaskArtifactChunkEmbeddingListResponse, + TaskArtifactChunkEmbeddingListScope, + TaskArtifactChunkEmbeddingListScopeKind, + TaskArtifactChunkEmbeddingListSummary, + TaskArtifactChunkEmbeddingRecord, + TaskArtifactChunkEmbeddingUpsertInput, + TaskArtifactChunkEmbeddingWriteResponse, +) +from alicebot_api.store import ( + ContinuityStore, + EmbeddingConfigRow, + MemoryEmbeddingRow, + TaskArtifactChunkEmbeddingRow, ) -from alicebot_api.store import ContinuityStore, EmbeddingConfigRow, MemoryEmbeddingRow class EmbeddingConfigValidationError(ValueError): @@ -35,6 +50,14 @@ class MemoryEmbeddingNotFoundError(LookupError): """Raised when a requested memory embedding is not visible inside the current user scope.""" +class TaskArtifactChunkEmbeddingValidationError(ValueError): + """Raised when an artifact-chunk embedding request fails explicit validation.""" + + +class TaskArtifactChunkEmbeddingNotFoundError(LookupError): + """Raised when an artifact-chunk embedding read target is not visible inside the current user scope.""" + + def _duplicate_embedding_config_message( *, provider: str, @@ -72,20 +95,67 @@ def _serialize_memory_embedding(embedding: MemoryEmbeddingRow) -> MemoryEmbeddin } -def _validate_vector(vector: tuple[float, ...]) -> list[float]: +def _serialize_task_artifact_chunk_embedding( + embedding: TaskArtifactChunkEmbeddingRow, +) -> TaskArtifactChunkEmbeddingRecord: + return { + "id": str(embedding["id"]), + "task_artifact_id": str(embedding["task_artifact_id"]), + "task_artifact_chunk_id": str(embedding["task_artifact_chunk_id"]), + "task_artifact_chunk_sequence_no": embedding["task_artifact_chunk_sequence_no"], + "embedding_config_id": str(embedding["embedding_config_id"]), + "dimensions": embedding["dimensions"], + "vector": [float(value) for value in embedding["vector"]], + "created_at": embedding["created_at"].isoformat(), + "updated_at": embedding["updated_at"].isoformat(), + } + + +def _validate_vector( + vector: tuple[float, ...], + *, + error_type: type[ValueError], +) -> list[float]: if not vector: - raise MemoryEmbeddingValidationError("vector must include at least one numeric value") + raise error_type("vector must include at least one numeric value") normalized: list[float] = [] for value in vector: normalized_value = float(value) if not math.isfinite(normalized_value): - raise MemoryEmbeddingValidationError("vector must contain only finite numeric values") + raise error_type("vector must contain only finite numeric values") normalized.append(normalized_value) return normalized +def _build_task_artifact_chunk_embedding_scope( + *, + kind: TaskArtifactChunkEmbeddingListScopeKind, + task_artifact_id: UUID, + task_artifact_chunk_id: UUID | None = None, +) -> TaskArtifactChunkEmbeddingListScope: + scope: TaskArtifactChunkEmbeddingListScope = { + "kind": kind, + "task_artifact_id": str(task_artifact_id), + } + if task_artifact_chunk_id is not None: + scope["task_artifact_chunk_id"] = str(task_artifact_chunk_id) + return scope + + +def _build_task_artifact_chunk_embedding_summary( + *, + items: list[TaskArtifactChunkEmbeddingRecord], + scope: TaskArtifactChunkEmbeddingListScope, +) -> TaskArtifactChunkEmbeddingListSummary: + return { + "total_count": len(items), + "order": list(TASK_ARTIFACT_CHUNK_EMBEDDING_LIST_ORDER), + "scope": scope, + } + + def create_embedding_config_record( store: ContinuityStore, *, @@ -168,7 +238,7 @@ def upsert_memory_embedding_record( f"{request.embedding_config_id}" ) - vector = _validate_vector(request.vector) + vector = _validate_vector(request.vector, error_type=MemoryEmbeddingValidationError) if len(vector) != config["dimensions"]: raise MemoryEmbeddingValidationError( "vector length must match embedding config dimensions " @@ -240,3 +310,131 @@ def list_memory_embedding_records( "items": items, "summary": summary, } + + +def upsert_task_artifact_chunk_embedding_record( + store: ContinuityStore, + *, + user_id: UUID, + request: TaskArtifactChunkEmbeddingUpsertInput, +) -> TaskArtifactChunkEmbeddingWriteResponse: + del user_id + + chunk = store.get_task_artifact_chunk_optional(request.task_artifact_chunk_id) + if chunk is None: + raise TaskArtifactChunkEmbeddingValidationError( + "task_artifact_chunk_id must reference an existing task artifact chunk owned by the " + f"user: {request.task_artifact_chunk_id}" + ) + + config = store.get_embedding_config_optional(request.embedding_config_id) + if config is None: + raise TaskArtifactChunkEmbeddingValidationError( + "embedding_config_id must reference an existing embedding config owned by the user: " + f"{request.embedding_config_id}" + ) + + vector = _validate_vector(request.vector, error_type=TaskArtifactChunkEmbeddingValidationError) + if len(vector) != config["dimensions"]: + raise TaskArtifactChunkEmbeddingValidationError( + "vector length must match embedding config dimensions " + f"({config['dimensions']}): {len(vector)}" + ) + + existing = store.get_task_artifact_chunk_embedding_by_chunk_and_config_optional( + task_artifact_chunk_id=request.task_artifact_chunk_id, + embedding_config_id=request.embedding_config_id, + ) + if existing is None: + created = store.create_task_artifact_chunk_embedding( + task_artifact_chunk_id=request.task_artifact_chunk_id, + embedding_config_id=request.embedding_config_id, + dimensions=config["dimensions"], + vector=vector, + ) + return { + "embedding": _serialize_task_artifact_chunk_embedding(created), + "write_mode": "created", + } + + updated = store.update_task_artifact_chunk_embedding( + task_artifact_chunk_embedding_id=existing["id"], + dimensions=config["dimensions"], + vector=vector, + ) + return { + "embedding": _serialize_task_artifact_chunk_embedding(updated), + "write_mode": "updated", + } + + +def get_task_artifact_chunk_embedding_record( + store: ContinuityStore, + *, + user_id: UUID, + task_artifact_chunk_embedding_id: UUID, +) -> TaskArtifactChunkEmbeddingDetailResponse: + del user_id + + embedding = store.get_task_artifact_chunk_embedding_optional(task_artifact_chunk_embedding_id) + if embedding is None: + raise TaskArtifactChunkEmbeddingNotFoundError( + f"task artifact chunk embedding {task_artifact_chunk_embedding_id} was not found" + ) + + return {"embedding": _serialize_task_artifact_chunk_embedding(embedding)} + + +def list_task_artifact_chunk_embedding_records_for_artifact( + store: ContinuityStore, + *, + user_id: UUID, + task_artifact_id: UUID, +) -> TaskArtifactChunkEmbeddingListResponse: + del user_id + + artifact = store.get_task_artifact_optional(task_artifact_id) + if artifact is None: + raise TaskArtifactNotFoundError(f"task artifact {task_artifact_id} was not found") + + items = [ + _serialize_task_artifact_chunk_embedding(embedding) + for embedding in store.list_task_artifact_chunk_embeddings_for_artifact(task_artifact_id) + ] + scope = _build_task_artifact_chunk_embedding_scope( + kind="artifact", + task_artifact_id=task_artifact_id, + ) + return { + "items": items, + "summary": _build_task_artifact_chunk_embedding_summary(items=items, scope=scope), + } + + +def list_task_artifact_chunk_embedding_records_for_chunk( + store: ContinuityStore, + *, + user_id: UUID, + task_artifact_chunk_id: UUID, +) -> TaskArtifactChunkEmbeddingListResponse: + del user_id + + chunk = store.get_task_artifact_chunk_optional(task_artifact_chunk_id) + if chunk is None: + raise TaskArtifactChunkEmbeddingNotFoundError( + f"task artifact chunk {task_artifact_chunk_id} was not found" + ) + + items = [ + _serialize_task_artifact_chunk_embedding(embedding) + for embedding in store.list_task_artifact_chunk_embeddings_for_chunk(task_artifact_chunk_id) + ] + scope = _build_task_artifact_chunk_embedding_scope( + kind="chunk", + task_artifact_id=chunk["task_artifact_id"], + task_artifact_chunk_id=task_artifact_chunk_id, + ) + return { + "items": items, + "summary": _build_task_artifact_chunk_embedding_summary(items=items, scope=scope), + } diff --git a/apps/api/src/alicebot_api/main.py b/apps/api/src/alicebot_api/main.py index bab17c2..0106917 100644 --- a/apps/api/src/alicebot_api/main.py +++ b/apps/api/src/alicebot_api/main.py @@ -49,6 +49,7 @@ PolicyEffect, PolicyEvaluationRequestInput, SemanticMemoryRetrievalRequestInput, + TaskArtifactChunkEmbeddingUpsertInput, TOOL_METADATA_VERSION_V0, ApprovalStatus, ArtifactScopedArtifactChunkRetrievalInput, @@ -133,10 +134,16 @@ EmbeddingConfigValidationError, MemoryEmbeddingNotFoundError, MemoryEmbeddingValidationError, + TaskArtifactChunkEmbeddingNotFoundError, + TaskArtifactChunkEmbeddingValidationError, create_embedding_config_record, get_memory_embedding_record, + get_task_artifact_chunk_embedding_record, list_embedding_config_records, list_memory_embedding_records, + list_task_artifact_chunk_embedding_records_for_artifact, + list_task_artifact_chunk_embedding_records_for_chunk, + upsert_task_artifact_chunk_embedding_record, upsert_memory_embedding_record, ) from alicebot_api.entity import ( @@ -355,6 +362,13 @@ class UpsertMemoryEmbeddingRequest(BaseModel): vector: list[float] = Field(min_length=1, max_length=20000) +class UpsertTaskArtifactChunkEmbeddingRequest(BaseModel): + user_id: UUID + task_artifact_chunk_id: UUID + embedding_config_id: UUID + vector: list[float] = Field(min_length=1, max_length=20000) + + class RetrieveSemanticMemoriesRequest(BaseModel): user_id: UUID embedding_config_id: UUID @@ -1951,6 +1965,32 @@ def upsert_memory_embedding(request: UpsertMemoryEmbeddingRequest) -> JSONRespon ) +@app.post("/v0/task-artifact-chunk-embeddings") +def upsert_task_artifact_chunk_embedding( + request: UpsertTaskArtifactChunkEmbeddingRequest, +) -> JSONResponse: + settings = get_settings() + + try: + with user_connection(settings.database_url, request.user_id) as conn: + payload = upsert_task_artifact_chunk_embedding_record( + ContinuityStore(conn), + user_id=request.user_id, + request=TaskArtifactChunkEmbeddingUpsertInput( + task_artifact_chunk_id=request.task_artifact_chunk_id, + embedding_config_id=request.embedding_config_id, + vector=tuple(request.vector), + ), + ) + except TaskArtifactChunkEmbeddingValidationError as exc: + return JSONResponse(status_code=400, content={"detail": str(exc)}) + + return JSONResponse( + status_code=201, + content=jsonable_encoder(payload), + ) + + @app.get("/v0/memories/{memory_id}/embeddings") def list_memory_embeddings(memory_id: UUID, user_id: UUID) -> JSONResponse: settings = get_settings() @@ -1971,6 +2011,52 @@ def list_memory_embeddings(memory_id: UUID, user_id: UUID) -> JSONResponse: ) +@app.get("/v0/task-artifacts/{task_artifact_id}/chunk-embeddings") +def list_task_artifact_chunk_embeddings_for_artifact( + task_artifact_id: UUID, + user_id: UUID, +) -> JSONResponse: + settings = get_settings() + + try: + with user_connection(settings.database_url, user_id) as conn: + payload = list_task_artifact_chunk_embedding_records_for_artifact( + ContinuityStore(conn), + user_id=user_id, + task_artifact_id=task_artifact_id, + ) + except TaskArtifactNotFoundError as exc: + return JSONResponse(status_code=404, content={"detail": str(exc)}) + + return JSONResponse( + status_code=200, + content=jsonable_encoder(payload), + ) + + +@app.get("/v0/task-artifact-chunks/{task_artifact_chunk_id}/embeddings") +def list_task_artifact_chunk_embeddings( + task_artifact_chunk_id: UUID, + user_id: UUID, +) -> JSONResponse: + settings = get_settings() + + try: + with user_connection(settings.database_url, user_id) as conn: + payload = list_task_artifact_chunk_embedding_records_for_chunk( + ContinuityStore(conn), + user_id=user_id, + task_artifact_chunk_id=task_artifact_chunk_id, + ) + except TaskArtifactChunkEmbeddingNotFoundError as exc: + return JSONResponse(status_code=404, content={"detail": str(exc)}) + + return JSONResponse( + status_code=200, + content=jsonable_encoder(payload), + ) + + @app.get("/v0/memory-embeddings/{memory_embedding_id}") def get_memory_embedding(memory_embedding_id: UUID, user_id: UUID) -> JSONResponse: settings = get_settings() @@ -1991,6 +2077,29 @@ def get_memory_embedding(memory_embedding_id: UUID, user_id: UUID) -> JSONRespon ) +@app.get("/v0/task-artifact-chunk-embeddings/{task_artifact_chunk_embedding_id}") +def get_task_artifact_chunk_embedding( + task_artifact_chunk_embedding_id: UUID, + user_id: UUID, +) -> JSONResponse: + settings = get_settings() + + try: + with user_connection(settings.database_url, user_id) as conn: + payload = get_task_artifact_chunk_embedding_record( + ContinuityStore(conn), + user_id=user_id, + task_artifact_chunk_embedding_id=task_artifact_chunk_embedding_id, + ) + except TaskArtifactChunkEmbeddingNotFoundError as exc: + return JSONResponse(status_code=404, content={"detail": str(exc)}) + + return JSONResponse( + status_code=200, + content=jsonable_encoder(payload), + ) + + @app.post("/v0/entities") def create_entity(request: CreateEntityRequest) -> JSONResponse: settings = get_settings() diff --git a/apps/api/src/alicebot_api/store.py b/apps/api/src/alicebot_api/store.py index d18ced9..206d168 100644 --- a/apps/api/src/alicebot_api/store.py +++ b/apps/api/src/alicebot_api/store.py @@ -270,6 +270,19 @@ class TaskArtifactChunkRow(TypedDict): updated_at: datetime +class TaskArtifactChunkEmbeddingRow(TypedDict): + id: UUID + user_id: UUID + task_artifact_id: UUID + task_artifact_chunk_id: UUID + task_artifact_chunk_sequence_no: int + embedding_config_id: UUID + dimensions: int + vector: list[float] + created_at: datetime + updated_at: datetime + + class TaskStepRow(TypedDict): id: UUID user_id: UUID @@ -1597,6 +1610,181 @@ class LabelCountRow(TypedDict): ORDER BY sequence_no ASC, id ASC """ +GET_TASK_ARTIFACT_CHUNK_SQL = """ + SELECT + id, + user_id, + task_artifact_id, + sequence_no, + char_start, + char_end_exclusive, + text, + created_at, + updated_at + FROM task_artifact_chunks + WHERE id = %s + """ + +INSERT_TASK_ARTIFACT_CHUNK_EMBEDDING_SQL = """ + WITH inserted AS ( + INSERT INTO task_artifact_chunk_embeddings ( + user_id, + task_artifact_chunk_id, + embedding_config_id, + dimensions, + vector, + created_at, + updated_at + ) + VALUES ( + app.current_user_id(), + %s, + %s, + %s, + %s, + clock_timestamp(), + clock_timestamp() + ) + RETURNING + id, + user_id, + task_artifact_chunk_id, + embedding_config_id, + dimensions, + vector, + created_at, + updated_at + ) + SELECT + inserted.id, + inserted.user_id, + chunks.task_artifact_id, + inserted.task_artifact_chunk_id, + chunks.sequence_no AS task_artifact_chunk_sequence_no, + inserted.embedding_config_id, + inserted.dimensions, + inserted.vector, + inserted.created_at, + inserted.updated_at + FROM inserted + JOIN task_artifact_chunks AS chunks + ON chunks.id = inserted.task_artifact_chunk_id + AND chunks.user_id = inserted.user_id + """ + +GET_TASK_ARTIFACT_CHUNK_EMBEDDING_SQL = """ + SELECT + embeddings.id, + embeddings.user_id, + chunks.task_artifact_id, + embeddings.task_artifact_chunk_id, + chunks.sequence_no AS task_artifact_chunk_sequence_no, + embeddings.embedding_config_id, + embeddings.dimensions, + embeddings.vector, + embeddings.created_at, + embeddings.updated_at + FROM task_artifact_chunk_embeddings AS embeddings + JOIN task_artifact_chunks AS chunks + ON chunks.id = embeddings.task_artifact_chunk_id + AND chunks.user_id = embeddings.user_id + WHERE embeddings.id = %s + """ + +GET_TASK_ARTIFACT_CHUNK_EMBEDDING_BY_CHUNK_AND_CONFIG_SQL = """ + SELECT + embeddings.id, + embeddings.user_id, + chunks.task_artifact_id, + embeddings.task_artifact_chunk_id, + chunks.sequence_no AS task_artifact_chunk_sequence_no, + embeddings.embedding_config_id, + embeddings.dimensions, + embeddings.vector, + embeddings.created_at, + embeddings.updated_at + FROM task_artifact_chunk_embeddings AS embeddings + JOIN task_artifact_chunks AS chunks + ON chunks.id = embeddings.task_artifact_chunk_id + AND chunks.user_id = embeddings.user_id + WHERE embeddings.task_artifact_chunk_id = %s + AND embeddings.embedding_config_id = %s + """ + +LIST_TASK_ARTIFACT_CHUNK_EMBEDDINGS_FOR_CHUNK_SQL = """ + SELECT + embeddings.id, + embeddings.user_id, + chunks.task_artifact_id, + embeddings.task_artifact_chunk_id, + chunks.sequence_no AS task_artifact_chunk_sequence_no, + embeddings.embedding_config_id, + embeddings.dimensions, + embeddings.vector, + embeddings.created_at, + embeddings.updated_at + FROM task_artifact_chunk_embeddings AS embeddings + JOIN task_artifact_chunks AS chunks + ON chunks.id = embeddings.task_artifact_chunk_id + AND chunks.user_id = embeddings.user_id + WHERE embeddings.task_artifact_chunk_id = %s + ORDER BY chunks.sequence_no ASC, embeddings.created_at ASC, embeddings.id ASC + """ + +LIST_TASK_ARTIFACT_CHUNK_EMBEDDINGS_FOR_ARTIFACT_SQL = """ + SELECT + embeddings.id, + embeddings.user_id, + chunks.task_artifact_id, + embeddings.task_artifact_chunk_id, + chunks.sequence_no AS task_artifact_chunk_sequence_no, + embeddings.embedding_config_id, + embeddings.dimensions, + embeddings.vector, + embeddings.created_at, + embeddings.updated_at + FROM task_artifact_chunk_embeddings AS embeddings + JOIN task_artifact_chunks AS chunks + ON chunks.id = embeddings.task_artifact_chunk_id + AND chunks.user_id = embeddings.user_id + WHERE chunks.task_artifact_id = %s + ORDER BY chunks.sequence_no ASC, embeddings.created_at ASC, embeddings.id ASC + """ + +UPDATE_TASK_ARTIFACT_CHUNK_EMBEDDING_SQL = """ + WITH updated AS ( + UPDATE task_artifact_chunk_embeddings + SET dimensions = %s, + vector = %s, + updated_at = clock_timestamp() + WHERE id = %s + RETURNING + id, + user_id, + task_artifact_chunk_id, + embedding_config_id, + dimensions, + vector, + created_at, + updated_at + ) + SELECT + updated.id, + updated.user_id, + chunks.task_artifact_id, + updated.task_artifact_chunk_id, + chunks.sequence_no AS task_artifact_chunk_sequence_no, + updated.embedding_config_id, + updated.dimensions, + updated.vector, + updated.created_at, + updated.updated_at + FROM updated + JOIN task_artifact_chunks AS chunks + ON chunks.id = updated.task_artifact_chunk_id + AND chunks.user_id = updated.user_id + """ + UPDATE_TASK_ARTIFACT_INGESTION_STATUS_SQL = """ UPDATE task_artifacts SET ingestion_status = %s, @@ -2770,9 +2958,77 @@ def create_task_artifact_chunk( (task_artifact_id, sequence_no, char_start, char_end_exclusive, text), ) + def get_task_artifact_chunk_optional(self, task_artifact_chunk_id: UUID) -> TaskArtifactChunkRow | None: + return self._fetch_optional_one(GET_TASK_ARTIFACT_CHUNK_SQL, (task_artifact_chunk_id,)) + def list_task_artifact_chunks(self, task_artifact_id: UUID) -> list[TaskArtifactChunkRow]: return self._fetch_all(LIST_TASK_ARTIFACT_CHUNKS_SQL, (task_artifact_id,)) + def create_task_artifact_chunk_embedding( + self, + *, + task_artifact_chunk_id: UUID, + embedding_config_id: UUID, + dimensions: int, + vector: list[float], + ) -> TaskArtifactChunkEmbeddingRow: + return self._fetch_one( + "create_task_artifact_chunk_embedding", + INSERT_TASK_ARTIFACT_CHUNK_EMBEDDING_SQL, + (task_artifact_chunk_id, embedding_config_id, dimensions, Jsonb(vector)), + ) + + def get_task_artifact_chunk_embedding_optional( + self, + task_artifact_chunk_embedding_id: UUID, + ) -> TaskArtifactChunkEmbeddingRow | None: + return self._fetch_optional_one( + GET_TASK_ARTIFACT_CHUNK_EMBEDDING_SQL, + (task_artifact_chunk_embedding_id,), + ) + + def get_task_artifact_chunk_embedding_by_chunk_and_config_optional( + self, + *, + task_artifact_chunk_id: UUID, + embedding_config_id: UUID, + ) -> TaskArtifactChunkEmbeddingRow | None: + return self._fetch_optional_one( + GET_TASK_ARTIFACT_CHUNK_EMBEDDING_BY_CHUNK_AND_CONFIG_SQL, + (task_artifact_chunk_id, embedding_config_id), + ) + + def list_task_artifact_chunk_embeddings_for_chunk( + self, + task_artifact_chunk_id: UUID, + ) -> list[TaskArtifactChunkEmbeddingRow]: + return self._fetch_all( + LIST_TASK_ARTIFACT_CHUNK_EMBEDDINGS_FOR_CHUNK_SQL, + (task_artifact_chunk_id,), + ) + + def list_task_artifact_chunk_embeddings_for_artifact( + self, + task_artifact_id: UUID, + ) -> list[TaskArtifactChunkEmbeddingRow]: + return self._fetch_all( + LIST_TASK_ARTIFACT_CHUNK_EMBEDDINGS_FOR_ARTIFACT_SQL, + (task_artifact_id,), + ) + + def update_task_artifact_chunk_embedding( + self, + *, + task_artifact_chunk_embedding_id: UUID, + dimensions: int, + vector: list[float], + ) -> TaskArtifactChunkEmbeddingRow: + return self._fetch_one( + "update_task_artifact_chunk_embedding", + UPDATE_TASK_ARTIFACT_CHUNK_EMBEDDING_SQL, + (dimensions, Jsonb(vector), task_artifact_chunk_embedding_id), + ) + def update_task_artifact_ingestion_status( self, *, diff --git a/tests/integration/test_migrations.py b/tests/integration/test_migrations.py index 001f1a0..e8e7011 100644 --- a/tests/integration/test_migrations.py +++ b/tests/integration/test_migrations.py @@ -303,6 +303,8 @@ def test_migrations_upgrade_and_downgrade(database_urls): assert cur.fetchone()[0] == "task_artifacts" cur.execute("SELECT to_regclass('public.task_artifact_chunks')") assert cur.fetchone()[0] == "task_artifact_chunks" + cur.execute("SELECT to_regclass('public.task_artifact_chunk_embeddings')") + assert cur.fetchone()[0] == "task_artifact_chunk_embeddings" cur.execute("SELECT to_regclass('public.task_steps')") assert cur.fetchone()[0] == "task_steps" cur.execute( @@ -386,6 +388,7 @@ def test_migrations_upgrade_and_downgrade(database_urls): 'task_workspaces', 'task_artifacts', 'task_artifact_chunks', + 'task_artifact_chunk_embeddings', 'task_steps', 'execution_budgets', 'tool_executions' @@ -407,6 +410,7 @@ def test_migrations_upgrade_and_downgrade(database_urls): ("memory_revisions", True, True), ("policies", True, True), ("sessions", True, True), + ("task_artifact_chunk_embeddings", True, True), ("task_artifact_chunks", True, True), ("task_artifacts", True, True), ("task_steps", True, True), @@ -479,6 +483,8 @@ def test_migrations_upgrade_and_downgrade(database_urls): has_table_privilege('alicebot_app', 'task_artifacts', 'DELETE'), has_table_privilege('alicebot_app', 'task_artifact_chunks', 'UPDATE'), has_table_privilege('alicebot_app', 'task_artifact_chunks', 'DELETE'), + has_table_privilege('alicebot_app', 'task_artifact_chunk_embeddings', 'UPDATE'), + has_table_privilege('alicebot_app', 'task_artifact_chunk_embeddings', 'DELETE'), has_table_privilege('alicebot_app', 'task_steps', 'UPDATE'), has_table_privilege('alicebot_app', 'task_steps', 'DELETE'), has_table_privilege('alicebot_app', 'execution_budgets', 'UPDATE'), @@ -524,16 +530,33 @@ def test_migrations_upgrade_and_downgrade(database_urls): False, True, False, + True, + False, False, False, ) + command.downgrade(config, "20260314_0024") + + with psycopg.connect(database_urls["admin"]) as conn: + with conn.cursor() as cur: + cur.execute("SELECT to_regclass('public.task_artifact_chunk_embeddings')") + assert cur.fetchone()[0] is None + cur.execute("SELECT to_regclass('public.task_artifact_chunks')") + assert cur.fetchone()[0] == "task_artifact_chunks" + cur.execute("SELECT to_regclass('public.task_artifacts')") + assert cur.fetchone()[0] == "task_artifacts" + cur.execute("SELECT to_regclass('public.task_workspaces')") + assert cur.fetchone()[0] == "task_workspaces" + command.downgrade(config, "20260313_0021") with psycopg.connect(database_urls["admin"]) as conn: with conn.cursor() as cur: cur.execute("SELECT to_regclass('public.task_artifact_chunks')") assert cur.fetchone()[0] is None + cur.execute("SELECT to_regclass('public.task_artifact_chunk_embeddings')") + assert cur.fetchone()[0] is None cur.execute("SELECT to_regclass('public.task_artifacts')") assert cur.fetchone()[0] is None cur.execute("SELECT to_regclass('public.task_workspaces')") diff --git a/tests/integration/test_task_artifact_chunk_embeddings_api.py b/tests/integration/test_task_artifact_chunk_embeddings_api.py new file mode 100644 index 0000000..97559cf --- /dev/null +++ b/tests/integration/test_task_artifact_chunk_embeddings_api.py @@ -0,0 +1,474 @@ +from __future__ import annotations + +import json +from typing import Any +from urllib.parse import urlencode +from uuid import UUID, uuid4 + +import anyio + +import apps.api.src.alicebot_api.main as main_module +from apps.api.src.alicebot_api.config import Settings +from alicebot_api.db import user_connection +from alicebot_api.store import ContinuityStore + + +def invoke_request( + method: str, + path: str, + *, + query_params: dict[str, str] | None = None, + payload: dict[str, Any] | None = None, +) -> tuple[int, dict[str, Any]]: + messages: list[dict[str, object]] = [] + encoded_body = b"" if payload is None else json.dumps(payload).encode() + request_received = False + + async def receive() -> dict[str, object]: + nonlocal request_received + if request_received: + return {"type": "http.disconnect"} + + request_received = True + return {"type": "http.request", "body": encoded_body, "more_body": False} + + async def send(message: dict[str, object]) -> None: + messages.append(message) + + query_string = urlencode(query_params or {}).encode() + scope = { + "type": "http", + "asgi": {"version": "3.0"}, + "http_version": "1.1", + "method": method, + "scheme": "http", + "path": path, + "raw_path": path.encode(), + "query_string": query_string, + "headers": [(b"content-type", b"application/json")], + "client": ("testclient", 50000), + "server": ("testserver", 80), + "root_path": "", + } + + anyio.run(main_module.app, scope, receive, send) + + start_message = next(message for message in messages if message["type"] == "http.response.start") + body = b"".join( + message.get("body", b"") + for message in messages + if message["type"] == "http.response.body" + ) + return start_message["status"], json.loads(body) + + +def seed_task_artifact_with_chunks(database_url: str, *, email: str) -> dict[str, UUID]: + user_id = uuid4() + + with user_connection(database_url, user_id) as conn: + store = ContinuityStore(conn) + store.create_user(user_id, email, email.split("@", 1)[0].title()) + thread = store.create_thread("Artifact chunk embedding thread") + tool = store.create_tool( + tool_key="proxy.echo", + name="Proxy Echo", + description="Deterministic proxy handler.", + version="1.0.0", + metadata_version="tool_metadata_v0", + active=True, + tags=["proxy"], + action_hints=["tool.run"], + scope_hints=["workspace"], + domain_hints=[], + risk_hints=[], + metadata={"transport": "proxy"}, + ) + task = store.create_task( + thread_id=thread["id"], + tool_id=tool["id"], + status="approved", + request={ + "thread_id": str(thread["id"]), + "tool_id": str(tool["id"]), + "action": "tool.run", + "scope": "workspace", + "domain_hint": None, + "risk_hint": None, + "attributes": {}, + }, + tool={ + "id": str(tool["id"]), + "tool_key": "proxy.echo", + "name": "Proxy Echo", + "description": "Deterministic proxy handler.", + "version": "1.0.0", + "metadata_version": "tool_metadata_v0", + "active": True, + "tags": ["proxy"], + "action_hints": ["tool.run"], + "scope_hints": ["workspace"], + "domain_hints": [], + "risk_hints": [], + "metadata": {"transport": "proxy"}, + "created_at": tool["created_at"].isoformat(), + }, + latest_approval_id=None, + latest_execution_id=None, + ) + workspace = store.create_task_workspace( + task_id=task["id"], + status="active", + local_path=f"/tmp/task-workspaces/{user_id}/{task['id']}", + ) + artifact = store.create_task_artifact( + task_id=task["id"], + task_workspace_id=workspace["id"], + status="registered", + ingestion_status="ingested", + relative_path="docs/spec.txt", + media_type_hint="text/plain", + ) + first_chunk = store.create_task_artifact_chunk( + task_artifact_id=artifact["id"], + sequence_no=1, + char_start=0, + char_end_exclusive=12, + text="alpha chunk", + ) + second_chunk = store.create_task_artifact_chunk( + task_artifact_id=artifact["id"], + sequence_no=2, + char_start=12, + char_end_exclusive=24, + text="beta chunk", + ) + + return { + "user_id": user_id, + "task_id": task["id"], + "task_artifact_id": artifact["id"], + "first_chunk_id": first_chunk["id"], + "second_chunk_id": second_chunk["id"], + } + + +def seed_embedding_config( + database_url: str, + *, + user_id: UUID, + provider: str, + model: str, + version: str, + dimensions: int, +) -> UUID: + with user_connection(database_url, user_id) as conn: + created = ContinuityStore(conn).create_embedding_config( + provider=provider, + model=model, + version=version, + dimensions=dimensions, + status="active", + metadata={"task": "artifact_chunk_retrieval"}, + ) + return created["id"] + + +def test_task_artifact_chunk_embedding_endpoints_persist_and_read_embeddings( + migrated_database_urls, + monkeypatch, +) -> None: + seeded = seed_task_artifact_with_chunks( + migrated_database_urls["app"], + email="owner@example.com", + ) + first_config_id = seed_embedding_config( + migrated_database_urls["app"], + user_id=seeded["user_id"], + provider="openai", + model="text-embedding-3-small", + version="2026-03-14", + dimensions=3, + ) + second_config_id = seed_embedding_config( + migrated_database_urls["app"], + user_id=seeded["user_id"], + provider="openai", + model="text-embedding-3-large", + version="2026-03-15", + dimensions=3, + ) + monkeypatch.setattr( + main_module, + "get_settings", + lambda: Settings(database_url=migrated_database_urls["app"]), + ) + + second_write_status, second_write_payload = invoke_request( + "POST", + "/v0/task-artifact-chunk-embeddings", + payload={ + "user_id": str(seeded["user_id"]), + "task_artifact_chunk_id": str(seeded["second_chunk_id"]), + "embedding_config_id": str(first_config_id), + "vector": [0.4, 0.5, 0.6], + }, + ) + first_write_status, first_write_payload = invoke_request( + "POST", + "/v0/task-artifact-chunk-embeddings", + payload={ + "user_id": str(seeded["user_id"]), + "task_artifact_chunk_id": str(seeded["first_chunk_id"]), + "embedding_config_id": str(second_config_id), + "vector": [0.1, 0.2, 0.3], + }, + ) + update_status, update_payload = invoke_request( + "POST", + "/v0/task-artifact-chunk-embeddings", + payload={ + "user_id": str(seeded["user_id"]), + "task_artifact_chunk_id": str(seeded["second_chunk_id"]), + "embedding_config_id": str(first_config_id), + "vector": [0.9, 0.8, 0.7], + }, + ) + artifact_list_status, artifact_list_payload = invoke_request( + "GET", + f"/v0/task-artifacts/{seeded['task_artifact_id']}/chunk-embeddings", + query_params={"user_id": str(seeded["user_id"])}, + ) + chunk_list_status, chunk_list_payload = invoke_request( + "GET", + f"/v0/task-artifact-chunks/{seeded['second_chunk_id']}/embeddings", + query_params={"user_id": str(seeded["user_id"])}, + ) + detail_status, detail_payload = invoke_request( + "GET", + f"/v0/task-artifact-chunk-embeddings/{update_payload['embedding']['id']}", + query_params={"user_id": str(seeded["user_id"])}, + ) + + assert second_write_status == 201 + assert second_write_payload["write_mode"] == "created" + assert first_write_status == 201 + assert first_write_payload["write_mode"] == "created" + assert update_status == 201 + assert update_payload["write_mode"] == "updated" + assert update_payload["embedding"]["vector"] == [0.9, 0.8, 0.7] + assert artifact_list_status == 200 + assert artifact_list_payload["summary"] == { + "total_count": 2, + "order": ["task_artifact_chunk_sequence_no_asc", "created_at_asc", "id_asc"], + "scope": { + "kind": "artifact", + "task_artifact_id": str(seeded["task_artifact_id"]), + }, + } + assert chunk_list_status == 200 + assert chunk_list_payload["summary"] == { + "total_count": 1, + "order": ["task_artifact_chunk_sequence_no_asc", "created_at_asc", "id_asc"], + "scope": { + "kind": "chunk", + "task_artifact_id": str(seeded["task_artifact_id"]), + "task_artifact_chunk_id": str(seeded["second_chunk_id"]), + }, + } + assert detail_status == 200 + assert detail_payload["embedding"]["id"] == update_payload["embedding"]["id"] + assert detail_payload["embedding"]["task_artifact_chunk_sequence_no"] == 2 + assert set(detail_payload["embedding"]) == { + "id", + "task_artifact_id", + "task_artifact_chunk_id", + "task_artifact_chunk_sequence_no", + "embedding_config_id", + "dimensions", + "vector", + "created_at", + "updated_at", + } + + with user_connection(migrated_database_urls["app"], seeded["user_id"]) as conn: + stored = ContinuityStore(conn).list_task_artifact_chunk_embeddings_for_artifact( + seeded["task_artifact_id"] + ) + + assert [item["id"] for item in artifact_list_payload["items"]] == [ + str(embedding["id"]) for embedding in stored + ] + assert [item["task_artifact_chunk_id"] for item in artifact_list_payload["items"]] == [ + str(seeded["first_chunk_id"]), + str(seeded["second_chunk_id"]), + ] + + +def test_task_artifact_chunk_embedding_writes_reject_invalid_refs_dimension_mismatches_and_cross_user_refs( + migrated_database_urls, + monkeypatch, +) -> None: + owner = seed_task_artifact_with_chunks(migrated_database_urls["app"], email="owner@example.com") + intruder = seed_task_artifact_with_chunks( + migrated_database_urls["app"], + email="intruder@example.com", + ) + owner_config_id = seed_embedding_config( + migrated_database_urls["app"], + user_id=owner["user_id"], + provider="openai", + model="text-embedding-3-large", + version="2026-03-14", + dimensions=3, + ) + intruder_config_id = seed_embedding_config( + migrated_database_urls["app"], + user_id=intruder["user_id"], + provider="openai", + model="text-embedding-3-large", + version="2026-03-14", + dimensions=3, + ) + monkeypatch.setattr( + main_module, + "get_settings", + lambda: Settings(database_url=migrated_database_urls["app"]), + ) + + missing_config_status, missing_config_payload = invoke_request( + "POST", + "/v0/task-artifact-chunk-embeddings", + payload={ + "user_id": str(owner["user_id"]), + "task_artifact_chunk_id": str(owner["first_chunk_id"]), + "embedding_config_id": str(uuid4()), + "vector": [0.1, 0.2, 0.3], + }, + ) + missing_chunk_status, missing_chunk_payload = invoke_request( + "POST", + "/v0/task-artifact-chunk-embeddings", + payload={ + "user_id": str(owner["user_id"]), + "task_artifact_chunk_id": str(uuid4()), + "embedding_config_id": str(owner_config_id), + "vector": [0.1, 0.2, 0.3], + }, + ) + mismatch_status, mismatch_payload = invoke_request( + "POST", + "/v0/task-artifact-chunk-embeddings", + payload={ + "user_id": str(owner["user_id"]), + "task_artifact_chunk_id": str(owner["first_chunk_id"]), + "embedding_config_id": str(owner_config_id), + "vector": [0.1, 0.2], + }, + ) + cross_user_chunk_status, cross_user_chunk_payload = invoke_request( + "POST", + "/v0/task-artifact-chunk-embeddings", + payload={ + "user_id": str(intruder["user_id"]), + "task_artifact_chunk_id": str(owner["first_chunk_id"]), + "embedding_config_id": str(intruder_config_id), + "vector": [0.1, 0.2, 0.3], + }, + ) + cross_user_config_status, cross_user_config_payload = invoke_request( + "POST", + "/v0/task-artifact-chunk-embeddings", + payload={ + "user_id": str(intruder["user_id"]), + "task_artifact_chunk_id": str(intruder["first_chunk_id"]), + "embedding_config_id": str(owner_config_id), + "vector": [0.1, 0.2, 0.3], + }, + ) + + assert missing_config_status == 400 + assert missing_config_payload["detail"].startswith( + "embedding_config_id must reference an existing embedding config owned by the user" + ) + assert missing_chunk_status == 400 + assert missing_chunk_payload["detail"].startswith( + "task_artifact_chunk_id must reference an existing task artifact chunk owned by the user" + ) + assert mismatch_status == 400 + assert mismatch_payload["detail"] == "vector length must match embedding config dimensions (3): 2" + assert cross_user_chunk_status == 400 + assert cross_user_chunk_payload["detail"] == ( + "task_artifact_chunk_id must reference an existing task artifact chunk owned by the " + f"user: {owner['first_chunk_id']}" + ) + assert cross_user_config_status == 400 + assert cross_user_config_payload["detail"] == ( + "embedding_config_id must reference an existing embedding config owned by the user: " + f"{owner_config_id}" + ) + + +def test_task_artifact_chunk_embedding_reads_respect_per_user_isolation( + migrated_database_urls, + monkeypatch, +) -> None: + owner = seed_task_artifact_with_chunks(migrated_database_urls["app"], email="owner@example.com") + intruder = seed_task_artifact_with_chunks( + migrated_database_urls["app"], + email="intruder@example.com", + ) + owner_config_id = seed_embedding_config( + migrated_database_urls["app"], + user_id=owner["user_id"], + provider="openai", + model="text-embedding-3-large", + version="2026-03-14", + dimensions=3, + ) + monkeypatch.setattr( + main_module, + "get_settings", + lambda: Settings(database_url=migrated_database_urls["app"]), + ) + + write_status, write_payload = invoke_request( + "POST", + "/v0/task-artifact-chunk-embeddings", + payload={ + "user_id": str(owner["user_id"]), + "task_artifact_chunk_id": str(owner["first_chunk_id"]), + "embedding_config_id": str(owner_config_id), + "vector": [0.1, 0.2, 0.3], + }, + ) + artifact_list_status, artifact_list_payload = invoke_request( + "GET", + f"/v0/task-artifacts/{owner['task_artifact_id']}/chunk-embeddings", + query_params={"user_id": str(intruder["user_id"])}, + ) + chunk_list_status, chunk_list_payload = invoke_request( + "GET", + f"/v0/task-artifact-chunks/{owner['first_chunk_id']}/embeddings", + query_params={"user_id": str(intruder["user_id"])}, + ) + detail_status, detail_payload = invoke_request( + "GET", + f"/v0/task-artifact-chunk-embeddings/{write_payload['embedding']['id']}", + query_params={"user_id": str(intruder["user_id"])}, + ) + + assert write_status == 201 + assert artifact_list_status == 404 + assert artifact_list_payload == { + "detail": f"task artifact {owner['task_artifact_id']} was not found" + } + assert chunk_list_status == 404 + assert chunk_list_payload == { + "detail": f"task artifact chunk {owner['first_chunk_id']} was not found" + } + assert detail_status == 404 + assert detail_payload == { + "detail": ( + f"task artifact chunk embedding {write_payload['embedding']['id']} was not found" + ) + } diff --git a/tests/unit/test_20260314_0025_task_artifact_chunk_embeddings.py b/tests/unit/test_20260314_0025_task_artifact_chunk_embeddings.py new file mode 100644 index 0000000..0844b8e --- /dev/null +++ b/tests/unit/test_20260314_0025_task_artifact_chunk_embeddings.py @@ -0,0 +1,46 @@ +from __future__ import annotations + +import importlib + + +MODULE_NAME = "apps.api.alembic.versions.20260314_0025_task_artifact_chunk_embeddings" + + +def load_migration_module(): + return importlib.import_module(MODULE_NAME) + + +def test_upgrade_executes_expected_statements_in_order(monkeypatch) -> None: + module = load_migration_module() + executed: list[str] = [] + + monkeypatch.setattr(module.op, "execute", executed.append) + + module.upgrade() + + assert executed == [ + module._UPGRADE_SCHEMA_STATEMENT, + *module._UPGRADE_GRANT_STATEMENTS, + "ALTER TABLE task_artifact_chunk_embeddings ENABLE ROW LEVEL SECURITY", + "ALTER TABLE task_artifact_chunk_embeddings FORCE ROW LEVEL SECURITY", + module._UPGRADE_POLICY_STATEMENT, + ] + + +def test_downgrade_executes_expected_statements_in_order(monkeypatch) -> None: + module = load_migration_module() + executed: list[str] = [] + + monkeypatch.setattr(module.op, "execute", executed.append) + + module.downgrade() + + assert executed == list(module._DOWNGRADE_STATEMENTS) + + +def test_task_artifact_chunk_embedding_privileges_stay_narrow() -> None: + module = load_migration_module() + + assert module._UPGRADE_GRANT_STATEMENTS == ( + "GRANT SELECT, INSERT, UPDATE ON task_artifact_chunk_embeddings TO alicebot_app", + ) diff --git a/tests/unit/test_main.py b/tests/unit/test_main.py index 7afeb2c..9c7a926 100644 --- a/tests/unit/test_main.py +++ b/tests/unit/test_main.py @@ -7,12 +7,15 @@ import pytest import apps.api.src.alicebot_api.main as main_module from apps.api.src.alicebot_api.config import Settings +from alicebot_api.artifacts import TaskArtifactNotFoundError from alicebot_api.compiler import CompiledTraceRun from alicebot_api.contracts import AdmissionDecisionOutput from alicebot_api.embedding import ( EmbeddingConfigValidationError, MemoryEmbeddingNotFoundError, MemoryEmbeddingValidationError, + TaskArtifactChunkEmbeddingNotFoundError, + TaskArtifactChunkEmbeddingValidationError, ) from alicebot_api.entity import EntityNotFoundError, EntityValidationError from alicebot_api.entity_edge import EntityEdgeValidationError @@ -112,6 +115,10 @@ def test_healthcheck_route_is_registered() -> None: assert "/v0/memory-embeddings" in route_paths assert "/v0/memories/{memory_id}/embeddings" in route_paths assert "/v0/memory-embeddings/{memory_embedding_id}" in route_paths + assert "/v0/task-artifact-chunk-embeddings" in route_paths + assert "/v0/task-artifacts/{task_artifact_id}/chunk-embeddings" in route_paths + assert "/v0/task-artifact-chunks/{task_artifact_chunk_id}/embeddings" in route_paths + assert "/v0/task-artifact-chunk-embeddings/{task_artifact_chunk_embedding_id}" in route_paths assert "/v0/entities" in route_paths assert "/v0/entity-edges" in route_paths assert "/v0/tools/route" in route_paths @@ -2100,6 +2107,226 @@ def fake_user_connection(_database_url: str, _current_user_id): } +def test_task_artifact_chunk_embedding_routes_success_and_validation_errors(monkeypatch) -> None: + user_id = uuid4() + chunk_id = uuid4() + config_id = uuid4() + settings = Settings(database_url="postgresql://app") + captured: dict[str, object] = {} + + @contextmanager + def fake_user_connection(database_url: str, current_user_id): + captured["database_url"] = database_url + captured["current_user_id"] = current_user_id + yield object() + + def fake_upsert_task_artifact_chunk_embedding_record(store, *, user_id, request): + captured["store_type"] = type(store).__name__ + captured["user_id"] = user_id + captured["request"] = request + return { + "embedding": { + "id": "artifact-embedding-123", + "task_artifact_id": "artifact-123", + "task_artifact_chunk_id": str(chunk_id), + "task_artifact_chunk_sequence_no": 2, + "embedding_config_id": str(config_id), + "dimensions": 3, + "vector": [0.1, 0.2, 0.3], + "created_at": "2026-03-14T12:00:00+00:00", + "updated_at": "2026-03-14T12:00:00+00:00", + }, + "write_mode": "created", + } + + monkeypatch.setattr(main_module, "get_settings", lambda: settings) + monkeypatch.setattr(main_module, "user_connection", fake_user_connection) + monkeypatch.setattr( + main_module, + "upsert_task_artifact_chunk_embedding_record", + fake_upsert_task_artifact_chunk_embedding_record, + ) + + response = main_module.upsert_task_artifact_chunk_embedding( + main_module.UpsertTaskArtifactChunkEmbeddingRequest( + user_id=user_id, + task_artifact_chunk_id=chunk_id, + embedding_config_id=config_id, + vector=[0.1, 0.2, 0.3], + ) + ) + + assert response.status_code == 201 + assert json.loads(response.body)["write_mode"] == "created" + assert captured["database_url"] == "postgresql://app" + assert captured["current_user_id"] == user_id + assert captured["user_id"] == user_id + assert captured["request"].task_artifact_chunk_id == chunk_id + + monkeypatch.setattr( + main_module, + "upsert_task_artifact_chunk_embedding_record", + lambda *_args, **_kwargs: (_ for _ in ()).throw( + TaskArtifactChunkEmbeddingValidationError( + "task_artifact_chunk_id must reference an existing task artifact chunk owned by the user" + ) + ), + ) + + error_response = main_module.upsert_task_artifact_chunk_embedding( + main_module.UpsertTaskArtifactChunkEmbeddingRequest( + user_id=user_id, + task_artifact_chunk_id=chunk_id, + embedding_config_id=config_id, + vector=[0.1, 0.2, 0.3], + ) + ) + + assert error_response.status_code == 400 + assert json.loads(error_response.body) == { + "detail": "task_artifact_chunk_id must reference an existing task artifact chunk owned by the user" + } + + +def test_task_artifact_chunk_embedding_read_routes_return_payload_and_not_found(monkeypatch) -> None: + user_id = uuid4() + artifact_id = uuid4() + chunk_id = uuid4() + embedding_id = uuid4() + settings = Settings(database_url="postgresql://app") + + @contextmanager + def fake_user_connection(_database_url: str, _current_user_id): + yield object() + + monkeypatch.setattr(main_module, "get_settings", lambda: settings) + monkeypatch.setattr(main_module, "user_connection", fake_user_connection) + monkeypatch.setattr( + main_module, + "list_task_artifact_chunk_embedding_records_for_artifact", + lambda *_args, **_kwargs: { + "items": [], + "summary": { + "total_count": 0, + "order": ["task_artifact_chunk_sequence_no_asc", "created_at_asc", "id_asc"], + "scope": { + "kind": "artifact", + "task_artifact_id": str(artifact_id), + }, + }, + }, + ) + monkeypatch.setattr( + main_module, + "list_task_artifact_chunk_embedding_records_for_chunk", + lambda *_args, **_kwargs: { + "items": [], + "summary": { + "total_count": 0, + "order": ["task_artifact_chunk_sequence_no_asc", "created_at_asc", "id_asc"], + "scope": { + "kind": "chunk", + "task_artifact_id": str(artifact_id), + "task_artifact_chunk_id": str(chunk_id), + }, + }, + }, + ) + monkeypatch.setattr( + main_module, + "get_task_artifact_chunk_embedding_record", + lambda *_args, **_kwargs: { + "embedding": { + "id": str(embedding_id), + "task_artifact_id": str(artifact_id), + "task_artifact_chunk_id": str(chunk_id), + "task_artifact_chunk_sequence_no": 2, + "embedding_config_id": "config-123", + "dimensions": 3, + "vector": [0.1, 0.2, 0.3], + "created_at": "2026-03-14T12:00:00+00:00", + "updated_at": "2026-03-14T12:00:00+00:00", + } + }, + ) + + artifact_response = main_module.list_task_artifact_chunk_embeddings_for_artifact( + task_artifact_id=artifact_id, + user_id=user_id, + ) + chunk_response = main_module.list_task_artifact_chunk_embeddings( + task_artifact_chunk_id=chunk_id, + user_id=user_id, + ) + detail_response = main_module.get_task_artifact_chunk_embedding( + task_artifact_chunk_embedding_id=embedding_id, + user_id=user_id, + ) + + assert artifact_response.status_code == 200 + assert json.loads(artifact_response.body)["summary"]["scope"]["task_artifact_id"] == str( + artifact_id + ) + assert chunk_response.status_code == 200 + assert json.loads(chunk_response.body)["summary"]["scope"]["task_artifact_chunk_id"] == str( + chunk_id + ) + assert detail_response.status_code == 200 + assert json.loads(detail_response.body)["embedding"]["id"] == str(embedding_id) + + monkeypatch.setattr( + main_module, + "list_task_artifact_chunk_embedding_records_for_artifact", + lambda *_args, **_kwargs: (_ for _ in ()).throw( + TaskArtifactNotFoundError(f"task artifact {artifact_id} was not found") + ), + ) + monkeypatch.setattr( + main_module, + "list_task_artifact_chunk_embedding_records_for_chunk", + lambda *_args, **_kwargs: (_ for _ in ()).throw( + TaskArtifactChunkEmbeddingNotFoundError( + f"task artifact chunk {chunk_id} was not found" + ) + ), + ) + monkeypatch.setattr( + main_module, + "get_task_artifact_chunk_embedding_record", + lambda *_args, **_kwargs: (_ for _ in ()).throw( + TaskArtifactChunkEmbeddingNotFoundError( + f"task artifact chunk embedding {embedding_id} was not found" + ) + ), + ) + + missing_artifact_response = main_module.list_task_artifact_chunk_embeddings_for_artifact( + task_artifact_id=artifact_id, + user_id=user_id, + ) + missing_chunk_response = main_module.list_task_artifact_chunk_embeddings( + task_artifact_chunk_id=chunk_id, + user_id=user_id, + ) + missing_detail_response = main_module.get_task_artifact_chunk_embedding( + task_artifact_chunk_embedding_id=embedding_id, + user_id=user_id, + ) + + assert missing_artifact_response.status_code == 404 + assert json.loads(missing_artifact_response.body) == { + "detail": f"task artifact {artifact_id} was not found" + } + assert missing_chunk_response.status_code == 404 + assert json.loads(missing_chunk_response.body) == { + "detail": f"task artifact chunk {chunk_id} was not found" + } + assert missing_detail_response.status_code == 404 + assert json.loads(missing_detail_response.body) == { + "detail": f"task artifact chunk embedding {embedding_id} was not found" + } + + def test_create_entity_returns_created_payload(monkeypatch) -> None: user_id = uuid4() first_memory_id = uuid4() diff --git a/tests/unit/test_task_artifact_chunk_embedding.py b/tests/unit/test_task_artifact_chunk_embedding.py new file mode 100644 index 0000000..d70366a --- /dev/null +++ b/tests/unit/test_task_artifact_chunk_embedding.py @@ -0,0 +1,344 @@ +from __future__ import annotations + +from datetime import UTC, datetime, timedelta +from uuid import UUID, uuid4 + +import pytest + +from alicebot_api.artifacts import TaskArtifactNotFoundError +from alicebot_api.contracts import TaskArtifactChunkEmbeddingUpsertInput +from alicebot_api.embedding import ( + TaskArtifactChunkEmbeddingNotFoundError, + TaskArtifactChunkEmbeddingValidationError, + get_task_artifact_chunk_embedding_record, + list_task_artifact_chunk_embedding_records_for_artifact, + list_task_artifact_chunk_embedding_records_for_chunk, + upsert_task_artifact_chunk_embedding_record, +) + + +class TaskArtifactChunkEmbeddingStoreStub: + def __init__(self) -> None: + self.base_time = datetime(2026, 3, 14, 12, 0, tzinfo=UTC) + self.artifacts: dict[UUID, dict[str, object]] = {} + self.chunks: dict[UUID, dict[str, object]] = {} + self.configs: dict[UUID, dict[str, object]] = {} + self.embeddings: list[dict[str, object]] = [] + self.embedding_by_id: dict[UUID, dict[str, object]] = {} + + def create_artifact(self) -> UUID: + artifact_id = uuid4() + self.artifacts[artifact_id] = { + "id": artifact_id, + "task_id": uuid4(), + "task_workspace_id": uuid4(), + "status": "registered", + "ingestion_status": "ingested", + "relative_path": "docs/spec.txt", + "media_type_hint": "text/plain", + "created_at": self.base_time, + "updated_at": self.base_time, + } + return artifact_id + + def create_chunk(self, *, task_artifact_id: UUID, sequence_no: int) -> UUID: + chunk_id = uuid4() + self.chunks[chunk_id] = { + "id": chunk_id, + "task_artifact_id": task_artifact_id, + "sequence_no": sequence_no, + "char_start": (sequence_no - 1) * 10, + "char_end_exclusive": sequence_no * 10, + "text": f"chunk-{sequence_no}", + "created_at": self.base_time + timedelta(minutes=sequence_no), + "updated_at": self.base_time + timedelta(minutes=sequence_no), + } + return chunk_id + + def create_config(self, *, dimensions: int = 3) -> UUID: + config_id = uuid4() + self.configs[config_id] = { + "id": config_id, + "provider": "openai", + "model": "text-embedding-3-large", + "version": "2026-03-14", + "dimensions": dimensions, + "status": "active", + "metadata": {"task": "artifact_chunk_retrieval"}, + "created_at": self.base_time, + } + return config_id + + def get_task_artifact_optional(self, task_artifact_id: UUID) -> dict[str, object] | None: + return self.artifacts.get(task_artifact_id) + + def get_task_artifact_chunk_optional( + self, + task_artifact_chunk_id: UUID, + ) -> dict[str, object] | None: + return self.chunks.get(task_artifact_chunk_id) + + def get_embedding_config_optional(self, embedding_config_id: UUID) -> dict[str, object] | None: + return self.configs.get(embedding_config_id) + + def get_task_artifact_chunk_embedding_by_chunk_and_config_optional( + self, + *, + task_artifact_chunk_id: UUID, + embedding_config_id: UUID, + ) -> dict[str, object] | None: + return next( + ( + embedding + for embedding in self.embeddings + if embedding["task_artifact_chunk_id"] == task_artifact_chunk_id + and embedding["embedding_config_id"] == embedding_config_id + ), + None, + ) + + def create_task_artifact_chunk_embedding( + self, + *, + task_artifact_chunk_id: UUID, + embedding_config_id: UUID, + dimensions: int, + vector: list[float], + ) -> dict[str, object]: + chunk = self.chunks[task_artifact_chunk_id] + embedding_id = uuid4() + record = { + "id": embedding_id, + "user_id": uuid4(), + "task_artifact_id": chunk["task_artifact_id"], + "task_artifact_chunk_id": task_artifact_chunk_id, + "task_artifact_chunk_sequence_no": chunk["sequence_no"], + "embedding_config_id": embedding_config_id, + "dimensions": dimensions, + "vector": vector, + "created_at": self.base_time + timedelta(seconds=len(self.embeddings)), + "updated_at": self.base_time + timedelta(seconds=len(self.embeddings)), + } + self.embeddings.append(record) + self.embedding_by_id[embedding_id] = record + return record + + def update_task_artifact_chunk_embedding( + self, + *, + task_artifact_chunk_embedding_id: UUID, + dimensions: int, + vector: list[float], + ) -> dict[str, object]: + record = self.embedding_by_id[task_artifact_chunk_embedding_id] + updated = { + **record, + "dimensions": dimensions, + "vector": vector, + "updated_at": self.base_time + timedelta(minutes=10), + } + self.embedding_by_id[task_artifact_chunk_embedding_id] = updated + for index, existing in enumerate(self.embeddings): + if existing["id"] == task_artifact_chunk_embedding_id: + self.embeddings[index] = updated + return updated + + def get_task_artifact_chunk_embedding_optional( + self, + task_artifact_chunk_embedding_id: UUID, + ) -> dict[str, object] | None: + return self.embedding_by_id.get(task_artifact_chunk_embedding_id) + + def list_task_artifact_chunk_embeddings_for_artifact( + self, + task_artifact_id: UUID, + ) -> list[dict[str, object]]: + return sorted( + ( + embedding + for embedding in self.embeddings + if embedding["task_artifact_id"] == task_artifact_id + ), + key=lambda embedding: ( + embedding["task_artifact_chunk_sequence_no"], + embedding["created_at"], + embedding["id"], + ), + ) + + def list_task_artifact_chunk_embeddings_for_chunk( + self, + task_artifact_chunk_id: UUID, + ) -> list[dict[str, object]]: + return sorted( + ( + embedding + for embedding in self.embeddings + if embedding["task_artifact_chunk_id"] == task_artifact_chunk_id + ), + key=lambda embedding: ( + embedding["task_artifact_chunk_sequence_no"], + embedding["created_at"], + embedding["id"], + ), + ) + + +def test_task_artifact_chunk_embedding_writes_and_reads_are_deterministic() -> None: + store = TaskArtifactChunkEmbeddingStoreStub() + artifact_id = store.create_artifact() + first_chunk_id = store.create_chunk(task_artifact_id=artifact_id, sequence_no=1) + second_chunk_id = store.create_chunk(task_artifact_id=artifact_id, sequence_no=2) + config_id = store.create_config() + + second_write = upsert_task_artifact_chunk_embedding_record( + store, # type: ignore[arg-type] + user_id=uuid4(), + request=TaskArtifactChunkEmbeddingUpsertInput( + task_artifact_chunk_id=second_chunk_id, + embedding_config_id=config_id, + vector=(0.4, 0.5, 0.6), + ), + ) + first_write = upsert_task_artifact_chunk_embedding_record( + store, # type: ignore[arg-type] + user_id=uuid4(), + request=TaskArtifactChunkEmbeddingUpsertInput( + task_artifact_chunk_id=first_chunk_id, + embedding_config_id=config_id, + vector=(0.1, 0.2, 0.3), + ), + ) + updated = upsert_task_artifact_chunk_embedding_record( + store, # type: ignore[arg-type] + user_id=uuid4(), + request=TaskArtifactChunkEmbeddingUpsertInput( + task_artifact_chunk_id=second_chunk_id, + embedding_config_id=config_id, + vector=(0.9, 0.8, 0.7), + ), + ) + + artifact_payload = list_task_artifact_chunk_embedding_records_for_artifact( + store, # type: ignore[arg-type] + user_id=uuid4(), + task_artifact_id=artifact_id, + ) + chunk_payload = list_task_artifact_chunk_embedding_records_for_chunk( + store, # type: ignore[arg-type] + user_id=uuid4(), + task_artifact_chunk_id=second_chunk_id, + ) + detail_payload = get_task_artifact_chunk_embedding_record( + store, # type: ignore[arg-type] + user_id=uuid4(), + task_artifact_chunk_embedding_id=UUID(updated["embedding"]["id"]), + ) + + assert second_write["write_mode"] == "created" + assert first_write["write_mode"] == "created" + assert updated["write_mode"] == "updated" + assert updated["embedding"]["vector"] == [0.9, 0.8, 0.7] + assert [item["task_artifact_chunk_id"] for item in artifact_payload["items"]] == [ + str(first_chunk_id), + str(second_chunk_id), + ] + assert artifact_payload["summary"] == { + "total_count": 2, + "order": ["task_artifact_chunk_sequence_no_asc", "created_at_asc", "id_asc"], + "scope": { + "kind": "artifact", + "task_artifact_id": str(artifact_id), + }, + } + assert chunk_payload["summary"] == { + "total_count": 1, + "order": ["task_artifact_chunk_sequence_no_asc", "created_at_asc", "id_asc"], + "scope": { + "kind": "chunk", + "task_artifact_id": str(artifact_id), + "task_artifact_chunk_id": str(second_chunk_id), + }, + } + assert detail_payload["embedding"]["id"] == updated["embedding"]["id"] + assert detail_payload["embedding"]["task_artifact_chunk_sequence_no"] == 2 + + +def test_task_artifact_chunk_embedding_writes_reject_missing_refs_and_dimension_mismatch() -> None: + store = TaskArtifactChunkEmbeddingStoreStub() + artifact_id = store.create_artifact() + chunk_id = store.create_chunk(task_artifact_id=artifact_id, sequence_no=1) + config_id = store.create_config(dimensions=3) + + with pytest.raises( + TaskArtifactChunkEmbeddingValidationError, + match="task_artifact_chunk_id must reference an existing task artifact chunk owned by the user", + ): + upsert_task_artifact_chunk_embedding_record( + store, # type: ignore[arg-type] + user_id=uuid4(), + request=TaskArtifactChunkEmbeddingUpsertInput( + task_artifact_chunk_id=uuid4(), + embedding_config_id=config_id, + vector=(0.1, 0.2, 0.3), + ), + ) + + with pytest.raises( + TaskArtifactChunkEmbeddingValidationError, + match="embedding_config_id must reference an existing embedding config owned by the user", + ): + upsert_task_artifact_chunk_embedding_record( + store, # type: ignore[arg-type] + user_id=uuid4(), + request=TaskArtifactChunkEmbeddingUpsertInput( + task_artifact_chunk_id=chunk_id, + embedding_config_id=uuid4(), + vector=(0.1, 0.2, 0.3), + ), + ) + + with pytest.raises( + TaskArtifactChunkEmbeddingValidationError, + match=r"vector length must match embedding config dimensions \(3\): 2", + ): + upsert_task_artifact_chunk_embedding_record( + store, # type: ignore[arg-type] + user_id=uuid4(), + request=TaskArtifactChunkEmbeddingUpsertInput( + task_artifact_chunk_id=chunk_id, + embedding_config_id=config_id, + vector=(0.1, 0.2), + ), + ) + + +def test_task_artifact_chunk_embedding_reads_raise_not_found_when_scope_is_missing() -> None: + store = TaskArtifactChunkEmbeddingStoreStub() + + with pytest.raises(TaskArtifactNotFoundError, match="task artifact .* was not found"): + list_task_artifact_chunk_embedding_records_for_artifact( + store, # type: ignore[arg-type] + user_id=uuid4(), + task_artifact_id=uuid4(), + ) + + with pytest.raises( + TaskArtifactChunkEmbeddingNotFoundError, + match="task artifact chunk .* was not found", + ): + list_task_artifact_chunk_embedding_records_for_chunk( + store, # type: ignore[arg-type] + user_id=uuid4(), + task_artifact_chunk_id=uuid4(), + ) + + with pytest.raises( + TaskArtifactChunkEmbeddingNotFoundError, + match="task artifact chunk embedding .* was not found", + ): + get_task_artifact_chunk_embedding_record( + store, # type: ignore[arg-type] + user_id=uuid4(), + task_artifact_chunk_embedding_id=uuid4(), + ) diff --git a/tests/unit/test_task_artifact_chunk_embedding_store.py b/tests/unit/test_task_artifact_chunk_embedding_store.py new file mode 100644 index 0000000..227a191 --- /dev/null +++ b/tests/unit/test_task_artifact_chunk_embedding_store.py @@ -0,0 +1,236 @@ +from __future__ import annotations + +from datetime import UTC, datetime +from typing import Any +from uuid import uuid4 + +from psycopg.types.json import Jsonb + +from alicebot_api.store import ContinuityStore + + +class RecordingCursor: + def __init__( + self, + fetchone_results: list[dict[str, Any]], + fetchall_results: list[list[dict[str, Any]]] | None = None, + ) -> None: + self.executed: list[tuple[str, tuple[object, ...] | None]] = [] + self.fetchone_results = list(fetchone_results) + self.fetchall_results = list(fetchall_results or []) + + def __enter__(self) -> "RecordingCursor": + return self + + def __exit__(self, exc_type, exc, tb) -> None: + return None + + def execute(self, query: str, params: tuple[object, ...] | None = None) -> None: + self.executed.append((query, params)) + + def fetchone(self) -> dict[str, Any] | None: + if not self.fetchone_results: + return None + return self.fetchone_results.pop(0) + + def fetchall(self) -> list[dict[str, Any]]: + if not self.fetchall_results: + return [] + return self.fetchall_results.pop(0) + + +class RecordingConnection: + def __init__(self, cursor: RecordingCursor) -> None: + self.cursor_instance = cursor + + def cursor(self) -> RecordingCursor: + return self.cursor_instance + + +def test_task_artifact_chunk_embedding_store_methods_use_expected_queries() -> None: + task_artifact_id = uuid4() + task_artifact_chunk_id = uuid4() + embedding_config_id = uuid4() + embedding_id = uuid4() + created_at = datetime(2026, 3, 14, 12, 0, tzinfo=UTC) + updated_at = datetime(2026, 3, 14, 12, 5, tzinfo=UTC) + cursor = RecordingCursor( + fetchone_results=[ + { + "id": task_artifact_chunk_id, + "user_id": uuid4(), + "task_artifact_id": task_artifact_id, + "sequence_no": 2, + "char_start": 10, + "char_end_exclusive": 20, + "text": "chunk-2", + "created_at": created_at, + "updated_at": created_at, + }, + { + "id": embedding_id, + "user_id": uuid4(), + "task_artifact_id": task_artifact_id, + "task_artifact_chunk_id": task_artifact_chunk_id, + "task_artifact_chunk_sequence_no": 2, + "embedding_config_id": embedding_config_id, + "dimensions": 3, + "vector": [0.1, 0.2, 0.3], + "created_at": created_at, + "updated_at": created_at, + }, + { + "id": embedding_id, + "user_id": uuid4(), + "task_artifact_id": task_artifact_id, + "task_artifact_chunk_id": task_artifact_chunk_id, + "task_artifact_chunk_sequence_no": 2, + "embedding_config_id": embedding_config_id, + "dimensions": 3, + "vector": [0.3, 0.2, 0.1], + "created_at": created_at, + "updated_at": updated_at, + }, + { + "id": embedding_id, + "user_id": uuid4(), + "task_artifact_id": task_artifact_id, + "task_artifact_chunk_id": task_artifact_chunk_id, + "task_artifact_chunk_sequence_no": 2, + "embedding_config_id": embedding_config_id, + "dimensions": 3, + "vector": [0.3, 0.2, 0.1], + "created_at": created_at, + "updated_at": updated_at, + }, + { + "id": embedding_id, + "user_id": uuid4(), + "task_artifact_id": task_artifact_id, + "task_artifact_chunk_id": task_artifact_chunk_id, + "task_artifact_chunk_sequence_no": 2, + "embedding_config_id": embedding_config_id, + "dimensions": 3, + "vector": [0.3, 0.2, 0.1], + "created_at": created_at, + "updated_at": updated_at, + }, + ], + fetchall_results=[ + [ + { + "id": embedding_id, + "task_artifact_id": task_artifact_id, + "task_artifact_chunk_id": task_artifact_chunk_id, + "task_artifact_chunk_sequence_no": 2, + "embedding_config_id": embedding_config_id, + } + ], + [ + { + "id": embedding_id, + "task_artifact_id": task_artifact_id, + "task_artifact_chunk_id": task_artifact_chunk_id, + "task_artifact_chunk_sequence_no": 2, + "embedding_config_id": embedding_config_id, + } + ], + ], + ) + store = ContinuityStore(RecordingConnection(cursor)) + + fetched_chunk = store.get_task_artifact_chunk_optional(task_artifact_chunk_id) + created = store.create_task_artifact_chunk_embedding( + task_artifact_chunk_id=task_artifact_chunk_id, + embedding_config_id=embedding_config_id, + dimensions=3, + vector=[0.1, 0.2, 0.3], + ) + updated = store.update_task_artifact_chunk_embedding( + task_artifact_chunk_embedding_id=embedding_id, + dimensions=3, + vector=[0.3, 0.2, 0.1], + ) + fetched_embedding = store.get_task_artifact_chunk_embedding_optional(embedding_id) + existing = store.get_task_artifact_chunk_embedding_by_chunk_and_config_optional( + task_artifact_chunk_id=task_artifact_chunk_id, + embedding_config_id=embedding_config_id, + ) + listed_for_chunk = store.list_task_artifact_chunk_embeddings_for_chunk(task_artifact_chunk_id) + listed_for_artifact = store.list_task_artifact_chunk_embeddings_for_artifact(task_artifact_id) + + assert fetched_chunk is not None + assert fetched_chunk["id"] == task_artifact_chunk_id + assert created["id"] == embedding_id + assert updated["updated_at"] == updated_at + assert fetched_embedding is not None + assert existing is not None + assert listed_for_chunk == [ + { + "id": embedding_id, + "task_artifact_id": task_artifact_id, + "task_artifact_chunk_id": task_artifact_chunk_id, + "task_artifact_chunk_sequence_no": 2, + "embedding_config_id": embedding_config_id, + } + ] + assert listed_for_artifact == [ + { + "id": embedding_id, + "task_artifact_id": task_artifact_id, + "task_artifact_chunk_id": task_artifact_chunk_id, + "task_artifact_chunk_sequence_no": 2, + "embedding_config_id": embedding_config_id, + } + ] + + get_chunk_query, get_chunk_params = cursor.executed[0] + assert "FROM task_artifact_chunks" in get_chunk_query + assert get_chunk_params == (task_artifact_chunk_id,) + + create_query, create_params = cursor.executed[1] + assert "INSERT INTO task_artifact_chunk_embeddings" in create_query + assert "JOIN task_artifact_chunks AS chunks" in create_query + assert create_params is not None + assert create_params[:3] == (task_artifact_chunk_id, embedding_config_id, 3) + assert isinstance(create_params[3], Jsonb) + assert create_params[3].obj == [0.1, 0.2, 0.3] + + update_query, update_params = cursor.executed[2] + assert "UPDATE task_artifact_chunk_embeddings" in update_query + assert update_params is not None + assert update_params[0] == 3 + assert isinstance(update_params[1], Jsonb) + assert update_params[1].obj == [0.3, 0.2, 0.1] + assert update_params[2] == embedding_id + + get_embedding_query, get_embedding_params = cursor.executed[3] + assert "FROM task_artifact_chunk_embeddings AS embeddings" in get_embedding_query + assert get_embedding_params == (embedding_id,) + + get_existing_query, get_existing_params = cursor.executed[4] + assert "WHERE embeddings.task_artifact_chunk_id = %s" in get_existing_query + assert "AND embeddings.embedding_config_id = %s" in get_existing_query + assert get_existing_params == (task_artifact_chunk_id, embedding_config_id) + + list_chunk_query, list_chunk_params = cursor.executed[5] + assert "WHERE embeddings.task_artifact_chunk_id = %s" in list_chunk_query + assert "ORDER BY chunks.sequence_no ASC, embeddings.created_at ASC, embeddings.id ASC" in list_chunk_query + assert list_chunk_params == (task_artifact_chunk_id,) + + list_artifact_query, list_artifact_params = cursor.executed[6] + assert "WHERE chunks.task_artifact_id = %s" in list_artifact_query + assert "ORDER BY chunks.sequence_no ASC, embeddings.created_at ASC, embeddings.id ASC" in list_artifact_query + assert list_artifact_params == (task_artifact_id,) + + +def test_task_artifact_chunk_embedding_store_optional_reads_return_none_when_row_is_missing() -> None: + cursor = RecordingCursor(fetchone_results=[]) + store = ContinuityStore(RecordingConnection(cursor)) + + assert store.get_task_artifact_chunk_optional(uuid4()) is None + assert store.get_task_artifact_chunk_embedding_optional(uuid4()) is None + assert store.get_task_artifact_chunk_embedding_by_chunk_and_config_optional( + task_artifact_chunk_id=uuid4(), + embedding_config_id=uuid4(), + ) is None