diff --git a/.ai/active/SPRINT_PACKET.md b/.ai/active/SPRINT_PACKET.md index 3547e32..2fa5e70 100644 --- a/.ai/active/SPRINT_PACKET.md +++ b/.ai/active/SPRINT_PACKET.md @@ -2,7 +2,7 @@ ## Sprint Title -Sprint 5D: Local Artifact Ingestion V0 +Sprint 5E: Artifact Chunk Retrieval V0 ## Sprint Type @@ -10,121 +10,114 @@ feature ## Sprint Reason -Milestone 5 now has deterministic task-workspace boundaries and explicit task-artifact records. The next safe step is to ingest registered local artifacts into durable chunk records, so later document retrieval can operate on explicit ingested data instead of raw filesystem reads. +Milestone 5 now has deterministic workspace boundaries, explicit artifact records, and durable chunk ingestion. The next safe step is to retrieve those ingested chunks through a narrow deterministic read path before adding embeddings, ranking, rich-document parsing, connectors, or UI. ## Sprint Intent -Add a narrow, explicit local artifact-ingestion seam that reads registered text artifacts from rooted task workspaces, chunks them deterministically, and persists durable artifact-chunk records, without yet adding document retrieval, embeddings, connectors, or UI. +Add a narrow retrieval seam over existing `task_artifact_chunks` so clients can request relevant ingested text chunks for one task or artifact using deterministic lexical matching only, without yet adding embeddings, compile-path integration, connectors, or UI. ## Git Instructions -- Branch Name: `codex/sprint-5d-artifact-ingestion-v0` +- Branch Name: `codex/sprint-5e-artifact-chunk-retrieval-v0` - Base Branch: `main` - PR Strategy: one sprint branch, one PR, no stacked PRs unless Control Tower explicitly opens a follow-up sprint - Merge Policy: squash merge only after reviewer `PASS` and explicit Control Tower merge approval ## Why This Sprint -- Sprint 5A shipped task-workspace provisioning. -- Sprint 5C shipped explicit task-artifact registration on top of those workspaces. -- The roadmap says document work should build on the existing artifact/workspace boundary. -- The narrowest next step is ingestion only: turn registered local artifacts into durable chunk records before any retrieval or connector work begins. +- Sprint 5A established rooted task-workspace provisioning. +- Sprint 5C established explicit task-artifact registration. +- Sprint 5D established deterministic local text-artifact ingestion into durable chunk rows. +- The next narrow Milestone 5 seam is retrieval over those persisted chunks only, so later document-aware context work can build on a stable read contract instead of raw file access. ## In Scope -- Add schema and migration support for: - - `task_artifact_chunks` - - any narrow `task_artifacts.ingestion_status` expansion required to represent successful ingestion deterministically - Define typed contracts for: - - artifact-ingestion requests - - artifact-ingestion responses - - artifact-chunk list responses - - artifact detail responses updated for ingestion status if needed -- Implement a narrow ingestion seam that: - - accepts one already-registered visible artifact - - resolves its rooted local file path from the persisted workspace boundary plus artifact relative path - - supports only a small explicit text input set, for example `text/plain` and `text/markdown` - - reads file contents deterministically - - normalizes line endings and chunks text deterministically by one explicit rule - - persists ordered chunk rows linked to the artifact - - updates artifact ingestion status deterministically + - artifact-chunk retrieval requests + - artifact-chunk retrieval result items + - retrieval summary metadata +- Implement a narrow retrieval seam that: + - searches only durable `task_artifact_chunks` + - scopes retrieval by the current user plus one explicit task or one explicit artifact + - accepts one explicit text query + - uses deterministic lexical matching only + - returns deterministic ordered chunk results with explicit match metadata + - excludes artifacts that are not yet ingested - Implement the minimal API or service paths needed for: - - ingesting one artifact - - listing chunks for one artifact + - retrieving chunks for one task + - retrieving chunks for one artifact when the caller wants a narrower scope - Add unit and integration tests for: - - supported text artifact ingestion - - deterministic chunk ordering and chunk content boundaries - - rooted path enforcement during ingestion - - unsupported media-type or file-shape rejection + - deterministic retrieval ordering + - scoped retrieval by task and by artifact + - empty-result behavior + - exclusion of non-ingested artifacts - per-user isolation - stable response shape ## Out of Scope -- No compile-path or search retrieval over artifact chunks yet. - No embeddings for artifact chunks. -- No document ranking or chunk selection. -- No PDF, DOCX, OCR, or rich document parsing beyond the narrow supported text set. +- No semantic retrieval or reranking. +- No compile-path integration of artifact chunks yet. +- No PDF, DOCX, OCR, or rich document parsing beyond the already-shipped text ingestion seam. - No Gmail or Calendar connector scope. - No runner-style orchestration. - No UI work. ## Required Deliverables -- Migration for `task_artifact_chunks` and any narrow ingestion-status update. -- Stable artifact-ingestion and artifact-chunk read contracts. -- Minimal deterministic local artifact-ingestion path over registered task artifacts. -- Unit and integration coverage for rooted-path safety, supported-format ingestion, chunk ordering, and isolation. +- Stable chunk-retrieval request and response contracts. +- Minimal deterministic lexical retrieval path over existing `task_artifact_chunks`. +- Unit and integration coverage for ordering, scoping, exclusion rules, and isolation. - Updated `BUILD_REPORT.md` with exact verification results and explicit deferred scope. ## Acceptance Criteria -- A client can ingest one supported registered local artifact into durable ordered chunk records. -- Ingestion reads only files rooted under the persisted task workspace boundary. -- Chunking behavior is deterministic and documented. -- Unsupported artifact types are rejected deterministically. -- Artifact chunk reads are deterministic and user-scoped. +- A client can retrieve relevant ingested chunk records for one visible task using one explicit text query. +- A client can retrieve relevant ingested chunk records for one visible artifact using one explicit text query. +- Retrieval uses only durable `task_artifact_chunks` rows already persisted in the repo. +- Retrieval excludes artifacts whose ingestion is not complete. +- Result ordering is deterministic and documented. - `./.venv/bin/python -m pytest tests/unit` passes. - `./.venv/bin/python -m pytest tests/integration` passes. -- No retrieval, embeddings, connector, runner, UI, or broader side-effect scope enters the sprint. +- No embeddings, semantic retrieval, compile integration, connector, runner, UI, or broader side-effect scope enters the sprint. ## Implementation Constraints -- Keep ingestion narrow and boring. -- Reuse existing `task_workspaces` and `task_artifacts` boundaries; do not scan directories implicitly. -- Support only a small explicit text-artifact set in this sprint. -- Keep chunking deterministic and simple enough to test precisely. -- Do not introduce retrieval or embedding behavior in the same sprint. +- Keep retrieval narrow and boring. +- Reuse existing task-artifact and chunk seams; do not read raw files during retrieval. +- Use deterministic lexical matching only in this sprint. +- Keep scope explicit: one task or one artifact per request. +- Do not merge artifact-chunk retrieval into the main context compiler in the same sprint. ## Suggested Work Breakdown -1. Add `task_artifact_chunks` schema and migration. -2. Define ingestion and chunk-read contracts. -3. Implement deterministic rooted file resolution from artifact metadata. -4. Implement narrow supported-format ingestion and deterministic chunk persistence. -5. Implement artifact chunk list reads. -6. Add unit and integration tests. -7. Update `BUILD_REPORT.md` with executed verification. +1. Define chunk-retrieval request and response contracts. +2. Implement deterministic lexical matching over existing chunk rows. +3. Add explicit task-scoped and artifact-scoped retrieval paths. +4. Enforce exclusion of non-ingested artifacts and current-user isolation. +5. Add unit and integration tests. +6. Update `BUILD_REPORT.md` with executed verification. ## Build Report Requirements `BUILD_REPORT.md` must include: -- the exact chunk schema and ingestion contract changes introduced -- the supported file types and chunking rule used +- the exact retrieval contracts introduced +- the lexical matching rule and ordering rule used - exact commands run - unit and integration test results -- one example artifact-ingestion response -- one example artifact-chunk list response +- one example task-scoped retrieval response +- one example artifact-scoped retrieval response - what remains intentionally deferred to later milestones ## Review Focus `REVIEW_REPORT.md` should verify: -- the sprint stayed limited to local artifact ingestion and chunk persistence -- ingestion reuses explicit task-workspace and artifact records rather than filesystem scanning -- rooted-path safety, chunk determinism, ordering, and isolation are test-backed -- no hidden retrieval, embedding, connector, runner, UI, or broader side-effect scope entered the sprint +- the sprint stayed limited to artifact chunk retrieval over durable chunk rows +- retrieval is deterministic, lexical-only, and scope-limited to one task or one artifact +- ordering, exclusion rules, and isolation are test-backed +- no hidden embeddings, semantic retrieval, compile integration, connector, runner, UI, or broader side-effect scope entered the sprint ## Exit Condition -This sprint is complete when the repo can ingest supported registered local artifacts into deterministic durable chunk records, expose stable chunk reads, and verify the full path with Postgres-backed tests, while still deferring document retrieval, embeddings, and connector work. +This sprint is complete when the repo can retrieve relevant ingested artifact chunks through a deterministic lexical read path scoped to one task or one artifact, verify the full path with Postgres-backed tests, and still defer semantic retrieval, compile integration, and connector work. diff --git a/BUILD_REPORT.md b/BUILD_REPORT.md index 828d9a5..4a4a636 100644 --- a/BUILD_REPORT.md +++ b/BUILD_REPORT.md @@ -2,195 +2,223 @@ ## sprint objective -Implement Sprint 5D: Local Artifact Ingestion V0 by adding a narrow, deterministic ingestion path that reads one registered local text artifact from its persisted task workspace boundary, chunks normalized text into durable ordered records, and exposes stable chunk reads without adding retrieval, embeddings, connectors, runners, or UI scope. +Implement Sprint 5E: Artifact Chunk Retrieval V0 by adding a narrow, deterministic lexical retrieval path over durable `task_artifact_chunks`, scoped to one visible task or one visible artifact, without adding embeddings, semantic ranking, compile-path integration, connectors, runners, or UI work. ## completed work -- Added migration `apps/api/alembic/versions/20260314_0024_task_artifact_chunks.py`. -- Expanded `task_artifacts.ingestion_status` from `pending` to `pending | ingested`. -- Added durable `task_artifact_chunks` storage with user scoping, RLS, and ordered per-artifact uniqueness. -- Added artifact-ingestion contracts in `apps/api/src/alicebot_api/contracts.py`: - - `TaskArtifactIngestInput` - - `TaskArtifactChunkRecord` - - `TaskArtifactChunkListSummary` - - `TaskArtifactChunkListResponse` - - `TaskArtifactIngestionResponse` - - `TASK_ARTIFACT_CHUNK_LIST_ORDER = ["sequence_no_asc", "id_asc"]` - - `TaskArtifactIngestionStatus = "pending" | "ingested"` -- Added artifact-ingestion service behavior in `apps/api/src/alicebot_api/artifacts.py`: - - rooted file resolution from persisted workspace `local_path` plus artifact `relative_path` - - explicit supported media types only: `text/plain`, `text/markdown` - - strict UTF-8 text decoding - - line-ending normalization to `\n` - - deterministic fixed-window chunking rule - - durable ordered chunk persistence - - deterministic `ingestion_status` transition to `ingested` +- Added retrieval contracts in `apps/api/src/alicebot_api/contracts.py`: + - `TaskScopedArtifactChunkRetrievalInput(task_id, query)` + - `ArtifactScopedArtifactChunkRetrievalInput(task_artifact_id, query)` + - `TaskArtifactChunkRetrievalMatch` + - `TaskArtifactChunkRetrievalItem` + - `TaskArtifactChunkRetrievalScope` + - `TaskArtifactChunkRetrievalSummary` + - `TaskArtifactChunkRetrievalResponse` + - `TASK_ARTIFACT_CHUNK_RETRIEVAL_ORDER = ["matched_query_term_count_desc", "first_match_char_start_asc", "relative_path_asc", "sequence_no_asc", "id_asc"]` +- Added retrieval behavior in `apps/api/src/alicebot_api/artifacts.py`: + - explicit query validation requiring at least one lexical word + - query normalization via casefolded unique `\w+` terms in first-occurrence order + - chunk matching against persisted chunk text only + - task-scoped retrieval across ingested artifacts for one visible task + - artifact-scoped retrieval for one visible artifact + - exclusion of artifacts whose `ingestion_status != "ingested"`, even if chunk rows exist + - deterministic response ordering with explicit per-item match metadata +- Added the minimal API routes in `apps/api/src/alicebot_api/main.py`: + - `POST /v0/tasks/{task_id}/artifact-chunks/retrieve` + - `POST /v0/task-artifacts/{task_artifact_id}/chunks/retrieve` - Added store support in `apps/api/src/alicebot_api/store.py`: - - `TaskArtifactChunkRow` - - advisory lock for per-artifact ingestion - - create/list chunk methods - - artifact ingestion-status update method -- Added API routes in `apps/api/src/alicebot_api/main.py`: - - `POST /v0/task-artifacts/{task_artifact_id}/ingest` - - `GET /v0/task-artifacts/{task_artifact_id}/chunks` + - `list_task_artifacts_for_task(task_id)` - Added unit and integration coverage for: - - supported text ingestion - - direct `text/markdown` ingestion - - deterministic chunk ordering and boundaries - - rooted-path enforcement during ingestion - - invalid UTF-8 rejection - - idempotent re-ingestion - - unsupported media-type rejection + - deterministic retrieval ordering + - task-scoped retrieval + - artifact-scoped retrieval + - empty-result behavior + - exclusion of non-ingested artifacts - per-user isolation - - stable ingestion and chunk-list response shapes -- Refreshed `ARCHITECTURE.md` and `.ai/handoff/CURRENT_STATE.md` so the documented shipped slice now matches Sprint 5D ingestion behavior and deferred scope. - -Exact chunk schema introduced: - -- `id uuid PRIMARY KEY` -- `user_id uuid NOT NULL` -- `task_artifact_id uuid NOT NULL` -- `sequence_no integer NOT NULL CHECK (sequence_no >= 1)` -- `char_start integer NOT NULL CHECK (char_start >= 0)` -- `char_end_exclusive integer NOT NULL CHECK (char_end_exclusive > char_start)` -- `text text NOT NULL CHECK (length(text) > 0)` -- `created_at timestamptz NOT NULL` -- `updated_at timestamptz NOT NULL` -- foreign key to `(task_artifacts.id, user_id)` with `ON DELETE CASCADE` -- unique index on `(user_id, task_artifact_id, sequence_no)` -- user-owned RLS policy -- runtime grants limited to `SELECT, INSERT` on `task_artifact_chunks` -- runtime `UPDATE` added on `task_artifacts` so ingestion can set `ingestion_status` - -Supported file types and chunking rule: - -- Supported media types: `text/plain`, `text/markdown` -- Text decoding: UTF-8 only -- Line-ending normalization: `\r\n` and `\r` become `\n` -- Chunking rule: `normalized_utf8_text_fixed_window_1000_chars_v1` -- Chunk boundary rule: split normalized text into contiguous, non-overlapping 1000-character windows with zero-based `char_start` and exclusive `char_end_exclusive` - -Exact ingestion contract changes introduced: - -- Request input: `TaskArtifactIngestInput(task_artifact_id)` -- Ingestion response: `{"artifact": TaskArtifactRecord, "summary": TaskArtifactChunkListSummary}` -- Chunk list response: `{"items": list[TaskArtifactChunkRecord], "summary": TaskArtifactChunkListSummary}` -- Artifact detail payload remains stable except `ingestion_status` can now be `ingested` - -Example artifact-ingestion response: + - stable response shape + +Exact retrieval contracts introduced: + +- Request inputs: + - `TaskScopedArtifactChunkRetrievalInput(task_id: UUID, query: str)` + - `ArtifactScopedArtifactChunkRetrievalInput(task_artifact_id: UUID, query: str)` +- Result item: + - `id` + - `task_id` + - `task_artifact_id` + - `relative_path` + - `media_type` + - `sequence_no` + - `char_start` + - `char_end_exclusive` + - `text` + - `match = {matched_query_terms, matched_query_term_count, first_match_char_start}` +- Summary metadata: + - `total_count` + - `searched_artifact_count` + - `query` + - `query_terms` + - `matching_rule` + - `order` + - `scope = {kind, task_id, task_artifact_id?}` + +Lexical matching rule used: + +- Rule id: `casefolded_unicode_word_overlap_unique_query_terms_v1` +- Query normalization: + - casefold the query + - extract `\w+` terms + - deduplicate in first-occurrence order + - reject queries that produce zero terms +- Chunk match rule: + - casefold the stored chunk text + - extract `\w+` chunk terms + - a chunk matches when at least one normalized query term is present in the chunk term set + - `matched_query_terms` are returned in normalized query order + - `matched_query_term_count` is the count of distinct matched query terms + - `first_match_char_start` is the earliest start offset in the chunk text of any matched term + +Ordering rule used: + +- `matched_query_term_count` descending +- `first_match_char_start` ascending +- `relative_path` ascending +- `sequence_no` ascending +- `id` ascending + +Example task-scoped retrieval response: ```json { - "artifact": { - "id": "11111111-1111-1111-1111-111111111111", - "task_id": "22222222-2222-2222-2222-222222222222", - "task_workspace_id": "33333333-3333-3333-3333-333333333333", - "status": "registered", - "ingestion_status": "ingested", - "relative_path": "docs/spec.txt", - "media_type_hint": "text/plain", - "created_at": "2026-03-14T10:00:00+00:00", - "updated_at": "2026-03-14T10:00:01+00:00" - }, + "items": [ + { + "id": "11111111-1111-1111-1111-111111111111", + "task_id": "22222222-2222-2222-2222-222222222222", + "task_artifact_id": "33333333-3333-3333-3333-333333333333", + "relative_path": "docs/a.txt", + "media_type": "text/plain", + "sequence_no": 1, + "char_start": 0, + "char_end_exclusive": 14, + "text": "beta alpha doc", + "match": { + "matched_query_terms": ["alpha", "beta"], + "matched_query_term_count": 2, + "first_match_char_start": 0 + } + } + ], "summary": { - "total_count": 2, - "total_characters": 1006, - "media_type": "text/plain", - "chunking_rule": "normalized_utf8_text_fixed_window_1000_chars_v1", - "order": ["sequence_no_asc", "id_asc"] + "total_count": 1, + "searched_artifact_count": 1, + "query": "Alpha beta", + "query_terms": ["alpha", "beta"], + "matching_rule": "casefolded_unicode_word_overlap_unique_query_terms_v1", + "order": [ + "matched_query_term_count_desc", + "first_match_char_start_asc", + "relative_path_asc", + "sequence_no_asc", + "id_asc" + ], + "scope": { + "kind": "task", + "task_id": "22222222-2222-2222-2222-222222222222" + } } } ``` -Example artifact-chunk list response: +Example artifact-scoped retrieval response: ```json { "items": [ { "id": "44444444-4444-4444-4444-444444444444", - "task_artifact_id": "11111111-1111-1111-1111-111111111111", + "task_id": "22222222-2222-2222-2222-222222222222", + "task_artifact_id": "55555555-5555-5555-5555-555555555555", + "relative_path": "notes/b.md", + "media_type": "text/markdown", "sequence_no": 1, "char_start": 0, - "char_end_exclusive": 4, - "text": "abc\n", - "created_at": "2026-03-14T10:00:01+00:00", - "updated_at": "2026-03-14T10:00:01+00:00" - }, - { - "id": "55555555-5555-5555-5555-555555555555", - "task_artifact_id": "11111111-1111-1111-1111-111111111111", - "sequence_no": 2, - "char_start": 4, - "char_end_exclusive": 7, - "text": "def", - "created_at": "2026-03-14T10:00:01+00:00", - "updated_at": "2026-03-14T10:00:01+00:00" + "char_end_exclusive": 15, + "text": "alpha beta note", + "match": { + "matched_query_terms": ["alpha", "beta"], + "matched_query_term_count": 2, + "first_match_char_start": 0 + } } ], "summary": { - "total_count": 2, - "total_characters": 7, - "media_type": "text/plain", - "chunking_rule": "normalized_utf8_text_fixed_window_1000_chars_v1", - "order": ["sequence_no_asc", "id_asc"] + "total_count": 1, + "searched_artifact_count": 1, + "query": "Alpha beta", + "query_terms": ["alpha", "beta"], + "matching_rule": "casefolded_unicode_word_overlap_unique_query_terms_v1", + "order": [ + "matched_query_term_count_desc", + "first_match_char_start_asc", + "relative_path_asc", + "sequence_no_asc", + "id_asc" + ], + "scope": { + "kind": "artifact", + "task_id": "22222222-2222-2222-2222-222222222222", + "task_artifact_id": "55555555-5555-5555-5555-555555555555" + } } } ``` ## incomplete work -- None within Sprint 5D scope. +- None within Sprint 5E scope. ## files changed -- `apps/api/alembic/versions/20260314_0024_task_artifact_chunks.py` - `apps/api/src/alicebot_api/artifacts.py` - `apps/api/src/alicebot_api/contracts.py` - `apps/api/src/alicebot_api/main.py` - `apps/api/src/alicebot_api/store.py` -- `ARCHITECTURE.md` -- `.ai/handoff/CURRENT_STATE.md` -- `tests/integration/test_migrations.py` - `tests/integration/test_task_artifacts_api.py` -- `tests/unit/test_20260314_0024_task_artifact_chunks.py` - `tests/unit/test_artifacts.py` - `tests/unit/test_artifacts_main.py` -- `tests/unit/test_main.py` - `tests/unit/test_task_artifact_store.py` - `BUILD_REPORT.md` ## tests run -- `./.venv/bin/python -m pytest tests/unit/test_artifacts.py tests/unit/test_artifacts_main.py tests/unit/test_task_artifact_store.py tests/unit/test_20260314_0024_task_artifact_chunks.py tests/unit/test_main.py` - - result: `63 passed in 0.77s` -- `./.venv/bin/python -m pytest tests/unit/test_artifacts.py` - - result: `16 passed in 0.11s` +- `./.venv/bin/python -m pytest tests/unit/test_artifacts.py tests/unit/test_artifacts_main.py tests/unit/test_task_artifact_store.py` + - result: `37 passed in 0.44s` +- `./.venv/bin/python -m pytest tests/unit/test_artifacts.py tests/unit/test_artifacts_main.py` + - result: `35 passed in 0.25s` - `./.venv/bin/python -m pytest tests/integration/test_task_artifacts_api.py` - - rerun with local access: `5 passed in 1.72s` + - sandboxed attempt failed to reach local Postgres on `localhost:5432` with `Operation not permitted` - `./.venv/bin/python -m pytest tests/unit` - - result: `347 passed in 0.56s` + - result: `358 passed in 0.56s` - `./.venv/bin/python -m pytest tests/integration` - - first sandboxed attempt failed to reach local Postgres and open a local socket - - rerun with local access: `104 passed in 30.87s` + - rerun with local access: `105 passed in 29.62s` - `git diff --check` - result: passed ## blockers/issues - No remaining implementation blockers. -- Local Postgres-backed integration tests required running outside the default sandbox; after rerun with local access, the full suite passed. +- Postgres-backed integration verification required unsandboxed localhost access. After rerun with local access, the full integration suite passed. ## recommended next step -Build the next milestone on top of these durable chunk records by adding retrieval over ingested chunks only, while still keeping embeddings, ranking, rich-document parsing, connectors, orchestration, and UI changes out of scope until separately sprinted. +Build the next milestone on top of this deterministic read contract by adding richer retrieval quality or compile-path usage in a separate sprint, while keeping those changes explicitly scoped and test-backed. ## intentionally deferred -- Retrieval or search over artifact chunks - Embeddings for artifact chunks -- Ranking or chunk selection -- PDF, DOCX, OCR, or rich document parsing -- Connector ingestion -- Runner/orchestration behavior -- UI changes +- Semantic retrieval or reranking +- Compile-path integration of artifact chunks +- PDF, DOCX, OCR, or richer document parsing +- Connector work +- Runner or orchestration work +- UI work diff --git a/REVIEW_REPORT.md b/REVIEW_REPORT.md index 16ae984..83dbbe0 100644 --- a/REVIEW_REPORT.md +++ b/REVIEW_REPORT.md @@ -6,35 +6,40 @@ PASS ## criteria met -- The sprint stayed within the intended slice: local artifact ingestion and chunk persistence only. I did not find retrieval, embeddings, connector, runner, or UI overreach in the changed code. -- The implementation reuses existing `task_workspaces` and `task_artifacts` records instead of scanning the filesystem. -- Ingestion resolves the artifact path from the persisted workspace root plus stored `relative_path`, and rejects rooted-path escapes deterministically. -- Supported text ingestion works for registered local artifacts and persists durable ordered `task_artifact_chunks` rows. -- Chunking is deterministic and documented in code and `BUILD_REPORT.md`: normalized line endings plus fixed 1000-character windows. -- Unsupported media types are rejected deterministically. -- Chunk reads are deterministic and user-scoped. -- The follow-up fixes added direct test coverage for `text/markdown` ingestion, invalid UTF-8 rejection, and idempotent re-ingestion. -- The stale architecture and handoff docs were updated to reflect Sprint 5D behavior and boundaries. -- Verification rerun during review: - - `./.venv/bin/python -m pytest tests/unit` -> `347 passed` - - `./.venv/bin/python -m pytest tests/integration` -> `104 passed` after rerunning with local access to Postgres and a local test socket - - `git diff --check` -> passed +- Retrieval is implemented only over durable `task_artifact_chunks` rows; the new logic in `apps/api/src/alicebot_api/artifacts.py` matches against persisted chunk text and does not read raw files during retrieval. +- Both required scopes are present and tested: + - task-scoped retrieval via `POST /v0/tasks/{task_id}/artifact-chunks/retrieve` + - artifact-scoped retrieval via `POST /v0/task-artifacts/{task_artifact_id}/chunks/retrieve` +- Matching is deterministic and lexical-only: + - query normalization uses casefolded `\w+` extraction with first-occurrence deduplication + - ordering is explicit and stable: matched term count desc, first match start asc, relative path asc, sequence no asc, id asc +- Non-ingested artifacts are excluded even if chunk rows exist. +- Per-user isolation is enforced through the existing user-scoped connection/RLS path and is covered by integration tests. +- Response shape is explicit and stable through the new retrieval contracts in `apps/api/src/alicebot_api/contracts.py`. +- Sprint scope stayed narrow: no embeddings, semantic retrieval, compile-path integration, connectors, runner logic, or UI work entered the implementation. +- `BUILD_REPORT.md` was updated and includes the required contracts, matching/order rules, commands, examples, and deferred scope. +- Acceptance test gates passed in this review: + - `./.venv/bin/python -m pytest tests/unit` -> `358 passed in 0.53s` + - `./.venv/bin/python -m pytest tests/integration` -> `105 passed in 29.84s` ## criteria missed -- No acceptance criteria from `SPRINT_PACKET.md` were missed. +- None. ## quality issues -- None found in the reviewed sprint scope. +- No blocking implementation or test-quality issues found in the sprint code. +- Non-blocking process note: `.ai/active/SPRINT_PACKET.md` is part of the working diff. If that edit came from the Builder, sprint inputs should ideally remain reviewer-controlled so implementation is not changing its own source-of-truth spec. ## regression risks -- No material regression risk beyond the normal risk profile for this slice. +- Low. The change is additive, scoped to artifact retrieval, and covered by unit plus Postgres-backed integration tests. +- Residual risk: retrieval behavior is intentionally simple lexical overlap, so future callers may over-assume ranking quality. That is consistent with the sprint packet and documented as deferred scope, not a defect in this sprint. ## docs issues -- None. `ARCHITECTURE.md`, `.ai/handoff/CURRENT_STATE.md`, and `BUILD_REPORT.md` now match the landed implementation and verification state. +- No required docs are missing for this sprint. +- No correction needed in `BUILD_REPORT.md` based on this review. ## should anything be added to RULES.md? @@ -42,8 +47,9 @@ PASS ## should anything update ARCHITECTURE.md? -- No further update required from this review pass. +- No immediate update required for sprint acceptance. The architecture impact is narrow and already understandable from the code plus `BUILD_REPORT.md`. ## recommended next action -- Accept the sprint and proceed with the normal merge path once Control Tower approves. +- Mark Sprint 5E as accepted and move to the next milestone in a separate sprint. +- If desired, tighten process hygiene by keeping `SPRINT_PACKET.md` outside Builder-owned changes unless Control Tower explicitly includes packet editing in scope. diff --git a/apps/api/src/alicebot_api/artifacts.py b/apps/api/src/alicebot_api/artifacts.py index 611dbe1..d3b794f 100644 --- a/apps/api/src/alicebot_api/artifacts.py +++ b/apps/api/src/alicebot_api/artifacts.py @@ -1,5 +1,6 @@ from __future__ import annotations +import re from pathlib import Path from typing import cast from uuid import UUID @@ -9,6 +10,14 @@ from alicebot_api.contracts import ( TASK_ARTIFACT_LIST_ORDER, TASK_ARTIFACT_CHUNK_LIST_ORDER, + TASK_ARTIFACT_CHUNK_RETRIEVAL_ORDER, + ArtifactScopedArtifactChunkRetrievalInput, + TaskArtifactChunkRetrievalItem, + TaskArtifactChunkRetrievalMatch, + TaskArtifactChunkRetrievalResponse, + TaskArtifactChunkRetrievalScope, + TaskArtifactChunkRetrievalScopeKind, + TaskArtifactChunkRetrievalSummary, TaskArtifactCreateResponse, TaskArtifactDetailResponse, TaskArtifactChunkListResponse, @@ -21,8 +30,10 @@ TaskArtifactRegisterInput, TaskArtifactStatus, TaskArtifactIngestionStatus, + TaskScopedArtifactChunkRetrievalInput, ) from alicebot_api.store import ContinuityStore, TaskArtifactChunkRow, TaskArtifactRow +from alicebot_api.tasks import TaskNotFoundError from alicebot_api.workspaces import TaskWorkspaceNotFoundError SUPPORTED_TEXT_ARTIFACT_MEDIA_TYPES = ("text/plain", "text/markdown") @@ -34,6 +45,10 @@ } TASK_ARTIFACT_CHUNK_MAX_CHARS = 1000 TASK_ARTIFACT_CHUNKING_RULE = "normalized_utf8_text_fixed_window_1000_chars_v1" +TASK_ARTIFACT_CHUNK_RETRIEVAL_MATCHING_RULE = ( + "casefolded_unicode_word_overlap_unique_query_terms_v1" +) +_LEXICAL_TERM_PATTERN = re.compile(r"\w+") class TaskArtifactNotFoundError(LookupError): @@ -48,6 +63,10 @@ class TaskArtifactValidationError(ValueError): """Raised when a local artifact path cannot satisfy registration constraints.""" +class TaskArtifactChunkRetrievalValidationError(ValueError): + """Raised when an artifact chunk retrieval request cannot be evaluated safely.""" + + def resolve_artifact_path(local_path: str) -> Path: return Path(local_path).expanduser().resolve() @@ -172,6 +191,150 @@ def build_task_artifact_chunk_list_summary( } +def extract_unique_lexical_terms(text: str) -> list[str]: + terms: list[str] = [] + seen: set[str] = set() + for match in _LEXICAL_TERM_PATTERN.finditer(text.casefold()): + term = match.group(0) + if term in seen: + continue + seen.add(term) + terms.append(term) + return terms + + +def resolve_artifact_chunk_retrieval_query_terms(query: str) -> list[str]: + terms = extract_unique_lexical_terms(query) + if not terms: + raise TaskArtifactChunkRetrievalValidationError( + "artifact chunk retrieval query must include at least one word" + ) + return terms + + +def build_task_artifact_chunk_retrieval_scope( + *, + kind: str, + task_id: UUID, + task_artifact_id: UUID | None = None, +) -> TaskArtifactChunkRetrievalScope: + scope: TaskArtifactChunkRetrievalScope = { + "kind": cast(TaskArtifactChunkRetrievalScopeKind, kind), + "task_id": str(task_id), + } + if task_artifact_id is not None: + scope["task_artifact_id"] = str(task_artifact_id) + return scope + + +def build_task_artifact_chunk_retrieval_summary( + *, + total_count: int, + searched_artifact_count: int, + query: str, + query_terms: list[str], + scope: TaskArtifactChunkRetrievalScope, +) -> TaskArtifactChunkRetrievalSummary: + return { + "total_count": total_count, + "searched_artifact_count": searched_artifact_count, + "query": query, + "query_terms": list(query_terms), + "matching_rule": TASK_ARTIFACT_CHUNK_RETRIEVAL_MATCHING_RULE, + "order": list(TASK_ARTIFACT_CHUNK_RETRIEVAL_ORDER), + "scope": scope, + } + + +def match_artifact_chunk_text( + *, + query_terms: list[str], + chunk_text: str, +) -> TaskArtifactChunkRetrievalMatch | None: + first_positions: dict[str, int] = {} + for match in _LEXICAL_TERM_PATTERN.finditer(chunk_text.casefold()): + term = match.group(0) + if term not in first_positions: + first_positions[term] = match.start() + + matched_terms = [term for term in query_terms if term in first_positions] + if not matched_terms: + return None + + return { + "matched_query_terms": matched_terms, + "matched_query_term_count": len(matched_terms), + "first_match_char_start": min(first_positions[term] for term in matched_terms), + } + + +def serialize_task_artifact_chunk_retrieval_item( + *, + artifact_row: TaskArtifactRow, + chunk_row: TaskArtifactChunkRow, + match: TaskArtifactChunkRetrievalMatch, +) -> TaskArtifactChunkRetrievalItem: + return { + "id": str(chunk_row["id"]), + "task_id": str(artifact_row["task_id"]), + "task_artifact_id": str(chunk_row["task_artifact_id"]), + "relative_path": artifact_row["relative_path"], + "media_type": infer_task_artifact_media_type(artifact_row) or "unknown", + "sequence_no": chunk_row["sequence_no"], + "char_start": chunk_row["char_start"], + "char_end_exclusive": chunk_row["char_end_exclusive"], + "text": chunk_row["text"], + "match": match, + } + + +def retrieve_matching_task_artifact_chunks( + store: ContinuityStore, + *, + artifact_rows: list[TaskArtifactRow], + query_terms: list[str], +) -> tuple[list[TaskArtifactChunkRetrievalItem], int]: + matched_items_with_keys: list[ + tuple[tuple[int, int, str, int, str], TaskArtifactChunkRetrievalItem] + ] = [] + searched_artifact_count = 0 + + for artifact_row in artifact_rows: + if artifact_row["ingestion_status"] != "ingested": + continue + + searched_artifact_count += 1 + chunk_rows = store.list_task_artifact_chunks(artifact_row["id"]) + for chunk_row in chunk_rows: + match = match_artifact_chunk_text( + query_terms=query_terms, + chunk_text=chunk_row["text"], + ) + if match is None: + continue + + item = serialize_task_artifact_chunk_retrieval_item( + artifact_row=artifact_row, + chunk_row=chunk_row, + match=match, + ) + matched_items_with_keys.append( + ( + ( + -match["matched_query_term_count"], + match["first_match_char_start"], + artifact_row["relative_path"], + chunk_row["sequence_no"], + str(chunk_row["id"]), + ), + item, + ) + ) + + matched_items_with_keys.sort(key=lambda entry: entry[0]) + return [item for _, item in matched_items_with_keys], searched_artifact_count + + def register_task_artifact_record( store: ContinuityStore, *, @@ -349,3 +512,73 @@ def list_task_artifact_chunk_records( "items": [serialize_task_artifact_chunk_row(chunk_row) for chunk_row in chunk_rows], "summary": build_task_artifact_chunk_list_summary(chunk_rows, media_type=media_type), } + + +def retrieve_task_scoped_artifact_chunk_records( + store: ContinuityStore, + *, + user_id: UUID, + request: TaskScopedArtifactChunkRetrievalInput, +) -> TaskArtifactChunkRetrievalResponse: + del user_id + + task = store.get_task_optional(request.task_id) + if task is None: + raise TaskNotFoundError(f"task {request.task_id} was not found") + + query_terms = resolve_artifact_chunk_retrieval_query_terms(request.query) + artifact_rows = store.list_task_artifacts_for_task(request.task_id) + items, searched_artifact_count = retrieve_matching_task_artifact_chunks( + store, + artifact_rows=artifact_rows, + query_terms=query_terms, + ) + scope = build_task_artifact_chunk_retrieval_scope( + kind="task", + task_id=request.task_id, + ) + return { + "items": items, + "summary": build_task_artifact_chunk_retrieval_summary( + total_count=len(items), + searched_artifact_count=searched_artifact_count, + query=request.query, + query_terms=query_terms, + scope=scope, + ), + } + + +def retrieve_artifact_scoped_artifact_chunk_records( + store: ContinuityStore, + *, + user_id: UUID, + request: ArtifactScopedArtifactChunkRetrievalInput, +) -> TaskArtifactChunkRetrievalResponse: + del user_id + + artifact_row = store.get_task_artifact_optional(request.task_artifact_id) + if artifact_row is None: + raise TaskArtifactNotFoundError(f"task artifact {request.task_artifact_id} was not found") + + query_terms = resolve_artifact_chunk_retrieval_query_terms(request.query) + items, searched_artifact_count = retrieve_matching_task_artifact_chunks( + store, + artifact_rows=[artifact_row], + query_terms=query_terms, + ) + scope = build_task_artifact_chunk_retrieval_scope( + kind="artifact", + task_id=artifact_row["task_id"], + task_artifact_id=artifact_row["id"], + ) + return { + "items": items, + "summary": build_task_artifact_chunk_retrieval_summary( + total_count=len(items), + searched_artifact_count=searched_artifact_count, + query=request.query, + query_terms=query_terms, + scope=scope, + ), + } diff --git a/apps/api/src/alicebot_api/contracts.py b/apps/api/src/alicebot_api/contracts.py index aa68f2e..c86549c 100644 --- a/apps/api/src/alicebot_api/contracts.py +++ b/apps/api/src/alicebot_api/contracts.py @@ -22,6 +22,7 @@ TaskWorkspaceStatus = Literal["active"] TaskArtifactStatus = Literal["registered"] TaskArtifactIngestionStatus = Literal["pending", "ingested"] +TaskArtifactChunkRetrievalScopeKind = Literal["task", "artifact"] TaskLifecycleSource = Literal[ "approval_request", "approval_resolution", @@ -133,6 +134,13 @@ TASK_WORKSPACE_LIST_ORDER = ["created_at_asc", "id_asc"] TASK_ARTIFACT_LIST_ORDER = ["created_at_asc", "id_asc"] TASK_ARTIFACT_CHUNK_LIST_ORDER = ["sequence_no_asc", "id_asc"] +TASK_ARTIFACT_CHUNK_RETRIEVAL_ORDER = [ + "matched_query_term_count_desc", + "first_match_char_start_asc", + "relative_path_asc", + "sequence_no_asc", + "id_asc", +] TASK_STEP_LIST_ORDER = ["sequence_no_asc", "created_at_asc", "id_asc"] TOOL_EXECUTION_LIST_ORDER = ["executed_at_asc", "id_asc"] EXECUTION_BUDGET_LIST_ORDER = ["created_at_asc", "id_asc"] @@ -1612,6 +1620,18 @@ class TaskArtifactIngestInput: task_artifact_id: UUID +@dataclass(frozen=True, slots=True) +class TaskScopedArtifactChunkRetrievalInput: + task_id: UUID + query: str + + +@dataclass(frozen=True, slots=True) +class ArtifactScopedArtifactChunkRetrievalInput: + task_artifact_id: UUID + query: str + + class TaskArtifactRecord(TypedDict): id: str task_id: str @@ -1671,6 +1691,46 @@ class TaskArtifactIngestionResponse(TypedDict): summary: TaskArtifactChunkListSummary +class TaskArtifactChunkRetrievalMatch(TypedDict): + matched_query_terms: list[str] + matched_query_term_count: int + first_match_char_start: int + + +class TaskArtifactChunkRetrievalItem(TypedDict): + id: str + task_id: str + task_artifact_id: str + relative_path: str + media_type: str + sequence_no: int + char_start: int + char_end_exclusive: int + text: str + match: TaskArtifactChunkRetrievalMatch + + +class TaskArtifactChunkRetrievalScope(TypedDict): + kind: TaskArtifactChunkRetrievalScopeKind + task_id: str + task_artifact_id: NotRequired[str] + + +class TaskArtifactChunkRetrievalSummary(TypedDict): + total_count: int + searched_artifact_count: int + query: str + query_terms: list[str] + matching_rule: str + order: list[str] + scope: TaskArtifactChunkRetrievalScope + + +class TaskArtifactChunkRetrievalResponse(TypedDict): + items: list[TaskArtifactChunkRetrievalItem] + summary: TaskArtifactChunkRetrievalSummary + + class TaskStepTraceLink(TypedDict): trace_id: str trace_kind: str diff --git a/apps/api/src/alicebot_api/main.py b/apps/api/src/alicebot_api/main.py index ebdf2d3..982becb 100644 --- a/apps/api/src/alicebot_api/main.py +++ b/apps/api/src/alicebot_api/main.py @@ -47,11 +47,13 @@ SemanticMemoryRetrievalRequestInput, TOOL_METADATA_VERSION_V0, ApprovalStatus, + ArtifactScopedArtifactChunkRetrievalInput, ProxyExecutionStatus, ToolAllowlistEvaluationRequestInput, ProxyExecutionRequestInput, TaskArtifactIngestInput, TaskArtifactRegisterInput, + TaskScopedArtifactChunkRetrievalInput, TaskStepKind, TaskStepLineageInput, TaskStepNextCreateInput, @@ -64,6 +66,7 @@ ) from alicebot_api.artifacts import ( TaskArtifactAlreadyExistsError, + TaskArtifactChunkRetrievalValidationError, TaskArtifactNotFoundError, TaskArtifactValidationError, get_task_artifact_record, @@ -71,6 +74,8 @@ list_task_artifact_chunk_records, list_task_artifact_records, register_task_artifact_record, + retrieve_artifact_scoped_artifact_chunk_records, + retrieve_task_scoped_artifact_chunk_records, ) from alicebot_api.approvals import ( ApprovalNotFoundError, @@ -420,6 +425,11 @@ class IngestTaskArtifactRequest(BaseModel): user_id: UUID +class RetrieveArtifactChunksRequest(BaseModel): + user_id: UUID + query: str = Field(min_length=1, max_length=1000) + + class TaskStepRequestSnapshot(BaseModel): thread_id: UUID tool_id: UUID @@ -1308,6 +1318,62 @@ def list_task_artifact_chunks(task_artifact_id: UUID, user_id: UUID) -> JSONResp ) +@app.post("/v0/tasks/{task_id}/artifact-chunks/retrieve") +def retrieve_task_artifact_chunks( + task_id: UUID, + request: RetrieveArtifactChunksRequest, +) -> JSONResponse: + settings = get_settings() + + try: + with user_connection(settings.database_url, request.user_id) as conn: + payload = retrieve_task_scoped_artifact_chunk_records( + ContinuityStore(conn), + user_id=request.user_id, + request=TaskScopedArtifactChunkRetrievalInput( + task_id=task_id, + query=request.query, + ), + ) + except TaskNotFoundError as exc: + return JSONResponse(status_code=404, content={"detail": str(exc)}) + except TaskArtifactChunkRetrievalValidationError as exc: + return JSONResponse(status_code=400, content={"detail": str(exc)}) + + return JSONResponse( + status_code=200, + content=jsonable_encoder(payload), + ) + + +@app.post("/v0/task-artifacts/{task_artifact_id}/chunks/retrieve") +def retrieve_task_artifact_chunks_for_artifact( + task_artifact_id: UUID, + request: RetrieveArtifactChunksRequest, +) -> JSONResponse: + settings = get_settings() + + try: + with user_connection(settings.database_url, request.user_id) as conn: + payload = retrieve_artifact_scoped_artifact_chunk_records( + ContinuityStore(conn), + user_id=request.user_id, + request=ArtifactScopedArtifactChunkRetrievalInput( + task_artifact_id=task_artifact_id, + query=request.query, + ), + ) + except TaskArtifactNotFoundError as exc: + return JSONResponse(status_code=404, content={"detail": str(exc)}) + except TaskArtifactChunkRetrievalValidationError as exc: + return JSONResponse(status_code=400, content={"detail": str(exc)}) + + return JSONResponse( + status_code=200, + content=jsonable_encoder(payload), + ) + + @app.post("/v0/tasks/{task_id}/steps") def create_next_task_step(task_id: UUID, request: CreateNextTaskStepRequest) -> JSONResponse: settings = get_settings() diff --git a/apps/api/src/alicebot_api/store.py b/apps/api/src/alicebot_api/store.py index 13c7cc3..d18ced9 100644 --- a/apps/api/src/alicebot_api/store.py +++ b/apps/api/src/alicebot_api/store.py @@ -1529,6 +1529,23 @@ class LabelCountRow(TypedDict): ORDER BY created_at ASC, id ASC """ +LIST_TASK_ARTIFACTS_FOR_TASK_SQL = """ + SELECT + id, + user_id, + task_id, + task_workspace_id, + status, + ingestion_status, + relative_path, + media_type_hint, + created_at, + updated_at + FROM task_artifacts + WHERE task_id = %s + ORDER BY created_at ASC, id ASC + """ + LOCK_TASK_ARTIFACT_INGESTION_SQL = "SELECT pg_advisory_xact_lock(hashtextextended(%s::text, 5))" INSERT_TASK_ARTIFACT_CHUNK_SQL = """ @@ -2731,6 +2748,9 @@ def get_task_artifact_by_workspace_relative_path_optional( def list_task_artifacts(self) -> list[TaskArtifactRow]: return self._fetch_all(LIST_TASK_ARTIFACTS_SQL) + def list_task_artifacts_for_task(self, task_id: UUID) -> list[TaskArtifactRow]: + return self._fetch_all(LIST_TASK_ARTIFACTS_FOR_TASK_SQL, (task_id,)) + def lock_task_artifact_ingestion(self, task_artifact_id: UUID) -> None: with self.conn.cursor() as cur: cur.execute(LOCK_TASK_ARTIFACT_INGESTION_SQL, (str(task_artifact_id),)) diff --git a/tests/integration/test_task_artifacts_api.py b/tests/integration/test_task_artifacts_api.py index cf9626b..1aab4e1 100644 --- a/tests/integration/test_task_artifacts_api.py +++ b/tests/integration/test_task_artifacts_api.py @@ -11,6 +11,7 @@ import apps.api.src.alicebot_api.main as main_module from apps.api.src.alicebot_api.config import Settings +from alicebot_api.artifacts import TASK_ARTIFACT_CHUNK_RETRIEVAL_MATCHING_RULE from alicebot_api.db import user_connection from alicebot_api.store import ContinuityStore @@ -663,3 +664,310 @@ def test_task_artifact_ingestion_enforces_rooted_workspace_paths( assert ingest_payload == { "detail": f"artifact path {outside_file.resolve()} escapes workspace root {workspace_path.resolve()}" } + + +def test_task_artifact_chunk_retrieval_endpoints_are_scoped_deterministic_and_isolated( + migrated_database_urls, + monkeypatch, + tmp_path, +) -> None: + owner = seed_task(migrated_database_urls["app"], email="owner@example.com") + intruder = seed_task(migrated_database_urls["app"], email="intruder@example.com") + workspace_root = tmp_path / "task-workspaces" + monkeypatch.setattr( + main_module, + "get_settings", + lambda: Settings( + database_url=migrated_database_urls["app"], + task_workspace_root=str(workspace_root), + ), + ) + + owner_workspace_status, owner_workspace_payload = invoke_request( + "POST", + f"/v0/tasks/{owner['task_id']}/workspace", + payload={"user_id": str(owner["user_id"])}, + ) + assert owner_workspace_status == 201 + owner_workspace_path = Path(owner_workspace_payload["workspace"]["local_path"]) + + docs_file = owner_workspace_path / "docs" / "a.txt" + docs_file.parent.mkdir(parents=True) + docs_file.write_text("beta alpha doc") + notes_file = owner_workspace_path / "notes" / "b.md" + notes_file.parent.mkdir(parents=True) + notes_file.write_text("alpha beta note") + weak_file = owner_workspace_path / "notes" / "c.txt" + weak_file.write_text("beta only") + pending_file = owner_workspace_path / "notes" / "hidden.txt" + pending_file.write_text("alpha beta hidden") + + docs_register_status, docs_register_payload = invoke_request( + "POST", + f"/v0/task-workspaces/{owner_workspace_payload['workspace']['id']}/artifacts", + payload={ + "user_id": str(owner["user_id"]), + "local_path": str(docs_file), + "media_type_hint": "text/plain", + }, + ) + notes_register_status, notes_register_payload = invoke_request( + "POST", + f"/v0/task-workspaces/{owner_workspace_payload['workspace']['id']}/artifacts", + payload={ + "user_id": str(owner["user_id"]), + "local_path": str(notes_file), + "media_type_hint": "text/markdown", + }, + ) + weak_register_status, weak_register_payload = invoke_request( + "POST", + f"/v0/task-workspaces/{owner_workspace_payload['workspace']['id']}/artifacts", + payload={ + "user_id": str(owner["user_id"]), + "local_path": str(weak_file), + "media_type_hint": "text/plain", + }, + ) + pending_register_status, pending_register_payload = invoke_request( + "POST", + f"/v0/task-workspaces/{owner_workspace_payload['workspace']['id']}/artifacts", + payload={ + "user_id": str(owner["user_id"]), + "local_path": str(pending_file), + "media_type_hint": "text/plain", + }, + ) + assert docs_register_status == 201 + assert notes_register_status == 201 + assert weak_register_status == 201 + assert pending_register_status == 201 + + docs_ingest_status, _ = invoke_request( + "POST", + f"/v0/task-artifacts/{docs_register_payload['artifact']['id']}/ingest", + payload={"user_id": str(owner["user_id"])}, + ) + notes_ingest_status, _ = invoke_request( + "POST", + f"/v0/task-artifacts/{notes_register_payload['artifact']['id']}/ingest", + payload={"user_id": str(owner["user_id"])}, + ) + weak_ingest_status, _ = invoke_request( + "POST", + f"/v0/task-artifacts/{weak_register_payload['artifact']['id']}/ingest", + payload={"user_id": str(owner["user_id"])}, + ) + assert docs_ingest_status == 200 + assert notes_ingest_status == 200 + assert weak_ingest_status == 200 + + with user_connection(migrated_database_urls["app"], owner["user_id"]) as conn: + store = ContinuityStore(conn) + store.create_task_artifact_chunk( + task_artifact_id=UUID(pending_register_payload["artifact"]["id"]), + sequence_no=1, + char_start=0, + char_end_exclusive=17, + text="alpha beta hidden", + ) + + intruder_workspace_status, intruder_workspace_payload = invoke_request( + "POST", + f"/v0/tasks/{intruder['task_id']}/workspace", + payload={"user_id": str(intruder["user_id"])}, + ) + assert intruder_workspace_status == 201 + intruder_workspace_path = Path(intruder_workspace_payload["workspace"]["local_path"]) + intruder_file = intruder_workspace_path / "docs" / "secret.txt" + intruder_file.parent.mkdir(parents=True) + intruder_file.write_text("alpha beta intruder") + + intruder_register_status, intruder_register_payload = invoke_request( + "POST", + f"/v0/task-workspaces/{intruder_workspace_payload['workspace']['id']}/artifacts", + payload={ + "user_id": str(intruder["user_id"]), + "local_path": str(intruder_file), + "media_type_hint": "text/plain", + }, + ) + assert intruder_register_status == 201 + intruder_ingest_status, _ = invoke_request( + "POST", + f"/v0/task-artifacts/{intruder_register_payload['artifact']['id']}/ingest", + payload={"user_id": str(intruder["user_id"])}, + ) + assert intruder_ingest_status == 200 + + task_retrieve_status, task_retrieve_payload = invoke_request( + "POST", + f"/v0/tasks/{owner['task_id']}/artifact-chunks/retrieve", + payload={"user_id": str(owner["user_id"]), "query": "Alpha beta"}, + ) + artifact_retrieve_status, artifact_retrieve_payload = invoke_request( + "POST", + f"/v0/task-artifacts/{notes_register_payload['artifact']['id']}/chunks/retrieve", + payload={"user_id": str(owner["user_id"]), "query": "Alpha beta"}, + ) + empty_retrieve_status, empty_retrieve_payload = invoke_request( + "POST", + f"/v0/tasks/{owner['task_id']}/artifact-chunks/retrieve", + payload={"user_id": str(owner["user_id"]), "query": "missing"}, + ) + isolated_task_retrieve_status, isolated_task_retrieve_payload = invoke_request( + "POST", + f"/v0/tasks/{owner['task_id']}/artifact-chunks/retrieve", + payload={"user_id": str(intruder["user_id"]), "query": "Alpha beta"}, + ) + isolated_artifact_retrieve_status, isolated_artifact_retrieve_payload = invoke_request( + "POST", + f"/v0/task-artifacts/{notes_register_payload['artifact']['id']}/chunks/retrieve", + payload={"user_id": str(intruder["user_id"]), "query": "Alpha beta"}, + ) + + assert task_retrieve_status == 200 + assert task_retrieve_payload == { + "items": [ + { + "id": task_retrieve_payload["items"][0]["id"], + "task_id": str(owner["task_id"]), + "task_artifact_id": docs_register_payload["artifact"]["id"], + "relative_path": "docs/a.txt", + "media_type": "text/plain", + "sequence_no": 1, + "char_start": 0, + "char_end_exclusive": 14, + "text": "beta alpha doc", + "match": { + "matched_query_terms": ["alpha", "beta"], + "matched_query_term_count": 2, + "first_match_char_start": 0, + }, + }, + { + "id": task_retrieve_payload["items"][1]["id"], + "task_id": str(owner["task_id"]), + "task_artifact_id": notes_register_payload["artifact"]["id"], + "relative_path": "notes/b.md", + "media_type": "text/markdown", + "sequence_no": 1, + "char_start": 0, + "char_end_exclusive": 15, + "text": "alpha beta note", + "match": { + "matched_query_terms": ["alpha", "beta"], + "matched_query_term_count": 2, + "first_match_char_start": 0, + }, + }, + { + "id": task_retrieve_payload["items"][2]["id"], + "task_id": str(owner["task_id"]), + "task_artifact_id": weak_register_payload["artifact"]["id"], + "relative_path": "notes/c.txt", + "media_type": "text/plain", + "sequence_no": 1, + "char_start": 0, + "char_end_exclusive": 9, + "text": "beta only", + "match": { + "matched_query_terms": ["beta"], + "matched_query_term_count": 1, + "first_match_char_start": 0, + }, + }, + ], + "summary": { + "total_count": 3, + "searched_artifact_count": 3, + "query": "Alpha beta", + "query_terms": ["alpha", "beta"], + "matching_rule": TASK_ARTIFACT_CHUNK_RETRIEVAL_MATCHING_RULE, + "order": [ + "matched_query_term_count_desc", + "first_match_char_start_asc", + "relative_path_asc", + "sequence_no_asc", + "id_asc", + ], + "scope": { + "kind": "task", + "task_id": str(owner["task_id"]), + }, + }, + } + + assert artifact_retrieve_status == 200 + assert artifact_retrieve_payload == { + "items": [ + { + "id": artifact_retrieve_payload["items"][0]["id"], + "task_id": str(owner["task_id"]), + "task_artifact_id": notes_register_payload["artifact"]["id"], + "relative_path": "notes/b.md", + "media_type": "text/markdown", + "sequence_no": 1, + "char_start": 0, + "char_end_exclusive": 15, + "text": "alpha beta note", + "match": { + "matched_query_terms": ["alpha", "beta"], + "matched_query_term_count": 2, + "first_match_char_start": 0, + }, + } + ], + "summary": { + "total_count": 1, + "searched_artifact_count": 1, + "query": "Alpha beta", + "query_terms": ["alpha", "beta"], + "matching_rule": TASK_ARTIFACT_CHUNK_RETRIEVAL_MATCHING_RULE, + "order": [ + "matched_query_term_count_desc", + "first_match_char_start_asc", + "relative_path_asc", + "sequence_no_asc", + "id_asc", + ], + "scope": { + "kind": "artifact", + "task_id": str(owner["task_id"]), + "task_artifact_id": notes_register_payload["artifact"]["id"], + }, + }, + } + + assert empty_retrieve_status == 200 + assert empty_retrieve_payload == { + "items": [], + "summary": { + "total_count": 0, + "searched_artifact_count": 3, + "query": "missing", + "query_terms": ["missing"], + "matching_rule": TASK_ARTIFACT_CHUNK_RETRIEVAL_MATCHING_RULE, + "order": [ + "matched_query_term_count_desc", + "first_match_char_start_asc", + "relative_path_asc", + "sequence_no_asc", + "id_asc", + ], + "scope": { + "kind": "task", + "task_id": str(owner["task_id"]), + }, + }, + } + + assert isolated_task_retrieve_status == 404 + assert isolated_task_retrieve_payload == { + "detail": f"task {owner['task_id']} was not found" + } + + assert isolated_artifact_retrieve_status == 404 + assert isolated_artifact_retrieve_payload == { + "detail": f"task artifact {notes_register_payload['artifact']['id']} was not found" + } diff --git a/tests/unit/test_artifacts.py b/tests/unit/test_artifacts.py index e6ed44c..07dc3de 100644 --- a/tests/unit/test_artifacts.py +++ b/tests/unit/test_artifacts.py @@ -8,33 +8,66 @@ from alicebot_api.artifacts import ( TASK_ARTIFACT_CHUNKING_RULE, + TASK_ARTIFACT_CHUNK_RETRIEVAL_MATCHING_RULE, TaskArtifactAlreadyExistsError, + TaskArtifactChunkRetrievalValidationError, TaskArtifactNotFoundError, TaskArtifactValidationError, build_workspace_relative_artifact_path, chunk_normalized_artifact_text, ensure_artifact_path_is_rooted, + extract_unique_lexical_terms, get_task_artifact_record, ingest_task_artifact_record, list_task_artifact_chunk_records, list_task_artifact_records, + match_artifact_chunk_text, normalize_artifact_text, register_task_artifact_record, + retrieve_artifact_scoped_artifact_chunk_records, + retrieve_task_scoped_artifact_chunk_records, serialize_task_artifact_row, ) -from alicebot_api.contracts import TaskArtifactIngestInput, TaskArtifactRegisterInput +from alicebot_api.contracts import ( + ArtifactScopedArtifactChunkRetrievalInput, + TaskArtifactIngestInput, + TaskArtifactRegisterInput, + TaskScopedArtifactChunkRetrievalInput, +) +from alicebot_api.tasks import TaskNotFoundError from alicebot_api.workspaces import TaskWorkspaceNotFoundError class ArtifactStoreStub: def __init__(self) -> None: self.base_time = datetime(2026, 3, 13, 10, 0, tzinfo=UTC) + self.tasks: list[dict[str, object]] = [] self.workspaces: list[dict[str, object]] = [] self.artifacts: list[dict[str, object]] = [] self.artifact_chunks: list[dict[str, object]] = [] self.locked_workspace_ids: list[UUID] = [] self.locked_artifact_ids: list[UUID] = [] + def create_task(self, *, task_id: UUID, user_id: UUID) -> dict[str, object]: + task = { + "id": task_id, + "user_id": user_id, + "thread_id": uuid4(), + "tool_id": uuid4(), + "status": "approved", + "request": {}, + "tool": {}, + "latest_approval_id": None, + "latest_execution_id": None, + "created_at": self.base_time, + "updated_at": self.base_time, + } + self.tasks.append(task) + return task + + def get_task_optional(self, task_id: UUID) -> dict[str, object] | None: + return next((task for task in self.tasks if task["id"] == task_id), None) + def create_task_workspace(self, *, task_workspace_id: UUID, task_id: UUID, user_id: UUID, local_path: str) -> dict[str, object]: workspace = { "id": task_workspace_id, @@ -98,6 +131,12 @@ def create_task_artifact( def list_task_artifacts(self) -> list[dict[str, object]]: return sorted(self.artifacts, key=lambda artifact: (artifact["created_at"], artifact["id"])) + def list_task_artifacts_for_task(self, task_id: UUID) -> list[dict[str, object]]: + return sorted( + (artifact for artifact in self.artifacts if artifact["task_id"] == task_id), + key=lambda artifact: (artifact["created_at"], artifact["id"]), + ) + def get_task_artifact_optional(self, task_artifact_id: UUID) -> dict[str, object] | None: return next((artifact for artifact in self.artifacts if artifact["id"] == task_artifact_id), None) @@ -670,6 +709,343 @@ def test_list_task_artifact_chunk_records_are_deterministic() -> None: } +def test_extract_unique_lexical_terms_preserves_first_occurrence_order() -> None: + assert extract_unique_lexical_terms("Alpha beta, alpha\nbeta gamma") == [ + "alpha", + "beta", + "gamma", + ] + + +def test_match_artifact_chunk_text_returns_explicit_metadata() -> None: + assert match_artifact_chunk_text( + query_terms=["alpha", "beta", "delta"], + chunk_text="beta alpha release", + ) == { + "matched_query_terms": ["alpha", "beta"], + "matched_query_term_count": 2, + "first_match_char_start": 0, + } + + +def test_task_scoped_chunk_retrieval_orders_matches_deterministically_and_skips_pending() -> None: + store = ArtifactStoreStub() + user_id = uuid4() + task_id = uuid4() + task_workspace_id = uuid4() + store.create_task(task_id=task_id, user_id=user_id) + store.create_task_workspace( + task_workspace_id=task_workspace_id, + task_id=task_id, + user_id=user_id, + local_path="/tmp/alicebot/task-workspaces/user/task", + ) + docs_artifact = store.create_task_artifact( + task_id=task_id, + task_workspace_id=task_workspace_id, + status="registered", + ingestion_status="ingested", + relative_path="docs/a.txt", + media_type_hint="text/plain", + ) + notes_artifact = store.create_task_artifact( + task_id=task_id, + task_workspace_id=task_workspace_id, + status="registered", + ingestion_status="ingested", + relative_path="notes/b.md", + media_type_hint="text/markdown", + ) + pending_artifact = store.create_task_artifact( + task_id=task_id, + task_workspace_id=task_workspace_id, + status="registered", + ingestion_status="pending", + relative_path="notes/hidden.txt", + media_type_hint="text/plain", + ) + weak_match_artifact = store.create_task_artifact( + task_id=task_id, + task_workspace_id=task_workspace_id, + status="registered", + ingestion_status="ingested", + relative_path="notes/c.txt", + media_type_hint="text/plain", + ) + store.create_task_artifact_chunk( + task_artifact_id=docs_artifact["id"], + sequence_no=1, + char_start=0, + char_end_exclusive=14, + text="beta alpha doc", + ) + store.create_task_artifact_chunk( + task_artifact_id=notes_artifact["id"], + sequence_no=1, + char_start=0, + char_end_exclusive=15, + text="alpha beta note", + ) + store.create_task_artifact_chunk( + task_artifact_id=pending_artifact["id"], + sequence_no=1, + char_start=0, + char_end_exclusive=17, + text="alpha beta hidden", + ) + store.create_task_artifact_chunk( + task_artifact_id=weak_match_artifact["id"], + sequence_no=1, + char_start=0, + char_end_exclusive=9, + text="beta only", + ) + + assert retrieve_task_scoped_artifact_chunk_records( + store, + user_id=user_id, + request=TaskScopedArtifactChunkRetrievalInput( + task_id=task_id, + query="Alpha beta", + ), + ) == { + "items": [ + { + "id": str(store.artifact_chunks[0]["id"]), + "task_id": str(task_id), + "task_artifact_id": str(docs_artifact["id"]), + "relative_path": "docs/a.txt", + "media_type": "text/plain", + "sequence_no": 1, + "char_start": 0, + "char_end_exclusive": 14, + "text": "beta alpha doc", + "match": { + "matched_query_terms": ["alpha", "beta"], + "matched_query_term_count": 2, + "first_match_char_start": 0, + }, + }, + { + "id": str(store.artifact_chunks[1]["id"]), + "task_id": str(task_id), + "task_artifact_id": str(notes_artifact["id"]), + "relative_path": "notes/b.md", + "media_type": "text/markdown", + "sequence_no": 1, + "char_start": 0, + "char_end_exclusive": 15, + "text": "alpha beta note", + "match": { + "matched_query_terms": ["alpha", "beta"], + "matched_query_term_count": 2, + "first_match_char_start": 0, + }, + }, + { + "id": str(store.artifact_chunks[3]["id"]), + "task_id": str(task_id), + "task_artifact_id": str(weak_match_artifact["id"]), + "relative_path": "notes/c.txt", + "media_type": "text/plain", + "sequence_no": 1, + "char_start": 0, + "char_end_exclusive": 9, + "text": "beta only", + "match": { + "matched_query_terms": ["beta"], + "matched_query_term_count": 1, + "first_match_char_start": 0, + }, + }, + ], + "summary": { + "total_count": 3, + "searched_artifact_count": 3, + "query": "Alpha beta", + "query_terms": ["alpha", "beta"], + "matching_rule": TASK_ARTIFACT_CHUNK_RETRIEVAL_MATCHING_RULE, + "order": [ + "matched_query_term_count_desc", + "first_match_char_start_asc", + "relative_path_asc", + "sequence_no_asc", + "id_asc", + ], + "scope": { + "kind": "task", + "task_id": str(task_id), + }, + }, + } + + +def test_artifact_scoped_chunk_retrieval_returns_empty_for_non_ingested_artifact() -> None: + store = ArtifactStoreStub() + user_id = uuid4() + task_id = uuid4() + task_workspace_id = uuid4() + store.create_task(task_id=task_id, user_id=user_id) + store.create_task_workspace( + task_workspace_id=task_workspace_id, + task_id=task_id, + user_id=user_id, + local_path="/tmp/alicebot/task-workspaces/user/task", + ) + artifact = store.create_task_artifact( + task_id=task_id, + task_workspace_id=task_workspace_id, + status="registered", + ingestion_status="pending", + relative_path="docs/spec.txt", + media_type_hint="text/plain", + ) + store.create_task_artifact_chunk( + task_artifact_id=artifact["id"], + sequence_no=1, + char_start=0, + char_end_exclusive=10, + text="alpha beta", + ) + + assert retrieve_artifact_scoped_artifact_chunk_records( + store, + user_id=user_id, + request=ArtifactScopedArtifactChunkRetrievalInput( + task_artifact_id=artifact["id"], + query="alpha", + ), + ) == { + "items": [], + "summary": { + "total_count": 0, + "searched_artifact_count": 0, + "query": "alpha", + "query_terms": ["alpha"], + "matching_rule": TASK_ARTIFACT_CHUNK_RETRIEVAL_MATCHING_RULE, + "order": [ + "matched_query_term_count_desc", + "first_match_char_start_asc", + "relative_path_asc", + "sequence_no_asc", + "id_asc", + ], + "scope": { + "kind": "artifact", + "task_id": str(task_id), + "task_artifact_id": str(artifact["id"]), + }, + }, + } + + +def test_task_scoped_chunk_retrieval_returns_empty_when_no_chunks_match() -> None: + store = ArtifactStoreStub() + user_id = uuid4() + task_id = uuid4() + task_workspace_id = uuid4() + store.create_task(task_id=task_id, user_id=user_id) + store.create_task_workspace( + task_workspace_id=task_workspace_id, + task_id=task_id, + user_id=user_id, + local_path="/tmp/alicebot/task-workspaces/user/task", + ) + artifact = store.create_task_artifact( + task_id=task_id, + task_workspace_id=task_workspace_id, + status="registered", + ingestion_status="ingested", + relative_path="docs/spec.txt", + media_type_hint="text/plain", + ) + store.create_task_artifact_chunk( + task_artifact_id=artifact["id"], + sequence_no=1, + char_start=0, + char_end_exclusive=11, + text="release plan", + ) + + response = retrieve_task_scoped_artifact_chunk_records( + store, + user_id=user_id, + request=TaskScopedArtifactChunkRetrievalInput( + task_id=task_id, + query="alpha", + ), + ) + + assert response == { + "items": [], + "summary": { + "total_count": 0, + "searched_artifact_count": 1, + "query": "alpha", + "query_terms": ["alpha"], + "matching_rule": TASK_ARTIFACT_CHUNK_RETRIEVAL_MATCHING_RULE, + "order": [ + "matched_query_term_count_desc", + "first_match_char_start_asc", + "relative_path_asc", + "sequence_no_asc", + "id_asc", + ], + "scope": { + "kind": "task", + "task_id": str(task_id), + }, + }, + } + + +def test_task_scoped_chunk_retrieval_raises_when_task_is_missing() -> None: + with pytest.raises(TaskNotFoundError, match="was not found"): + retrieve_task_scoped_artifact_chunk_records( + ArtifactStoreStub(), + user_id=uuid4(), + request=TaskScopedArtifactChunkRetrievalInput( + task_id=uuid4(), + query="alpha", + ), + ) + + +def test_artifact_chunk_retrieval_rejects_query_without_words() -> None: + store = ArtifactStoreStub() + user_id = uuid4() + task_id = uuid4() + task_workspace_id = uuid4() + store.create_task(task_id=task_id, user_id=user_id) + store.create_task_workspace( + task_workspace_id=task_workspace_id, + task_id=task_id, + user_id=user_id, + local_path="/tmp/alicebot/task-workspaces/user/task", + ) + artifact = store.create_task_artifact( + task_id=task_id, + task_workspace_id=task_workspace_id, + status="registered", + ingestion_status="ingested", + relative_path="docs/spec.txt", + media_type_hint="text/plain", + ) + + with pytest.raises( + TaskArtifactChunkRetrievalValidationError, + match="must include at least one word", + ): + retrieve_artifact_scoped_artifact_chunk_records( + store, + user_id=user_id, + request=ArtifactScopedArtifactChunkRetrievalInput( + task_artifact_id=artifact["id"], + query=" ... ", + ), + ) + + def test_list_and_get_task_artifact_records_are_deterministic() -> None: store = ArtifactStoreStub() user_id = uuid4() diff --git a/tests/unit/test_artifacts_main.py b/tests/unit/test_artifacts_main.py index 9e6e1b7..a009b60 100644 --- a/tests/unit/test_artifacts_main.py +++ b/tests/unit/test_artifacts_main.py @@ -8,9 +8,11 @@ from apps.api.src.alicebot_api.config import Settings from alicebot_api.artifacts import ( TaskArtifactAlreadyExistsError, + TaskArtifactChunkRetrievalValidationError, TaskArtifactNotFoundError, TaskArtifactValidationError, ) +from alicebot_api.tasks import TaskNotFoundError from alicebot_api.workspaces import TaskWorkspaceNotFoundError @@ -105,6 +107,159 @@ def fake_user_connection(*_args, **_kwargs): } +def test_retrieve_task_artifact_chunks_endpoint_returns_payload(monkeypatch) -> None: + user_id = uuid4() + task_id = uuid4() + settings = Settings(database_url="postgresql://app") + + @contextmanager + def fake_user_connection(*_args, **_kwargs): + yield object() + + monkeypatch.setattr(main_module, "get_settings", lambda: settings) + monkeypatch.setattr(main_module, "user_connection", fake_user_connection) + monkeypatch.setattr( + main_module, + "retrieve_task_scoped_artifact_chunk_records", + lambda *_args, **_kwargs: { + "items": [], + "summary": { + "total_count": 0, + "searched_artifact_count": 1, + "query": "alpha", + "query_terms": ["alpha"], + "matching_rule": "casefolded_unicode_word_overlap_unique_query_terms_v1", + "order": [ + "matched_query_term_count_desc", + "first_match_char_start_asc", + "relative_path_asc", + "sequence_no_asc", + "id_asc", + ], + "scope": {"kind": "task", "task_id": str(task_id)}, + }, + }, + ) + + response = main_module.retrieve_task_artifact_chunks( + task_id, + main_module.RetrieveArtifactChunksRequest(user_id=user_id, query="alpha"), + ) + + assert response.status_code == 200 + assert json.loads(response.body) == { + "items": [], + "summary": { + "total_count": 0, + "searched_artifact_count": 1, + "query": "alpha", + "query_terms": ["alpha"], + "matching_rule": "casefolded_unicode_word_overlap_unique_query_terms_v1", + "order": [ + "matched_query_term_count_desc", + "first_match_char_start_asc", + "relative_path_asc", + "sequence_no_asc", + "id_asc", + ], + "scope": {"kind": "task", "task_id": str(task_id)}, + }, + } + + +def test_retrieve_task_artifact_chunks_endpoint_maps_task_not_found_to_404(monkeypatch) -> None: + user_id = uuid4() + task_id = uuid4() + settings = Settings(database_url="postgresql://app") + + @contextmanager + def fake_user_connection(*_args, **_kwargs): + yield object() + + def fake_retrieve_task_scoped_artifact_chunk_records(*_args, **_kwargs): + raise TaskNotFoundError(f"task {task_id} was not found") + + monkeypatch.setattr(main_module, "get_settings", lambda: settings) + monkeypatch.setattr(main_module, "user_connection", fake_user_connection) + monkeypatch.setattr( + main_module, + "retrieve_task_scoped_artifact_chunk_records", + fake_retrieve_task_scoped_artifact_chunk_records, + ) + + response = main_module.retrieve_task_artifact_chunks( + task_id, + main_module.RetrieveArtifactChunksRequest(user_id=user_id, query="alpha"), + ) + + assert response.status_code == 404 + assert json.loads(response.body) == {"detail": f"task {task_id} was not found"} + + +def test_retrieve_task_artifact_chunks_endpoint_maps_validation_to_400(monkeypatch) -> None: + user_id = uuid4() + task_id = uuid4() + settings = Settings(database_url="postgresql://app") + + @contextmanager + def fake_user_connection(*_args, **_kwargs): + yield object() + + def fake_retrieve_task_scoped_artifact_chunk_records(*_args, **_kwargs): + raise TaskArtifactChunkRetrievalValidationError( + "artifact chunk retrieval query must include at least one word" + ) + + monkeypatch.setattr(main_module, "get_settings", lambda: settings) + monkeypatch.setattr(main_module, "user_connection", fake_user_connection) + monkeypatch.setattr( + main_module, + "retrieve_task_scoped_artifact_chunk_records", + fake_retrieve_task_scoped_artifact_chunk_records, + ) + + response = main_module.retrieve_task_artifact_chunks( + task_id, + main_module.RetrieveArtifactChunksRequest(user_id=user_id, query="alpha"), + ) + + assert response.status_code == 400 + assert json.loads(response.body) == { + "detail": "artifact chunk retrieval query must include at least one word" + } + + +def test_retrieve_artifact_chunk_endpoint_maps_not_found_to_404(monkeypatch) -> None: + user_id = uuid4() + task_artifact_id = uuid4() + settings = Settings(database_url="postgresql://app") + + @contextmanager + def fake_user_connection(*_args, **_kwargs): + yield object() + + def fake_retrieve_artifact_scoped_artifact_chunk_records(*_args, **_kwargs): + raise TaskArtifactNotFoundError(f"task artifact {task_artifact_id} was not found") + + monkeypatch.setattr(main_module, "get_settings", lambda: settings) + monkeypatch.setattr(main_module, "user_connection", fake_user_connection) + monkeypatch.setattr( + main_module, + "retrieve_artifact_scoped_artifact_chunk_records", + fake_retrieve_artifact_scoped_artifact_chunk_records, + ) + + response = main_module.retrieve_task_artifact_chunks_for_artifact( + task_artifact_id, + main_module.RetrieveArtifactChunksRequest(user_id=user_id, query="alpha"), + ) + + assert response.status_code == 404 + assert json.loads(response.body) == { + "detail": f"task artifact {task_artifact_id} was not found" + } + + def test_register_task_artifact_endpoint_maps_workspace_not_found_to_404(monkeypatch) -> None: user_id = uuid4() task_workspace_id = uuid4() diff --git a/tests/unit/test_task_artifact_store.py b/tests/unit/test_task_artifact_store.py index df841c0..938c680 100644 --- a/tests/unit/test_task_artifact_store.py +++ b/tests/unit/test_task_artifact_store.py @@ -112,12 +112,14 @@ def test_task_artifact_store_methods_use_expected_queries() -> None: relative_path="docs/spec.txt", ) listed = store.list_task_artifacts() + listed_for_task = store.list_task_artifacts_for_task(task_id) store.lock_task_artifacts(task_workspace_id) assert created["id"] == task_artifact_id assert fetched is not None assert duplicate is not None assert listed[0]["id"] == task_artifact_id + assert listed_for_task[0]["id"] == task_artifact_id assert cursor.executed == [ ( """ @@ -221,6 +223,25 @@ def test_task_artifact_store_methods_use_expected_queries() -> None: """, None, ), + ( + """ + SELECT + id, + user_id, + task_id, + task_workspace_id, + status, + ingestion_status, + relative_path, + media_type_hint, + created_at, + updated_at + FROM task_artifacts + WHERE task_id = %s + ORDER BY created_at ASC, id ASC + """, + (task_id,), + ), ( "SELECT pg_advisory_xact_lock(hashtextextended(%s::text, 4))", (str(task_workspace_id),),