diff --git a/.ai/active/SPRINT_PACKET.md b/.ai/active/SPRINT_PACKET.md index 9c1979c..3547e32 100644 --- a/.ai/active/SPRINT_PACKET.md +++ b/.ai/active/SPRINT_PACKET.md @@ -2,7 +2,7 @@ ## Sprint Title -Sprint 5C: Task Artifact Records and Registration +Sprint 5D: Local Artifact Ingestion V0 ## Sprint Type @@ -10,116 +10,121 @@ feature ## Sprint Reason -Milestone 5 should continue on top of the shipped task-workspace boundary. Before document ingestion or connectors can safely rely on workspaces, the repo needs explicit, reviewable artifact records tied to those workspaces instead of ad hoc filesystem assumptions. +Milestone 5 now has deterministic task-workspace boundaries and explicit task-artifact records. The next safe step is to ingest registered local artifacts into durable chunk records, so later document retrieval can operate on explicit ingested data instead of raw filesystem reads. ## Sprint Intent -Add durable task-artifact records plus a narrow local artifact registration path on top of existing `task_workspaces`, so later document ingestion and retrieval can consume explicit artifact metadata instead of raw workspace scanning. +Add a narrow, explicit local artifact-ingestion seam that reads registered text artifacts from rooted task workspaces, chunks them deterministically, and persists durable artifact-chunk records, without yet adding document retrieval, embeddings, connectors, or UI. ## Git Instructions -- Branch Name: `codex/sprint-5c-task-artifacts` +- Branch Name: `codex/sprint-5d-artifact-ingestion-v0` - Base Branch: `main` - PR Strategy: one sprint branch, one PR, no stacked PRs unless Control Tower explicitly opens a follow-up sprint - Merge Policy: squash merge only after reviewer `PASS` and explicit Control Tower merge approval ## Why This Sprint -- Sprint 5A shipped deterministic rooted task-workspace provisioning. -- Sprint 5B cleaned and synchronized live project truth, so Milestone 5 planning can proceed from current facts. -- The roadmap says artifact and workspace boundaries should be explicit before document-heavy work lands. -- The narrowest next step is artifact records and registration only, not ingestion, chunking, connectors, or UI. +- Sprint 5A shipped task-workspace provisioning. +- Sprint 5C shipped explicit task-artifact registration on top of those workspaces. +- The roadmap says document work should build on the existing artifact/workspace boundary. +- The narrowest next step is ingestion only: turn registered local artifacts into durable chunk records before any retrieval or connector work begins. ## In Scope - Add schema and migration support for: - - `task_artifacts` + - `task_artifact_chunks` + - any narrow `task_artifacts.ingestion_status` expansion required to represent successful ingestion deterministically - Define typed contracts for: - - artifact registration requests - - artifact create responses - - artifact list responses - - artifact detail responses -- Implement a narrow artifact seam that: - - registers one local file path under an existing visible task workspace - - persists one user-scoped artifact record linked to that workspace and task - - validates that the artifact path stays rooted under the workspace local path - - stores explicit artifact metadata such as relative path, media type hint if supplied, and status fields needed for later ingestion - - exposes deterministic list and detail reads + - artifact-ingestion requests + - artifact-ingestion responses + - artifact-chunk list responses + - artifact detail responses updated for ingestion status if needed +- Implement a narrow ingestion seam that: + - accepts one already-registered visible artifact + - resolves its rooted local file path from the persisted workspace boundary plus artifact relative path + - supports only a small explicit text input set, for example `text/plain` and `text/markdown` + - reads file contents deterministically + - normalizes line endings and chunks text deterministically by one explicit rule + - persists ordered chunk rows linked to the artifact + - updates artifact ingestion status deterministically - Implement the minimal API or service paths needed for: - - registering one artifact for a task workspace - - listing artifacts - - reading one artifact by id + - ingesting one artifact + - listing chunks for one artifact - Add unit and integration tests for: - - artifact registration - - rooted path validation against the workspace boundary - - duplicate registration behavior for the same workspace-relative path + - supported text artifact ingestion + - deterministic chunk ordering and chunk content boundaries + - rooted path enforcement during ingestion + - unsupported media-type or file-shape rejection - per-user isolation - stable response shape ## Out of Scope -- No document ingestion. -- No chunking, embeddings, or document retrieval. -- No background scanning of workspace directories. +- No compile-path or search retrieval over artifact chunks yet. +- No embeddings for artifact chunks. +- No document ranking or chunk selection. +- No PDF, DOCX, OCR, or rich document parsing beyond the narrow supported text set. - No Gmail or Calendar connector scope. - No runner-style orchestration. -- No new proxy handlers or broader side-effect expansion. - No UI work. ## Required Deliverables -- Migration for `task_artifacts`. -- Stable artifact register/list/detail contracts. -- Minimal deterministic artifact-registration and persistence path over existing task workspaces. -- Unit and integration coverage for rooted-path safety, duplicate handling, ordering, and isolation. +- Migration for `task_artifact_chunks` and any narrow ingestion-status update. +- Stable artifact-ingestion and artifact-chunk read contracts. +- Minimal deterministic local artifact-ingestion path over registered task artifacts. +- Unit and integration coverage for rooted-path safety, supported-format ingestion, chunk ordering, and isolation. - Updated `BUILD_REPORT.md` with exact verification results and explicit deferred scope. ## Acceptance Criteria -- A client can register one artifact under an existing visible task workspace. -- Every artifact record stores a workspace-relative path and remains rooted under the persisted workspace local path. -- Duplicate registration behavior for the same workspace-relative path is deterministic and documented. -- Artifact list and detail reads are deterministic and user-scoped. +- A client can ingest one supported registered local artifact into durable ordered chunk records. +- Ingestion reads only files rooted under the persisted task workspace boundary. +- Chunking behavior is deterministic and documented. +- Unsupported artifact types are rejected deterministically. +- Artifact chunk reads are deterministic and user-scoped. - `./.venv/bin/python -m pytest tests/unit` passes. - `./.venv/bin/python -m pytest tests/integration` passes. -- No document ingestion, connector, runner, handler-expansion, or broader side-effect scope enters the sprint. +- No retrieval, embeddings, connector, runner, UI, or broader side-effect scope enters the sprint. ## Implementation Constraints -- Keep the artifact seam narrow and boring. -- Reuse the existing `task_workspaces` boundary; do not invent a parallel storage contract. -- Prefer explicit artifact registration over implicit directory scanning in this sprint. -- Keep rooted-path validation deterministic and local-filesystem-only. -- Do not parse, chunk, embed, or retrieve artifact contents in the same sprint. +- Keep ingestion narrow and boring. +- Reuse existing `task_workspaces` and `task_artifacts` boundaries; do not scan directories implicitly. +- Support only a small explicit text-artifact set in this sprint. +- Keep chunking deterministic and simple enough to test precisely. +- Do not introduce retrieval or embedding behavior in the same sprint. ## Suggested Work Breakdown -1. Add `task_artifacts` schema and migration. -2. Define artifact register/list/detail contracts. -3. Implement deterministic rooted artifact-path validation against the persisted workspace path. -4. Implement artifact registration, list, and detail behavior. -5. Add unit and integration tests. -6. Update `BUILD_REPORT.md` with executed verification. +1. Add `task_artifact_chunks` schema and migration. +2. Define ingestion and chunk-read contracts. +3. Implement deterministic rooted file resolution from artifact metadata. +4. Implement narrow supported-format ingestion and deterministic chunk persistence. +5. Implement artifact chunk list reads. +6. Add unit and integration tests. +7. Update `BUILD_REPORT.md` with executed verification. ## Build Report Requirements `BUILD_REPORT.md` must include: -- the exact artifact schema and contract changes introduced -- the artifact-path rooting and duplicate-handling rule used +- the exact chunk schema and ingestion contract changes introduced +- the supported file types and chunking rule used - exact commands run - unit and integration test results -- one example artifact registration response -- one example artifact detail response +- one example artifact-ingestion response +- one example artifact-chunk list response - what remains intentionally deferred to later milestones ## Review Focus `REVIEW_REPORT.md` should verify: -- the sprint stayed limited to task artifact records and registration -- artifact paths are deterministic, rooted safely under existing workspaces, and user-scoped -- duplicate handling, ordering, and isolation are test-backed -- no hidden document ingestion, connector, runner, UI, handler-expansion, or broader side-effect scope entered the sprint +- the sprint stayed limited to local artifact ingestion and chunk persistence +- ingestion reuses explicit task-workspace and artifact records rather than filesystem scanning +- rooted-path safety, chunk determinism, ordering, and isolation are test-backed +- no hidden retrieval, embedding, connector, runner, UI, or broader side-effect scope entered the sprint ## Exit Condition -This sprint is complete when the repo can persist deterministic user-scoped task-artifact records rooted under existing task workspaces, expose stable artifact reads, and verify the full path with Postgres-backed tests, while still deferring document ingestion, retrieval, and connector work. +This sprint is complete when the repo can ingest supported registered local artifacts into deterministic durable chunk records, expose stable chunk reads, and verify the full path with Postgres-backed tests, while still deferring document retrieval, embeddings, and connector work. diff --git a/.ai/handoff/CURRENT_STATE.md b/.ai/handoff/CURRENT_STATE.md index a220a6e..680e8af 100644 --- a/.ai/handoff/CURRENT_STATE.md +++ b/.ai/handoff/CURRENT_STATE.md @@ -2,26 +2,27 @@ ## Canonical Truth -- The accepted repo state is current through Sprint 5A. +- The working repo state is current through Sprint 5D, including post-review follow-up fixes for artifact-ingestion coverage and stale docs. - Use [PRODUCT_BRIEF.md](/Users/samirusani/Desktop/Codex/AliceBot/PRODUCT_BRIEF.md) for product scope, [ARCHITECTURE.md](/Users/samirusani/Desktop/Codex/AliceBot/ARCHITECTURE.md) for implemented technical boundaries, [ROADMAP.md](/Users/samirusani/Desktop/Codex/AliceBot/ROADMAP.md) for forward planning, and [RULES.md](/Users/samirusani/Desktop/Codex/AliceBot/RULES.md) for durable operating rules. - Historical build and review reports have been moved under [docs/archive/sprints](/Users/samirusani/Desktop/Codex/AliceBot/docs/archive/sprints). ## Implemented Repo Slice -- `apps/api` is the only shipped product surface. It implements continuity, tracing, deterministic context compilation, governed memory admission and review, embeddings, semantic retrieval, entities, policy and tool governance, approval persistence and resolution, approved-only `proxy.echo` execution, execution budgets, task/task-step lifecycle reads and mutations, explicit manual continuation lineage, explicit task-step linkage for approval and execution synchronization, and deterministic rooted local task-workspace provisioning. -- The live schema includes continuity, trace, memory, embedding, entity, governance, `tasks`, `task_steps`, and `task_workspaces` tables with row-level security on user-owned data. +- `apps/api` is the only shipped product surface. It implements continuity, tracing, deterministic context compilation, governed memory admission and review, embeddings, semantic retrieval, entities, policy and tool governance, approval persistence and resolution, approved-only `proxy.echo` execution, execution budgets, task/task-step lifecycle reads and mutations, explicit manual continuation lineage, explicit task-step linkage for approval and execution synchronization, deterministic rooted local task-workspace provisioning, explicit task-artifact registration, and narrow local text-artifact ingestion into durable chunk rows. +- The live schema includes continuity, trace, memory, embedding, entity, governance, `tasks`, `task_steps`, `task_workspaces`, `task_artifacts`, and `task_artifact_chunks` tables with row-level security on user-owned data. - `apps/web` and `workers` remain starter scaffolds only. ## Current Boundaries - Task workspaces are implemented only as deterministic rooted local directories plus durable `task_workspaces` records. +- Task artifacts are implemented only as explicit rooted local-file registrations under those workspaces plus narrow deterministic ingestion for `text/plain` and `text/markdown`. - The shipped multi-step task path is still explicit and narrow: later steps are appended manually with lineage, while approval and execution synchronization use explicit linked `task_step_id` references. - The only execution handler in the repo is the in-process no-external-I/O `proxy.echo` path. ## Not Implemented -- Artifact storage or indexing beyond the local workspace boundary. -- Document ingestion, chunking, or document retrieval. +- Retrieval, ranking, or embeddings over artifact chunks. +- Rich document parsing beyond the narrow local text ingestion seam. - Read-only Gmail or Calendar connectors. - Runner-style orchestration or automatic multi-step progression. - Auth beyond the current database user-context model. @@ -30,17 +31,17 @@ - Memory extraction and retrieval quality remain the main product risk. - Auth is still incomplete beyond database user context. -- Workspace provisioning is intentionally narrow and local; broader artifact and document flows still need their own accepted seams. +- Workspace provisioning and artifact ingestion are intentionally narrow and local; broader retrieval, embedding, connector, and rich-document flows still need their own accepted seams. -## Latest Accepted Verification +## Latest Local Verification -- Sprint 5A review status: `PASS`. -- Accepted verification on March 13, 2026: - - `./.venv/bin/python -m pytest tests/unit` -> `315 passed` - - `./.venv/bin/python -m pytest tests/integration` -> `99 passed` +- Latest review artifact: `PASS WITH FIXES`. +- Post-review local verification on March 14, 2026: + - `./.venv/bin/python -m pytest tests/unit` -> `347 passed` + - `./.venv/bin/python -m pytest tests/integration` -> `104 passed` ## Planning Guardrails -- Plan from the implemented Sprint 5A repo state, not from older milestone narratives. -- Do not describe Milestone 5 document, artifact, connector, or runner work as shipped. +- Plan from the implemented Sprint 5D repo state, not from older milestone narratives. +- Do not describe retrieval, embeddings, connectors, runner work, or broader rich-document handling as shipped. - Keep live truth files compact; archive historical detail instead of re-expanding the active context set. diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md index f097b55..9899d45 100644 --- a/ARCHITECTURE.md +++ b/ARCHITECTURE.md @@ -2,16 +2,16 @@ ## Current Implemented Slice -AliceBot now implements the accepted repo slice through Sprint 5C. The shipped backend includes: +AliceBot now implements the accepted repo slice through Sprint 5D. The shipped backend includes: - foundation continuity storage over `users`, `threads`, `sessions`, and append-only `events` - deterministic tracing and context compilation over durable continuity, memory, entity, and entity-edge records - governed memory admission, explicit-preference extraction, memory review labels, review queue reads, evaluation summary reads, explicit embedding config and memory-embedding storage, direct semantic retrieval, and deterministic hybrid compile-path memory merge - deterministic prompt assembly and one no-tools response path that persists assistant replies as immutable continuity events - user-scoped consents, policies, policy evaluation, tool registry, allowlist evaluation, tool routing, approval request persistence, approval resolution, approved-only proxy execution through the in-process `proxy.echo` handler, durable execution review, and execution-budget lifecycle plus enforcement -- durable `tasks`, `task_steps`, `task_workspaces`, and `task_artifacts`, deterministic task-step sequencing, explicit task-step transitions, explicit manual continuation with lineage through `parent_step_id`, `source_approval_id`, and `source_execution_id`, explicit `tool_executions.task_step_id` linkage for execution synchronization, deterministic rooted local task-workspace provisioning, and explicit rooted local artifact registration plus deterministic artifact reads +- durable `tasks`, `task_steps`, `task_workspaces`, `task_artifacts`, and `task_artifact_chunks`, deterministic task-step sequencing, explicit task-step transitions, explicit manual continuation with lineage through `parent_step_id`, `source_approval_id`, and `source_execution_id`, explicit `tool_executions.task_step_id` linkage for execution synchronization, deterministic rooted local task-workspace provisioning, explicit rooted local artifact registration, deterministic local text-artifact ingestion into durable chunk rows, and deterministic artifact plus chunk reads -The current multi-step boundary is narrow and explicit. Manual continuation is implemented and review-passed. Approval resolution and proxy execution now both use explicit task-step linkage rather than first-step inference. Task workspaces are now implemented only as deterministic rooted local boundaries, and task artifacts are now implemented only as explicit rooted local-file registrations under those workspaces. Broader runner-style orchestration, automatic multi-step progression, artifact indexing, document ingestion, connectors, and new side-effect surfaces are still planned later and must not be described as live behavior. +The current multi-step boundary is narrow and explicit. Manual continuation is implemented and review-passed. Approval resolution and proxy execution now both use explicit task-step linkage rather than first-step inference. Task workspaces are now implemented only as deterministic rooted local boundaries, and task artifacts are now implemented only as explicit rooted local-file registrations plus narrow deterministic text ingestion under those workspaces. Broader runner-style orchestration, automatic multi-step progression, retrieval over artifact chunks, embeddings, rich-document parsing, connectors, and new side-effect surfaces are still planned later and must not be described as live behavior. ## Implemented Now @@ -24,7 +24,7 @@ The current multi-step boundary is narrow and explicit. Manual continuation is i - memory and retrieval: `POST /v0/memories/admit`, `POST /v0/memories/extract-explicit-preferences`, `GET /v0/memories`, `GET /v0/memories/review-queue`, `GET /v0/memories/evaluation-summary`, `POST /v0/memories/semantic-retrieval`, `GET /v0/memories/{memory_id}`, `GET /v0/memories/{memory_id}/revisions`, `POST /v0/memories/{memory_id}/labels`, `GET /v0/memories/{memory_id}/labels` - embeddings and graph seams: `POST /v0/embedding-configs`, `GET /v0/embedding-configs`, `POST /v0/memory-embeddings`, `GET /v0/memories/{memory_id}/embeddings`, `GET /v0/memory-embeddings/{memory_embedding_id}`, `POST /v0/entities`, `GET /v0/entities`, `GET /v0/entities/{entity_id}`, `POST /v0/entity-edges`, `GET /v0/entities/{entity_id}/edges` - governance: `POST /v0/consents`, `GET /v0/consents`, `POST /v0/policies`, `GET /v0/policies`, `GET /v0/policies/{policy_id}`, `POST /v0/policies/evaluate`, `POST /v0/tools`, `GET /v0/tools`, `GET /v0/tools/{tool_id}`, `POST /v0/tools/allowlist/evaluate`, `POST /v0/tools/route`, `POST /v0/approvals/requests`, `GET /v0/approvals`, `GET /v0/approvals/{approval_id}`, `POST /v0/approvals/{approval_id}/approve`, `POST /v0/approvals/{approval_id}/reject`, `POST /v0/approvals/{approval_id}/execute` - - task and execution review: `GET /v0/tasks`, `GET /v0/tasks/{task_id}`, `POST /v0/tasks/{task_id}/workspace`, `GET /v0/task-workspaces`, `GET /v0/task-workspaces/{task_workspace_id}`, `POST /v0/task-workspaces/{task_workspace_id}/artifacts`, `GET /v0/task-artifacts`, `GET /v0/task-artifacts/{task_artifact_id}`, `GET /v0/tasks/{task_id}/steps`, `GET /v0/task-steps/{task_step_id}`, `POST /v0/tasks/{task_id}/steps`, `POST /v0/task-steps/{task_step_id}/transition`, `POST /v0/execution-budgets`, `GET /v0/execution-budgets`, `GET /v0/execution-budgets/{execution_budget_id}`, `POST /v0/execution-budgets/{execution_budget_id}/deactivate`, `POST /v0/execution-budgets/{execution_budget_id}/supersede`, `GET /v0/tool-executions`, `GET /v0/tool-executions/{execution_id}` +- task and execution review: `GET /v0/tasks`, `GET /v0/tasks/{task_id}`, `POST /v0/tasks/{task_id}/workspace`, `GET /v0/task-workspaces`, `GET /v0/task-workspaces/{task_workspace_id}`, `POST /v0/task-workspaces/{task_workspace_id}/artifacts`, `GET /v0/task-artifacts`, `GET /v0/task-artifacts/{task_artifact_id}`, `POST /v0/task-artifacts/{task_artifact_id}/ingest`, `GET /v0/task-artifacts/{task_artifact_id}/chunks`, `GET /v0/tasks/{task_id}/steps`, `GET /v0/task-steps/{task_step_id}`, `POST /v0/tasks/{task_id}/steps`, `POST /v0/task-steps/{task_step_id}/transition`, `POST /v0/execution-budgets`, `GET /v0/execution-budgets`, `GET /v0/execution-budgets/{execution_budget_id}`, `POST /v0/execution-budgets/{execution_budget_id}/deactivate`, `POST /v0/execution-budgets/{execution_budget_id}/supersede`, `GET /v0/tool-executions`, `GET /v0/tool-executions/{execution_id}` - `apps/web` and `workers` remain starter shells only. ### Data Foundation @@ -37,7 +37,7 @@ The current multi-step boundary is narrow and explicit. Manual continuation is i - memory and retrieval tables: `memories`, `memory_revisions`, `memory_review_labels`, `embedding_configs`, `memory_embeddings` - graph tables: `entities`, `entity_edges` - governance tables: `consents`, `policies`, `tools`, `approvals`, `tool_executions`, `execution_budgets` - - task lifecycle tables: `tasks`, `task_steps`, `task_workspaces`, `task_artifacts` + - task lifecycle tables: `tasks`, `task_steps`, `task_workspaces`, `task_artifacts`, `task_artifact_chunks` - `events`, `trace_events`, and `memory_revisions` are append-only by application contract and database enforcement. - `memory_review_labels` are append-only by database enforcement. - `tasks` are explicit user-scoped lifecycle records keyed to one thread and one tool, with durable request/tool snapshots, status in `pending_approval | approved | executed | denied | blocked`, and latest approval/execution pointers for the current narrow lifecycle seam. @@ -49,18 +49,19 @@ The current multi-step boundary is narrow and explicit. Manual continuation is i - Lineage fields are guarded by composite user-scoped foreign keys and a self-reference check so a step cannot cite itself as its parent. - `tool_executions` now persist an explicit `task_step_id` linked by a composite foreign key to `task_steps(id, user_id)`. - `task_workspaces` persist one active workspace record per visible task and user, store a deterministic `local_path`, and enforce that active uniqueness through a partial unique index on `(user_id, task_id)`. -- `task_artifacts` persist explicit user-scoped artifact rows linked to both `tasks` and `task_workspaces`, store `status = registered`, `ingestion_status = pending`, store only a workspace-relative `relative_path` plus optional `media_type_hint`, and enforce deterministic duplicate rejection through a unique index on `(user_id, task_workspace_id, relative_path)`. +- `task_artifacts` persist explicit user-scoped artifact rows linked to both `tasks` and `task_workspaces`, store `status = registered`, `ingestion_status in ('pending', 'ingested')`, store only a workspace-relative `relative_path` plus optional `media_type_hint`, and enforce deterministic duplicate rejection through a unique index on `(user_id, task_workspace_id, relative_path)`. +- `task_artifact_chunks` persist explicit user-scoped durable chunk rows linked to one artifact, store ordered `sequence_no`, zero-based `char_start`, exclusive `char_end_exclusive`, and chunk `text`, and enforce deterministic uniqueness through a unique index on `(user_id, task_artifact_id, sequence_no)`. - `execution_budgets` enforce at most one active budget per `(user_id, tool_key, domain_hint)` selector scope through a partial unique index. - Per-request user context is set in the database through `app.current_user_id()`. - `TASK_WORKSPACE_ROOT` defines the only allowed base directory for workspace provisioning, and the live path rule is `resolved_root / user_id / task_id`. ### Repo Boundaries In This Slice -- `apps/api`: implemented API, store, contracts, service logic, and migrations for continuity, tracing, memory, embeddings, entities, policies, tools, approvals, proxy execution, execution budgets, tasks, task steps, task workspaces, and task artifacts. +- `apps/api`: implemented API, store, contracts, service logic, and migrations for continuity, tracing, memory, embeddings, entities, policies, tools, approvals, proxy execution, execution budgets, tasks, task steps, task workspaces, task artifacts, and narrow local artifact chunk ingestion. - `apps/web`: minimal shell only; no shipped workflow UI. - `workers`: scaffold only; no background jobs or runner logic are implemented. - `infra`: local development bootstrap assets only. -- `tests`: unit and Postgres-backed integration coverage for the shipped seams above, including Sprint 4O task-step lineage/manual continuation, Sprint 4S step-linked execution synchronization, Sprint 5A task-workspace provisioning, and Sprint 5C task-artifact registration. +- `tests`: unit and Postgres-backed integration coverage for the shipped seams above, including Sprint 4O task-step lineage/manual continuation, Sprint 4S step-linked execution synchronization, Sprint 5A task-workspace provisioning, Sprint 5C task-artifact registration, and Sprint 5D local artifact ingestion plus chunk reads. ## Core Flows Implemented Now @@ -175,12 +176,26 @@ The current multi-step boundary is narrow and explicit. Manual continuation is i 8. `GET /v0/task-artifacts` lists visible artifact rows in deterministic `created_at ASC, id ASC` order. 9. `GET /v0/task-artifacts/{task_artifact_id}` returns one user-visible artifact detail record. +### Task Artifact Ingestion And Chunk Reads + +1. Accept a user-scoped `POST /v0/task-artifacts/{task_artifact_id}/ingest` request for one visible registered artifact. +2. Lock ingestion for that artifact before deciding whether work is needed. +3. Resolve the persisted workspace `local_path` plus persisted artifact `relative_path`, and reject any rooted-path escape deterministically. +4. Support only the narrow explicit text set: `text/plain` and `text/markdown`. +5. Read file bytes deterministically and require valid UTF-8 text. +6. Normalize line endings by rewriting `\r\n` and `\r` to `\n`. +7. Chunk normalized text deterministically with rule `normalized_utf8_text_fixed_window_1000_chars_v1`. +8. Persist ordered `task_artifact_chunks` rows with `sequence_no`, `char_start`, `char_end_exclusive`, and `text`. +9. Update the parent artifact to `ingestion_status = ingested`. +10. If the artifact is already ingested, return the existing artifact and chunk summary without reinserting chunks. +11. `GET /v0/task-artifacts/{task_artifact_id}/chunks` returns visible chunk rows in deterministic `sequence_no ASC, id ASC` order plus stable summary metadata. + ## Security Model Implemented Now -- User-owned continuity, trace, memory, embedding, entity, governance, task, task-step, task-workspace, and task-artifact tables enforce row-level security. +- User-owned continuity, trace, memory, embedding, entity, governance, task, task-step, task-workspace, task-artifact, and task-artifact-chunk tables enforce row-level security. - The runtime role is limited to the narrow `SELECT` / `INSERT` / `UPDATE` permissions required by the shipped seams; there is no broad DDL or unrestricted table access at runtime. - Cross-user references are constrained through composite foreign keys on `(id, user_id)` where the schema needs ownership-linked joins. -- Approval, execution, memory, entity, task/task-step, task-workspace, and task-artifact reads all operate only inside the current user scope. +- Approval, execution, memory, entity, task/task-step, task-workspace, task-artifact, and task-artifact-chunk reads all operate only inside the current user scope. - Task-step manual continuation adds both schema-level and service-level lineage protection: - schema-level: user-scoped foreign keys and parent-not-self check - service-level: same-task, latest-step, visible-approval, visible-execution, and parent-outcome-match validation @@ -205,7 +220,11 @@ The current multi-step boundary is narrow and explicit. Manual continuation is i - artifact register/list/detail response shape - rooted artifact-path enforcement beneath the persisted workspace path - duplicate artifact registration rejection for the same workspace-relative path - - task-artifact per-user isolation + - supported `text/plain` and `text/markdown` ingestion + - deterministic line-ending normalization and fixed-window chunk boundaries + - invalid UTF-8 rejection + - idempotent re-ingestion of already ingested artifacts + - task-artifact and task-artifact-chunk per-user isolation - trace visibility for continuation and transition events - user isolation for task and task-step reads and mutations - adversarial lineage validation for cross-task, cross-user, and parent-step mismatch cases @@ -215,7 +234,8 @@ The current multi-step boundary is narrow and explicit. Manual continuation is i The following areas remain planned later and must not be described as implemented: - runner-style orchestration and automatic multi-step progression beyond the current explicit manual continuation seam -- artifact indexing, artifact content processing, and document ingestion beyond the current explicit rooted local registration boundary +- retrieval over artifact chunks, chunk ranking, and embeddings beyond the current explicit rooted local ingestion boundary +- rich document parsing beyond the current narrow UTF-8 text and markdown ingestion boundary - read-only Gmail and Calendar connectors - broader tool proxying and real-world side effects beyond the current no-I/O `proxy.echo` handler - model-driven extraction, reranking, and broader memory review automation diff --git a/BUILD_REPORT.md b/BUILD_REPORT.md index a8fde01..828d9a5 100644 --- a/BUILD_REPORT.md +++ b/BUILD_REPORT.md @@ -2,55 +2,82 @@ ## sprint objective -Implement Sprint 5C: Task Artifact Records and Registration by adding durable user-scoped `task_artifacts` records on top of existing `task_workspaces`, plus a narrow rooted local-file registration path and deterministic artifact reads. +Implement Sprint 5D: Local Artifact Ingestion V0 by adding a narrow, deterministic ingestion path that reads one registered local text artifact from its persisted task workspace boundary, chunks normalized text into durable ordered records, and exposes stable chunk reads without adding retrieval, embeddings, connectors, runners, or UI scope. ## completed work -- Added `task_artifacts` schema via `apps/api/alembic/versions/20260313_0023_task_artifacts.py`. -- Added artifact contracts in `apps/api/src/alicebot_api/contracts.py`: - - `TaskArtifactRegisterInput` - - `TaskArtifactRecord` - - `TaskArtifactCreateResponse` - - `TaskArtifactListResponse` - - `TaskArtifactDetailResponse` - - `TaskArtifactStatus = "registered"` - - `TaskArtifactIngestionStatus = "pending"` - - `TASK_ARTIFACT_LIST_ORDER = ["created_at_asc", "id_asc"]` -- Added artifact persistence and reads in `apps/api/src/alicebot_api/store.py`: - - `TaskArtifactRow` - - create/get/list/duplicate-lookup methods - - advisory lock for per-workspace artifact registration -- Added narrow artifact registration service in `apps/api/src/alicebot_api/artifacts.py`. +- Added migration `apps/api/alembic/versions/20260314_0024_task_artifact_chunks.py`. +- Expanded `task_artifacts.ingestion_status` from `pending` to `pending | ingested`. +- Added durable `task_artifact_chunks` storage with user scoping, RLS, and ordered per-artifact uniqueness. +- Added artifact-ingestion contracts in `apps/api/src/alicebot_api/contracts.py`: + - `TaskArtifactIngestInput` + - `TaskArtifactChunkRecord` + - `TaskArtifactChunkListSummary` + - `TaskArtifactChunkListResponse` + - `TaskArtifactIngestionResponse` + - `TASK_ARTIFACT_CHUNK_LIST_ORDER = ["sequence_no_asc", "id_asc"]` + - `TaskArtifactIngestionStatus = "pending" | "ingested"` +- Added artifact-ingestion service behavior in `apps/api/src/alicebot_api/artifacts.py`: + - rooted file resolution from persisted workspace `local_path` plus artifact `relative_path` + - explicit supported media types only: `text/plain`, `text/markdown` + - strict UTF-8 text decoding + - line-ending normalization to `\n` + - deterministic fixed-window chunking rule + - durable ordered chunk persistence + - deterministic `ingestion_status` transition to `ingested` +- Added store support in `apps/api/src/alicebot_api/store.py`: + - `TaskArtifactChunkRow` + - advisory lock for per-artifact ingestion + - create/list chunk methods + - artifact ingestion-status update method - Added API routes in `apps/api/src/alicebot_api/main.py`: - - `POST /v0/task-workspaces/{task_workspace_id}/artifacts` - - `GET /v0/task-artifacts` - - `GET /v0/task-artifacts/{task_artifact_id}` -- Added unit and integration coverage for rooted path validation, duplicate behavior, deterministic ordering, response shape, migration coverage, and per-user isolation. - -Artifact schema introduced: + - `POST /v0/task-artifacts/{task_artifact_id}/ingest` + - `GET /v0/task-artifacts/{task_artifact_id}/chunks` +- Added unit and integration coverage for: + - supported text ingestion + - direct `text/markdown` ingestion + - deterministic chunk ordering and boundaries + - rooted-path enforcement during ingestion + - invalid UTF-8 rejection + - idempotent re-ingestion + - unsupported media-type rejection + - per-user isolation + - stable ingestion and chunk-list response shapes +- Refreshed `ARCHITECTURE.md` and `.ai/handoff/CURRENT_STATE.md` so the documented shipped slice now matches Sprint 5D ingestion behavior and deferred scope. + +Exact chunk schema introduced: - `id uuid PRIMARY KEY` - `user_id uuid NOT NULL` -- `task_id uuid NOT NULL` -- `task_workspace_id uuid NOT NULL` -- `status text NOT NULL CHECK (status IN ('registered'))` -- `ingestion_status text NOT NULL CHECK (ingestion_status IN ('pending'))` -- `relative_path text NOT NULL` -- `media_type_hint text NULL` +- `task_artifact_id uuid NOT NULL` +- `sequence_no integer NOT NULL CHECK (sequence_no >= 1)` +- `char_start integer NOT NULL CHECK (char_start >= 0)` +- `char_end_exclusive integer NOT NULL CHECK (char_end_exclusive > char_start)` +- `text text NOT NULL CHECK (length(text) > 0)` - `created_at timestamptz NOT NULL` - `updated_at timestamptz NOT NULL` -- foreign keys to `(tasks.id, user_id)` and `(task_workspaces.id, user_id)` -- unique index on `(user_id, task_workspace_id, relative_path)` -- user-owned RLS policy with runtime grants limited to `SELECT, INSERT` +- foreign key to `(task_artifacts.id, user_id)` with `ON DELETE CASCADE` +- unique index on `(user_id, task_artifact_id, sequence_no)` +- user-owned RLS policy +- runtime grants limited to `SELECT, INSERT` on `task_artifact_chunks` +- runtime `UPDATE` added on `task_artifacts` so ingestion can set `ingestion_status` + +Supported file types and chunking rule: + +- Supported media types: `text/plain`, `text/markdown` +- Text decoding: UTF-8 only +- Line-ending normalization: `\r\n` and `\r` become `\n` +- Chunking rule: `normalized_utf8_text_fixed_window_1000_chars_v1` +- Chunk boundary rule: split normalized text into contiguous, non-overlapping 1000-character windows with zero-based `char_start` and exclusive `char_end_exclusive` -Artifact-path rooting and duplicate-handling rule: +Exact ingestion contract changes introduced: -- Registration accepts one existing regular file path. -- The file path is resolved locally and must stay rooted under the persisted workspace `local_path`. -- The persisted record stores only the workspace-relative POSIX path, not an absolute artifact path. -- Duplicate registration for the same `(user_id, task_workspace_id, relative_path)` is rejected with HTTP `409`. +- Request input: `TaskArtifactIngestInput(task_artifact_id)` +- Ingestion response: `{"artifact": TaskArtifactRecord, "summary": TaskArtifactChunkListSummary}` +- Chunk list response: `{"items": list[TaskArtifactChunkRecord], "summary": TaskArtifactChunkListSummary}` +- Artifact detail payload remains stable except `ingestion_status` can now be `ingested` -Example artifact registration response: +Example artifact-ingestion response: ```json { @@ -59,47 +86,74 @@ Example artifact registration response: "task_id": "22222222-2222-2222-2222-222222222222", "task_workspace_id": "33333333-3333-3333-3333-333333333333", "status": "registered", - "ingestion_status": "pending", + "ingestion_status": "ingested", "relative_path": "docs/spec.txt", "media_type_hint": "text/plain", - "created_at": "2026-03-13T10:00:00+00:00", - "updated_at": "2026-03-13T10:00:00+00:00" + "created_at": "2026-03-14T10:00:00+00:00", + "updated_at": "2026-03-14T10:00:01+00:00" + }, + "summary": { + "total_count": 2, + "total_characters": 1006, + "media_type": "text/plain", + "chunking_rule": "normalized_utf8_text_fixed_window_1000_chars_v1", + "order": ["sequence_no_asc", "id_asc"] } } ``` -Example artifact detail response: +Example artifact-chunk list response: ```json { - "artifact": { - "id": "11111111-1111-1111-1111-111111111111", - "task_id": "22222222-2222-2222-2222-222222222222", - "task_workspace_id": "33333333-3333-3333-3333-333333333333", - "status": "registered", - "ingestion_status": "pending", - "relative_path": "docs/spec.txt", - "media_type_hint": "text/plain", - "created_at": "2026-03-13T10:00:00+00:00", - "updated_at": "2026-03-13T10:00:00+00:00" + "items": [ + { + "id": "44444444-4444-4444-4444-444444444444", + "task_artifact_id": "11111111-1111-1111-1111-111111111111", + "sequence_no": 1, + "char_start": 0, + "char_end_exclusive": 4, + "text": "abc\n", + "created_at": "2026-03-14T10:00:01+00:00", + "updated_at": "2026-03-14T10:00:01+00:00" + }, + { + "id": "55555555-5555-5555-5555-555555555555", + "task_artifact_id": "11111111-1111-1111-1111-111111111111", + "sequence_no": 2, + "char_start": 4, + "char_end_exclusive": 7, + "text": "def", + "created_at": "2026-03-14T10:00:01+00:00", + "updated_at": "2026-03-14T10:00:01+00:00" + } + ], + "summary": { + "total_count": 2, + "total_characters": 7, + "media_type": "text/plain", + "chunking_rule": "normalized_utf8_text_fixed_window_1000_chars_v1", + "order": ["sequence_no_asc", "id_asc"] } } ``` ## incomplete work -- None within Sprint 5C scope. +- None within Sprint 5D scope. ## files changed -- `apps/api/alembic/versions/20260313_0023_task_artifacts.py` +- `apps/api/alembic/versions/20260314_0024_task_artifact_chunks.py` - `apps/api/src/alicebot_api/artifacts.py` - `apps/api/src/alicebot_api/contracts.py` - `apps/api/src/alicebot_api/main.py` - `apps/api/src/alicebot_api/store.py` +- `ARCHITECTURE.md` +- `.ai/handoff/CURRENT_STATE.md` - `tests/integration/test_migrations.py` - `tests/integration/test_task_artifacts_api.py` -- `tests/unit/test_20260313_0023_task_artifacts.py` +- `tests/unit/test_20260314_0024_task_artifact_chunks.py` - `tests/unit/test_artifacts.py` - `tests/unit/test_artifacts_main.py` - `tests/unit/test_main.py` @@ -108,21 +162,35 @@ Example artifact detail response: ## tests run -- `./.venv/bin/python -m pytest tests/unit/test_artifacts.py tests/unit/test_artifacts_main.py tests/unit/test_task_artifact_store.py tests/unit/test_20260313_0023_task_artifacts.py tests/unit/test_main.py tests/integration/test_task_artifacts_api.py tests/integration/test_migrations.py` - - result: `54 passed, 3 errors` - - note: the 3 errors were sandboxed local-Postgres connection failures before rerunning with elevated access +- `./.venv/bin/python -m pytest tests/unit/test_artifacts.py tests/unit/test_artifacts_main.py tests/unit/test_task_artifact_store.py tests/unit/test_20260314_0024_task_artifact_chunks.py tests/unit/test_main.py` + - result: `63 passed in 0.77s` +- `./.venv/bin/python -m pytest tests/unit/test_artifacts.py` + - result: `16 passed in 0.11s` +- `./.venv/bin/python -m pytest tests/integration/test_task_artifacts_api.py` + - rerun with local access: `5 passed in 1.72s` - `./.venv/bin/python -m pytest tests/unit` - - result: `332 passed` + - result: `347 passed in 0.56s` - `./.venv/bin/python -m pytest tests/integration` - - result: `100 passed in 29.55s` + - first sandboxed attempt failed to reach local Postgres and open a local socket + - rerun with local access: `104 passed in 30.87s` - `git diff --check` - result: passed ## blockers/issues - No remaining implementation blockers. -- Document ingestion, chunking, embeddings over artifacts, retrieval, connectors, scanning, UI work, and broader side effects remain intentionally deferred. +- Local Postgres-backed integration tests required running outside the default sandbox; after rerun with local access, the full suite passed. ## recommended next step -Build the next milestone on top of these explicit artifact records by adding a separate ingestion workflow that consumes `task_artifacts` metadata without weakening the task-workspace boundary or reintroducing implicit directory scanning. +Build the next milestone on top of these durable chunk records by adding retrieval over ingested chunks only, while still keeping embeddings, ranking, rich-document parsing, connectors, orchestration, and UI changes out of scope until separately sprinted. + +## intentionally deferred + +- Retrieval or search over artifact chunks +- Embeddings for artifact chunks +- Ranking or chunk selection +- PDF, DOCX, OCR, or rich document parsing +- Connector ingestion +- Runner/orchestration behavior +- UI changes diff --git a/REVIEW_REPORT.md b/REVIEW_REPORT.md index 1863373..16ae984 100644 --- a/REVIEW_REPORT.md +++ b/REVIEW_REPORT.md @@ -6,53 +6,44 @@ PASS ## criteria met -- Added the required `task_artifacts` migration and kept the schema narrow: user-scoped rows linked to both `tasks` and `task_workspaces`, explicit status fields, a unique `(user_id, task_workspace_id, relative_path)` constraint, and RLS-enabled runtime access limited to `SELECT, INSERT`. -- Added stable typed contracts for artifact registration, create, list, and detail responses in `apps/api/src/alicebot_api/contracts.py`. -- Implemented the required narrow API seam: - - `POST /v0/task-workspaces/{task_workspace_id}/artifacts` - - `GET /v0/task-artifacts` - - `GET /v0/task-artifacts/{task_artifact_id}` -- Registration is rooted safely under the persisted workspace path: the local file path is resolved, required to exist as a regular file, checked against the resolved workspace root, and persisted only as a workspace-relative POSIX path. -- Duplicate registration behavior is deterministic and documented in code/tests: the same workspace-relative path returns HTTP `409`. -- Artifact reads are deterministic and user-scoped: list order is `created_at ASC, id ASC`, detail/list reads run inside the current user DB scope, and isolation is covered by integration tests. -- The sprint stayed within scope. No ingestion, chunking, retrieval, connector, runner, UI, or broader side-effect work entered the diff. -- `BUILD_REPORT.md` includes the schema summary, contract changes, duplicate/rooting rules, exact commands, example responses, test results, and deferred scope. -- Verification passed: - - `./.venv/bin/python -m pytest tests/unit` -> `332 passed` - - `./.venv/bin/python -m pytest tests/integration` -> `100 passed in 27.20s` +- The sprint stayed within the intended slice: local artifact ingestion and chunk persistence only. I did not find retrieval, embeddings, connector, runner, or UI overreach in the changed code. +- The implementation reuses existing `task_workspaces` and `task_artifacts` records instead of scanning the filesystem. +- Ingestion resolves the artifact path from the persisted workspace root plus stored `relative_path`, and rejects rooted-path escapes deterministically. +- Supported text ingestion works for registered local artifacts and persists durable ordered `task_artifact_chunks` rows. +- Chunking is deterministic and documented in code and `BUILD_REPORT.md`: normalized line endings plus fixed 1000-character windows. +- Unsupported media types are rejected deterministically. +- Chunk reads are deterministic and user-scoped. +- The follow-up fixes added direct test coverage for `text/markdown` ingestion, invalid UTF-8 rejection, and idempotent re-ingestion. +- The stale architecture and handoff docs were updated to reflect Sprint 5D behavior and boundaries. +- Verification rerun during review: + - `./.venv/bin/python -m pytest tests/unit` -> `347 passed` + - `./.venv/bin/python -m pytest tests/integration` -> `104 passed` after rerunning with local access to Postgres and a local test socket - `git diff --check` -> passed ## criteria missed -- None on the sprint packet’s functional acceptance criteria. +- No acceptance criteria from `SPRINT_PACKET.md` were missed. ## quality issues -- None blocking in the implementation reviewed. +- None found in the reviewed sprint scope. ## regression risks -- Low. The change is narrowly scoped to artifact registration/read behavior on top of the existing workspace seam. -- The follow-up change is documentation-only and aligns `ARCHITECTURE.md` with the already reviewed implementation. +- No material regression risk beyond the normal risk profile for this slice. ## docs issues -- None blocking. The previous architecture drift is now fixed. -- `RULES.md` does not appear to need changes for this sprint. +- None. `ARCHITECTURE.md`, `.ai/handoff/CURRENT_STATE.md`, and `BUILD_REPORT.md` now match the landed implementation and verification state. ## should anything be added to RULES.md? -- No. The existing rules already cover the durable guidance this sprint exercised: narrow scope, typed contracts, migration-backed schemas, RLS on user-owned tables, and test-backed delivery. +- No. ## should anything update ARCHITECTURE.md? -- No additional update is required for this sprint. The follow-up doc change now reflects the shipped Sprint 5C boundary: - - `task_artifacts` is listed in the implemented data model - - the artifact register/list/detail endpoints are listed in the live API surface - - the rooted registration and duplicate-rejection rules are documented - - deferred scope is narrowed to indexing/content processing/ingestion beyond the current registration seam +- No further update required from this review pass. ## recommended next action -- Accept Sprint 5C. -- Keep later milestone work focused on artifact indexing and ingestion as a separate seam on top of these explicit artifact records. +- Accept the sprint and proceed with the normal merge path once Control Tower approves. diff --git a/apps/api/alembic/versions/20260314_0024_task_artifact_chunks.py b/apps/api/alembic/versions/20260314_0024_task_artifact_chunks.py new file mode 100644 index 0000000..ee51410 --- /dev/null +++ b/apps/api/alembic/versions/20260314_0024_task_artifact_chunks.py @@ -0,0 +1,97 @@ +"""Add user-scoped task artifact chunk records.""" + +from __future__ import annotations + +from alembic import op + + +revision = "20260314_0024" +down_revision = "20260313_0023" +branch_labels = None +depends_on = None + +_RLS_TABLES = ("task_artifact_chunks",) + +_UPGRADE_SCHEMA_STATEMENT = """ + CREATE TABLE task_artifact_chunks ( + id uuid PRIMARY KEY DEFAULT gen_random_uuid(), + user_id uuid NOT NULL REFERENCES users(id) ON DELETE CASCADE, + task_artifact_id uuid NOT NULL, + sequence_no integer NOT NULL, + char_start integer NOT NULL, + char_end_exclusive integer NOT NULL, + text text NOT NULL, + created_at timestamptz NOT NULL DEFAULT now(), + updated_at timestamptz NOT NULL DEFAULT now(), + UNIQUE (id, user_id), + CONSTRAINT task_artifact_chunks_artifact_user_fk + FOREIGN KEY (task_artifact_id, user_id) + REFERENCES task_artifacts(id, user_id) + ON DELETE CASCADE, + CONSTRAINT task_artifact_chunks_sequence_no_check + CHECK (sequence_no >= 1), + CONSTRAINT task_artifact_chunks_char_start_check + CHECK (char_start >= 0), + CONSTRAINT task_artifact_chunks_char_end_exclusive_check + CHECK (char_end_exclusive > char_start), + CONSTRAINT task_artifact_chunks_text_nonempty_check + CHECK (length(text) > 0) + ); + + CREATE UNIQUE INDEX task_artifact_chunks_artifact_sequence_idx + ON task_artifact_chunks (user_id, task_artifact_id, sequence_no); + """ + +_UPGRADE_GRANT_STATEMENTS = ( + "GRANT UPDATE ON task_artifacts TO alicebot_app", + "GRANT SELECT, INSERT ON task_artifact_chunks TO alicebot_app", +) + +_UPGRADE_POLICY_STATEMENT = """ + CREATE POLICY task_artifact_chunks_is_owner ON task_artifact_chunks + USING (user_id = app.current_user_id()) + WITH CHECK (user_id = app.current_user_id()); + """ + +_UPGRADE_TASK_ARTIFACTS_STATEMENTS = ( + "ALTER TABLE task_artifacts DROP CONSTRAINT task_artifacts_ingestion_status_check", + """ + ALTER TABLE task_artifacts + ADD CONSTRAINT task_artifacts_ingestion_status_check + CHECK (ingestion_status IN ('pending', 'ingested')) + """, +) + +_DOWNGRADE_STATEMENTS = ( + "REVOKE UPDATE ON task_artifacts FROM alicebot_app", + "DROP TABLE IF EXISTS task_artifact_chunks", + "ALTER TABLE task_artifacts DROP CONSTRAINT task_artifacts_ingestion_status_check", + """ + ALTER TABLE task_artifacts + ADD CONSTRAINT task_artifacts_ingestion_status_check + CHECK (ingestion_status IN ('pending')) + """, +) + + +def _execute_statements(statements: tuple[str, ...]) -> None: + for statement in statements: + op.execute(statement) + + +def _enable_row_level_security() -> None: + for table_name in _RLS_TABLES: + op.execute(f"ALTER TABLE {table_name} ENABLE ROW LEVEL SECURITY") + op.execute(f"ALTER TABLE {table_name} FORCE ROW LEVEL SECURITY") + + +def upgrade() -> None: + _execute_statements(_UPGRADE_TASK_ARTIFACTS_STATEMENTS) + op.execute(_UPGRADE_SCHEMA_STATEMENT) + _execute_statements(_UPGRADE_GRANT_STATEMENTS) + _enable_row_level_security() + op.execute(_UPGRADE_POLICY_STATEMENT) + + +def downgrade() -> None: + _execute_statements(_DOWNGRADE_STATEMENTS) diff --git a/apps/api/src/alicebot_api/artifacts.py b/apps/api/src/alicebot_api/artifacts.py index c91f77c..611dbe1 100644 --- a/apps/api/src/alicebot_api/artifacts.py +++ b/apps/api/src/alicebot_api/artifacts.py @@ -8,17 +8,33 @@ from alicebot_api.contracts import ( TASK_ARTIFACT_LIST_ORDER, + TASK_ARTIFACT_CHUNK_LIST_ORDER, TaskArtifactCreateResponse, TaskArtifactDetailResponse, + TaskArtifactChunkListResponse, + TaskArtifactChunkListSummary, + TaskArtifactChunkRecord, TaskArtifactListResponse, TaskArtifactRecord, + TaskArtifactIngestInput, + TaskArtifactIngestionResponse, TaskArtifactRegisterInput, TaskArtifactStatus, TaskArtifactIngestionStatus, ) -from alicebot_api.store import ContinuityStore, TaskArtifactRow +from alicebot_api.store import ContinuityStore, TaskArtifactChunkRow, TaskArtifactRow from alicebot_api.workspaces import TaskWorkspaceNotFoundError +SUPPORTED_TEXT_ARTIFACT_MEDIA_TYPES = ("text/plain", "text/markdown") +SUPPORTED_TEXT_ARTIFACT_EXTENSIONS = { + ".txt": "text/plain", + ".text": "text/plain", + ".md": "text/markdown", + ".markdown": "text/markdown", +} +TASK_ARTIFACT_CHUNK_MAX_CHARS = 1000 +TASK_ARTIFACT_CHUNKING_RULE = "normalized_utf8_text_fixed_window_1000_chars_v1" + class TaskArtifactNotFoundError(LookupError): """Raised when a task artifact is not visible inside the current user scope.""" @@ -83,6 +99,79 @@ def serialize_task_artifact_row(row: TaskArtifactRow) -> TaskArtifactRecord: } +def serialize_task_artifact_chunk_row(row: TaskArtifactChunkRow) -> TaskArtifactChunkRecord: + return { + "id": str(row["id"]), + "task_artifact_id": str(row["task_artifact_id"]), + "sequence_no": row["sequence_no"], + "char_start": row["char_start"], + "char_end_exclusive": row["char_end_exclusive"], + "text": row["text"], + "created_at": row["created_at"].isoformat(), + "updated_at": row["updated_at"].isoformat(), + } + + +def infer_task_artifact_media_type(row: TaskArtifactRow) -> str | None: + if row["media_type_hint"] is not None: + return row["media_type_hint"] + + artifact_path = Path(row["relative_path"]) + return SUPPORTED_TEXT_ARTIFACT_EXTENSIONS.get(artifact_path.suffix.lower()) + + +def resolve_supported_task_artifact_media_type(row: TaskArtifactRow) -> str: + media_type = infer_task_artifact_media_type(row) + if media_type in SUPPORTED_TEXT_ARTIFACT_MEDIA_TYPES: + return cast(str, media_type) + + supported_types = ", ".join(SUPPORTED_TEXT_ARTIFACT_MEDIA_TYPES) + raise TaskArtifactValidationError( + f"artifact {row['relative_path']} has unsupported media type " + f"{media_type or 'unknown'}; supported types: {supported_types}" + ) + + +def normalize_artifact_text(text: str) -> str: + return text.replace("\r\n", "\n").replace("\r", "\n") + + +def chunk_normalized_artifact_text( + text: str, + *, + chunk_size: int = TASK_ARTIFACT_CHUNK_MAX_CHARS, +) -> list[tuple[int, int, str]]: + chunks: list[tuple[int, int, str]] = [] + for char_start in range(0, len(text), chunk_size): + char_end_exclusive = min(char_start + chunk_size, len(text)) + chunks.append((char_start, char_end_exclusive, text[char_start:char_end_exclusive])) + return chunks + + +def resolve_registered_artifact_path(*, workspace_path: Path, relative_path: str) -> Path: + artifact_path = (workspace_path / relative_path).resolve() + ensure_artifact_path_is_rooted( + workspace_path=workspace_path, + artifact_path=artifact_path, + ) + return artifact_path + + +def build_task_artifact_chunk_list_summary( + chunk_rows: list[TaskArtifactChunkRow], + *, + media_type: str, +) -> TaskArtifactChunkListSummary: + total_characters = sum(len(row["text"]) for row in chunk_rows) + return { + "total_count": len(chunk_rows), + "total_characters": total_characters, + "media_type": media_type, + "chunking_rule": TASK_ARTIFACT_CHUNKING_RULE, + "order": list(TASK_ARTIFACT_CHUNK_LIST_ORDER), + } + + def register_task_artifact_record( store: ContinuityStore, *, @@ -171,3 +260,92 @@ def get_task_artifact_record( if row is None: raise TaskArtifactNotFoundError(f"task artifact {task_artifact_id} was not found") return {"artifact": serialize_task_artifact_row(row)} + + +def ingest_task_artifact_record( + store: ContinuityStore, + *, + user_id: UUID, + request: TaskArtifactIngestInput, +) -> TaskArtifactIngestionResponse: + del user_id + + row = store.get_task_artifact_optional(request.task_artifact_id) + if row is None: + raise TaskArtifactNotFoundError(f"task artifact {request.task_artifact_id} was not found") + + store.lock_task_artifact_ingestion(row["id"]) + row = store.get_task_artifact_optional(request.task_artifact_id) + if row is None: + raise TaskArtifactNotFoundError(f"task artifact {request.task_artifact_id} was not found") + + media_type = resolve_supported_task_artifact_media_type(row) + chunk_rows = store.list_task_artifact_chunks(row["id"]) + if row["ingestion_status"] == "ingested": + return { + "artifact": serialize_task_artifact_row(row), + "summary": build_task_artifact_chunk_list_summary(chunk_rows, media_type=media_type), + } + + workspace = store.get_task_workspace_optional(row["task_workspace_id"]) + if workspace is None: + raise TaskWorkspaceNotFoundError( + f"task workspace {row['task_workspace_id']} was not found" + ) + + workspace_path = Path(workspace["local_path"]).expanduser().resolve() + artifact_path = resolve_registered_artifact_path( + workspace_path=workspace_path, + relative_path=row["relative_path"], + ) + _require_existing_file(artifact_path) + + try: + text = artifact_path.read_bytes().decode("utf-8") + except UnicodeDecodeError as exc: + raise TaskArtifactValidationError( + f"artifact {row['relative_path']} is not valid UTF-8 text" + ) from exc + + normalized_text = normalize_artifact_text(text) + for index, (char_start, char_end_exclusive, chunk_text) in enumerate( + chunk_normalized_artifact_text(normalized_text), + start=1, + ): + store.create_task_artifact_chunk( + task_artifact_id=row["id"], + sequence_no=index, + char_start=char_start, + char_end_exclusive=char_end_exclusive, + text=chunk_text, + ) + + artifact_row = store.update_task_artifact_ingestion_status( + task_artifact_id=row["id"], + ingestion_status="ingested", + ) + chunk_rows = store.list_task_artifact_chunks(row["id"]) + return { + "artifact": serialize_task_artifact_row(artifact_row), + "summary": build_task_artifact_chunk_list_summary(chunk_rows, media_type=media_type), + } + + +def list_task_artifact_chunk_records( + store: ContinuityStore, + *, + user_id: UUID, + task_artifact_id: UUID, +) -> TaskArtifactChunkListResponse: + del user_id + + row = store.get_task_artifact_optional(task_artifact_id) + if row is None: + raise TaskArtifactNotFoundError(f"task artifact {task_artifact_id} was not found") + + chunk_rows = store.list_task_artifact_chunks(task_artifact_id) + media_type = infer_task_artifact_media_type(row) or "unknown" + return { + "items": [serialize_task_artifact_chunk_row(chunk_row) for chunk_row in chunk_rows], + "summary": build_task_artifact_chunk_list_summary(chunk_rows, media_type=media_type), + } diff --git a/apps/api/src/alicebot_api/contracts.py b/apps/api/src/alicebot_api/contracts.py index 06113c4..aa68f2e 100644 --- a/apps/api/src/alicebot_api/contracts.py +++ b/apps/api/src/alicebot_api/contracts.py @@ -21,7 +21,7 @@ TaskStatus = Literal["pending_approval", "approved", "executed", "denied", "blocked"] TaskWorkspaceStatus = Literal["active"] TaskArtifactStatus = Literal["registered"] -TaskArtifactIngestionStatus = Literal["pending"] +TaskArtifactIngestionStatus = Literal["pending", "ingested"] TaskLifecycleSource = Literal[ "approval_request", "approval_resolution", @@ -132,6 +132,7 @@ TASK_LIST_ORDER = ["created_at_asc", "id_asc"] TASK_WORKSPACE_LIST_ORDER = ["created_at_asc", "id_asc"] TASK_ARTIFACT_LIST_ORDER = ["created_at_asc", "id_asc"] +TASK_ARTIFACT_CHUNK_LIST_ORDER = ["sequence_no_asc", "id_asc"] TASK_STEP_LIST_ORDER = ["sequence_no_asc", "created_at_asc", "id_asc"] TOOL_EXECUTION_LIST_ORDER = ["executed_at_asc", "id_asc"] EXECUTION_BUDGET_LIST_ORDER = ["created_at_asc", "id_asc"] @@ -140,7 +141,7 @@ TASK_STATUSES = ["pending_approval", "approved", "executed", "denied", "blocked"] TASK_WORKSPACE_STATUSES = ["active"] TASK_ARTIFACT_STATUSES = ["registered"] -TASK_ARTIFACT_INGESTION_STATUSES = ["pending"] +TASK_ARTIFACT_INGESTION_STATUSES = ["pending", "ingested"] TASK_STEP_KINDS = ["governed_request"] TASK_STEP_STATUSES = ["created", "approved", "executed", "blocked", "denied"] APPROVAL_REQUEST_VERSION_V0 = "approval_request_v0" @@ -1606,6 +1607,11 @@ class TaskArtifactRegisterInput: media_type_hint: str | None = None +@dataclass(frozen=True, slots=True) +class TaskArtifactIngestInput: + task_artifact_id: UUID + + class TaskArtifactRecord(TypedDict): id: str task_id: str @@ -1636,6 +1642,35 @@ class TaskArtifactDetailResponse(TypedDict): artifact: TaskArtifactRecord +class TaskArtifactChunkRecord(TypedDict): + id: str + task_artifact_id: str + sequence_no: int + char_start: int + char_end_exclusive: int + text: str + created_at: str + updated_at: str + + +class TaskArtifactChunkListSummary(TypedDict): + total_count: int + total_characters: int + media_type: str + chunking_rule: str + order: list[str] + + +class TaskArtifactChunkListResponse(TypedDict): + items: list[TaskArtifactChunkRecord] + summary: TaskArtifactChunkListSummary + + +class TaskArtifactIngestionResponse(TypedDict): + artifact: TaskArtifactRecord + summary: TaskArtifactChunkListSummary + + class TaskStepTraceLink(TypedDict): trace_id: str trace_kind: str diff --git a/apps/api/src/alicebot_api/main.py b/apps/api/src/alicebot_api/main.py index 5bce9bf..ebdf2d3 100644 --- a/apps/api/src/alicebot_api/main.py +++ b/apps/api/src/alicebot_api/main.py @@ -50,6 +50,7 @@ ProxyExecutionStatus, ToolAllowlistEvaluationRequestInput, ProxyExecutionRequestInput, + TaskArtifactIngestInput, TaskArtifactRegisterInput, TaskStepKind, TaskStepLineageInput, @@ -66,6 +67,8 @@ TaskArtifactNotFoundError, TaskArtifactValidationError, get_task_artifact_record, + ingest_task_artifact_record, + list_task_artifact_chunk_records, list_task_artifact_records, register_task_artifact_record, ) @@ -413,6 +416,10 @@ class RegisterTaskArtifactRequest(BaseModel): media_type_hint: str | None = Field(default=None, min_length=1, max_length=200) +class IngestTaskArtifactRequest(BaseModel): + user_id: UUID + + class TaskStepRequestSnapshot(BaseModel): thread_id: UUID tool_id: UUID @@ -1254,6 +1261,53 @@ def get_task_artifact(task_artifact_id: UUID, user_id: UUID) -> JSONResponse: ) +@app.post("/v0/task-artifacts/{task_artifact_id}/ingest") +def ingest_task_artifact( + task_artifact_id: UUID, + request: IngestTaskArtifactRequest, +) -> JSONResponse: + settings = get_settings() + + try: + with user_connection(settings.database_url, request.user_id) as conn: + payload = ingest_task_artifact_record( + ContinuityStore(conn), + user_id=request.user_id, + request=TaskArtifactIngestInput(task_artifact_id=task_artifact_id), + ) + except TaskArtifactNotFoundError as exc: + return JSONResponse(status_code=404, content={"detail": str(exc)}) + except TaskWorkspaceNotFoundError as exc: + return JSONResponse(status_code=404, content={"detail": str(exc)}) + except TaskArtifactValidationError as exc: + return JSONResponse(status_code=400, content={"detail": str(exc)}) + + return JSONResponse( + status_code=200, + content=jsonable_encoder(payload), + ) + + +@app.get("/v0/task-artifacts/{task_artifact_id}/chunks") +def list_task_artifact_chunks(task_artifact_id: UUID, user_id: UUID) -> JSONResponse: + settings = get_settings() + + try: + with user_connection(settings.database_url, user_id) as conn: + payload = list_task_artifact_chunk_records( + ContinuityStore(conn), + user_id=user_id, + task_artifact_id=task_artifact_id, + ) + except TaskArtifactNotFoundError as exc: + return JSONResponse(status_code=404, content={"detail": str(exc)}) + + return JSONResponse( + status_code=200, + content=jsonable_encoder(payload), + ) + + @app.post("/v0/tasks/{task_id}/steps") def create_next_task_step(task_id: UUID, request: CreateNextTaskStepRequest) -> JSONResponse: settings = get_settings() diff --git a/apps/api/src/alicebot_api/store.py b/apps/api/src/alicebot_api/store.py index c849a35..13c7cc3 100644 --- a/apps/api/src/alicebot_api/store.py +++ b/apps/api/src/alicebot_api/store.py @@ -258,6 +258,18 @@ class TaskArtifactRow(TypedDict): updated_at: datetime +class TaskArtifactChunkRow(TypedDict): + id: UUID + user_id: UUID + task_artifact_id: UUID + sequence_no: int + char_start: int + char_end_exclusive: int + text: str + created_at: datetime + updated_at: datetime + + class TaskStepRow(TypedDict): id: UUID user_id: UUID @@ -1517,6 +1529,75 @@ class LabelCountRow(TypedDict): ORDER BY created_at ASC, id ASC """ +LOCK_TASK_ARTIFACT_INGESTION_SQL = "SELECT pg_advisory_xact_lock(hashtextextended(%s::text, 5))" + +INSERT_TASK_ARTIFACT_CHUNK_SQL = """ + INSERT INTO task_artifact_chunks ( + user_id, + task_artifact_id, + sequence_no, + char_start, + char_end_exclusive, + text, + created_at, + updated_at + ) + VALUES ( + app.current_user_id(), + %s, + %s, + %s, + %s, + %s, + clock_timestamp(), + clock_timestamp() + ) + RETURNING + id, + user_id, + task_artifact_id, + sequence_no, + char_start, + char_end_exclusive, + text, + created_at, + updated_at + """ + +LIST_TASK_ARTIFACT_CHUNKS_SQL = """ + SELECT + id, + user_id, + task_artifact_id, + sequence_no, + char_start, + char_end_exclusive, + text, + created_at, + updated_at + FROM task_artifact_chunks + WHERE task_artifact_id = %s + ORDER BY sequence_no ASC, id ASC + """ + +UPDATE_TASK_ARTIFACT_INGESTION_STATUS_SQL = """ + UPDATE task_artifacts + SET ingestion_status = %s, + updated_at = clock_timestamp() + WHERE id = %s + RETURNING + id, + user_id, + task_id, + task_workspace_id, + status, + ingestion_status, + relative_path, + media_type_hint, + created_at, + updated_at + """ + INSERT_TASK_STEP_SQL = """ INSERT INTO task_steps ( user_id, @@ -2650,6 +2731,40 @@ def get_task_artifact_by_workspace_relative_path_optional( def list_task_artifacts(self) -> list[TaskArtifactRow]: return self._fetch_all(LIST_TASK_ARTIFACTS_SQL) + def lock_task_artifact_ingestion(self, task_artifact_id: UUID) -> None: + with self.conn.cursor() as cur: + cur.execute(LOCK_TASK_ARTIFACT_INGESTION_SQL, (str(task_artifact_id),)) + + def create_task_artifact_chunk( + self, + *, + task_artifact_id: UUID, + sequence_no: int, + char_start: int, + char_end_exclusive: int, + text: str, + ) -> TaskArtifactChunkRow: + return self._fetch_one( + "create_task_artifact_chunk", + INSERT_TASK_ARTIFACT_CHUNK_SQL, + (task_artifact_id, sequence_no, char_start, char_end_exclusive, text), + ) + + def list_task_artifact_chunks(self, task_artifact_id: UUID) -> list[TaskArtifactChunkRow]: + return self._fetch_all(LIST_TASK_ARTIFACT_CHUNKS_SQL, (task_artifact_id,)) + + def update_task_artifact_ingestion_status( + self, + *, + task_artifact_id: UUID, + ingestion_status: str, + ) -> TaskArtifactRow: + return self._fetch_one( + "update_task_artifact_ingestion_status", + UPDATE_TASK_ARTIFACT_INGESTION_STATUS_SQL, + (ingestion_status, task_artifact_id), + ) + def lock_task_steps(self, task_id: UUID) -> None: with self.conn.cursor() as cur: cur.execute(LOCK_TASK_STEPS_SQL, (str(task_id),)) diff --git a/tests/integration/test_migrations.py b/tests/integration/test_migrations.py index 250c270..001f1a0 100644 --- a/tests/integration/test_migrations.py +++ b/tests/integration/test_migrations.py @@ -301,6 +301,8 @@ def test_migrations_upgrade_and_downgrade(database_urls): assert cur.fetchone()[0] == "task_workspaces" cur.execute("SELECT to_regclass('public.task_artifacts')") assert cur.fetchone()[0] == "task_artifacts" + cur.execute("SELECT to_regclass('public.task_artifact_chunks')") + assert cur.fetchone()[0] == "task_artifact_chunks" cur.execute("SELECT to_regclass('public.task_steps')") assert cur.fetchone()[0] == "task_steps" cur.execute( @@ -383,6 +385,7 @@ def test_migrations_upgrade_and_downgrade(database_urls): 'tasks', 'task_workspaces', 'task_artifacts', + 'task_artifact_chunks', 'task_steps', 'execution_budgets', 'tool_executions' @@ -404,6 +407,7 @@ def test_migrations_upgrade_and_downgrade(database_urls): ("memory_revisions", True, True), ("policies", True, True), ("sessions", True, True), + ("task_artifact_chunks", True, True), ("task_artifacts", True, True), ("task_steps", True, True), ("task_workspaces", True, True), @@ -473,6 +477,8 @@ def test_migrations_upgrade_and_downgrade(database_urls): has_table_privilege('alicebot_app', 'task_workspaces', 'DELETE'), has_table_privilege('alicebot_app', 'task_artifacts', 'UPDATE'), has_table_privilege('alicebot_app', 'task_artifacts', 'DELETE'), + has_table_privilege('alicebot_app', 'task_artifact_chunks', 'UPDATE'), + has_table_privilege('alicebot_app', 'task_artifact_chunks', 'DELETE'), has_table_privilege('alicebot_app', 'task_steps', 'UPDATE'), has_table_privilege('alicebot_app', 'task_steps', 'DELETE'), has_table_privilege('alicebot_app', 'execution_budgets', 'UPDATE'), @@ -510,6 +516,8 @@ def test_migrations_upgrade_and_downgrade(database_urls): False, False, False, + True, + False, False, False, True, @@ -524,6 +532,8 @@ def test_migrations_upgrade_and_downgrade(database_urls): with psycopg.connect(database_urls["admin"]) as conn: with conn.cursor() as cur: + cur.execute("SELECT to_regclass('public.task_artifact_chunks')") + assert cur.fetchone()[0] is None cur.execute("SELECT to_regclass('public.task_artifacts')") assert cur.fetchone()[0] is None cur.execute("SELECT to_regclass('public.task_workspaces')") diff --git a/tests/integration/test_task_artifacts_api.py b/tests/integration/test_task_artifacts_api.py index ff78f47..cf9626b 100644 --- a/tests/integration/test_task_artifacts_api.py +++ b/tests/integration/test_task_artifacts_api.py @@ -7,6 +7,7 @@ from uuid import UUID, uuid4 import anyio +import psycopg import apps.api.src.alicebot_api.main as main_module from apps.api.src.alicebot_api.config import Settings @@ -291,3 +292,374 @@ def test_task_artifact_endpoints_register_list_detail_isolate_and_reject_duplica assert isolated_create_payload == { "detail": f"task workspace {workspace_payload['workspace']['id']} was not found" } + + +def test_task_artifact_ingestion_and_chunk_endpoints_are_deterministic_and_isolated( + migrated_database_urls, + monkeypatch, + tmp_path, +) -> None: + owner = seed_task(migrated_database_urls["app"], email="owner@example.com") + intruder = seed_task(migrated_database_urls["app"], email="intruder@example.com") + workspace_root = tmp_path / "task-workspaces" + monkeypatch.setattr( + main_module, + "get_settings", + lambda: Settings( + database_url=migrated_database_urls["app"], + task_workspace_root=str(workspace_root), + ), + ) + + workspace_status, workspace_payload = invoke_request( + "POST", + f"/v0/tasks/{owner['task_id']}/workspace", + payload={"user_id": str(owner["user_id"])}, + ) + assert workspace_status == 201 + + workspace_path = Path(workspace_payload["workspace"]["local_path"]) + supported_file = workspace_path / "docs" / "spec.txt" + supported_file.parent.mkdir(parents=True) + supported_file.write_text(("A" * 998) + "\r\n" + ("B" * 5) + "\rC") + unsupported_file = workspace_path / "docs" / "manual.pdf" + unsupported_file.write_text("not really a pdf") + + register_status, register_payload = invoke_request( + "POST", + f"/v0/task-workspaces/{workspace_payload['workspace']['id']}/artifacts", + payload={ + "user_id": str(owner["user_id"]), + "local_path": str(supported_file), + "media_type_hint": "text/plain", + }, + ) + assert register_status == 201 + + ingest_status, ingest_payload = invoke_request( + "POST", + f"/v0/task-artifacts/{register_payload['artifact']['id']}/ingest", + payload={"user_id": str(owner["user_id"])}, + ) + chunk_list_status, chunk_list_payload = invoke_request( + "GET", + f"/v0/task-artifacts/{register_payload['artifact']['id']}/chunks", + query_params={"user_id": str(owner["user_id"])}, + ) + isolated_chunk_list_status, isolated_chunk_list_payload = invoke_request( + "GET", + f"/v0/task-artifacts/{register_payload['artifact']['id']}/chunks", + query_params={"user_id": str(intruder["user_id"])}, + ) + isolated_ingest_status, isolated_ingest_payload = invoke_request( + "POST", + f"/v0/task-artifacts/{register_payload['artifact']['id']}/ingest", + payload={"user_id": str(intruder["user_id"])}, + ) + + unsupported_register_status, unsupported_register_payload = invoke_request( + "POST", + f"/v0/task-workspaces/{workspace_payload['workspace']['id']}/artifacts", + payload={ + "user_id": str(owner["user_id"]), + "local_path": str(unsupported_file), + "media_type_hint": "application/pdf", + }, + ) + assert unsupported_register_status == 201 + unsupported_ingest_status, unsupported_ingest_payload = invoke_request( + "POST", + f"/v0/task-artifacts/{unsupported_register_payload['artifact']['id']}/ingest", + payload={"user_id": str(owner["user_id"])}, + ) + + assert ingest_status == 200 + assert ingest_payload == { + "artifact": { + "id": register_payload["artifact"]["id"], + "task_id": str(owner["task_id"]), + "task_workspace_id": workspace_payload["workspace"]["id"], + "status": "registered", + "ingestion_status": "ingested", + "relative_path": "docs/spec.txt", + "media_type_hint": "text/plain", + "created_at": register_payload["artifact"]["created_at"], + "updated_at": ingest_payload["artifact"]["updated_at"], + }, + "summary": { + "total_count": 2, + "total_characters": 1006, + "media_type": "text/plain", + "chunking_rule": "normalized_utf8_text_fixed_window_1000_chars_v1", + "order": ["sequence_no_asc", "id_asc"], + }, + } + + assert chunk_list_status == 200 + assert chunk_list_payload == { + "items": [ + { + "id": chunk_list_payload["items"][0]["id"], + "task_artifact_id": register_payload["artifact"]["id"], + "sequence_no": 1, + "char_start": 0, + "char_end_exclusive": 1000, + "text": ("A" * 998) + "\n" + "B", + "created_at": chunk_list_payload["items"][0]["created_at"], + "updated_at": chunk_list_payload["items"][0]["updated_at"], + }, + { + "id": chunk_list_payload["items"][1]["id"], + "task_artifact_id": register_payload["artifact"]["id"], + "sequence_no": 2, + "char_start": 1000, + "char_end_exclusive": 1006, + "text": "BBBB\nC", + "created_at": chunk_list_payload["items"][1]["created_at"], + "updated_at": chunk_list_payload["items"][1]["updated_at"], + }, + ], + "summary": { + "total_count": 2, + "total_characters": 1006, + "media_type": "text/plain", + "chunking_rule": "normalized_utf8_text_fixed_window_1000_chars_v1", + "order": ["sequence_no_asc", "id_asc"], + }, + } + + assert isolated_chunk_list_status == 404 + assert isolated_chunk_list_payload == { + "detail": f"task artifact {register_payload['artifact']['id']} was not found" + } + + assert isolated_ingest_status == 404 + assert isolated_ingest_payload == { + "detail": f"task artifact {register_payload['artifact']['id']} was not found" + } + + assert unsupported_ingest_status == 400 + assert unsupported_ingest_payload == { + "detail": ( + "artifact docs/manual.pdf has unsupported media type application/pdf; " + "supported types: text/plain, text/markdown" + ) + } + + +def test_task_artifact_ingestion_supports_markdown_and_reingest_is_idempotent( + migrated_database_urls, + monkeypatch, + tmp_path, +) -> None: + owner = seed_task(migrated_database_urls["app"], email="owner@example.com") + workspace_root = tmp_path / "task-workspaces" + monkeypatch.setattr( + main_module, + "get_settings", + lambda: Settings( + database_url=migrated_database_urls["app"], + task_workspace_root=str(workspace_root), + ), + ) + + workspace_status, workspace_payload = invoke_request( + "POST", + f"/v0/tasks/{owner['task_id']}/workspace", + payload={"user_id": str(owner["user_id"])}, + ) + assert workspace_status == 201 + + workspace_path = Path(workspace_payload["workspace"]["local_path"]) + markdown_file = workspace_path / "notes" / "plan.md" + markdown_file.parent.mkdir(parents=True) + markdown_file.write_text("# Plan\r\n\r\n- Ship ingestion\n- Keep scope narrow\r") + + register_status, register_payload = invoke_request( + "POST", + f"/v0/task-workspaces/{workspace_payload['workspace']['id']}/artifacts", + payload={ + "user_id": str(owner["user_id"]), + "local_path": str(markdown_file), + "media_type_hint": "text/markdown", + }, + ) + assert register_status == 201 + + first_ingest_status, first_ingest_payload = invoke_request( + "POST", + f"/v0/task-artifacts/{register_payload['artifact']['id']}/ingest", + payload={"user_id": str(owner["user_id"])}, + ) + second_ingest_status, second_ingest_payload = invoke_request( + "POST", + f"/v0/task-artifacts/{register_payload['artifact']['id']}/ingest", + payload={"user_id": str(owner["user_id"])}, + ) + chunk_list_status, chunk_list_payload = invoke_request( + "GET", + f"/v0/task-artifacts/{register_payload['artifact']['id']}/chunks", + query_params={"user_id": str(owner["user_id"])}, + ) + + assert first_ingest_status == 200 + assert first_ingest_payload == { + "artifact": { + "id": register_payload["artifact"]["id"], + "task_id": str(owner["task_id"]), + "task_workspace_id": workspace_payload["workspace"]["id"], + "status": "registered", + "ingestion_status": "ingested", + "relative_path": "notes/plan.md", + "media_type_hint": "text/markdown", + "created_at": register_payload["artifact"]["created_at"], + "updated_at": first_ingest_payload["artifact"]["updated_at"], + }, + "summary": { + "total_count": 1, + "total_characters": 45, + "media_type": "text/markdown", + "chunking_rule": "normalized_utf8_text_fixed_window_1000_chars_v1", + "order": ["sequence_no_asc", "id_asc"], + }, + } + assert second_ingest_status == 200 + assert second_ingest_payload == first_ingest_payload + assert chunk_list_status == 200 + assert chunk_list_payload == { + "items": [ + { + "id": chunk_list_payload["items"][0]["id"], + "task_artifact_id": register_payload["artifact"]["id"], + "sequence_no": 1, + "char_start": 0, + "char_end_exclusive": 45, + "text": "# Plan\n\n- Ship ingestion\n- Keep scope narrow\n", + "created_at": chunk_list_payload["items"][0]["created_at"], + "updated_at": chunk_list_payload["items"][0]["updated_at"], + } + ], + "summary": { + "total_count": 1, + "total_characters": 45, + "media_type": "text/markdown", + "chunking_rule": "normalized_utf8_text_fixed_window_1000_chars_v1", + "order": ["sequence_no_asc", "id_asc"], + }, + } + + +def test_task_artifact_ingestion_rejects_invalid_utf8_content( + migrated_database_urls, + monkeypatch, + tmp_path, +) -> None: + owner = seed_task(migrated_database_urls["app"], email="owner@example.com") + workspace_root = tmp_path / "task-workspaces" + monkeypatch.setattr( + main_module, + "get_settings", + lambda: Settings( + database_url=migrated_database_urls["app"], + task_workspace_root=str(workspace_root), + ), + ) + + workspace_status, workspace_payload = invoke_request( + "POST", + f"/v0/tasks/{owner['task_id']}/workspace", + payload={"user_id": str(owner["user_id"])}, + ) + assert workspace_status == 201 + + workspace_path = Path(workspace_payload["workspace"]["local_path"]) + broken_file = workspace_path / "docs" / "broken.txt" + broken_file.parent.mkdir(parents=True) + broken_file.write_bytes(b"\xff\xfe\xfd") + + register_status, register_payload = invoke_request( + "POST", + f"/v0/task-workspaces/{workspace_payload['workspace']['id']}/artifacts", + payload={ + "user_id": str(owner["user_id"]), + "local_path": str(broken_file), + "media_type_hint": "text/plain", + }, + ) + assert register_status == 201 + + ingest_status, ingest_payload = invoke_request( + "POST", + f"/v0/task-artifacts/{register_payload['artifact']['id']}/ingest", + payload={"user_id": str(owner["user_id"])}, + ) + + assert ingest_status == 400 + assert ingest_payload == { + "detail": "artifact docs/broken.txt is not valid UTF-8 text" + } + + +def test_task_artifact_ingestion_enforces_rooted_workspace_paths( + migrated_database_urls, + monkeypatch, + tmp_path, +) -> None: + owner = seed_task(migrated_database_urls["app"], email="owner@example.com") + workspace_root = tmp_path / "task-workspaces" + monkeypatch.setattr( + main_module, + "get_settings", + lambda: Settings( + database_url=migrated_database_urls["app"], + task_workspace_root=str(workspace_root), + ), + ) + + workspace_status, workspace_payload = invoke_request( + "POST", + f"/v0/tasks/{owner['task_id']}/workspace", + payload={"user_id": str(owner["user_id"])}, + ) + assert workspace_status == 201 + + workspace_path = Path(workspace_payload["workspace"]["local_path"]) + safe_file = workspace_path / "docs" / "spec.txt" + safe_file.parent.mkdir(parents=True) + safe_file.write_text("spec") + outside_file = tmp_path / "escape.txt" + outside_file.write_text("escape") + + register_status, register_payload = invoke_request( + "POST", + f"/v0/task-workspaces/{workspace_payload['workspace']['id']}/artifacts", + payload={ + "user_id": str(owner["user_id"]), + "local_path": str(safe_file), + "media_type_hint": "text/plain", + }, + ) + assert register_status == 201 + + with psycopg.connect(migrated_database_urls["admin"]) as conn: + with conn.cursor() as cur: + cur.execute( + """ + UPDATE task_artifacts + SET relative_path = '../../../escape.txt' + WHERE id = %s + """, + (register_payload["artifact"]["id"],), + ) + conn.commit() + + ingest_status, ingest_payload = invoke_request( + "POST", + f"/v0/task-artifacts/{register_payload['artifact']['id']}/ingest", + payload={"user_id": str(owner["user_id"])}, + ) + + assert ingest_status == 400 + assert ingest_payload == { + "detail": f"artifact path {outside_file.resolve()} escapes workspace root {workspace_path.resolve()}" + } diff --git a/tests/unit/test_20260314_0024_task_artifact_chunks.py b/tests/unit/test_20260314_0024_task_artifact_chunks.py new file mode 100644 index 0000000..5e9e6da --- /dev/null +++ b/tests/unit/test_20260314_0024_task_artifact_chunks.py @@ -0,0 +1,48 @@ +from __future__ import annotations + +import importlib + + +MODULE_NAME = "apps.api.alembic.versions.20260314_0024_task_artifact_chunks" + + +def load_migration_module(): + return importlib.import_module(MODULE_NAME) + + +def test_upgrade_executes_expected_statements_in_order(monkeypatch) -> None: + module = load_migration_module() + executed: list[str] = [] + + monkeypatch.setattr(module.op, "execute", executed.append) + + module.upgrade() + + assert executed == [ + *module._UPGRADE_TASK_ARTIFACTS_STATEMENTS, + module._UPGRADE_SCHEMA_STATEMENT, + *module._UPGRADE_GRANT_STATEMENTS, + "ALTER TABLE task_artifact_chunks ENABLE ROW LEVEL SECURITY", + "ALTER TABLE task_artifact_chunks FORCE ROW LEVEL SECURITY", + module._UPGRADE_POLICY_STATEMENT, + ] + + +def test_downgrade_executes_expected_statements_in_order(monkeypatch) -> None: + module = load_migration_module() + executed: list[str] = [] + + monkeypatch.setattr(module.op, "execute", executed.append) + + module.downgrade() + + assert executed == list(module._DOWNGRADE_STATEMENTS) + + +def test_task_artifact_chunk_privileges_allow_only_expected_runtime_writes() -> None: + module = load_migration_module() + + assert module._UPGRADE_GRANT_STATEMENTS == ( + "GRANT UPDATE ON task_artifacts TO alicebot_app", + "GRANT SELECT, INSERT ON task_artifact_chunks TO alicebot_app", + ) diff --git a/tests/unit/test_artifacts.py b/tests/unit/test_artifacts.py index 33d6fb7..e6ed44c 100644 --- a/tests/unit/test_artifacts.py +++ b/tests/unit/test_artifacts.py @@ -7,17 +7,22 @@ import pytest from alicebot_api.artifacts import ( + TASK_ARTIFACT_CHUNKING_RULE, TaskArtifactAlreadyExistsError, TaskArtifactNotFoundError, TaskArtifactValidationError, build_workspace_relative_artifact_path, + chunk_normalized_artifact_text, ensure_artifact_path_is_rooted, get_task_artifact_record, + ingest_task_artifact_record, + list_task_artifact_chunk_records, list_task_artifact_records, + normalize_artifact_text, register_task_artifact_record, serialize_task_artifact_row, ) -from alicebot_api.contracts import TaskArtifactRegisterInput +from alicebot_api.contracts import TaskArtifactIngestInput, TaskArtifactRegisterInput from alicebot_api.workspaces import TaskWorkspaceNotFoundError @@ -26,7 +31,9 @@ def __init__(self) -> None: self.base_time = datetime(2026, 3, 13, 10, 0, tzinfo=UTC) self.workspaces: list[dict[str, object]] = [] self.artifacts: list[dict[str, object]] = [] + self.artifact_chunks: list[dict[str, object]] = [] self.locked_workspace_ids: list[UUID] = [] + self.locked_artifact_ids: list[UUID] = [] def create_task_workspace(self, *, task_workspace_id: UUID, task_id: UUID, user_id: UUID, local_path: str) -> dict[str, object]: workspace = { @@ -94,6 +101,54 @@ def list_task_artifacts(self) -> list[dict[str, object]]: def get_task_artifact_optional(self, task_artifact_id: UUID) -> dict[str, object] | None: return next((artifact for artifact in self.artifacts if artifact["id"] == task_artifact_id), None) + def lock_task_artifact_ingestion(self, task_artifact_id: UUID) -> None: + self.locked_artifact_ids.append(task_artifact_id) + + def create_task_artifact_chunk( + self, + *, + task_artifact_id: UUID, + sequence_no: int, + char_start: int, + char_end_exclusive: int, + text: str, + ) -> dict[str, object]: + chunk = { + "id": uuid4(), + "user_id": self.workspaces[0]["user_id"], + "task_artifact_id": task_artifact_id, + "sequence_no": sequence_no, + "char_start": char_start, + "char_end_exclusive": char_end_exclusive, + "text": text, + "created_at": self.base_time + timedelta(seconds=len(self.artifact_chunks)), + "updated_at": self.base_time + timedelta(seconds=len(self.artifact_chunks)), + } + self.artifact_chunks.append(chunk) + return chunk + + def list_task_artifact_chunks(self, task_artifact_id: UUID) -> list[dict[str, object]]: + return sorted( + ( + chunk + for chunk in self.artifact_chunks + if chunk["task_artifact_id"] == task_artifact_id + ), + key=lambda chunk: (chunk["sequence_no"], chunk["id"]), + ) + + def update_task_artifact_ingestion_status( + self, + *, + task_artifact_id: UUID, + ingestion_status: str, + ) -> dict[str, object]: + artifact = self.get_task_artifact_optional(task_artifact_id) + assert artifact is not None + artifact["ingestion_status"] = ingestion_status + artifact["updated_at"] = self.base_time + timedelta(minutes=30) + return artifact + def test_ensure_artifact_path_is_rooted_rejects_escape() -> None: with pytest.raises(TaskArtifactValidationError, match="escapes workspace root"): @@ -241,6 +296,380 @@ def test_register_task_artifact_record_rejects_paths_outside_workspace(tmp_path) ) +def test_normalize_and_chunk_artifact_text_are_deterministic() -> None: + normalized = normalize_artifact_text("ab\r\ncd\ref") + + assert normalized == "ab\ncd\nef" + assert chunk_normalized_artifact_text(normalized, chunk_size=4) == [ + (0, 4, "ab\nc"), + (4, 8, "d\nef"), + ] + + +def test_ingest_task_artifact_record_persists_deterministic_chunks(tmp_path) -> None: + store = ArtifactStoreStub() + user_id = uuid4() + task_id = uuid4() + task_workspace_id = uuid4() + workspace_path = tmp_path / "workspaces" / str(user_id) / str(task_id) + workspace_path.mkdir(parents=True) + artifact_path = workspace_path / "docs" / "spec.txt" + artifact_path.parent.mkdir(parents=True) + artifact_path.write_text(("A" * 998) + "\r\n" + ("B" * 5) + "\rC") + store.create_task_workspace( + task_workspace_id=task_workspace_id, + task_id=task_id, + user_id=user_id, + local_path=str(workspace_path), + ) + artifact = store.create_task_artifact( + task_id=task_id, + task_workspace_id=task_workspace_id, + status="registered", + ingestion_status="pending", + relative_path="docs/spec.txt", + media_type_hint="text/plain", + ) + + response = ingest_task_artifact_record( + store, + user_id=user_id, + request=TaskArtifactIngestInput(task_artifact_id=artifact["id"]), + ) + + assert response == { + "artifact": { + "id": str(artifact["id"]), + "task_id": str(task_id), + "task_workspace_id": str(task_workspace_id), + "status": "registered", + "ingestion_status": "ingested", + "relative_path": "docs/spec.txt", + "media_type_hint": "text/plain", + "created_at": "2026-03-13T10:00:00+00:00", + "updated_at": "2026-03-13T10:30:00+00:00", + }, + "summary": { + "total_count": 2, + "total_characters": 1006, + "media_type": "text/plain", + "chunking_rule": TASK_ARTIFACT_CHUNKING_RULE, + "order": ["sequence_no_asc", "id_asc"], + }, + } + assert store.locked_artifact_ids == [artifact["id"]] + assert store.list_task_artifact_chunks(artifact["id"]) == [ + { + "id": store.artifact_chunks[0]["id"], + "user_id": user_id, + "task_artifact_id": artifact["id"], + "sequence_no": 1, + "char_start": 0, + "char_end_exclusive": 1000, + "text": ("A" * 998) + "\n" + "B", + "created_at": datetime(2026, 3, 13, 10, 0, tzinfo=UTC), + "updated_at": datetime(2026, 3, 13, 10, 0, tzinfo=UTC), + }, + { + "id": store.artifact_chunks[1]["id"], + "user_id": user_id, + "task_artifact_id": artifact["id"], + "sequence_no": 2, + "char_start": 1000, + "char_end_exclusive": 1006, + "text": "BBBB\nC", + "created_at": datetime(2026, 3, 13, 10, 0, 1, tzinfo=UTC), + "updated_at": datetime(2026, 3, 13, 10, 0, 1, tzinfo=UTC), + }, + ] + + +def test_ingest_task_artifact_record_supports_markdown(tmp_path) -> None: + store = ArtifactStoreStub() + user_id = uuid4() + task_id = uuid4() + task_workspace_id = uuid4() + workspace_path = tmp_path / "workspaces" / str(user_id) / str(task_id) + workspace_path.mkdir(parents=True) + artifact_path = workspace_path / "notes" / "plan.md" + artifact_path.parent.mkdir(parents=True) + artifact_path.write_text("# Plan\r\n\r\n- Ship ingestion\n- Keep scope narrow\r") + store.create_task_workspace( + task_workspace_id=task_workspace_id, + task_id=task_id, + user_id=user_id, + local_path=str(workspace_path), + ) + artifact = store.create_task_artifact( + task_id=task_id, + task_workspace_id=task_workspace_id, + status="registered", + ingestion_status="pending", + relative_path="notes/plan.md", + media_type_hint="text/markdown", + ) + + response = ingest_task_artifact_record( + store, + user_id=user_id, + request=TaskArtifactIngestInput(task_artifact_id=artifact["id"]), + ) + + assert response["artifact"]["ingestion_status"] == "ingested" + assert response["summary"] == { + "total_count": 1, + "total_characters": 45, + "media_type": "text/markdown", + "chunking_rule": TASK_ARTIFACT_CHUNKING_RULE, + "order": ["sequence_no_asc", "id_asc"], + } + assert store.list_task_artifact_chunks(artifact["id"]) == [ + { + "id": store.artifact_chunks[0]["id"], + "user_id": user_id, + "task_artifact_id": artifact["id"], + "sequence_no": 1, + "char_start": 0, + "char_end_exclusive": 45, + "text": "# Plan\n\n- Ship ingestion\n- Keep scope narrow\n", + "created_at": datetime(2026, 3, 13, 10, 0, tzinfo=UTC), + "updated_at": datetime(2026, 3, 13, 10, 0, tzinfo=UTC), + } + ] + + +def test_ingest_task_artifact_record_is_idempotent_for_already_ingested_artifact() -> None: + store = ArtifactStoreStub() + user_id = uuid4() + task_id = uuid4() + task_workspace_id = uuid4() + store.create_task_workspace( + task_workspace_id=task_workspace_id, + task_id=task_id, + user_id=user_id, + local_path="/tmp/alicebot/task-workspaces/user/task", + ) + artifact = store.create_task_artifact( + task_id=task_id, + task_workspace_id=task_workspace_id, + status="registered", + ingestion_status="ingested", + relative_path="docs/spec.txt", + media_type_hint="text/plain", + ) + store.create_task_artifact_chunk( + task_artifact_id=artifact["id"], + sequence_no=1, + char_start=0, + char_end_exclusive=4, + text="spec", + ) + + response = ingest_task_artifact_record( + store, + user_id=user_id, + request=TaskArtifactIngestInput(task_artifact_id=artifact["id"]), + ) + + assert response == { + "artifact": { + "id": str(artifact["id"]), + "task_id": str(task_id), + "task_workspace_id": str(task_workspace_id), + "status": "registered", + "ingestion_status": "ingested", + "relative_path": "docs/spec.txt", + "media_type_hint": "text/plain", + "created_at": "2026-03-13T10:00:00+00:00", + "updated_at": "2026-03-13T10:00:00+00:00", + }, + "summary": { + "total_count": 1, + "total_characters": 4, + "media_type": "text/plain", + "chunking_rule": TASK_ARTIFACT_CHUNKING_RULE, + "order": ["sequence_no_asc", "id_asc"], + }, + } + assert store.locked_artifact_ids == [artifact["id"]] + assert len(store.artifact_chunks) == 1 + + +def test_ingest_task_artifact_record_rejects_unsupported_media_type(tmp_path) -> None: + store = ArtifactStoreStub() + user_id = uuid4() + task_id = uuid4() + task_workspace_id = uuid4() + workspace_path = tmp_path / "workspaces" / str(user_id) / str(task_id) + workspace_path.mkdir(parents=True) + artifact_path = workspace_path / "docs" / "spec.pdf" + artifact_path.parent.mkdir(parents=True) + artifact_path.write_text("not really a pdf") + store.create_task_workspace( + task_workspace_id=task_workspace_id, + task_id=task_id, + user_id=user_id, + local_path=str(workspace_path), + ) + artifact = store.create_task_artifact( + task_id=task_id, + task_workspace_id=task_workspace_id, + status="registered", + ingestion_status="pending", + relative_path="docs/spec.pdf", + media_type_hint="application/pdf", + ) + + with pytest.raises( + TaskArtifactValidationError, + match="artifact docs/spec.pdf has unsupported media type application/pdf", + ): + ingest_task_artifact_record( + store, + user_id=user_id, + request=TaskArtifactIngestInput(task_artifact_id=artifact["id"]), + ) + + +def test_ingest_task_artifact_record_rejects_invalid_utf8_content(tmp_path) -> None: + store = ArtifactStoreStub() + user_id = uuid4() + task_id = uuid4() + task_workspace_id = uuid4() + workspace_path = tmp_path / "workspaces" / str(user_id) / str(task_id) + workspace_path.mkdir(parents=True) + artifact_path = workspace_path / "docs" / "broken.txt" + artifact_path.parent.mkdir(parents=True) + artifact_path.write_bytes(b"\xff\xfe\xfd") + store.create_task_workspace( + task_workspace_id=task_workspace_id, + task_id=task_id, + user_id=user_id, + local_path=str(workspace_path), + ) + artifact = store.create_task_artifact( + task_id=task_id, + task_workspace_id=task_workspace_id, + status="registered", + ingestion_status="pending", + relative_path="docs/broken.txt", + media_type_hint="text/plain", + ) + + with pytest.raises( + TaskArtifactValidationError, + match="artifact docs/broken.txt is not valid UTF-8 text", + ): + ingest_task_artifact_record( + store, + user_id=user_id, + request=TaskArtifactIngestInput(task_artifact_id=artifact["id"]), + ) + + +def test_ingest_task_artifact_record_rejects_paths_outside_workspace(tmp_path) -> None: + store = ArtifactStoreStub() + user_id = uuid4() + task_id = uuid4() + task_workspace_id = uuid4() + workspace_path = tmp_path / "workspaces" / str(user_id) / str(task_id) + workspace_path.mkdir(parents=True) + outside_path = tmp_path / "escape.txt" + outside_path.write_text("escape") + store.create_task_workspace( + task_workspace_id=task_workspace_id, + task_id=task_id, + user_id=user_id, + local_path=str(workspace_path), + ) + artifact = store.create_task_artifact( + task_id=task_id, + task_workspace_id=task_workspace_id, + status="registered", + ingestion_status="pending", + relative_path="../escape.txt", + media_type_hint="text/plain", + ) + + with pytest.raises(TaskArtifactValidationError, match="escapes workspace root"): + ingest_task_artifact_record( + store, + user_id=user_id, + request=TaskArtifactIngestInput(task_artifact_id=artifact["id"]), + ) + + +def test_list_task_artifact_chunk_records_are_deterministic() -> None: + store = ArtifactStoreStub() + user_id = uuid4() + task_id = uuid4() + task_workspace_id = uuid4() + store.create_task_workspace( + task_workspace_id=task_workspace_id, + task_id=task_id, + user_id=user_id, + local_path="/tmp/alicebot/task-workspaces/user/task", + ) + artifact = store.create_task_artifact( + task_id=task_id, + task_workspace_id=task_workspace_id, + status="registered", + ingestion_status="ingested", + relative_path="docs/spec.txt", + media_type_hint="text/plain", + ) + store.create_task_artifact_chunk( + task_artifact_id=artifact["id"], + sequence_no=1, + char_start=0, + char_end_exclusive=4, + text="spec", + ) + store.create_task_artifact_chunk( + task_artifact_id=artifact["id"], + sequence_no=2, + char_start=4, + char_end_exclusive=8, + text="plan", + ) + + assert list_task_artifact_chunk_records( + store, + user_id=user_id, + task_artifact_id=artifact["id"], + ) == { + "items": [ + { + "id": str(store.artifact_chunks[0]["id"]), + "task_artifact_id": str(artifact["id"]), + "sequence_no": 1, + "char_start": 0, + "char_end_exclusive": 4, + "text": "spec", + "created_at": "2026-03-13T10:00:00+00:00", + "updated_at": "2026-03-13T10:00:00+00:00", + }, + { + "id": str(store.artifact_chunks[1]["id"]), + "task_artifact_id": str(artifact["id"]), + "sequence_no": 2, + "char_start": 4, + "char_end_exclusive": 8, + "text": "plan", + "created_at": "2026-03-13T10:00:01+00:00", + "updated_at": "2026-03-13T10:00:01+00:00", + }, + ], + "summary": { + "total_count": 2, + "total_characters": 8, + "media_type": "text/plain", + "chunking_rule": TASK_ARTIFACT_CHUNKING_RULE, + "order": ["sequence_no_asc", "id_asc"], + }, + } + + def test_list_and_get_task_artifact_records_are_deterministic() -> None: store = ArtifactStoreStub() user_id = uuid4() diff --git a/tests/unit/test_artifacts_main.py b/tests/unit/test_artifacts_main.py index a791655..9e6e1b7 100644 --- a/tests/unit/test_artifacts_main.py +++ b/tests/unit/test_artifacts_main.py @@ -64,6 +64,47 @@ def fake_get_task_artifact_record(*_args, **_kwargs): assert json.loads(response.body) == {"detail": f"task artifact {task_artifact_id} was not found"} +def test_list_task_artifact_chunks_endpoint_returns_payload(monkeypatch) -> None: + user_id = uuid4() + task_artifact_id = uuid4() + settings = Settings(database_url="postgresql://app") + + @contextmanager + def fake_user_connection(*_args, **_kwargs): + yield object() + + monkeypatch.setattr(main_module, "get_settings", lambda: settings) + monkeypatch.setattr(main_module, "user_connection", fake_user_connection) + monkeypatch.setattr( + main_module, + "list_task_artifact_chunk_records", + lambda *_args, **_kwargs: { + "items": [], + "summary": { + "total_count": 0, + "total_characters": 0, + "media_type": "text/plain", + "chunking_rule": "normalized_utf8_text_fixed_window_1000_chars_v1", + "order": ["sequence_no_asc", "id_asc"], + }, + }, + ) + + response = main_module.list_task_artifact_chunks(task_artifact_id, user_id) + + assert response.status_code == 200 + assert json.loads(response.body) == { + "items": [], + "summary": { + "total_count": 0, + "total_characters": 0, + "media_type": "text/plain", + "chunking_rule": "normalized_utf8_text_fixed_window_1000_chars_v1", + "order": ["sequence_no_asc", "id_asc"], + }, + } + + def test_register_task_artifact_endpoint_maps_workspace_not_found_to_404(monkeypatch) -> None: user_id = uuid4() task_workspace_id = uuid4() @@ -152,3 +193,61 @@ def fake_register_task_artifact_record(*_args, **_kwargs): assert json.loads(response.body) == { "detail": f"artifact docs/spec.txt is already registered for task workspace {task_workspace_id}" } + + +def test_ingest_task_artifact_endpoint_maps_validation_to_400(monkeypatch) -> None: + user_id = uuid4() + task_artifact_id = uuid4() + settings = Settings(database_url="postgresql://app") + + @contextmanager + def fake_user_connection(*_args, **_kwargs): + yield object() + + def fake_ingest_task_artifact_record(*_args, **_kwargs): + raise TaskArtifactValidationError( + "artifact docs/spec.txt has unsupported media type application/pdf; " + "supported types: text/plain, text/markdown" + ) + + monkeypatch.setattr(main_module, "get_settings", lambda: settings) + monkeypatch.setattr(main_module, "user_connection", fake_user_connection) + monkeypatch.setattr(main_module, "ingest_task_artifact_record", fake_ingest_task_artifact_record) + + response = main_module.ingest_task_artifact( + task_artifact_id, + main_module.IngestTaskArtifactRequest(user_id=user_id), + ) + + assert response.status_code == 400 + assert json.loads(response.body) == { + "detail": ( + "artifact docs/spec.txt has unsupported media type application/pdf; " + "supported types: text/plain, text/markdown" + ) + } + + +def test_ingest_task_artifact_endpoint_maps_not_found_to_404(monkeypatch) -> None: + user_id = uuid4() + task_artifact_id = uuid4() + settings = Settings(database_url="postgresql://app") + + @contextmanager + def fake_user_connection(*_args, **_kwargs): + yield object() + + def fake_ingest_task_artifact_record(*_args, **_kwargs): + raise TaskArtifactNotFoundError(f"task artifact {task_artifact_id} was not found") + + monkeypatch.setattr(main_module, "get_settings", lambda: settings) + monkeypatch.setattr(main_module, "user_connection", fake_user_connection) + monkeypatch.setattr(main_module, "ingest_task_artifact_record", fake_ingest_task_artifact_record) + + response = main_module.ingest_task_artifact( + task_artifact_id, + main_module.IngestTaskArtifactRequest(user_id=user_id), + ) + + assert response.status_code == 404 + assert json.loads(response.body) == {"detail": f"task artifact {task_artifact_id} was not found"} diff --git a/tests/unit/test_main.py b/tests/unit/test_main.py index 20b7a00..446f108 100644 --- a/tests/unit/test_main.py +++ b/tests/unit/test_main.py @@ -129,6 +129,8 @@ def test_healthcheck_route_is_registered() -> None: assert "/v0/task-workspaces/{task_workspace_id}/artifacts" in route_paths assert "/v0/task-artifacts" in route_paths assert "/v0/task-artifacts/{task_artifact_id}" in route_paths + assert "/v0/task-artifacts/{task_artifact_id}/ingest" in route_paths + assert "/v0/task-artifacts/{task_artifact_id}/chunks" in route_paths assert "/v0/task-steps/{task_step_id}" in route_paths assert "/v0/task-steps/{task_step_id}/transition" in route_paths assert "/v0/entities/{entity_id}" in route_paths diff --git a/tests/unit/test_task_artifact_store.py b/tests/unit/test_task_artifact_store.py index c6f6277..df841c0 100644 --- a/tests/unit/test_task_artifact_store.py +++ b/tests/unit/test_task_artifact_store.py @@ -226,3 +226,145 @@ def test_task_artifact_store_methods_use_expected_queries() -> None: (str(task_workspace_id),), ), ] + + +def test_task_artifact_chunk_store_methods_use_expected_queries() -> None: + task_artifact_id = uuid4() + cursor = RecordingCursor( + fetchone_results=[ + { + "id": uuid4(), + "user_id": uuid4(), + "task_artifact_id": task_artifact_id, + "sequence_no": 1, + "char_start": 0, + "char_end_exclusive": 4, + "text": "spec", + "created_at": "2026-03-14T10:00:00+00:00", + "updated_at": "2026-03-14T10:00:00+00:00", + }, + { + "id": task_artifact_id, + "user_id": uuid4(), + "task_id": uuid4(), + "task_workspace_id": uuid4(), + "status": "registered", + "ingestion_status": "ingested", + "relative_path": "docs/spec.txt", + "media_type_hint": "text/plain", + "created_at": "2026-03-14T10:00:00+00:00", + "updated_at": "2026-03-14T10:01:00+00:00", + }, + ], + fetchall_result=[ + { + "id": uuid4(), + "user_id": uuid4(), + "task_artifact_id": task_artifact_id, + "sequence_no": 1, + "char_start": 0, + "char_end_exclusive": 4, + "text": "spec", + "created_at": "2026-03-14T10:00:00+00:00", + "updated_at": "2026-03-14T10:00:00+00:00", + } + ], + ) + store = ContinuityStore(RecordingConnection(cursor)) + + created = store.create_task_artifact_chunk( + task_artifact_id=task_artifact_id, + sequence_no=1, + char_start=0, + char_end_exclusive=4, + text="spec", + ) + updated = store.update_task_artifact_ingestion_status( + task_artifact_id=task_artifact_id, + ingestion_status="ingested", + ) + listed = store.list_task_artifact_chunks(task_artifact_id) + store.lock_task_artifact_ingestion(task_artifact_id) + + assert created["task_artifact_id"] == task_artifact_id + assert updated["ingestion_status"] == "ingested" + assert listed[0]["task_artifact_id"] == task_artifact_id + assert cursor.executed == [ + ( + """ + INSERT INTO task_artifact_chunks ( + user_id, + task_artifact_id, + sequence_no, + char_start, + char_end_exclusive, + text, + created_at, + updated_at + ) + VALUES ( + app.current_user_id(), + %s, + %s, + %s, + %s, + %s, + clock_timestamp(), + clock_timestamp() + ) + RETURNING + id, + user_id, + task_artifact_id, + sequence_no, + char_start, + char_end_exclusive, + text, + created_at, + updated_at + """, + (task_artifact_id, 1, 0, 4, "spec"), + ), + ( + """ + UPDATE task_artifacts + SET ingestion_status = %s, + updated_at = clock_timestamp() + WHERE id = %s + RETURNING + id, + user_id, + task_id, + task_workspace_id, + status, + ingestion_status, + relative_path, + media_type_hint, + created_at, + updated_at + """, + ("ingested", task_artifact_id), + ), + ( + """ + SELECT + id, + user_id, + task_artifact_id, + sequence_no, + char_start, + char_end_exclusive, + text, + created_at, + updated_at + FROM task_artifact_chunks + WHERE task_artifact_id = %s + ORDER BY sequence_no ASC, id ASC + """, + (task_artifact_id,), + ), + ( + "SELECT pg_advisory_xact_lock(hashtextextended(%s::text, 5))", + (str(task_artifact_id),), + ), + ]