Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
91 changes: 48 additions & 43 deletions .ai/active/SPRINT_PACKET.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,23 +2,23 @@

## Sprint Title

Sprint 5M: DOCX Artifact Parsing V0
Sprint 5N: RFC822 Email Artifact Parsing V0

## Sprint Type

feature

## Sprint Reason

Sprint 5L proved the richer-document-parsing seam can widen safely without changing the rooted workspace, durable chunk, retrieval, or compile contracts. The next safe slice is DOCX ingestion only, not broader PDF compatibility, OCR, connectors, or UI.
Sprint 5L and Sprint 5M proved the richer-document-parsing seam can widen safely without changing the rooted workspace, durable chunk, retrieval, or compile contracts. The next safe slice is RFC822 email ingestion only, which prepares the path for later read-only Gmail work without opening live connector, auth, or UI scope yet.

## Sprint Intent

Extend the existing artifact-ingestion seam so registered DOCX artifacts can be ingested into the existing durable `task_artifact_chunks` substrate through deterministic local text extraction, without changing retrieval contracts, compile contracts, connectors, or UI.
Extend the existing artifact-ingestion seam so registered RFC822 email artifacts can be ingested into the existing durable `task_artifact_chunks` substrate through deterministic local parsing of message headers and text bodies, without changing retrieval contracts, compile contracts, live connector scope, or UI.

## Git Instructions

- Branch Name: `codex/sprint-5m-docx-artifact-parsing-v0`
- Branch Name: `codex/sprint-5n-rfc822-email-artifact-parsing-v0`
- Base Branch: `main`
- PR Strategy: one sprint branch, one PR, no stacked PRs unless Control Tower explicitly opens a follow-up sprint
- Merge Policy: squash merge only after reviewer `PASS` and explicit Control Tower merge approval
Expand All @@ -29,100 +29,105 @@ Extend the existing artifact-ingestion seam so registered DOCX artifacts can be
- Sprint 5C shipped explicit task-artifact registration.
- Sprint 5D shipped deterministic local text-artifact ingestion into durable chunk rows.
- Sprint 5E through 5J shipped lexical retrieval, semantic retrieval, and hybrid compile-path artifact retrieval on top of those persisted chunk rows.
- Sprint 5L extended the same ingestion seam to narrow PDF text extraction without changing retrieval or compile contracts.
- The next narrow richer-document move is a separate DOCX ingestion seam, which increases format coverage without widening into OCR, connector, or UI scope.
- Sprint 5L extended the same ingestion seam to narrow PDF text extraction.
- Sprint 5M extended the same ingestion seam to narrow DOCX text extraction.
- The next narrow richer-document move is RFC822 email parsing, which advances the Gmail-adjacent path while still staying on the existing rooted artifact and chunk substrate instead of opening a live connector.

## In Scope

- Extend schema and contracts only as narrowly needed to support DOCX ingestion metadata, for example:
- Extend schema and contracts only as narrowly needed to support RFC822 ingestion metadata, for example:
- `task_artifacts.ingestion_status` reuse if no new status is required
- optional deterministic extraction metadata on artifact detail or ingestion responses if needed
- Define typed contracts for:
- DOCX artifact-ingestion requests if they differ from the current generic artifact-ingestion path
- artifact-ingestion responses updated for DOCX extraction metadata if needed
- artifact detail or chunk summary metadata updated for DOCX ingestion if needed
- email artifact-ingestion requests if they differ from the current generic artifact-ingestion path
- artifact-ingestion responses updated for email extraction metadata if needed
- artifact detail or chunk summary metadata updated for email ingestion if needed
- Extend the existing ingestion seam so it:
- accepts already-registered visible DOCX artifacts
- accepts already-registered visible RFC822 email artifacts
- resolves rooted local file paths from persisted workspace plus artifact relative path
- supports one explicit DOCX extraction path only
- extracts deterministic text from DOCX package contents without OCR or image extraction
- supports one explicit local email parsing path only
- parses deterministic text from message headers plus plain-text body parts
- handles multipart messages narrowly and predictably
- rejects unsupported body forms when no extractable text body is present
- normalizes extracted text before chunking
- persists ordered chunk rows into the existing `task_artifact_chunks` table
- updates artifact ingestion status deterministically
- Add unit and integration tests for:
- supported DOCX ingestion
- deterministic chunk ordering and chunk boundaries from extracted DOCX text
- rooted path enforcement during DOCX ingestion
- rejection of malformed or textless DOCX files when no extractable text is present
- supported RFC822 ingestion
- deterministic chunk ordering and chunk boundaries from extracted email text
- rooted path enforcement during email ingestion
- rejection of malformed or textless email artifacts when no extractable text is present
- per-user isolation
- stable response shape

## Out of Scope

- No broader PDF compatibility work.
- No live Gmail API or OAuth work.
- No Calendar connector scope.
- No HTML-to-text rendering beyond a narrow explicit rule if strictly needed.
- No attachment extraction.
- No OCR.
- No image extraction from DOCX.
- No changes to lexical retrieval contracts.
- No changes to semantic retrieval contracts.
- No compile contract changes.
- No Gmail or Calendar connector scope.
- No runner-style orchestration.
- No UI work.

## Required Deliverables

- Narrow ingestion support for visible DOCX artifacts using the existing artifact and chunk seams.
- Stable contract updates only where DOCX extraction metadata is necessary.
- Unit and integration coverage for DOCX extraction, rooted-path safety, deterministic chunk persistence, and isolation.
- Narrow ingestion support for visible RFC822 email artifacts using the existing artifact and chunk seams.
- Stable contract updates only where email extraction metadata is necessary.
- Unit and integration coverage for email extraction, rooted-path safety, deterministic chunk persistence, and isolation.
- Updated `BUILD_REPORT.md` with exact verification results and explicit deferred scope.

## Acceptance Criteria

- A client can ingest one supported visible DOCX artifact into durable ordered chunk rows using the existing artifact-ingestion seam.
- DOCX ingestion reads only files rooted under the persisted task workspace boundary.
- Extracted text is normalized and chunked deterministically into the existing `task_artifact_chunks` contract.
- Malformed or textless DOCX files are rejected deterministically rather than silently producing misleading chunks.
- A client can ingest one supported visible RFC822 email artifact into durable ordered chunk rows using the existing artifact-ingestion seam.
- Email ingestion reads only files rooted under the persisted task workspace boundary.
- Extracted email text is normalized and chunked deterministically into the existing `task_artifact_chunks` contract.
- Malformed or textless email artifacts are rejected deterministically rather than silently producing misleading chunks.
- Existing lexical, semantic, and hybrid artifact retrieval contracts continue to operate over the persisted chunk rows without contract changes.
- `./.venv/bin/python -m pytest tests/unit` passes.
- `./.venv/bin/python -m pytest tests/integration` passes.
- No PDF-compatibility expansion, OCR, connector, runner, compile-contract, or UI scope enters the sprint.
- No live Gmail connector, Calendar connector, OAuth, attachment extraction, compile-contract, runner, or UI scope enters the sprint.

## Implementation Constraints

- Keep richer parsing narrow and boring.
- Reuse the existing rooted `task_workspaces`, `task_artifacts`, and `task_artifact_chunks` seams rather than creating a parallel document store.
- Support DOCX text extraction only; do not introduce OCR, image extraction, or document-layout reconstruction in the same sprint.
- Reuse the existing rooted `task_workspaces`, `task_artifacts`, and `task_artifact_chunks` seams rather than creating a parallel email store.
- Support deterministic local RFC822 parsing only; do not introduce live connector behavior in the same sprint.
- Prefer plain-text body extraction; if multipart handling is needed, keep the accepted body selection rule explicit and deterministic.
- Preserve existing retrieval and compile contracts by feeding the already-shipped chunk substrate.
- Keep extraction and chunking deterministic and testable from local files alone.

## Suggested Work Breakdown

1. Define any minimal DOCX-ingestion contract updates needed.
2. Implement deterministic rooted DOCX text extraction in the existing artifact-ingestion seam.
3. Normalize extracted text and persist ordered chunk rows into the existing chunk store.
4. Add deterministic failure behavior for malformed or textless DOCX files.
1. Define any minimal RFC822-ingestion contract updates needed.
2. Implement deterministic rooted email parsing in the existing artifact-ingestion seam.
3. Normalize extracted email text and persist ordered chunk rows into the existing chunk store.
4. Add deterministic failure behavior for malformed or textless email artifacts.
5. Add unit and integration tests.
6. Update `BUILD_REPORT.md` with executed verification.

## Build Report Requirements

`BUILD_REPORT.md` must include:
- the exact DOCX-ingestion contract changes introduced, if any
- the DOCX extraction path and chunking rule used
- the exact RFC822-ingestion contract changes introduced, if any
- the email extraction path and chunking rule used
- the header/body selection rule used
- exact commands run
- unit and integration test results
- one example DOCX artifact-ingestion response
- one example chunk list response produced from a DOCX artifact
- one example email artifact-ingestion response
- one example chunk list response produced from an email artifact
- what remains intentionally deferred to later milestones

## Review Focus

`REVIEW_REPORT.md` should verify:
- the sprint stayed limited to DOCX artifact parsing through the existing ingestion seam
- DOCX ingestion reuses the existing rooted workspace, artifact, and chunk contracts
- the sprint stayed limited to RFC822 email artifact parsing through the existing ingestion seam
- email ingestion reuses the existing rooted workspace, artifact, and chunk contracts
- extraction determinism, chunk ordering, rooted-path safety, and isolation are test-backed
- no hidden PDF-compatibility expansion, OCR, connector, runner, compile-contract, or UI scope entered the sprint
- no hidden live Gmail connector, Calendar connector, OAuth, attachment extraction, compile-contract, runner, or UI scope entered the sprint

## Exit Condition

This sprint is complete when the repo can ingest supported visible DOCX artifacts into deterministic durable chunk rows through the existing artifact-ingestion seam, verify the full path with Postgres-backed tests, and still defer broader document parsing, connectors, and UI.
This sprint is complete when the repo can ingest supported visible RFC822 email artifacts into deterministic durable chunk rows through the existing artifact-ingestion seam, verify the full path with Postgres-backed tests, and still defer live connector work, broader email handling, and UI.
Loading