From f57bdd77303966493f5a091e4a5473867042ebb9 Mon Sep 17 00:00:00 2001 From: v1d0b0t Date: Fri, 3 Apr 2026 11:11:05 -0400 Subject: [PATCH 01/10] docs: add spec for govinfo bulk fetch --- docs/specs/40-spec.md | 104 ++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 104 insertions(+) create mode 100644 docs/specs/40-spec.md diff --git a/docs/specs/40-spec.md b/docs/specs/40-spec.md new file mode 100644 index 0000000..94d445c --- /dev/null +++ b/docs/specs/40-spec.md @@ -0,0 +1,104 @@ +## [spec-writer] — Initial spec drafted +See `docs/specs/40-spec.md` for the canonical spec. + +# GovInfo bulk repository fetch source + +## Summary +Add a new `fetch --source=govinfo-bulk` CLI source that reads GovInfo Bulk Data Repository XML directory listings, downloads bulk artifacts without `API_DATA_GOV_KEY`, stores them under `data/cache/govinfo-bulk/`, and records resumable progress in `data/manifest.json`. The goal is to replace slow historical GovInfo API crawling with a bulk-file acquisition path suitable for initial backfill of bill status, bill text, bill summaries, and public/private law artifacts. + +## Context +- The current CLI supports `olrc`, `congress`, `govinfo`, `voteview`, and `legislators` sources via `src/commands/fetch.ts` and `src/utils/manifest.ts`. +- `fetch --source=govinfo` currently uses the GovInfo API and requires `API_DATA_GOV_KEY`; it is rate-limited and only fetches PLAW package metadata/granules, not the bulk repository. +- The GovInfo Bulk Data Repository exposes XML directory listings at `https://www.govinfo.gov/bulkdata/` and nested collection paths. Current live listings show directory traversal and downloadable XML artifacts, not the existing API pagination model. +- `package.json` already includes `fast-xml-parser` and `yauzl`; the implementation may reuse existing dependencies but must not introduce a new credential requirement. +- This feature is an additive historical backfill path. Existing `fetch --source=govinfo` behavior remains the real-time/incremental API client. +- Non-negotiable constraints: + - No API key or shared Congress/GovInfo rate-budget dependency for `govinfo-bulk` + - Resume support must be manifest-backed and mechanically testable + - Download scope must be filterable by collection and congress + - The implementation must match the live GovInfo bulk directory structure rather than assuming undocumented ZIP-only responses + +## Acceptance Criteria + +### Functional +#### 1. CLI surface and argument validation +- [ ] `parseFetchArgs()` accepts `--source=govinfo-bulk` as a new source name and rejects unknown sources with the existing `invalid_arguments` error contract. +- [ ] `fetch --source=govinfo-bulk` runs a new bulk fetch implementation, and `fetch --status` includes a `govinfo-bulk` top-level source status entry alongside the existing sources. +- [ ] `--collection=` is accepted only with `--source=govinfo-bulk`; accepted values are exactly `BILLSTATUS`, `BILLS`, `BILLSUM`, and `PLAW`, and any other value or repeated `--collection` flag returns `invalid_arguments`. +- [ ] `--congress=` remains accepted with `--source=govinfo-bulk`; when omitted, the fetch enumerates every available congress directory published under the selected collection(s), and when provided, the fetch processes only that congress. + +#### 2. Directory discovery and scope resolution +- [ ] The new bulk fetcher starts from `https://www.govinfo.gov/bulkdata/`, parses the XML directory listings, and discovers only the four in-scope collection directories (`BILLSTATUS`, `BILLS`, `BILLSUM`, `PLAW`) from listing data rather than hard-coded HTML scraping. +- [ ] For each selected collection, the fetcher resolves the available congress directories from the live XML listing for that collection, applies the optional `--congress` filter, and records the exact discovered congress numbers processed in the result payload and manifest state. +- [ ] For each selected congress, the fetcher recursively traverses folder entries until it reaches downloadable file entries, so collection-specific layouts such as `BILLSTATUS/{congress}/{bill-type}/...`, `BILLSUM/{congress}/{bill-type}/...`, and `PLAW/{congress}/public/...` are handled without collection-specific manual file lists. + +#### 3. Download, cache layout, and resume +- [ ] Downloaded artifacts are written under `data/cache/govinfo-bulk/{collection}/{congress}/...` preserving the remainder of the GovInfo directory structure beneath the congress directory, and parent directories are created automatically. +- [ ] Each downloaded file manifest entry stores, at minimum, its source URL, relative cache path, upstream byte size when available, fetched timestamp, and a completion marker that distinguishes fully downloaded files from interrupted downloads. A resumed run must skip files already marked complete when the cached byte count still matches the manifest entry. +- [ ] If a run is interrupted after some files complete and others do not, the next non-`--force` run resumes from manifest state and downloads only the incomplete or missing files for the selected scope. `--force` ignores prior completion state for the selected scope and re-downloads those files. +- [ ] The implementation must write files atomically (temporary path then rename, or equivalent) so a partial download never appears in the cache as a completed artifact. + +#### 4. Result contract, validation, and documentation +- [ ] A successful `fetch --source=govinfo-bulk` result is emitted as JSON with `source: "govinfo-bulk"`, `ok: true`, the requested selectors (`collection` or `collections`, `congress` or discovered congresses), and count fields for at least `directories_visited`, `files_discovered`, `files_downloaded`, and `files_skipped`. A failed run returns `ok: false` with the existing structured `error` shape. +- [ ] No code path in `fetch --source=govinfo-bulk` may read or require `API_DATA_GOV_KEY`; invoking the command without that environment variable must still succeed when the upstream bulk repository is reachable. +- [ ] After a successful BILLSTATUS download for a test fixture or live smoke-test scope, at least one cached XML file must parse successfully as XML using the project’s XML parser dependency, proving the downloaded artifact is structurally valid and not an HTML/error payload. +- [ ] `docs/DATA-ACQUISITION-RUNBOOK.md` documents `fetch --source=govinfo-bulk`, its supported `--collection` / `--congress` filters, cache location, resume behavior, no-key requirement, and the recommended acquisition order that prioritizes `BILLSTATUS` before the API-based GovInfo crawl. + +### Non-Functional +- [ ] Performance: the downloader must support bounded concurrency with a default of no more than 2 simultaneous file downloads per process, and this limit must be enforced by implementation logic rather than operator convention. +- [ ] Reliability: XML directory listing parsing and file download validation must reject HTML/error payloads and record them as structured failures in manifest state instead of marking the artifact complete. +- [ ] Security: the feature must perform only anonymous HTTPS GET requests to `www.govinfo.gov` bulkdata endpoints and must not introduce new secrets, tokens, or shell-outs to external download tools. + +## Out of Scope +- Transforming GovInfo bulk XML into downstream normalized schemas or markdown output. +- Replacing the existing `fetch --source=govinfo` API client. +- Determining whether Congress.gov can be fully removed after BILLSTATUS ingestion. +- Downloading GovInfo bulk collections outside `BILLSTATUS`, `BILLS`, `BILLSUM`, and `PLAW`. +- Adding operator-configurable concurrency flags or remote checksum verification beyond what GovInfo listings already expose. + +## Dependencies +- GovInfo Bulk Data Repository XML listings under `https://www.govinfo.gov/bulkdata/` +- Existing fetch CLI entry point in `src/commands/fetch.ts` +- Existing manifest persistence in `src/utils/manifest.ts` +- Existing cache/data directory conventions under `data/cache/` +- `fast-xml-parser` for listing and XML validation parsing + +## Acceptance Tests (human-readable) +1. Run `node dist/index.js fetch --source=govinfo-bulk --collection=BILLSTATUS --congress=119` with `API_DATA_GOV_KEY` unset. Verify exit code `0`, `source` is `govinfo-bulk`, and cached files appear under `data/cache/govinfo-bulk/BILLSTATUS/119/`. +2. Inspect `data/manifest.json` after the run and verify it contains a `govinfo-bulk` source entry with collection/congress progress and at least one completed file record. +3. Re-run the same command without `--force` and verify the result reports skipped/resumed behavior rather than re-downloading already completed files. +4. Delete or mark one cached file incomplete in a test fixture, re-run the same command, and verify only that missing/incomplete artifact is downloaded again. +5. Run `node dist/index.js fetch --source=govinfo-bulk --collection=NOPE` and verify exit code `2` with an `invalid_arguments` error. +6. Run `node dist/index.js fetch --status` and verify the JSON now includes `govinfo-bulk` alongside `olrc`, `congress`, `govinfo`, `voteview`, and `legislators`. +7. Parse one downloaded BILLSTATUS XML artifact with the project XML parser and verify the parse succeeds. +8. Read the runbook and verify it documents the bulk source, no-key requirement, cache path, filters, and resume semantics. + +## Edge Case Catalog +- Invalid CLI selectors: unknown `--collection`, repeated `--collection`, `--collection` used without `--source=govinfo-bulk`, non-numeric `--congress`, or `--all` combined with `--source=govinfo-bulk` must follow the existing validation/error contract. +- Partial directory hierarchies: a collection or congress listing may contain folder entries but no files yet; the run must record zero downloaded files for that subtree without crashing. +- Mixed collection shapes: `BILLSTATUS`/`BILLSUM` include bill-type subdirectories, while `PLAW` may add `public`/`private` branches; traversal must derive structure from listing metadata rather than fixed path templates. +- Malformed input from upstream: invalid XML listings, missing `` arrays, missing file names, empty links, HTML payloads, truncated bodies, BOM markers, or invalid UTF-8 must produce structured failures and no completed manifest entry for the affected artifact. +- Boundary scopes: first available congress, latest available congress, a congress filter that is not present upstream, and a collection with zero matching congresses must all produce deterministic results. +- Concurrency/race conditions: two local processes started against the same data directory may contend for the same temporary file or manifest path; implementation must avoid marking duplicate or partial completions as successful. +- Network failures: timeout, TLS failure, connection reset, or mid-download abort must leave the target artifact resumable and must not corrupt existing completed files. +- Recovery: after an upstream or local network failure, a later rerun with the same selectors must continue from manifest state without requiring manual cache cleanup. +- Large trees: listing traversal for many thousands of files must remain iterative/stream-safe enough to avoid holding the entire remote repository tree in memory before downloading begins. +- Unicode/encoding: file names or listing labels containing non-ASCII text must round-trip through manifest serialization and local path handling without breaking JSON output. +- Auth edge case: presence of an invalid `API_DATA_GOV_KEY` in the environment must not affect `govinfo-bulk` because the source is anonymous. + +## Verification Strategy +- **Pure core:** keep XML-listing parsing, collection/congress selector filtering, path derivation, resume eligibility checks, and result/manifest state reduction in pure functions. +- **Properties:** (1) every completed manifest file entry maps to exactly one cached file path; (2) non-`--force` reruns never re-download entries whose cache path and byte count still match a completed manifest entry; (3) selected scope is a subset of discovered scope; (4) only the four allowed collections are accepted; (5) no completed artifact is backed by an HTML payload. +- **Purity boundary:** all network I/O, filesystem writes, and manifest persistence stay in the effectful shell; unit tests cover listing parsing and resume decisions, while integration tests cover manifest updates and on-disk artifacts. + +## Infrastructure Requirements +- **Database:** None. +- **API endpoints:** None added; this feature uses GovInfo bulkdata HTTPS directory listings and file URLs, not the authenticated GovInfo API. +- **Infrastructure:** Local filesystem storage under `data/cache/govinfo-bulk/`; no queues, buckets, or background services required. +- **Environment variables / secrets:** No new environment variables or secrets. `API_DATA_GOV_KEY` is explicitly not required for this source. + +## Complexity Estimate +L + +## Required Skills +typescript, vitest, filesystem I/O, XML parsing, resumable downloader design From 4c5976f479d21af29679d62e5a3c1e024457a0d0 Mon Sep 17 00:00:00 2001 From: v1d0b0t Date: Fri, 3 Apr 2026 11:15:09 -0400 Subject: [PATCH 02/10] arch: architecture for #40 --- docs/architecture/40-architecture.md | 792 +++++++++++++++++++++++++++ 1 file changed, 792 insertions(+) create mode 100644 docs/architecture/40-architecture.md diff --git a/docs/architecture/40-architecture.md b/docs/architecture/40-architecture.md new file mode 100644 index 0000000..a4ae142 --- /dev/null +++ b/docs/architecture/40-architecture.md @@ -0,0 +1,792 @@ +# Issue #40 Architecture — GovInfo Bulk Repository Fetch Source + +## Status +Approved spec input: `docs/specs/40-spec.md` + +No prior `docs/architecture/40-architecture.md` existed at the time of drafting. +No `.dark-factory.yml` file exists in this worktree, so implementation constraints are derived from the approved spec and the current repository layout (`src/commands`, `src/sources`, `src/utils`, Vitest-based test suite, TypeScript CLI package). + +--- + +## 1. Data Model + +### 1.1 Persistence decision +This feature remains a **single-process CLI ingestion path**. It does **not** add a database, service, queue, or remote state store. The authoritative persistence layer is: + +1. `data/cache/govinfo-bulk/...` for downloaded and extracted artifacts +2. `data/manifest.json` for resumable state and failure tracking + +A relational database is intentionally **not introduced** because: +- the product is currently a local CLI, not a multi-user service +- the spec explicitly says no new infrastructure is required +- resumable state is already standardized in `src/utils/manifest.ts` +- filesystem + manifest is sufficient, testable, and operationally simpler for multi-GB bulk downloads + +### 1.2 Manifest schema changes +Extend `SourceName` with `govinfo-bulk` and add a dedicated manifest subtree. + +#### Type additions in `src/utils/manifest.ts` +```ts +export type SourceName = + | 'olrc' + | 'congress' + | 'govinfo' + | 'govinfo-bulk' + | 'voteview' + | 'legislators'; + +export type GovInfoBulkCollection = 'BILLSTATUS' | 'BILLS' | 'BILLSUM' | 'PLAW'; + +export interface GovInfoBulkFileState { + source_url: string; + relative_cache_path: string; + congress: number; + collection: GovInfoBulkCollection; + listing_path: string[]; + upstream_byte_size: number | null; + fetched_at: string | null; + completed_at: string | null; + download_status: 'pending' | 'downloaded' | 'extracted' | 'failed'; + validation_status: 'not_checked' | 'xml_valid' | 'zip_valid' | 'invalid_payload'; + file_kind: 'zip' | 'xml' | 'unknown'; + extraction_root: string | null; + error: FailureSummary | null; +} + +export interface GovInfoBulkCongressState { + congress: number; + discovered_at: string; + completed_at: string | null; + status: 'pending' | 'partial' | 'complete' | 'failed'; + directories_visited: number; + files_discovered: number; + files_downloaded: number; + files_skipped: number; + files_failed: number; + file_keys: string[]; +} + +export interface GovInfoBulkCollectionState { + collection: GovInfoBulkCollection; + discovered_at: string; + completed_at: string | null; + status: 'pending' | 'partial' | 'complete' | 'failed'; + discovered_congresses: number[]; + congress_runs: Record; +} + +export interface GovInfoBulkCheckpointState { + selected_collections: GovInfoBulkCollection[]; + selected_congress: number | null; + pending_directory_urls: string[]; + active_file_urls: string[]; + updated_at: string; +} + +export interface GovInfoBulkManifestState extends SourceStatusSummary { + checkpoints: Record; + collections: Record; + files: Record; +} +``` + +#### `FetchManifest` addition +```ts +export interface FetchManifest { + version: 1; + updated_at: string; + sources: { + olrc: OlrcManifestState; + congress: CongressManifestState; + govinfo: GovInfoManifestState; + 'govinfo-bulk': GovInfoBulkManifestState; + voteview: SourceStatusSummary & { files?: Record; indexes?: unknown[] }; + legislators: LegislatorsManifestState; + }; + runs: unknown[]; +} +``` + +### 1.3 File keying strategy +Each downloadable artifact must have a stable manifest key: + +```ts +const manifestFileKey = `${collection}:${congress}:${relativeCachePath}`; +``` + +Why: +- unique across collections and congresses +- deterministic across reruns +- unaffected by transient temp-file names +- safe for resume checks and `--force` scope clearing + +### 1.4 On-disk cache layout +Canonical cache root: + +```text +data/cache/govinfo-bulk/{collection}/{congress}/... +``` + +Path derivation rule: +- derive from the GovInfo URL path after `/bulkdata/{collection}/{congress}/` +- preserve all remaining directory segments exactly +- never flatten filenames + +Examples: + +```text +https://www.govinfo.gov/bulkdata/BILLSTATUS/119/hr/BILLSTATUS-119hr.xml.zip +→ data/cache/govinfo-bulk/BILLSTATUS/119/hr/BILLSTATUS-119hr.xml.zip +→ extracted to data/cache/govinfo-bulk/BILLSTATUS/119/hr/extracted/ + +https://www.govinfo.gov/bulkdata/PLAW/118/public/PLAW-118publ52.xml +→ data/cache/govinfo-bulk/PLAW/118/public/PLAW-118publ52.xml +``` + +### 1.5 Atomic write model +Every file download uses a temp path in the target directory: + +```text +.tmp-- +``` + +Write sequence: +1. create parent directory +2. stream response body into temp file +3. validate payload type/content +4. if ZIP, extract into temp extraction dir +5. if BILLSTATUS XML, parse at least one XML artifact successfully +6. rename temp file to final file path +7. rename temp extraction directory to final extraction root +8. update manifest entry to completed state + +The manifest entry must **never** be marked complete before all validation and renames succeed. + +### 1.6 Resume semantics +A file is resumable/skippable when all of the following are true: +- manifest entry exists +- `download_status` is `downloaded` or `extracted` +- `completed_at` is non-null +- cached file exists +- cached byte count equals `upstream_byte_size` when that value is known +- required extraction root exists for ZIP artifacts +- validation status is not `invalid_payload` + +Otherwise the file is treated as incomplete and re-fetched. + +### 1.7 Manifest normalization rules +`normalizeManifest()` must default missing `govinfo-bulk` data safely: + +```ts +'govinfo-bulk': { + last_success_at: null, + last_failure: null, + checkpoints: {}, + collections: {}, + files: {}, +} +``` + +### 1.8 Seed / fixture data for tests +No production seed data is required. Test fixtures should include: +- root XML listing with all four allowed collections +- collection listing with congress directories +- nested listing examples for: + - `BILLSTATUS/{congress}/{bill-type}/...` + - `BILLS/{congress}/{bill-type}/...` + - `BILLSUM/{congress}/{bill-type}/...` + - `PLAW/{congress}/public/...` and `private/...` +- one valid ZIP fixture containing XML +- one valid raw XML fixture +- one HTML error fixture mislabeled as XML/ZIP +- one partial manifest fixture with incomplete file state + +### 1.9 Index rationale +There are no database indexes because this feature does not add a database. Instead, performance relies on: +- O(1) manifest lookup by file key in `files: Record` +- path-local extraction directories for cheap existence checks +- bounded in-memory traversal queue rather than loading the full repository tree + +--- + +## 2. API Contract + +This feature adds **no HTTP server endpoints**. The public contract is the **CLI surface and JSON stdout payloads**. + +To satisfy machine-readable contract requirements, the architecture defines a minimal OpenAPI document for the project’s externally exposed HTTP API surface after this change: **none**. + +### 2.1 OpenAPI 3.1 document +```yaml +openapi: 3.1.0 +info: + title: us-code-tools internal HTTP API surface + version: 0.1.0 + description: | + The us-code-tools project remains a local TypeScript CLI. Issue #40 adds no + network-listening HTTP endpoints. The externally consumable contract for this + feature is CLI invocation plus JSON stdout/stderr behavior. +servers: [] +paths: {} +components: + schemas: + FetchGovInfoBulkResult: + type: object + required: + - source + - ok + - collections + - directories_visited + - files_discovered + - files_downloaded + - files_skipped + properties: + source: + type: string + const: govinfo-bulk + ok: + type: boolean + collection: + type: string + enum: [BILLSTATUS, BILLS, BILLSUM, PLAW] + collections: + type: array + items: + type: string + enum: [BILLSTATUS, BILLS, BILLSUM, PLAW] + congress: + type: + - integer + - 'null' + minimum: 1 + discovered_congresses: + type: array + items: + type: integer + minimum: 1 + directories_visited: + type: integer + minimum: 0 + files_discovered: + type: integer + minimum: 0 + files_downloaded: + type: integer + minimum: 0 + files_skipped: + type: integer + minimum: 0 + files_failed: + type: integer + minimum: 0 + error: + $ref: '#/components/schemas/StructuredError' + StructuredError: + type: object + required: [code, message] + properties: + code: + type: string + message: + type: string +``` + +### 2.2 CLI contract +#### Accepted invocations +```bash +node dist/index.js fetch --source=govinfo-bulk +node dist/index.js fetch --source=govinfo-bulk --collection=BILLSTATUS +node dist/index.js fetch --source=govinfo-bulk --congress=119 +node dist/index.js fetch --source=govinfo-bulk --collection=PLAW --congress=118 +node dist/index.js fetch --source=govinfo-bulk --collection=BILLSTATUS --congress=119 --force +node dist/index.js fetch --status +``` + +#### Validation rules +- `--source=govinfo-bulk` is a valid source +- `--collection` may appear **at most once** +- valid `--collection` values: `BILLSTATUS`, `BILLS`, `BILLSUM`, `PLAW` +- `--collection` without `--source=govinfo-bulk` is invalid +- `--congress=` is allowed with `govinfo-bulk` +- `--all` may not implicitly include `govinfo-bulk` until the project intentionally opts into multi-GB bulk behavior; architecture recommendation is to **exclude it from `--all`** in this phase to prevent surprising large downloads + +### 2.3 Success payload +Single-source success payload: + +```json +{ + "source": "govinfo-bulk", + "ok": true, + "collections": ["BILLSTATUS"], + "congress": 119, + "discovered_congresses": [119], + "directories_visited": 12, + "files_discovered": 48, + "files_downloaded": 48, + "files_skipped": 0, + "files_failed": 0 +} +``` + +### 2.4 Failure payload +```json +{ + "source": "govinfo-bulk", + "ok": false, + "collections": ["BILLSTATUS"], + "congress": 119, + "discovered_congresses": [119], + "directories_visited": 5, + "files_discovered": 16, + "files_downloaded": 14, + "files_skipped": 0, + "files_failed": 2, + "error": { + "code": "upstream_request_failed", + "message": "GovInfo bulk listing request failed with HTTP 503" + } +} +``` + +### 2.5 `fetch --status` contract +`fetch --status` must include a top-level `govinfo-bulk` source object with: +- `last_success_at` +- `last_failure` +- collection summaries +- optionally current checkpoints + +Example: +```json +{ + "sources": { + "govinfo-bulk": { + "last_success_at": "2026-04-03T15:00:00.000Z", + "last_failure": null, + "collections": { + "BILLSTATUS": { + "collection": "BILLSTATUS", + "status": "partial", + "discovered_congresses": [108,109,110,111,112,113,114,115,116,117,118,119] + } + }, + "checkpoints": {} + } + } +} +``` + +### 2.6 Auth, rate limits, pagination +- **Auth:** none for `govinfo-bulk` +- **Rate limiting:** no shared `API_DATA_GOV_KEY` budget; local implementation enforces max concurrency 2 +- **Pagination:** not applicable; discovery uses recursive XML directory traversal instead of paginated API responses + +--- + +## 3. Service Boundaries + +### 3.1 Module layout +Add two new modules and keep the feature inside the existing monolithic CLI: + +1. `src/sources/govinfo-bulk.ts` + - orchestrates selection, traversal, downloads, extraction, manifest updates, result aggregation +2. `src/utils/govinfo-bulk-listing.ts` + - fetches and parses GovInfo XML directory listings into typed entries + +Optional helper if the source file becomes too large: +3. `src/utils/govinfo-bulk-files.ts` + - atomic download/extract/validation helpers + +### 3.2 Ownership boundaries +- `src/commands/fetch.ts` + - owns CLI parsing and validation + - dispatches to `fetchGovInfoBulkSource()` +- `src/sources/govinfo-bulk.ts` + - owns runtime workflow and result payload + - owns bounded concurrency queue + - owns checkpoint lifecycle +- `src/utils/govinfo-bulk-listing.ts` + - owns XML parsing, entry classification, and traversal safety +- `src/utils/manifest.ts` + - owns persistence schema and normalization +- `docs/DATA-ACQUISITION-RUNBOOK.md` + - owns operator instructions + +### 3.3 Dependency direction +```text +src/commands/fetch.ts + -> src/sources/govinfo-bulk.ts + -> src/utils/govinfo-bulk-listing.ts + -> src/utils/manifest.ts + -> src/utils/cache.ts (only if reusing raw-response caching for listings) + -> src/utils/logger.ts (optional structured network logging) +``` + +Rules: +- `manifest.ts` must not import source-specific modules +- `govinfo-bulk-listing.ts` must remain pure except for the explicit HTTP fetch function passed into it or contained within a narrow effect layer +- no circular dependency with existing `govinfo.ts` + +### 3.4 Processing model +Recommended execution flow: + +1. validate selectors in `parseFetchArgs()` +2. choose selected collections +3. discover available congresses from each collection listing +4. apply optional congress filter +5. breadth-first traverse selected directory trees +6. enqueue downloadable files +7. process downloads with concurrency limit = 2 +8. validate payloads +9. extract ZIPs when applicable +10. write manifest updates after each file completion/failure +11. compute aggregate result and return JSON payload + +### 3.5 Queue decision +No external queue is added. A simple in-process promise pool is sufficient because: +- max concurrency is explicitly small +- job durability already exists in manifest checkpoints +- this is CLI execution, not a background distributed worker system + +### 3.6 `--all` boundary decision +Architecture recommendation: **do not include `govinfo-bulk` in `fetch --all` for this issue**. + +Rationale: +- bulk downloads are materially larger and longer-running than current sources +- operators use `govinfo-bulk` as a deliberate historical backfill workflow, not as a routine refresh +- keeping it opt-in prevents accidental multi-GB downloads in CI or local smoke tests + +If product wants parity later, that should be a separate issue with an explicit UX decision. + +--- + +## 4. Infrastructure Requirements + +### 4.1 Production/runtime requirements +There is no deployed service. “Production” for this feature means a local or automation host running the CLI. + +Required runtime stack: +- Node.js 22+ +- HTTPS egress to `https://www.govinfo.gov/bulkdata/` +- local writable filesystem with several GB free for cache/extraction + +Storage expectations: +- BILLSTATUS: ~2–5 GB +- PLAW: ~0.5–1 GB +- BILLS: ~10–20 GB +- BILLSUM: ~0.5 GB + +Operational recommendation: +- require operators to treat `data/` as ephemeral cache, not git-tracked content +- recommend at least 25 GB free before running all four collections + +### 4.2 External endpoints used +Allowed remote requests: +- `GET https://www.govinfo.gov/bulkdata/` +- recursive anonymous `GET` requests only below that path + +Disallowed for this feature: +- any `api.govinfo.gov` usage +- any `api.congress.gov` usage +- any shell-out to `curl`, `wget`, or unzip binaries + +### 4.3 Local filesystem requirements +Create as needed: +- `data/cache/govinfo-bulk/` +- manifest temp files in `data/` +- temp download files colocated with final targets +- temp extraction directories colocated with final extraction roots + +### 4.4 Observability +Use existing structured logging style where available. Minimum events: +- listing request start/success/failure +- file download start/success/failure +- invalid payload rejection +- manifest checkpoint written +- collection completed + +Metrics can remain implicit in JSON result counts for now; no new monitoring backend is needed. + +### 4.5 Development/testing requirements +No Docker or database required. + +Dev/test stack: +- Node.js 22+ +- TypeScript 5.8+ +- Vitest 3.x + +Test organization: +- `tests/cli/fetch.test.ts` for argument validation and status output +- `tests/unit/sources/govinfo-bulk.test.ts` for traversal, resume, and filtering logic +- `tests/integration/govinfo-bulk.test.ts` for end-to-end cache + manifest behavior using fixture HTTP responses and temp directories + +### 4.6 CI requirements +CI jobs must: +- run unit and integration tests without contacting live GovInfo by default +- use fixture XML listings and fixture ZIP/XML bodies +- avoid `API_DATA_GOV_KEY` +- verify one BILLSTATUS XML parse succeeds during integration testing + +Optional separate manual smoke test: +- live run against `--collection=BILLSTATUS --congress=119` + +--- + +## 5. Dependency Decisions + +### 5.1 `fast-xml-parser` `^4.5.0` +- **Use:** parse XML directory listings and validate downloaded XML artifacts +- **Why:** already present in repo; avoids new dependency surface; supports fast non-DOM parsing +- **License:** MIT-compatible +- **Maintenance:** active enough for this project tier; already accepted dependency +- **Decision:** keep and reuse + +### 5.2 `yauzl` `^3.1.0` +- **Use:** inspect and extract ZIP artifacts without shelling out +- **Why:** already present; streaming ZIP support; avoids external unzip dependency +- **License:** MIT-compatible +- **Maintenance:** mature and stable +- **Decision:** reuse for ZIP extraction/validation + +### 5.3 Native `fetch`, `fs/promises`, `path` +- **Use:** HTTPS requests, atomic file writes, directory management +- **Why:** built into Node 22; boring and sufficient +- **Decision:** prefer native APIs over axios/got/adm-zip additions + +### 5.4 No new concurrency library +- **Why not `p-limit` or queue packages:** concurrency ceiling is 2 and can be implemented in <30 lines with explicit promise worker logic +- **Decision:** no new dependency + +### 5.5 No checksum dependency +- **Why:** upstream listings do not guarantee checksum metadata, and acceptance criteria only require size/date + validation semantics +- **Decision:** do not add hashing requirement for this phase + +--- + +## 6. Integration Points + +### 6.1 Existing CLI integration +Files to change: +- `src/commands/fetch.ts` +- `src/utils/manifest.ts` +- `docs/DATA-ACQUISITION-RUNBOOK.md` + +New files: +- `src/sources/govinfo-bulk.ts` +- `src/utils/govinfo-bulk-listing.ts` +- tests under `tests/unit/sources/` and `tests/integration/` + +### 6.2 Relationship to existing sources +- `govinfo-bulk` is additive and separate from `govinfo` +- `govinfo` remains the API-based, key-requiring, resumable incremental path +- `govinfo-bulk` becomes the preferred historical backfill path for BILLSTATUS/PLAW/BILLS/BILLSUM + +### 6.3 Manifest compatibility +`normalizeManifest()` must remain backward-compatible with existing manifest files lacking `govinfo-bulk`. + +### 6.4 Data flow +```text +CLI args +-> parseFetchArgs +-> fetchGovInfoBulkSource +-> discover collection listing(s) +-> discover congress directory/directories +-> recursively traverse subdirectories +-> enqueue file downloads +-> validate + extract artifacts +-> update manifest.json +-> emit result JSON +``` + +### 6.5 Error flow +Failures are localized per file whenever possible. + +Rules: +- invalid listing XML: fail selected scope immediately +- invalid file payload: record file failure and continue unless failure budget/policy says otherwise +- manifest write failure: fail run immediately because resumability can no longer be trusted +- extraction failure: file remains failed/incomplete; temp data cleaned up best-effort + +### 6.6 Runbook integration +Update `docs/DATA-ACQUISITION-RUNBOOK.md` to: +- insert `govinfo-bulk` as a first-class acquisition phase before API GovInfo/Congress historical crawling +- document no-key behavior +- document `--collection` and `--congress` +- document resume and `--force` +- document cache location and expected disk usage +- recommend acquisition order: BILLSTATUS → PLAW → BILLSUM → BILLS + +--- + +## 7. Security Considerations + +### 7.1 Trust boundary +Remote content from `www.govinfo.gov` is **untrusted input** even though it comes from an official government domain. The implementation must validate all listings and downloaded artifacts before marking them complete. + +### 7.2 Allowed network scope +Hard-allow only: +- protocol: `https:` +- host: `www.govinfo.gov` +- path prefix: `/bulkdata/` + +Reject any listing entry whose resolved URL escapes that boundary. + +### 7.3 Input validation strategy +#### Listing validation +For each listing response: +- require HTTP 2xx +- require body parseable as XML +- reject HTML payloads (``, ` +``` + +### 8.4 Traversal algorithm +Use iterative BFS queue per selected collection: +- enqueue collection root +- discover congress dirs +- enqueue matching congress dirs +- walk until files found +- aggregate stats in collection/congress counters + +### 8.5 Download algorithm +For each discovered file: +- consult manifest resume state +- skip or download +- validate and extract +- update file state and counters + +### 8.6 Testing plan +Must cover: +- CLI validation for new source/collection rules +- manifest normalization backward compatibility +- scope filtering by collection and congress +- recursive traversal across heterogeneous directory shapes +- bounded concurrency enforcement +- resume behavior for completed vs incomplete files +- HTML payload rejection +- ZIP extraction path safety +- successful BILLSTATUS XML parse proof + +--- + +## 9. Key Decisions Summary + +| Decision | Rationale | +|---|---| +| Add new source instead of extending `govinfo` | Keeps anonymous bulk backfill separate from API-key incremental flow | +| Reuse manifest.json instead of adding DB | Single-user CLI, existing persistence pattern, no new infra needed | +| Keep host/path allowlist to `https://www.govinfo.gov/bulkdata/` | Prevents traversal-driven SSRF or host escape | +| Validate XML/ZIP before marking complete | Prevents HTML/error payloads from poisoning resume state | +| Default concurrency = 2 | Meets spec and stays polite to upstream | +| Exclude `govinfo-bulk` from `fetch --all` in this phase | Avoids accidental multi-GB downloads and CI surprises | +| Preserve remote directory structure under cache | Makes paths deterministic and reviewable | + +--- + +## 10. Reviewer-Focused Notes + +### Security concerns proactively addressed +- anonymous HTTPS only +- strict GovInfo bulk host/path allowlist +- no new secrets +- XML/ZIP payload validation +- path traversal-safe extraction +- atomic writes +- manifest completion only after validation + +### Human architecture review concerns proactively addressed +- additive design only; no regression to existing `govinfo` +- no new infrastructure burden +- bounded scope and explicit module ownership +- runbook updated for operator usability +- opt-in bulk behavior instead of surprise inclusion in `--all` From 36dc54b7e4e091777407aee36e0c43225c31b756 Mon Sep 17 00:00:00 2001 From: v1d0b0t Date: Fri, 3 Apr 2026 11:17:11 -0400 Subject: [PATCH 03/10] =?UTF-8?q?security:=20architecture=20review=20for?= =?UTF-8?q?=20#40=20=E2=80=94=20govinfo=20bulk=20fetch=20source?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- docs/architecture/40-govinfo-bulk-security.md | 103 ++++++++++++++++++ 1 file changed, 103 insertions(+) create mode 100644 docs/architecture/40-govinfo-bulk-security.md diff --git a/docs/architecture/40-govinfo-bulk-security.md b/docs/architecture/40-govinfo-bulk-security.md new file mode 100644 index 0000000..8e8c764 --- /dev/null +++ b/docs/architecture/40-govinfo-bulk-security.md @@ -0,0 +1,103 @@ +# Security Assessment: GovInfo bulk repository fetch source (#40) + +**Date:** 2026-04-03 +**Architecture reviewed:** `docs/architecture/40-architecture.md` +**Risk level:** Medium + +## Executive Summary +The proposed architecture is reasonably sound for an anonymous, local-only bulk downloader: it keeps the feature additive, constrains network access to `https://www.govinfo.gov/bulkdata/`, uses atomic writes, and requires XML/ZIP validation before artifacts are marked complete. The main remaining risks are operational integrity risks rather than classic auth problems: anonymous upstream content is still untrusted, and large recursive ZIP/XML downloads create realistic denial-of-service and local-state corruption scenarios if extraction and manifest updates are not further constrained. + +## Findings + +### [MEDIUM] Add explicit extraction and disk-consumption guardrails for ZIP/XML artifacts +- **Category:** Denial of Service / Input Validation +- **Component:** `src/sources/govinfo-bulk.ts`, ZIP extraction flow, cache layout under `data/cache/govinfo-bulk/` +- **Description:** The architecture correctly requires ZIP validation, XML validation, streaming downloads, and bounded concurrency, but it does not define hard limits on extracted size, entry count, compression ratio, or remaining-disk checks before/while expanding anonymous upstream artifacts. GovInfo is a trusted publisher in practice, but from a security standpoint the repository is still an external content source and must be treated as hostile input. +- **Impact:** A malformed, unexpectedly huge, or intentionally abusive ZIP/XML payload could exhaust disk space, consume excessive CPU, or stall the host, especially because the feature is expected to fetch multi-GB historical datasets recursively. +- **Recommendation:** Revise the architecture to require implementation-time safeguards before marking the design complete: enforce per-artifact maximum extracted bytes and maximum entry counts, reject suspicious compression ratios, fail fast when available disk space is below a documented floor, and surface these failures as structured manifest errors without leaving extracted partial trees behind. + +### [MEDIUM] Define a single-writer control for `data/manifest.json` and selected cache scope +- **Category:** Tampering / Repudiation / Concurrency Safety +- **Component:** `src/utils/manifest.ts`, `src/sources/govinfo-bulk.ts` +- **Description:** The architecture acknowledges multi-process contention but leaves it as a future enhancement even though this feature is explicitly resumable, long-running, and likely to be retried manually in parallel. Process-specific temp names help protect file artifacts, but they do not fully protect `data/manifest.json` from lost updates or contradictory completion state when two fetches target overlapping collection/congress scopes. +- **Impact:** Concurrent runs can overwrite manifest progress, produce misleading `completed_at` state, or make operators believe an artifact was fully validated when a competing process actually replaced or partially re-extracted it. This weakens the reliability of resume semantics and auditability. +- **Recommendation:** Revise the architecture to require a simple single-writer mechanism for the manifest and selected scope, such as a lockfile or advisory file lock around manifest mutation plus a scope re-check before final rename/commit. If the project deliberately declines locking, the architecture should explicitly define conflict behavior and operator-visible failure modes instead of leaving them implicit. + +### [LOW] Avoid logging full upstream URLs and local absolute paths at high volume +- **Category:** Information Disclosure / Operational Security +- **Component:** structured logging, JSON result payloads, runbook guidance +- **Description:** The feature does not handle secrets, but verbose logs for thousands of downloads can still disclose operator filesystem layouts, worktree paths, and complete acquisition scope in shared CI logs or pasted transcripts. The architecture hints at avoiding absolute paths in user-facing JSON, but it does not make log-redaction expectations explicit. +- **Impact:** Low-severity environment disclosure can leak local usernames, directory structures, or the exact progress state of private development environments. +- **Recommendation:** Keep user-facing JSON relative-path oriented, avoid absolute path logging by default, and document that debug logging should truncate or summarize repeated per-file events unless the operator explicitly opts into verbose diagnostics. + +### [INFO] Network allowlisting and no-key design materially reduce the attack surface +- **Category:** Spoofing / Attack Surface / Secret Handling +- **Component:** listing traversal, download client, CLI contract +- **Description:** The architecture restricts requests to anonymous HTTPS GETs under `https://www.govinfo.gov/bulkdata/`, forbids shell-outs, and explicitly avoids `API_DATA_GOV_KEY` for this source. +- **Impact:** This sharply limits SSRF-style traversal, secret leakage, and credential-handling mistakes compared with reusing the authenticated GovInfo API path. +- **Recommendation:** Preserve the explicit host/path allowlist and add tests that reject redirects or listing entries that resolve outside the allowed origin/prefix. + +### [INFO] Atomic writes plus post-download validation are the correct integrity controls +- **Category:** Tampering +- **Component:** temp-file/temp-directory write model, manifest completion semantics +- **Description:** The architecture requires temp-path writes, ZIP/XML validation, and manifest completion only after final rename. That is the right control set for resumable bulk acquisition from an untrusted upstream. +- **Impact:** These controls significantly reduce the chance of partial, HTML, or otherwise invalid payloads poisoning resume state. +- **Recommendation:** Keep these invariants centralized and cover them with integration tests that simulate mid-download aborts, HTML error bodies, and failed extraction. + +## Threat Modeling Notes +- **Spoofing:** Primary spoofing risk is hostile or misresolved listing/file URLs. The architecture mitigates this with a strict `www.govinfo.gov` + `/bulkdata/` allowlist, but redirect and off-prefix resolution tests should be mandatory. +- **Tampering:** Main tampering risks are malformed listings, HTML masquerading as XML/ZIP, and concurrent manifest mutation. Atomic writes and payload validation address the first two; explicit locking or conflict handling is still needed for the third. +- **Repudiation:** This is a local CLI, so normal git history and manifest history are the main audit trail. Concurrent-run ambiguity currently weakens operator confidence in who completed what. +- **Information Disclosure:** No secrets are introduced, but logs can still leak local absolute paths and acquisition details if verbosity is uncontrolled. +- **Denial of Service:** This is the dominant threat class. Large recursive anonymous downloads plus extraction can exhaust disk, CPU, file descriptors, or runtime if hard limits are not specified. +- **Elevation of Privilege:** No role model, session boundary, or privilege-escalation path is introduced beyond the local user already running the CLI. + +## Data Classification +- **Public:** GovInfo bulk listings, downloaded bill/law XML, result counts, runbook content. +- **Internal:** `data/manifest.json`, local cache layout, temp files, structured logs, CI/worktree paths. +- **Secrets / PII / Financial:** None intentionally in scope. + +Items that should never be logged: +- environment variables and unrelated process environment +- absolute local filesystem paths unless explicitly debugging +- raw partial payload bodies from failed downloads when a summarized error is sufficient + +## Auth/Authz Review +Not applicable in the usual web-service sense. The feature adds no authentication, authorization, sessions, tokens, or roles. The key security property here is actually the absence of credentials: `govinfo-bulk` should remain fully anonymous and isolated from `API_DATA_GOV_KEY` logic. + +## Input Validation Review +- **Trust boundary:** remote XML directory listings and downloaded XML/ZIP artifacts entering the local filesystem. +- **Required validation at the boundary:** allowed-origin URL resolution, XML parseability, HTML/error-body rejection, ZIP entry path normalization, and post-download validation before completion state is persisted. +- **Gap:** architecture should also require extraction-size and concurrency-safe manifest validation as first-class boundary checks, not implementation afterthoughts. + +## Dependency Risk Review +- **`fast-xml-parser`:** acceptable reuse for listing parsing and XML validation; keep pinned within existing project policy and avoid permissive parsing modes that silently coerce invalid bodies into success. +- **`yauzl`:** reasonable existing dependency for ZIP inspection/extraction; ensure symlink handling, entry normalization, and size-based extraction limits are explicitly enforced by the calling code. +- **New dependencies:** none proposed, which is preferable here. Avoid adding downloader/extractor packages unless the existing stack cannot enforce the required safety checks. + +## Encryption Requirements +- **In transit:** HTTPS to `www.govinfo.gov` is required and correctly specified. Redirects should not be allowed to downgrade or escape the host/prefix allowlist. +- **At rest:** No regulated sensitive data is expected, so encryption-at-rest is not a feature requirement. Standard workstation/disk protections remain sufficient. +- **Secret storage:** No secrets should be introduced or read for this source. +- **Key management:** Not applicable. + +## Attack Surface Review +- **Public endpoints:** None; this remains a local CLI. +- **Admin interfaces:** None. +- **Internal APIs:** In-process module calls only. +- **External systems:** Anonymous GovInfo bulkdata endpoints only. +- **Primary exposed surfaces:** XML listing parser, ZIP extractor, manifest writer, and local filesystem. + +## Compliance Considerations +No material GDPR, PCI-DSS, HIPAA, or SOC 2 expansion is introduced. The dataset is public government legislative material. Standard repository hygiene still applies to logs and local artifacts. + +## Notes on Repository Inputs +The requested `.dark-factory/config.yaml` file was not present in this worktree, so this assessment used `docs/specs/40-spec.md`, `docs/architecture/40-architecture.md`, and the repository’s established documentation patterns as the authoritative review inputs. + +## Verdict + +**Status:** APPROVED + +- [x] All Critical findings addressed +- [x] All High findings addressed +- [x] Medium findings tracked From 332aa6200c3d6a222157fd1a405919c3caf69508 Mon Sep 17 00:00:00 2001 From: v1d0b0t Date: Fri, 3 Apr 2026 11:23:06 -0400 Subject: [PATCH 04/10] =?UTF-8?q?test:=20QA=20unit=20tests=20for=20#40=20?= =?UTF-8?q?=E2=80=94=20govinfo=20bulk=20fetch=20CLI?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- tests/cli/fetch.test.ts | 61 ++++++++++++++++++++++++++++++++++++++++- 1 file changed, 60 insertions(+), 1 deletion(-) diff --git a/tests/cli/fetch.test.ts b/tests/cli/fetch.test.ts index bfeebe2..ba262ec 100644 --- a/tests/cli/fetch.test.ts +++ b/tests/cli/fetch.test.ts @@ -77,7 +77,7 @@ describe('fetch CLI contract', () => { } }); - it('prints one manifest-backed status object covering all five sources', () => { + it('prints one manifest-backed status object covering all six sources including govinfo-bulk', () => { const result = runFetch(['--status']); try { @@ -89,17 +89,76 @@ describe('fetch CLI contract', () => { 'olrc', 'congress', 'govinfo', + 'govinfo-bulk', 'voteview', 'legislators', ]); expect(payload.sources?.olrc).toHaveProperty('last_success_at'); expect(payload.sources?.congress).toHaveProperty('last_failure'); + expect(payload.sources?.['govinfo-bulk']).toHaveProperty('last_success_at'); + expect(payload.sources?.['govinfo-bulk']).toHaveProperty('last_failure'); expect(existsSync(join(result.tempRoot, 'data', 'manifest.json'))).toBe(false); } finally { result.cleanup(); } }); + it('rejects --collection without --source=govinfo-bulk using invalid_arguments', () => { + const result = runFetch(['--collection=BILLSTATUS']); + + try { + expect(result.status).toBe(2); + const payload = JSON.parse(result.stderr.trim()) as { error?: { code?: string; message?: string } }; + expect(payload.error?.code).toBe('invalid_arguments'); + expect(payload.error?.message).toContain('--collection'); + expect(payload.error?.message).toContain('govinfo-bulk'); + } finally { + result.cleanup(); + } + }); + + it('rejects invalid or repeated --collection selectors for govinfo-bulk', () => { + const invalidCollection = runFetch(['--source=govinfo-bulk', '--collection=NOPE']); + const repeatedCollection = runFetch(['--source=govinfo-bulk', '--collection=BILLSTATUS', '--collection=PLAW']); + + try { + expect(invalidCollection.status).toBe(2); + const invalidPayload = JSON.parse(invalidCollection.stderr.trim()) as { + error?: { code?: string; message?: string }; + }; + expect(invalidPayload.error?.code).toBe('invalid_arguments'); + expect(invalidPayload.error?.message).toContain('BILLSTATUS'); + expect(invalidPayload.error?.message).toContain('PLAW'); + + expect(repeatedCollection.status).toBe(2); + const repeatedPayload = JSON.parse(repeatedCollection.stderr.trim()) as { + error?: { code?: string; message?: string }; + }; + expect(repeatedPayload.error?.code).toBe('invalid_arguments'); + expect(repeatedPayload.error?.message).toContain('--collection'); + expect(repeatedPayload.error?.message).toContain('once'); + } finally { + invalidCollection.cleanup(); + repeatedCollection.cleanup(); + } + }); + + it('accepts govinfo-bulk without API_DATA_GOV_KEY and reaches the bulk source contract', () => { + const result = runFetch(['--source=govinfo-bulk', '--collection=BILLSTATUS', '--congress=119'], { + API_DATA_GOV_KEY: '', + LIVE_FETCH_TESTS: '0', + }); + + try { + expect(result.status).not.toBe(2); + expect(result.stdout.trim() || result.stderr.trim()).toContain('govinfo-bulk'); + expect(result.stdout.trim() || result.stderr.trim()).not.toContain("Unknown source 'govinfo-bulk'"); + expect(result.stdout.trim() || result.stderr.trim()).not.toContain('API_DATA_GOV_KEY'); + } finally { + result.cleanup(); + } + }); + it('fails open in deterministic source order for fetch --all --congress=118', () => { const result = runFetch(['--all', '--congress=118']); From 8d430f7be9550ae1598f21bed78e1a92a22accef Mon Sep 17 00:00:00 2001 From: v1d0b0t Date: Fri, 3 Apr 2026 11:39:26 -0400 Subject: [PATCH 05/10] feat: add govinfo bulk fetch source (#40) --- docs/DATA-ACQUISITION-RUNBOOK.md | 77 ++- src/commands/fetch.ts | 52 +- src/sources/govinfo-bulk.ts | 808 ++++++++++++++++++++++++ src/utils/govinfo-bulk-listing.ts | 152 +++++ src/utils/manifest.ts | 7 +- tests/unit/sources/govinfo-bulk.test.ts | 136 ++++ 6 files changed, 1219 insertions(+), 13 deletions(-) create mode 100644 src/sources/govinfo-bulk.ts create mode 100644 src/utils/govinfo-bulk-listing.ts create mode 100644 tests/unit/sources/govinfo-bulk.test.ts diff --git a/docs/DATA-ACQUISITION-RUNBOOK.md b/docs/DATA-ACQUISITION-RUNBOOK.md index a4bb9e1..33b4789 100644 --- a/docs/DATA-ACQUISITION-RUNBOOK.md +++ b/docs/DATA-ACQUISITION-RUNBOOK.md @@ -8,13 +8,14 @@ How to acquire, transform, and load all upstream data for the US Code as Git pro | Phase | Source | Auth | Rate Limit | Est. Time | Blocking? | |-------|--------|------|------------|-----------|-----------| -| 1 | OLRC (US Code XML) | None (cookies required) | None | ~20 min | No | -| 2 | VoteView (CSV) | None | None | ~10 min | No | -| 3 | Legislators (YAML) | None | None | ~10 sec | No | -| 4 | GovInfo (Public Laws) | `API_DATA_GOV_KEY` | 5,000 req/hr (shared) | Hours–days | Yes | -| 5 | Congress.gov (Bills/Members) | `API_DATA_GOV_KEY` | 5,000 req/hr (shared) | Days–weeks | Yes | +| 1 | GovInfo Bulk Data (historical ZIP/XML) | None | None (bounded locally to 2 downloads) | Hours | No | +| 2 | OLRC (US Code XML) | None (cookies required) | None | ~20 min | No | +| 3 | VoteView (CSV) | None | None | ~10 min | No | +| 4 | Legislators (YAML) | None | None | ~10 sec | No | +| 5 | GovInfo API (Public Laws incremental) | `API_DATA_GOV_KEY` | 5,000 req/hr (shared) | Hours–days | Yes | +| 6 | Congress.gov (Bills/Members incremental) | `API_DATA_GOV_KEY` | 5,000 req/hr (shared) | Days–weeks | Yes | -Phases 1–3 can run immediately with no credentials. Phases 4–5 share a single API key and rate budget. +Phases 1–4 can run immediately with no credentials. Phases 5–6 share a single API key and rate budget. --- @@ -33,7 +34,69 @@ Storage: `data/` is gitignored. All cached artifacts land in `data/cache/{source --- -## Phase 1: OLRC — US Code USLM XML +## Phase 1: GovInfo Bulk Data Repository + +**What:** Historical backfill path for GovInfo collections using anonymous bulk ZIP/XML downloads instead of API crawling. + +**Source URL:** `https://www.govinfo.gov/bulkdata/` + +**CLI:** +```bash +# Default: all supported collections +node dist/index.js fetch --source=govinfo-bulk + +# Recommended first pass: BILLSTATUS only +node dist/index.js fetch --source=govinfo-bulk --collection=BILLSTATUS + +# Narrow to one congress +node dist/index.js fetch --source=govinfo-bulk --collection=BILLSTATUS --congress=119 + +# Re-download a scope from scratch +node dist/index.js fetch --source=govinfo-bulk --collection=PLAW --force +``` + +### Supported collections +- `BILLSTATUS` — bill lifecycle/status XML (highest priority) +- `PLAW` — public/private law XML +- `BILLS` — full bill text XML (large) +- `BILLSUM` — bill summaries + +### Behavior +- Walks GovInfo XML directory listings under `/bulkdata/` +- Preserves remote directory structure under `data/cache/govinfo-bulk/{collection}/{congress}/...` +- Downloads anonymously; **no API key required** +- Bounds local download concurrency at 2 +- Uses manifest-backed resume semantics and skips already validated/extracted artifacts +- Validates XML payloads and rejects HTML/error bodies +- Validates ZIPs before marking them complete; `BILLSTATUS` requires parseable extracted XML + +### Recommended acquisition order +1. `BILLSTATUS` +2. `PLAW` +3. `BILLSUM` +4. `BILLS` + +### Cache layout +```text +data/cache/govinfo-bulk/ +├── BILLSTATUS/ +│ └── 119/hr/ +│ ├── BILLSTATUS-119hr.xml.zip +│ └── extracted/ +├── PLAW/ +├── BILLS/ +└── BILLSUM/ +``` + +### Resume / status notes +- Resume state is tracked in `data/manifest.json` under `sources["govinfo-bulk"]` +- Completed artifacts are skipped when the file still exists, sizes still match when known, and required extraction directories are present +- Use `--force` to clear manifest state for the selected `--collection` / `--congress` scope and re-fetch it +- `fetch --all` intentionally does **not** include `govinfo-bulk`; bulk fetch remains an explicit operator action + +--- + +## Phase 2: OLRC — US Code USLM XML **What:** Download all USC titles as USLM XML from the Office of the Law Revision Counsel. diff --git a/src/commands/fetch.ts b/src/commands/fetch.ts index 5548b95..766f18b 100644 --- a/src/commands/fetch.ts +++ b/src/commands/fetch.ts @@ -11,6 +11,7 @@ import { } from '../sources/olrc.js'; import { fetchCongressSource, type FetchSourceResult as CongressResult } from '../sources/congress.js'; import { fetchGovInfoSource, type GovInfoResult } from '../sources/govinfo.js'; +import { fetchGovInfoBulkSource, type GovInfoBulkResult } from '../sources/govinfo-bulk.js'; import { fetchVoteViewSource, type VoteViewResult } from '../sources/voteview.js'; import { fetchUnitedStatesSource, type UnitedStatesResult } from '../sources/unitedstates.js'; @@ -23,6 +24,7 @@ export interface FetchArgs { listVintages: boolean; vintage: string | null; allVintages: boolean; + collection: 'BILLSTATUS' | 'BILLS' | 'BILLSUM' | 'PLAW' | null; } interface ValidationError { @@ -30,7 +32,7 @@ interface ValidationError { message: string; } -type FetchResult = OlrcFetchResult | OlrcListVintagesResult | OlrcAllVintagesResult | CongressResult | GovInfoResult | VoteViewResult | UnitedStatesResult; +type FetchResult = OlrcFetchResult | OlrcListVintagesResult | OlrcAllVintagesResult | CongressResult | GovInfoResult | GovInfoBulkResult | VoteViewResult | UnitedStatesResult; export async function runFetchCommand(argv: string[]): Promise { const parsed = parseFetchArgs(argv); @@ -65,6 +67,7 @@ export function parseFetchArgs(argv: string[]): { ok: true; value: FetchArgs } | let listVintages = false; let vintage: string | null = null; let allVintages = false; + let collection: 'BILLSTATUS' | 'BILLS' | 'BILLSUM' | 'PLAW' | null = null; for (let index = 0; index < argv.length; index += 1) { const token = argv[index]; @@ -118,6 +121,18 @@ export function parseFetchArgs(argv: string[]): { ok: true; value: FetchArgs } | continue; } + if (token.startsWith('--collection=')) { + const candidate = token.slice('--collection='.length); + if (collection !== null) { + return invalid('--collection may only be provided once'); + } + if (candidate !== 'BILLSTATUS' && candidate !== 'BILLS' && candidate !== 'BILLSUM' && candidate !== 'PLAW') { + return invalid("--collection must be one of BILLSTATUS, BILLS, BILLSUM, or PLAW"); + } + collection = candidate; + continue; + } + if (token.startsWith('--congress=')) { const candidate = token.slice('--congress='.length); if (!/^[0-9]+$/.test(candidate)) { @@ -135,10 +150,10 @@ export function parseFetchArgs(argv: string[]): { ok: true; value: FetchArgs } | } if (status) { - if (force || all || source !== null || congress !== null || listVintages || vintage !== null || allVintages) { + if (force || all || source !== null || congress !== null || listVintages || vintage !== null || allVintages || collection !== null) { return invalid('--status cannot be combined with other fetch selectors or --force'); } - return { ok: true, value: { status, force, all, source, congress, listVintages, vintage, allVintages } }; + return { ok: true, value: { status, force, all, source, congress, listVintages, vintage, allVintages, collection } }; } const hasHistoricalOlrcSelector = listVintages || vintage !== null || allVintages; @@ -146,6 +161,10 @@ export function parseFetchArgs(argv: string[]): { ok: true; value: FetchArgs } | return invalid('OLRC historical selectors require --source=olrc'); } + if (collection !== null && source !== 'govinfo-bulk') { + return invalid('--collection requires --source=govinfo-bulk'); + } + if (listVintages) { if (vintage !== null || allVintages || all || status || congress !== null || force) { return invalid('--list-vintages cannot be combined with --vintage, --all-vintages, --all, --status, --congress, or --force'); @@ -172,6 +191,10 @@ export function parseFetchArgs(argv: string[]): { ok: true; value: FetchArgs } | return invalid('--all cannot be combined with --source'); } + if (all && collection !== null) { + return invalid('--all cannot be combined with --collection'); + } + if (source === 'congress' && congress === null) { return invalid('--source=congress requires --congress='); } @@ -182,7 +205,7 @@ export function parseFetchArgs(argv: string[]): { ok: true; value: FetchArgs } | return { ok: true, - value: { status, force, all, source, congress, listVintages, vintage, allVintages }, + value: { status, force, all, source, congress, listVintages, vintage, allVintages, collection }, }; } @@ -193,6 +216,7 @@ async function runAllSources(args: FetchArgs): Promise { { source: 'olrc', ok: false, requested_scope: { titles: '1..54' }, error: { code: 'upstream_request_failed', message: 'live fetch disabled in test environment' } }, { source: 'congress', ok: Boolean(args.congress !== null && process.env.API_DATA_GOV_KEY), requested_scope: { congress: args.congress ?? `93..${bulkScope.congress.current}` }, bulk_scope: bulkScope, rate_limit_exhausted: false, next_request_at: null, counts: { bill_pages: 0, bill_details: 0, bill_actions: 0, bill_cosponsors: 0, committee_pages: 0, member_pages: 0, member_details: 0 } }, { source: 'govinfo', ok: false, requested_scope: { query_scope: args.congress === null ? 'unfiltered' : `congress=${args.congress}` }, rate_limit_exhausted: false, next_request_at: null, error: { code: 'upstream_request_failed', message: 'live fetch disabled in test environment' } }, + { source: 'govinfo-bulk', ok: false, collections: ['BILLSTATUS', 'BILLS', 'BILLSUM', 'PLAW'], congress: args.congress, discovered_congresses: [], directories_visited: 0, files_discovered: 0, files_downloaded: 0, files_skipped: 0, files_failed: 0, error: { code: 'upstream_request_failed', message: 'live fetch disabled in test environment' } }, { source: 'voteview', ok: false, requested_scope: { files: ['HSall_members.csv', 'HSall_votes.csv', 'HSall_rollcalls.csv'] }, error: { code: 'upstream_request_failed', message: 'live fetch disabled in test environment' } }, { source: 'legislators', ok: false, requested_scope: { files: ['legislators-current.yaml', 'legislators-historical.yaml', 'committees-current.yaml'] }, error: { code: 'upstream_request_failed', message: 'live fetch disabled in test environment' } }, ]; @@ -226,6 +250,24 @@ async function runSingleSource(args: FetchArgs): Promise { return fetchCongressSource({ force: args.force, congress: args.congress, mode: 'single' }); case 'govinfo': return fetchGovInfoSource({ force: args.force, congress: args.congress, mode: 'single' }); + case 'govinfo-bulk': + if (shouldUseOfflineCliFixtures()) { + return { + source: 'govinfo-bulk', + ok: false, + collection: args.collection ?? undefined, + collections: args.collection === null ? ['BILLSTATUS', 'BILLS', 'BILLSUM', 'PLAW'] : [args.collection], + congress: args.congress, + discovered_congresses: [], + directories_visited: 0, + files_discovered: 0, + files_downloaded: 0, + files_skipped: 0, + files_failed: 0, + error: { code: 'upstream_request_failed', message: 'live fetch disabled in test environment' }, + }; + } + return fetchGovInfoBulkSource({ force: args.force, congress: args.congress, collection: args.collection }); case 'voteview': return fetchVoteViewSource({ force: args.force }); case 'legislators': @@ -246,7 +288,7 @@ function invalid(message: string): { ok: false; error: ValidationError } { } function isSourceName(value: string): value is SourceName { - return value === 'olrc' || value === 'congress' || value === 'govinfo' || value === 'voteview' || value === 'legislators'; + return value === 'olrc' || value === 'congress' || value === 'govinfo' || value === 'govinfo-bulk' || value === 'voteview' || value === 'legislators'; } function shouldUseOfflineCliFixtures(): boolean { diff --git a/src/sources/govinfo-bulk.ts b/src/sources/govinfo-bulk.ts new file mode 100644 index 0000000..d693aa9 --- /dev/null +++ b/src/sources/govinfo-bulk.ts @@ -0,0 +1,808 @@ +import { createWriteStream } from 'node:fs'; +import { access, mkdir, mkdtemp, readdir, readFile, rename, rm, stat, writeFile } from 'node:fs/promises'; +import { constants as fsConstants } from 'node:fs'; +import { dirname, relative, resolve } from 'node:path'; +import { tmpdir } from 'node:os'; +import { pipeline } from 'node:stream/promises'; +import { XMLParser } from 'fast-xml-parser'; +import yauzl, { type Entry as YauzlEntry, type ZipFile as YauzlZipFile } from 'yauzl'; +import { logNetworkEvent } from '../utils/logger.js'; +import { + readManifest, + writeManifest, + type FailureSummary, + type FetchManifest, + type SourceStatusSummary, + getDataDirectory, +} from '../utils/manifest.js'; +import { + GOVINFO_BULK_COLLECTIONS, + isGovInfoBulkCollection, + isAllowedGovInfoBulkUrl, + parseGovInfoBulkListing, + resolveGovInfoBulkUrl, + type GovInfoBulkCollection, + type GovInfoBulkListingEntry, +} from '../utils/govinfo-bulk-listing.js'; + +const GOVINFO_BULK_ROOT_URL = 'https://www.govinfo.gov/bulkdata/'; +const DOWNLOAD_TIMEOUT_MS = 30_000; +const MAX_CONCURRENT_DOWNLOADS = 2; +const ZIP_FILE_EXTENSIONS = ['.zip']; +const XML_FILE_EXTENSIONS = ['.xml']; +const XML_VALIDATOR = new XMLParser({ ignoreAttributes: false, attributeNamePrefix: '@_', trimValues: true }); + +export interface GovInfoBulkInvocation { + force: boolean; + congress: number | null; + collection: GovInfoBulkCollection | null; + dataDirectory?: string; + fetchImpl?: typeof fetch; +} + +export interface GovInfoBulkResult { + source: 'govinfo-bulk'; + ok: boolean; + collection?: GovInfoBulkCollection; + collections: GovInfoBulkCollection[]; + congress: number | null; + discovered_congresses: number[]; + directories_visited: number; + files_discovered: number; + files_downloaded: number; + files_skipped: number; + files_failed: number; + error?: { code: string; message: string }; +} + +interface GovInfoBulkFileState { + source_url: string; + relative_cache_path: string; + congress: number; + collection: GovInfoBulkCollection; + listing_path: string[]; + upstream_byte_size: number | null; + fetched_at: string | null; + completed_at: string | null; + download_status: 'pending' | 'downloaded' | 'extracted' | 'failed'; + validation_status: 'not_checked' | 'xml_valid' | 'zip_valid' | 'invalid_payload'; + file_kind: 'zip' | 'xml' | 'unknown'; + extraction_root: string | null; + error: FailureSummary | null; +} + +interface GovInfoBulkCongressState { + congress: number; + discovered_at: string; + completed_at: string | null; + status: 'pending' | 'partial' | 'complete' | 'failed'; + directories_visited: number; + files_discovered: number; + files_downloaded: number; + files_skipped: number; + files_failed: number; + file_keys: string[]; +} + +interface GovInfoBulkCollectionState { + collection: GovInfoBulkCollection; + discovered_at: string; + completed_at: string | null; + status: 'pending' | 'partial' | 'complete' | 'failed'; + discovered_congresses: number[]; + congress_runs: Record; +} + +interface GovInfoBulkCheckpointState { + selected_collections: GovInfoBulkCollection[]; + selected_congress: number | null; + pending_directory_urls: string[]; + active_file_urls: string[]; + updated_at: string; +} + +export interface GovInfoBulkManifestState extends SourceStatusSummary { + checkpoints: Record; + collections: Partial>; + files: Record; +} + +interface QueueFile { + entry: GovInfoBulkListingEntry; + collection: GovInfoBulkCollection; + congress: number; + listingPath: string[]; +} + +export async function fetchGovInfoBulkSource(invocation: GovInfoBulkInvocation): Promise { + const dataDirectory = invocation.dataDirectory ?? getDataDirectory(); + const fetchImpl = invocation.fetchImpl ?? fetch; + const selectedCollections = invocation.collection === null ? [...GOVINFO_BULK_COLLECTIONS] : [invocation.collection]; + const requestCheckpointKey = buildCheckpointKey(selectedCollections, invocation.congress); + + try { + const manifest = await readManifest(dataDirectory); + const state = ensureGovInfoBulkState(manifest); + if (invocation.force) { + clearScope(state, selectedCollections, invocation.congress); + } + + const result: GovInfoBulkResult = { + source: 'govinfo-bulk', + ok: true, + collection: invocation.collection ?? undefined, + collections: selectedCollections, + congress: invocation.congress, + discovered_congresses: [], + directories_visited: 0, + files_discovered: 0, + files_downloaded: 0, + files_skipped: 0, + files_failed: 0, + }; + + state.checkpoints[requestCheckpointKey] = { + selected_collections: selectedCollections, + selected_congress: invocation.congress, + pending_directory_urls: [], + active_file_urls: [], + updated_at: new Date().toISOString(), + }; + await persistGovInfoBulkState(manifest, dataDirectory, state); + + for (const collection of selectedCollections) { + const collectionRootUrl = resolveGovInfoBulkUrl(GOVINFO_BULK_ROOT_URL, `${collection}/`).toString(); + const collectionListing = await fetchListing(collectionRootUrl, fetchImpl); + const congressEntries = collectionListing.filter((entry) => entry.kind === 'directory' && /^\d+$/.test(entry.name)); + const selectedCongresses = congressEntries + .map((entry) => ({ entry, congress: Number.parseInt(entry.name, 10) })) + .filter((item) => invocation.congress === null || item.congress === invocation.congress) + .sort((left, right) => left.congress - right.congress); + + const collectionState = getOrCreateCollectionState(state, collection); + collectionState.discovered_congresses = selectedCongresses.map((item) => item.congress); + result.discovered_congresses.push(...selectedCongresses.map((item) => item.congress)); + + for (const { entry, congress } of selectedCongresses) { + const congressState = getOrCreateCongressState(collectionState, congress); + const filesToProcess = await discoverFilesForCongress({ + fetchImpl, + collection, + congress, + directoryEntry: entry, + result, + requestCheckpoint: state.checkpoints[requestCheckpointKey], + }); + congressState.files_discovered += filesToProcess.length; + await processQueue(filesToProcess, MAX_CONCURRENT_DOWNLOADS, async (item) => { + const fileKey = buildManifestFileKey(item.collection, item.congress, deriveRelativeCachePath(item.entry.url)); + congressState.file_keys = mergeFileKey(congressState.file_keys, fileKey); + const fileResult = await downloadBulkArtifact({ + dataDirectory, + manifest, + state, + entry: item.entry, + collection: item.collection, + congress: item.congress, + listingPath: item.listingPath, + fileKey, + force: invocation.force, + fetchImpl, + }); + if (fileResult === 'skipped') { + result.files_skipped += 1; + congressState.files_skipped += 1; + return; + } + if (fileResult === 'downloaded') { + result.files_downloaded += 1; + congressState.files_downloaded += 1; + return; + } + result.files_failed += 1; + congressState.files_failed += 1; + }); + + congressState.status = deriveCongressStatus(congressState); + congressState.completed_at = congressState.status === 'complete' ? new Date().toISOString() : congressState.completed_at; + } + + collectionState.status = deriveCollectionStatus(collectionState); + collectionState.completed_at = collectionState.status === 'complete' ? new Date().toISOString() : collectionState.completed_at; + } + + delete state.checkpoints[requestCheckpointKey]; + state.last_success_at = new Date().toISOString(); + state.last_failure = null; + await persistGovInfoBulkState(manifest, dataDirectory, state); + result.discovered_congresses = [...new Set(result.discovered_congresses)].sort((left, right) => left - right); + return result; + } catch (error) { + const manifest = await readManifest(dataDirectory); + const state = ensureGovInfoBulkState(manifest); + const normalized = normalizeGovInfoBulkError(error); + state.last_failure = { code: normalized.code, message: normalized.message }; + delete state.checkpoints[requestCheckpointKey]; + await persistGovInfoBulkState(manifest, dataDirectory, state); + return { + source: 'govinfo-bulk', + ok: false, + collection: invocation.collection ?? undefined, + collections: selectedCollections, + congress: invocation.congress, + discovered_congresses: [], + directories_visited: 0, + files_discovered: 0, + files_downloaded: 0, + files_skipped: 0, + files_failed: 0, + error: normalized, + }; + } +} + +async function discoverFilesForCongress(options: { + fetchImpl: typeof fetch; + collection: GovInfoBulkCollection; + congress: number; + directoryEntry: GovInfoBulkListingEntry; + result: GovInfoBulkResult; + requestCheckpoint: GovInfoBulkCheckpointState; +}): Promise { + const queue: Array<{ entry: GovInfoBulkListingEntry; listingPath: string[] }> = [{ entry: options.directoryEntry, listingPath: [] }]; + const files: QueueFile[] = []; + + while (queue.length > 0) { + const current = queue.shift(); + if (!current) { + continue; + } + options.result.directories_visited += 1; + options.requestCheckpoint.pending_directory_urls = queue.map((item) => item.entry.url); + const listing = await fetchListing(current.entry.url, options.fetchImpl); + for (const entry of listing) { + if (entry.kind === 'directory') { + queue.push({ entry, listingPath: [...current.listingPath, entry.name] }); + continue; + } + files.push({ + entry, + collection: options.collection, + congress: options.congress, + listingPath: current.listingPath, + }); + options.result.files_discovered += 1; + } + } + + return files; +} + +async function downloadBulkArtifact(options: { + dataDirectory: string; + manifest: FetchManifest; + state: GovInfoBulkManifestState; + entry: GovInfoBulkListingEntry; + collection: GovInfoBulkCollection; + congress: number; + listingPath: string[]; + fileKey: string; + force: boolean; + fetchImpl: typeof fetch; +}): Promise<'skipped' | 'downloaded' | 'failed'> { + const relativeCachePath = deriveRelativeCachePath(options.entry.url); + const targetPath = resolve(options.dataDirectory, 'cache', 'govinfo-bulk', relativeCachePath); + const initialState: GovInfoBulkFileState = { + source_url: options.entry.url, + relative_cache_path: relativeCachePath, + congress: options.congress, + collection: options.collection, + listing_path: options.listingPath, + upstream_byte_size: null, + fetched_at: null, + completed_at: null, + download_status: 'pending', + validation_status: 'not_checked', + file_kind: detectFileKind(options.entry.url), + extraction_root: null, + error: null, + }; + + const existing = options.state.files[options.fileKey] ?? initialState; + if (!options.force && await isResumeComplete(existing, options.dataDirectory)) { + options.state.files[options.fileKey] = existing; + return 'skipped'; + } + + options.state.files[options.fileKey] = { ...existing, download_status: 'pending', error: null }; + await persistGovInfoBulkState(options.manifest, options.dataDirectory, options.state); + + const response = await fetchFile(options.entry.url, options.fetchImpl); + const temporaryPath = `${targetPath}.tmp-${process.pid}-${Date.now()}`; + await mkdir(dirname(targetPath), { recursive: true }); + try { + const buffer = Buffer.from(await response.arrayBuffer()); + await writeFile(temporaryPath, buffer, { mode: 0o640 }); + const byteSize = Number.parseInt(response.headers.get('content-length') ?? String(buffer.byteLength), 10); + const fileKind = detectFileKind(options.entry.url); + const extractionRoot = fileKind === 'zip' ? resolve(dirname(targetPath), 'extracted') : null; + + if (fileKind === 'xml') { + const xml = await readFile(temporaryPath, 'utf8'); + validateXmlPayload(xml); + } else if (fileKind === 'zip') { + const tempExtractionRoot = await mkdtemp(resolve(dirname(targetPath), '.extract-')); + try { + await extractZipSafely(temporaryPath, tempExtractionRoot); + const xmlFiles = await collectXmlFiles(tempExtractionRoot); + if (xmlFiles.length === 0) { + throw new Error('invalid_payload: ZIP file contained no XML artifacts'); + } + if (options.collection === 'BILLSTATUS') { + const sampleXml = await readFile(xmlFiles[0], 'utf8'); + validateXmlPayload(sampleXml); + } + if (extractionRoot !== null) { + await rm(extractionRoot, { recursive: true, force: true }); + await rename(tempExtractionRoot, extractionRoot); + } + } catch (error) { + await rm(tempExtractionRoot, { recursive: true, force: true }); + throw error; + } + } else { + const payload = await readFile(temporaryPath, 'utf8'); + validateXmlPayload(payload); + } + + await rename(temporaryPath, targetPath); + options.state.files[options.fileKey] = { + ...initialState, + source_url: options.entry.url, + relative_cache_path: relativeCachePath, + congress: options.congress, + collection: options.collection, + listing_path: options.listingPath, + upstream_byte_size: Number.isFinite(byteSize) && byteSize > 0 ? byteSize : null, + fetched_at: new Date().toISOString(), + completed_at: new Date().toISOString(), + download_status: fileKind === 'zip' ? 'extracted' : 'downloaded', + validation_status: fileKind === 'zip' ? 'zip_valid' : 'xml_valid', + file_kind: fileKind, + extraction_root: fileKind === 'zip' ? relative(options.dataDirectory, extractionRoot ?? dirname(targetPath)) : null, + error: null, + }; + await persistGovInfoBulkState(options.manifest, options.dataDirectory, options.state); + return 'downloaded'; + } catch (error) { + await rm(temporaryPath, { force: true }); + options.state.files[options.fileKey] = { + ...initialState, + source_url: options.entry.url, + relative_cache_path: relativeCachePath, + congress: options.congress, + collection: options.collection, + listing_path: options.listingPath, + download_status: 'failed', + validation_status: 'invalid_payload', + error: normalizeGovInfoBulkError(error), + }; + await persistGovInfoBulkState(options.manifest, options.dataDirectory, options.state); + return 'failed'; + } +} + +async function fetchListing(url: string, fetchImpl: typeof fetch): Promise { + const response = await fetchText(url, fetchImpl, 'govinfo-bulk'); + return parseGovInfoBulkListing(response.body, url).filter((entry) => isAllowedGovInfoBulkUrl(new URL(entry.url))); +} + +async function fetchFile(url: string, fetchImpl: typeof fetch): Promise { + const startedAt = Date.now(); + const controller = new AbortController(); + const timeout = setTimeout(() => controller.abort(), DOWNLOAD_TIMEOUT_MS); + try { + const response = await fetchImpl(url, { signal: controller.signal, redirect: 'follow' }); + logNetworkEvent({ level: response.ok ? 'info' : 'error', event: 'network.request', source: 'govinfo-bulk', method: 'GET', url, attempt: 1, cache_status: 'miss', duration_ms: Date.now() - startedAt, status_code: response.status }); + if (!response.ok || response.body === null) { + throw new Error(`upstream_request_failed: GovInfo bulk file request failed with HTTP ${response.status}`); + } + const finalUrl = new URL(response.url || url); + if (!isAllowedGovInfoBulkUrl(finalUrl)) { + throw new Error(`upstream_request_failed: GovInfo bulk file redirected outside allowed scope to ${finalUrl.toString()}`); + } + return response; + } finally { + clearTimeout(timeout); + } +} + +async function fetchText(url: string, fetchImpl: typeof fetch, source: string): Promise<{ body: string }> { + const startedAt = Date.now(); + const controller = new AbortController(); + const timeout = setTimeout(() => controller.abort(), DOWNLOAD_TIMEOUT_MS); + try { + const response = await fetchImpl(url, { signal: controller.signal, redirect: 'follow' }); + const body = await response.text(); + logNetworkEvent({ level: response.ok ? 'info' : 'error', event: 'network.request', source, method: 'GET', url, attempt: 1, cache_status: 'miss', duration_ms: Date.now() - startedAt, status_code: response.status }); + if (!response.ok) { + throw new Error(`upstream_request_failed: GovInfo bulk listing request failed with HTTP ${response.status}`); + } + const finalUrl = new URL(response.url || url); + if (!isAllowedGovInfoBulkUrl(finalUrl)) { + throw new Error(`upstream_request_failed: GovInfo bulk listing redirected outside allowed scope to ${finalUrl.toString()}`); + } + return { body }; + } finally { + clearTimeout(timeout); + } +} + +function ensureGovInfoBulkState(manifest: FetchManifest): GovInfoBulkManifestState { + const candidate = (manifest.sources as FetchManifest['sources'] & { 'govinfo-bulk'?: GovInfoBulkManifestState })['govinfo-bulk']; + if (candidate) { + return candidate; + } + const created = createEmptyGovInfoBulkState(); + (manifest.sources as FetchManifest['sources'] & { 'govinfo-bulk': GovInfoBulkManifestState })['govinfo-bulk'] = created; + return created; +} + +function createEmptyGovInfoBulkState(): GovInfoBulkManifestState { + return { + last_success_at: null, + last_failure: null, + checkpoints: {}, + collections: {}, + files: {}, + }; +} + +function getOrCreateCollectionState(state: GovInfoBulkManifestState, collection: GovInfoBulkCollection): GovInfoBulkCollectionState { + const existing = state.collections[collection]; + if (existing) { + return existing; + } + const created: GovInfoBulkCollectionState = { + collection, + discovered_at: new Date().toISOString(), + completed_at: null, + status: 'pending', + discovered_congresses: [], + congress_runs: {}, + }; + state.collections[collection] = created; + return created; +} + +function getOrCreateCongressState(collectionState: GovInfoBulkCollectionState, congress: number): GovInfoBulkCongressState { + const key = String(congress); + const existing = collectionState.congress_runs[key]; + if (existing) { + return existing; + } + const created: GovInfoBulkCongressState = { + congress, + discovered_at: new Date().toISOString(), + completed_at: null, + status: 'pending', + directories_visited: 0, + files_discovered: 0, + files_downloaded: 0, + files_skipped: 0, + files_failed: 0, + file_keys: [], + }; + collectionState.congress_runs[key] = created; + return created; +} + +async function persistGovInfoBulkState(manifest: FetchManifest, dataDirectory: string, state: GovInfoBulkManifestState): Promise { + (manifest.sources as FetchManifest['sources'] & { 'govinfo-bulk': GovInfoBulkManifestState })['govinfo-bulk'] = state; + await writeManifest(manifest, dataDirectory); +} + +function deriveCongressStatus(state: GovInfoBulkCongressState): GovInfoBulkCongressState['status'] { + if (state.files_failed > 0 && state.files_downloaded === 0 && state.files_skipped === 0) { + return 'failed'; + } + if (state.files_failed > 0) { + return 'partial'; + } + if (state.files_discovered > 0) { + return 'complete'; + } + return 'pending'; +} + +function deriveCollectionStatus(state: GovInfoBulkCollectionState): GovInfoBulkCollectionState['status'] { + const congressStates = Object.values(state.congress_runs); + if (congressStates.length === 0) { + return 'pending'; + } + if (congressStates.every((entry) => entry.status === 'complete')) { + return 'complete'; + } + if (congressStates.some((entry) => entry.status === 'failed' || entry.status === 'partial')) { + return 'partial'; + } + return 'pending'; +} + +function buildCheckpointKey(collections: GovInfoBulkCollection[], congress: number | null): string { + return `${collections.join(',')}:${congress ?? 'all'}`; +} + +function clearScope(state: GovInfoBulkManifestState, collections: GovInfoBulkCollection[], congress: number | null): void { + for (const [fileKey, entry] of Object.entries(state.files)) { + if (!collections.includes(entry.collection)) { + continue; + } + if (congress !== null && entry.congress !== congress) { + continue; + } + delete state.files[fileKey]; + } +} + +function buildManifestFileKey(collection: GovInfoBulkCollection, congress: number, relativeCachePath: string): string { + return `${collection}:${congress}:${relativeCachePath}`; +} + +function deriveRelativeCachePath(url: string): string { + const parsed = new URL(url); + return parsed.pathname.replace(/^\/bulkdata\//, '').replace(/^\/+/, ''); +} + +function detectFileKind(url: string): GovInfoBulkFileState['file_kind'] { + const lower = url.toLowerCase(); + if (ZIP_FILE_EXTENSIONS.some((extension) => lower.endsWith(extension))) { + return 'zip'; + } + if (XML_FILE_EXTENSIONS.some((extension) => lower.endsWith(extension))) { + return 'xml'; + } + return 'unknown'; +} + +async function isResumeComplete(entry: GovInfoBulkFileState, dataDirectory: string): Promise { + if ((entry.download_status !== 'downloaded' && entry.download_status !== 'extracted') || entry.completed_at === null) { + return false; + } + + const targetPath = resolve(dataDirectory, 'cache', 'govinfo-bulk', entry.relative_cache_path); + try { + const targetStat = await stat(targetPath); + if (entry.upstream_byte_size !== null && targetStat.size !== entry.upstream_byte_size) { + return false; + } + } catch { + return false; + } + + if (entry.file_kind === 'zip' && entry.extraction_root !== null) { + try { + await access(resolve(dataDirectory, entry.extraction_root), fsConstants.F_OK); + } catch { + return false; + } + } + + return entry.validation_status !== 'invalid_payload'; +} + +async function collectXmlFiles(root: string): Promise { + const files: string[] = []; + const stack = [root]; + while (stack.length > 0) { + const current = stack.pop(); + if (!current) { + continue; + } + const entries = await readdir(current, { withFileTypes: true }); + for (const entry of entries) { + const nextPath = resolve(current, entry.name); + if (entry.isDirectory()) { + stack.push(nextPath); + } else if (entry.isFile() && entry.name.toLowerCase().endsWith('.xml')) { + files.push(nextPath); + } + } + } + return files.sort(); +} + +function validateXmlPayload(xml: string): void { + const trimmed = xml.trimStart(); + if (trimmed.startsWith(' { + await mkdir(extractionRoot, { recursive: true }); + const zip = await openZip(zipPath); + try { + await new Promise((resolvePromise, rejectPromise) => { + zip.readEntry(); + zip.on('entry', (entry: YauzlEntry) => { + const normalized = entry.fileName.replace(/\\/g, '/'); + if (normalized.startsWith('/') || normalized.split('/').includes('..')) { + rejectPromise(new Error(`invalid_payload: ZIP entry escapes extraction root (${entry.fileName})`)); + return; + } + const destination = resolve(extractionRoot, normalized); + if (!destination.startsWith(extractionRoot)) { + rejectPromise(new Error(`invalid_payload: ZIP entry resolves outside extraction root (${entry.fileName})`)); + return; + } + if (/\/$/.test(entry.fileName)) { + mkdir(destination, { recursive: true }).then(() => zip.readEntry(), rejectPromise); + return; + } + mkdir(dirname(destination), { recursive: true }) + .then(() => openZipReadStream(zip, entry)) + .then(async (stream) => { + await pipeline(stream, createWriteStream(destination, { mode: 0o640 })); + zip.readEntry(); + }) + .catch(rejectPromise); + }); + zip.once('end', () => resolvePromise()); + zip.once('error', rejectPromise); + }); + } finally { + zip.close(); + } +} + +function openZip(path: string): Promise { + return new Promise((resolvePromise, rejectPromise) => { + yauzl.open(path, { lazyEntries: true }, (error, zip) => { + if (error || zip === undefined) { + rejectPromise(error ?? new Error('Failed to open ZIP archive')); + return; + } + resolvePromise(zip); + }); + }); +} + +function openZipReadStream(zip: YauzlZipFile, entry: YauzlEntry): Promise { + return new Promise((resolvePromise, rejectPromise) => { + zip.openReadStream(entry, (error: Error | null, stream?: NodeJS.ReadableStream) => { + if (error || stream === undefined) { + rejectPromise(error ?? new Error(`Failed to read ZIP entry ${entry.fileName}`)); + return; + } + resolvePromise(stream); + }); + }); +} + +async function processQueue(items: T[], concurrency: number, worker: (item: T) => Promise): Promise { + const queue = [...items]; + const workers = Array.from({ length: Math.min(concurrency, queue.length) }, async () => { + while (queue.length > 0) { + const item = queue.shift(); + if (item === undefined) { + continue; + } + await worker(item); + } + }); + await Promise.all(workers); +} + +function mergeFileKey(existing: string[], fileKey: string): string[] { + return existing.includes(fileKey) ? existing : [...existing, fileKey]; +} + +function normalizeGovInfoBulkError(error: unknown): { code: string; message: string } { + if (error instanceof Error) { + if (error.message.startsWith('invalid_')) { + const [code, ...rest] = error.message.split(':'); + return { code, message: rest.join(':').trim() || error.message }; + } + if (error.message.startsWith('upstream_request_failed:')) { + return { code: 'upstream_request_failed', message: error.message }; + } + return { code: 'upstream_request_failed', message: error.message }; + } + return { code: 'upstream_request_failed', message: 'GovInfo bulk fetch failed' }; +} + +export function normalizeGovInfoBulkManifestState(value: unknown): GovInfoBulkManifestState { + if (!isRecord(value)) { + return createEmptyGovInfoBulkState(); + } + + const collections: Partial> = {}; + if (isRecord(value.collections)) { + for (const [key, entry] of Object.entries(value.collections)) { + if (!isGovInfoBulkCollection(key) || !isRecord(entry)) { + continue; + } + collections[key] = { + collection: key, + discovered_at: typeof entry.discovered_at === 'string' ? entry.discovered_at : new Date(0).toISOString(), + completed_at: typeof entry.completed_at === 'string' || entry.completed_at === null ? entry.completed_at : null, + status: entry.status === 'pending' || entry.status === 'partial' || entry.status === 'complete' || entry.status === 'failed' ? entry.status : 'pending', + discovered_congresses: Array.isArray(entry.discovered_congresses) + ? entry.discovered_congresses.filter((candidate): candidate is number => typeof candidate === 'number' && Number.isSafeInteger(candidate) && candidate > 0) + : [], + congress_runs: normalizeCongressRuns(entry.congress_runs), + }; + } + } + + const files: Record = {}; + if (isRecord(value.files)) { + for (const [key, entry] of Object.entries(value.files)) { + if (!isRecord(entry) || !isGovInfoBulkCollection(String(entry.collection ?? ''))) { + continue; + } + const normalizedCollection = typeof entry.collection === 'string' ? entry.collection : null; + if (normalizedCollection === null || !isGovInfoBulkCollection(normalizedCollection)) { + continue; + } + files[key] = { + source_url: typeof entry.source_url === 'string' ? entry.source_url : '', + relative_cache_path: typeof entry.relative_cache_path === 'string' ? entry.relative_cache_path : '', + congress: typeof entry.congress === 'number' ? entry.congress : 0, + collection: normalizedCollection, + listing_path: Array.isArray(entry.listing_path) ? entry.listing_path.filter((item): item is string => typeof item === 'string') : [], + upstream_byte_size: typeof entry.upstream_byte_size === 'number' ? entry.upstream_byte_size : null, + fetched_at: typeof entry.fetched_at === 'string' || entry.fetched_at === null ? entry.fetched_at : null, + completed_at: typeof entry.completed_at === 'string' || entry.completed_at === null ? entry.completed_at : null, + download_status: entry.download_status === 'pending' || entry.download_status === 'downloaded' || entry.download_status === 'extracted' || entry.download_status === 'failed' ? entry.download_status : 'pending', + validation_status: entry.validation_status === 'not_checked' || entry.validation_status === 'xml_valid' || entry.validation_status === 'zip_valid' || entry.validation_status === 'invalid_payload' ? entry.validation_status : 'not_checked', + file_kind: entry.file_kind === 'zip' || entry.file_kind === 'xml' || entry.file_kind === 'unknown' ? entry.file_kind : 'unknown', + extraction_root: typeof entry.extraction_root === 'string' || entry.extraction_root === null ? entry.extraction_root : null, + error: isFailureSummary(entry.error) ? entry.error : null, + }; + } + } + + return { + last_success_at: typeof value.last_success_at === 'string' || value.last_success_at === null ? value.last_success_at : null, + last_failure: isFailureSummary(value.last_failure) ? value.last_failure : null, + checkpoints: {}, + collections, + files, + }; +} + +function normalizeCongressRuns(value: unknown): Record { + if (!isRecord(value)) { + return {}; + } + const normalized: Record = {}; + for (const [key, entry] of Object.entries(value)) { + if (!isRecord(entry)) { + continue; + } + normalized[key] = { + congress: typeof entry.congress === 'number' ? entry.congress : Number.parseInt(key, 10), + discovered_at: typeof entry.discovered_at === 'string' ? entry.discovered_at : new Date(0).toISOString(), + completed_at: typeof entry.completed_at === 'string' || entry.completed_at === null ? entry.completed_at : null, + status: entry.status === 'pending' || entry.status === 'partial' || entry.status === 'complete' || entry.status === 'failed' ? entry.status : 'pending', + directories_visited: typeof entry.directories_visited === 'number' ? entry.directories_visited : 0, + files_discovered: typeof entry.files_discovered === 'number' ? entry.files_discovered : 0, + files_downloaded: typeof entry.files_downloaded === 'number' ? entry.files_downloaded : 0, + files_skipped: typeof entry.files_skipped === 'number' ? entry.files_skipped : 0, + files_failed: typeof entry.files_failed === 'number' ? entry.files_failed : 0, + file_keys: Array.isArray(entry.file_keys) ? entry.file_keys.filter((item): item is string => typeof item === 'string') : [], + }; + } + return normalized; +} + +function isFailureSummary(value: unknown): value is FailureSummary { + return Boolean(value && typeof value === 'object' && 'code' in value && 'message' in value && typeof value.code === 'string' && typeof value.message === 'string'); +} + +function isRecord(value: unknown): value is Record { + return Boolean(value && typeof value === 'object' && !Array.isArray(value)); +} diff --git a/src/utils/govinfo-bulk-listing.ts b/src/utils/govinfo-bulk-listing.ts new file mode 100644 index 0000000..e44b0e5 --- /dev/null +++ b/src/utils/govinfo-bulk-listing.ts @@ -0,0 +1,152 @@ +import { XMLParser } from 'fast-xml-parser'; + +export type GovInfoBulkCollection = 'BILLSTATUS' | 'BILLS' | 'BILLSUM' | 'PLAW'; + +export interface GovInfoBulkListingEntry { + name: string; + href: string; + url: string; + kind: 'directory' | 'file'; +} + +const GOVINFO_BULK_ORIGIN = 'https://www.govinfo.gov'; +const GOVINFO_BULK_PREFIX = '/bulkdata/'; +const LISTING_PARSER = new XMLParser({ + ignoreAttributes: false, + attributeNamePrefix: '', + trimValues: true, +}); + +export const GOVINFO_BULK_COLLECTIONS: GovInfoBulkCollection[] = ['BILLSTATUS', 'BILLS', 'BILLSUM', 'PLAW']; + +export function isGovInfoBulkCollection(value: string): value is GovInfoBulkCollection { + return GOVINFO_BULK_COLLECTIONS.includes(value as GovInfoBulkCollection); +} + +export function parseGovInfoBulkListing(xml: string, baseUrl: string): GovInfoBulkListingEntry[] { + const trimmed = xml.trimStart(); + if (trimmed.startsWith('): 'directory' | 'file' { + if (entry.href.endsWith('/') || entry.name.endsWith('/') || new URL(entry.url).pathname.endsWith('/')) { + return 'directory'; + } + return 'file'; +} + +export function resolveGovInfoBulkUrl(baseUrl: string, href: string): URL { + const resolved = new URL(href, baseUrl); + if (!isAllowedGovInfoBulkUrl(resolved)) { + throw new Error(`invalid_listing_url: disallowed GovInfo bulk URL '${resolved.toString()}'`); + } + return resolved; +} + +export function isAllowedGovInfoBulkUrl(url: URL): boolean { + return url.protocol === 'https:' && url.origin === GOVINFO_BULK_ORIGIN && url.pathname.startsWith(GOVINFO_BULK_PREFIX); +} + +function findEntriesRoot(parsed: unknown): unknown { + if (!isRecord(parsed)) { + throw new Error('invalid_listing_payload: listing XML did not parse into an object'); + } + + return parsed.directory ?? parsed.listing ?? parsed.files ?? parsed; +} + +function collectRawEntries(root: unknown): Array> { + if (!isRecord(root)) { + return []; + } + + const directCandidates = [root.entry, root.item, root.directory, root.file, root.entries, root.items]; + for (const candidate of directCandidates) { + const normalized = normalizeCandidateEntries(candidate); + if (normalized.length > 0) { + return normalized; + } + } + + const recursive: Array> = []; + for (const value of Object.values(root)) { + recursive.push(...normalizeCandidateEntries(value)); + } + return recursive; +} + +function normalizeCandidateEntries(candidate: unknown): Array> { + if (Array.isArray(candidate)) { + return candidate.filter(isRecord); + } + return isRecord(candidate) ? [candidate] : []; +} + +function readStringField(entry: Record, fields: string[]): string | null { + for (const field of fields) { + const value = entry[field]; + if (typeof value === 'string' && value.trim().length > 0) { + return value.trim(); + } + } + return null; +} + +function normalizeListingName(name: string | null, url: URL, href: string): string { + if (name !== null) { + return name.endsWith('/') || href.endsWith('/') ? name.replace(/\/+$/, '') : name; + } + + const pathname = url.pathname.replace(/\/+$/, ''); + const lastSegment = pathname.split('/').filter((segment) => segment.length > 0).at(-1); + return lastSegment ?? pathname; +} + +function dedupeEntries(entries: GovInfoBulkListingEntry[]): GovInfoBulkListingEntry[] { + const seen = new Set(); + const deduped: GovInfoBulkListingEntry[] = []; + for (const entry of entries) { + if (seen.has(entry.url)) { + continue; + } + seen.add(entry.url); + deduped.push(entry); + } + return deduped; +} + +function isRecord(value: unknown): value is Record { + return Boolean(value && typeof value === 'object' && !Array.isArray(value)); +} diff --git a/src/utils/manifest.ts b/src/utils/manifest.ts index 297c1ce..8e92974 100644 --- a/src/utils/manifest.ts +++ b/src/utils/manifest.ts @@ -2,7 +2,9 @@ import { access, mkdir, readFile, rename, writeFile } from 'node:fs/promises'; import { constants as fsConstants } from 'node:fs'; import { dirname, resolve } from 'node:path'; -export type SourceName = 'olrc' | 'congress' | 'govinfo' | 'voteview' | 'legislators'; +import { normalizeGovInfoBulkManifestState, type GovInfoBulkManifestState } from '../sources/govinfo-bulk.js'; + +export type SourceName = 'olrc' | 'congress' | 'govinfo' | 'govinfo-bulk' | 'voteview' | 'legislators'; export interface FailureSummary { code: string; @@ -162,6 +164,7 @@ export interface FetchManifest { olrc: OlrcManifestState; congress: CongressManifestState; govinfo: GovInfoManifestState; + 'govinfo-bulk': GovInfoBulkManifestState; voteview: SourceStatusSummary & { files?: Record; indexes?: unknown[] }; legislators: LegislatorsManifestState; }; @@ -197,6 +200,7 @@ export function createEmptyManifest(): FetchManifest { query_scopes: {}, checkpoints: {}, }, + 'govinfo-bulk': normalizeGovInfoBulkManifestState(null), voteview: { last_success_at: null, last_failure: null, files: {}, indexes: [] }, legislators: { last_success_at: null, @@ -280,6 +284,7 @@ function normalizeManifest(parsed: Partial): FetchManifest { olrc: normalizeOlrcState(parsed.sources.olrc), congress: normalizeCongressState(parsed.sources.congress), govinfo: normalizeGovInfoState(parsed.sources.govinfo), + 'govinfo-bulk': normalizeGovInfoBulkManifestState((parsed.sources as Record)['govinfo-bulk']), voteview: normalizeVoteviewState(parsed.sources.voteview), legislators: normalizeLegislatorsState(parsed.sources.legislators), }, diff --git a/tests/unit/sources/govinfo-bulk.test.ts b/tests/unit/sources/govinfo-bulk.test.ts new file mode 100644 index 0000000..4316f90 --- /dev/null +++ b/tests/unit/sources/govinfo-bulk.test.ts @@ -0,0 +1,136 @@ +import { describe, expect, it } from 'vitest'; +import { mkdtempSync, writeFileSync, rmSync, existsSync, readFileSync } from 'node:fs'; +import { execFileSync } from 'node:child_process'; +import { tmpdir } from 'node:os'; +import { join, resolve } from 'node:path'; +import { readManifest } from '../../../src/utils/manifest.js'; +import { fetchGovInfoBulkSource } from '../../../src/sources/govinfo-bulk.js'; +import { parseGovInfoBulkListing, resolveGovInfoBulkUrl } from '../../../src/utils/govinfo-bulk-listing.js'; + +const BILLSTATUS_XML = '119'; + +function response(body: string | Buffer, contentType: string): Response { + return new Response(body, { status: 200, headers: { 'content-type': contentType, 'content-length': String(Buffer.byteLength(body)) } }); +} + +function createZipBytes(fileName: string, contents: string): Buffer { + const root = mkdtempSync(join(tmpdir(), 'govinfo-bulk-zip-')); + const sourcePath = resolve(root, fileName); + const archivePath = resolve(root, 'archive.zip'); + writeFileSync(sourcePath, contents, 'utf8'); + execFileSync('zip', ['-q', archivePath, fileName], { cwd: root }); + const bytes = readFileSync(archivePath); + rmSync(root, { recursive: true, force: true }); + return bytes; +} + +describe('govinfo bulk utilities', () => { + it('parses XML directory listings and resolves directory/file entries', () => { + const entries = parseGovInfoBulkListing( + `119119/BILLSTATUS-119hr.xml.zip119/hr/BILLSTATUS-119hr.xml.zip`, + 'https://www.govinfo.gov/bulkdata/BILLSTATUS/', + ); + + expect(entries).toEqual([ + { + name: '119', + href: '119/', + url: 'https://www.govinfo.gov/bulkdata/BILLSTATUS/119/', + kind: 'directory', + }, + { + name: 'BILLSTATUS-119hr.xml.zip', + href: '119/hr/BILLSTATUS-119hr.xml.zip', + url: 'https://www.govinfo.gov/bulkdata/BILLSTATUS/119/hr/BILLSTATUS-119hr.xml.zip', + kind: 'file', + }, + ]); + expect(() => resolveGovInfoBulkUrl('https://www.govinfo.gov/bulkdata/', 'https://example.com/evil.xml')).toThrow(/disallowed/i); + }); + + it('downloads, extracts, and records manifest-backed resume state for a BILLSTATUS artifact', async () => { + const dataDirectory = mkdtempSync(join(tmpdir(), 'govinfo-bulk-')); + const zipBytes = createZipBytes('bill.xml', BILLSTATUS_XML); + const fetchImpl: typeof fetch = async (input) => { + const url = typeof input === 'string' ? input : input.toString(); + if (url === 'https://www.govinfo.gov/bulkdata/BILLSTATUS/') { + return response('119119/', 'application/xml'); + } + if (url === 'https://www.govinfo.gov/bulkdata/BILLSTATUS/119/') { + return response('hrhr/', 'application/xml'); + } + if (url === 'https://www.govinfo.gov/bulkdata/BILLSTATUS/119/hr/') { + return response('BILLSTATUS-119hr.xml.zipBILLSTATUS-119hr.xml.zip', 'application/xml'); + } + if (url === 'https://www.govinfo.gov/bulkdata/BILLSTATUS/119/hr/BILLSTATUS-119hr.xml.zip') { + return response(zipBytes, 'application/zip'); + } + throw new Error(`Unexpected URL: ${url}`); + }; + + try { + const firstRun = await fetchGovInfoBulkSource({ + force: false, + congress: 119, + collection: 'BILLSTATUS', + dataDirectory, + fetchImpl, + }); + + expect(firstRun.ok).toBe(true); + expect(firstRun.files_downloaded).toBe(1); + expect(firstRun.files_skipped).toBe(0); + expect(existsSync(resolve(dataDirectory, 'cache/govinfo-bulk/BILLSTATUS/119/hr/BILLSTATUS-119hr.xml.zip'))).toBe(true); + expect(existsSync(resolve(dataDirectory, 'cache/govinfo-bulk/BILLSTATUS/119/hr/extracted/bill.xml'))).toBe(true); + + const manifest = await readManifest(dataDirectory); + const state = (manifest.sources as typeof manifest.sources & { 'govinfo-bulk': { files: Record } })['govinfo-bulk']; + const fileEntry = Object.values(state.files)[0]; + expect(fileEntry.download_status).toBe('extracted'); + expect(fileEntry.validation_status).toBe('zip_valid'); + + const secondRun = await fetchGovInfoBulkSource({ + force: false, + congress: 119, + collection: 'BILLSTATUS', + dataDirectory, + fetchImpl, + }); + expect(secondRun.files_downloaded).toBe(0); + expect(secondRun.files_skipped).toBe(1); + } finally { + rmSync(dataDirectory, { recursive: true, force: true }); + } + }); + + it('rejects HTML payloads instead of marking artifacts complete', async () => { + const dataDirectory = mkdtempSync(join(tmpdir(), 'govinfo-bulk-html-')); + const fetchImpl: typeof fetch = async (input) => { + const url = typeof input === 'string' ? input : input.toString(); + if (url === 'https://www.govinfo.gov/bulkdata/BILLSUM/') { + return response('119119/', 'application/xml'); + } + if (url === 'https://www.govinfo.gov/bulkdata/BILLSUM/119/') { + return response('summaries.xmlsummaries.xml', 'application/xml'); + } + if (url === 'https://www.govinfo.gov/bulkdata/BILLSUM/119/summaries.xml') { + return response('not xml', 'text/html'); + } + throw new Error(`Unexpected URL: ${url}`); + }; + + try { + const result = await fetchGovInfoBulkSource({ force: false, congress: 119, collection: 'BILLSUM', dataDirectory, fetchImpl }); + expect(result.ok).toBe(true); + expect(result.files_failed).toBe(1); + + const manifest = await readManifest(dataDirectory); + const state = (manifest.sources as typeof manifest.sources & { 'govinfo-bulk': { files: Record } })['govinfo-bulk']; + const fileEntry = Object.values(state.files)[0]; + expect(fileEntry.validation_status).toBe('invalid_payload'); + expect(fileEntry.download_status).toBe('failed'); + } finally { + rmSync(dataDirectory, { recursive: true, force: true }); + } + }); +}); From ea8bfde681f5c0b1f17434a26f06d84feed65e27 Mon Sep 17 00:00:00 2001 From: v1d0b0t Date: Fri, 3 Apr 2026 11:46:12 -0400 Subject: [PATCH 06/10] test: add govinfo-bulk adversary regressions for #40 --- tests/unit/sources/govinfo-bulk.test.ts | 164 +++++++++++++++++++++++- 1 file changed, 163 insertions(+), 1 deletion(-) diff --git a/tests/unit/sources/govinfo-bulk.test.ts b/tests/unit/sources/govinfo-bulk.test.ts index 4316f90..765bc4a 100644 --- a/tests/unit/sources/govinfo-bulk.test.ts +++ b/tests/unit/sources/govinfo-bulk.test.ts @@ -3,7 +3,7 @@ import { mkdtempSync, writeFileSync, rmSync, existsSync, readFileSync } from 'no import { execFileSync } from 'node:child_process'; import { tmpdir } from 'node:os'; import { join, resolve } from 'node:path'; -import { readManifest } from '../../../src/utils/manifest.js'; +import { readManifest, writeManifest } from '../../../src/utils/manifest.js'; import { fetchGovInfoBulkSource } from '../../../src/sources/govinfo-bulk.js'; import { parseGovInfoBulkListing, resolveGovInfoBulkUrl } from '../../../src/utils/govinfo-bulk-listing.js'; @@ -133,4 +133,166 @@ describe('govinfo bulk utilities', () => { rmSync(dataDirectory, { recursive: true, force: true }); } }); + + it('merges manifest writes from stale snapshots instead of dropping another writer\'s completed file state', async () => { + const dataDirectory = mkdtempSync(join(tmpdir(), 'govinfo-bulk-manifest-race-')); + + try { + const baseManifest = await readManifest(dataDirectory); + const writerOneManifest = structuredClone(baseManifest); + const writerTwoManifest = structuredClone(baseManifest); + + const writerOneState = (writerOneManifest.sources as typeof writerOneManifest.sources & { + 'govinfo-bulk': { files: Record }; + })['govinfo-bulk']; + writerOneState.files['BILLSTATUS:119:hr/BILLSTATUS-119hr.xml.zip'] = { + source_url: 'https://www.govinfo.gov/bulkdata/BILLSTATUS/119/hr/BILLSTATUS-119hr.xml.zip', + relative_cache_path: 'BILLSTATUS/119/hr/BILLSTATUS-119hr.xml.zip', + congress: 119, + collection: 'BILLSTATUS', + listing_path: ['BILLSTATUS', '119', 'hr'], + upstream_byte_size: 123, + fetched_at: '2026-04-03T15:00:00.000Z', + completed_at: '2026-04-03T15:00:01.000Z', + download_status: 'extracted', + validation_status: 'zip_valid', + file_kind: 'zip', + extraction_root: 'cache/govinfo-bulk/BILLSTATUS/119/hr/extracted', + error: null, + }; + + const writerTwoState = (writerTwoManifest.sources as typeof writerTwoManifest.sources & { + 'govinfo-bulk': { files: Record }; + })['govinfo-bulk']; + writerTwoState.files['PLAW:118:public/PLAW-118publ1.xml'] = { + source_url: 'https://www.govinfo.gov/bulkdata/PLAW/118/public/PLAW-118publ1.xml', + relative_cache_path: 'PLAW/118/public/PLAW-118publ1.xml', + congress: 118, + collection: 'PLAW', + listing_path: ['PLAW', '118', 'public'], + upstream_byte_size: 456, + fetched_at: '2026-04-03T15:00:02.000Z', + completed_at: '2026-04-03T15:00:03.000Z', + download_status: 'downloaded', + validation_status: 'xml_valid', + file_kind: 'xml', + extraction_root: null, + error: null, + }; + + await writeManifest(writerOneManifest, dataDirectory); + await writeManifest(writerTwoManifest, dataDirectory); + + const mergedManifest = await readManifest(dataDirectory); + const mergedState = (mergedManifest.sources as typeof mergedManifest.sources & { + 'govinfo-bulk': { files: Record }; + })['govinfo-bulk']; + + expect(Object.keys(mergedState.files).sort()).toEqual([ + 'BILLSTATUS:119:hr/BILLSTATUS-119hr.xml.zip', + 'PLAW:118:public/PLAW-118publ1.xml', + ]); + } finally { + rmSync(dataDirectory, { recursive: true, force: true }); + } + }); + + it('streams file downloads to disk instead of requiring response.arrayBuffer()', async () => { + const dataDirectory = mkdtempSync(join(tmpdir(), 'govinfo-bulk-streaming-')); + const streamedXml = 'streamed'; + + const fetchImpl: typeof fetch = async (input) => { + const url = typeof input === 'string' ? input : input.toString(); + if (url === 'https://www.govinfo.gov/bulkdata/BILLSUM/') { + return response('119119/', 'application/xml'); + } + if (url === 'https://www.govinfo.gov/bulkdata/BILLSUM/119/') { + return response('summaries.xmlsummaries.xml', 'application/xml'); + } + if (url === 'https://www.govinfo.gov/bulkdata/BILLSUM/119/summaries.xml') { + const streamedResponse = new Response(new ReadableStream({ + start(controller) { + controller.enqueue(new TextEncoder().encode(streamedXml)); + controller.close(); + }, + }), { + status: 200, + headers: { + 'content-type': 'application/xml', + 'content-length': String(Buffer.byteLength(streamedXml)), + }, + }); + Object.defineProperty(streamedResponse, 'arrayBuffer', { + value: async () => { + throw new Error('download path must stream instead of buffering'); + }, + }); + return streamedResponse; + } + throw new Error(`Unexpected URL: ${url}`); + }; + + try { + const result = await fetchGovInfoBulkSource({ force: false, congress: 119, collection: 'BILLSUM', dataDirectory, fetchImpl }); + expect(result.ok).toBe(true); + expect(result.files_downloaded).toBe(1); + expect(readFileSync(resolve(dataDirectory, 'cache/govinfo-bulk/BILLSUM/119/summaries.xml'), 'utf8')).toContain('streamed'); + } finally { + rmSync(dataDirectory, { recursive: true, force: true }); + } + }); + + it('keeps the first completed artifact when overlapping runs target the same file', async () => { + const dataDirectory = mkdtempSync(join(tmpdir(), 'govinfo-bulk-overlap-')); + const firstXml = 'first'; + const secondXml = 'second'; + let releaseFirstDownload: (() => void) | null = null; + const firstDownloadReady = new Promise((resolveReady) => { + releaseFirstDownload = resolveReady; + }); + let fileRequestCount = 0; + + const fetchImpl: typeof fetch = async (input) => { + const url = typeof input === 'string' ? input : input.toString(); + if (url === 'https://www.govinfo.gov/bulkdata/BILLSUM/') { + return response('119119/', 'application/xml'); + } + if (url === 'https://www.govinfo.gov/bulkdata/BILLSUM/119/') { + return response('summaries.xmlsummaries.xml', 'application/xml'); + } + if (url === 'https://www.govinfo.gov/bulkdata/BILLSUM/119/summaries.xml') { + fileRequestCount += 1; + if (fileRequestCount === 1) { + await firstDownloadReady; + return response(firstXml, 'application/xml'); + } + return response(secondXml, 'application/xml'); + } + throw new Error(`Unexpected URL: ${url}`); + }; + + try { + const firstRun = fetchGovInfoBulkSource({ force: false, congress: 119, collection: 'BILLSUM', dataDirectory, fetchImpl }); + await new Promise((resolveDelay) => setTimeout(resolveDelay, 10)); + const secondRun = fetchGovInfoBulkSource({ force: false, congress: 119, collection: 'BILLSUM', dataDirectory, fetchImpl }); + releaseFirstDownload?.(); + + const [firstResult, secondResult] = await Promise.all([firstRun, secondRun]); + expect(firstResult.ok).toBe(true); + expect(secondResult.ok).toBe(true); + + const finalPath = resolve(dataDirectory, 'cache/govinfo-bulk/BILLSUM/119/summaries.xml'); + expect(readFileSync(finalPath, 'utf8')).toContain('second'); + + const manifest = await readManifest(dataDirectory); + const state = (manifest.sources as typeof manifest.sources & { + 'govinfo-bulk': { files: Record }; + })['govinfo-bulk']; + expect(Object.values(state.files)).toHaveLength(1); + expect(Object.values(state.files)[0].completed_at).not.toBeNull(); + expect(fileRequestCount).toBe(2); + } finally { + rmSync(dataDirectory, { recursive: true, force: true }); + } + }); }); From 73d39542aa1376223ba39c13284a63d67f78eb58 Mon Sep 17 00:00:00 2001 From: v1d0b0t Date: Fri, 3 Apr 2026 11:53:34 -0400 Subject: [PATCH 07/10] fix: harden govinfo bulk manifest writes (#40) --- src/sources/govinfo-bulk.ts | 88 ++++++++++++++++------ src/utils/manifest.ts | 97 ++++++++++++++++++++++++- tests/unit/sources/govinfo-bulk.test.ts | 2 +- 3 files changed, 163 insertions(+), 24 deletions(-) diff --git a/src/sources/govinfo-bulk.ts b/src/sources/govinfo-bulk.ts index d693aa9..e86eacf 100644 --- a/src/sources/govinfo-bulk.ts +++ b/src/sources/govinfo-bulk.ts @@ -1,8 +1,10 @@ import { createWriteStream } from 'node:fs'; -import { access, mkdir, mkdtemp, readdir, readFile, rename, rm, stat, writeFile } from 'node:fs/promises'; +import { access, mkdir, mkdtemp, readdir, readFile, rename, rm, stat } from 'node:fs/promises'; import { constants as fsConstants } from 'node:fs'; import { dirname, relative, resolve } from 'node:path'; import { tmpdir } from 'node:os'; +import { Readable, Transform } from 'node:stream'; +import type { ReadableStream as NodeReadableStream } from 'node:stream/web'; import { pipeline } from 'node:stream/promises'; import { XMLParser } from 'fast-xml-parser'; import yauzl, { type Entry as YauzlEntry, type ZipFile as YauzlZipFile } from 'yauzl'; @@ -320,10 +322,11 @@ async function downloadBulkArtifact(options: { const response = await fetchFile(options.entry.url, options.fetchImpl); const temporaryPath = `${targetPath}.tmp-${process.pid}-${Date.now()}`; await mkdir(dirname(targetPath), { recursive: true }); + let temporaryExtractionRoot: string | null = null; try { - const buffer = Buffer.from(await response.arrayBuffer()); - await writeFile(temporaryPath, buffer, { mode: 0o640 }); - const byteSize = Number.parseInt(response.headers.get('content-length') ?? String(buffer.byteLength), 10); + const streamedByteCount = await streamResponseToDisk(response, temporaryPath); + const headerByteCount = Number.parseInt(response.headers.get('content-length') ?? '', 10); + const byteSize = Number.isFinite(headerByteCount) && headerByteCount > 0 ? headerByteCount : streamedByteCount; const fileKind = detectFileKind(options.entry.url); const extractionRoot = fileKind === 'zip' ? resolve(dirname(targetPath), 'extracted') : null; @@ -331,30 +334,41 @@ async function downloadBulkArtifact(options: { const xml = await readFile(temporaryPath, 'utf8'); validateXmlPayload(xml); } else if (fileKind === 'zip') { - const tempExtractionRoot = await mkdtemp(resolve(dirname(targetPath), '.extract-')); - try { - await extractZipSafely(temporaryPath, tempExtractionRoot); - const xmlFiles = await collectXmlFiles(tempExtractionRoot); - if (xmlFiles.length === 0) { - throw new Error('invalid_payload: ZIP file contained no XML artifacts'); - } - if (options.collection === 'BILLSTATUS') { - const sampleXml = await readFile(xmlFiles[0], 'utf8'); - validateXmlPayload(sampleXml); - } - if (extractionRoot !== null) { - await rm(extractionRoot, { recursive: true, force: true }); - await rename(tempExtractionRoot, extractionRoot); - } - } catch (error) { - await rm(tempExtractionRoot, { recursive: true, force: true }); - throw error; + temporaryExtractionRoot = await mkdtemp(resolve(dirname(targetPath), '.extract-')); + await extractZipSafely(temporaryPath, temporaryExtractionRoot); + const xmlFiles = await collectXmlFiles(temporaryExtractionRoot); + if (xmlFiles.length === 0) { + throw new Error('invalid_payload: ZIP file contained no XML artifacts'); + } + if (options.collection === 'BILLSTATUS') { + const sampleXml = await readFile(xmlFiles[0], 'utf8'); + validateXmlPayload(sampleXml); } } else { const payload = await readFile(temporaryPath, 'utf8'); validateXmlPayload(payload); } + if (!options.force && await wasArtifactCompletedByAnotherWriter(options.fileKey, options.dataDirectory)) { + await rm(temporaryPath, { force: true }); + if (temporaryExtractionRoot !== null) { + await rm(temporaryExtractionRoot, { recursive: true, force: true }); + } + const refreshedManifest = await readManifest(options.dataDirectory); + const refreshedState = ensureGovInfoBulkState(refreshedManifest); + const refreshedEntry = refreshedState.files[options.fileKey]; + if (refreshedEntry) { + options.state.files[options.fileKey] = refreshedEntry; + } + return 'skipped'; + } + + if (temporaryExtractionRoot !== null && extractionRoot !== null) { + await rm(extractionRoot, { recursive: true, force: true }); + await rename(temporaryExtractionRoot, extractionRoot); + temporaryExtractionRoot = null; + } + await rename(temporaryPath, targetPath); options.state.files[options.fileKey] = { ...initialState, @@ -376,6 +390,9 @@ async function downloadBulkArtifact(options: { return 'downloaded'; } catch (error) { await rm(temporaryPath, { force: true }); + if (temporaryExtractionRoot !== null) { + await rm(temporaryExtractionRoot, { recursive: true, force: true }); + } options.state.files[options.fileKey] = { ...initialState, source_url: options.entry.url, @@ -392,6 +409,33 @@ async function downloadBulkArtifact(options: { } } +async function streamResponseToDisk(response: Response, destinationPath: string): Promise { + if (response.body === null) { + throw new Error('upstream_request_failed: GovInfo bulk file response had no readable body'); + } + + let byteCount = 0; + const countBytes = new Transform({ + transform(chunk, _encoding, callback) { + byteCount += Buffer.isBuffer(chunk) ? chunk.byteLength : Buffer.byteLength(String(chunk)); + callback(null, chunk); + }, + }); + + await pipeline(Readable.fromWeb(response.body as NodeReadableStream), countBytes, createWriteStream(destinationPath, { mode: 0o640 })); + return byteCount; +} + +async function wasArtifactCompletedByAnotherWriter(fileKey: string, dataDirectory: string): Promise { + const refreshedManifest = await readManifest(dataDirectory); + const refreshedState = ensureGovInfoBulkState(refreshedManifest); + const refreshedEntry = refreshedState.files[fileKey]; + if (!refreshedEntry) { + return false; + } + return isResumeComplete(refreshedEntry, dataDirectory); +} + async function fetchListing(url: string, fetchImpl: typeof fetch): Promise { const response = await fetchText(url, fetchImpl, 'govinfo-bulk'); return parseGovInfoBulkListing(response.body, url).filter((entry) => isAllowedGovInfoBulkUrl(new URL(entry.url))); diff --git a/src/utils/manifest.ts b/src/utils/manifest.ts index 8e92974..dbdeaf2 100644 --- a/src/utils/manifest.ts +++ b/src/utils/manifest.ts @@ -257,12 +257,107 @@ export async function readManifest(dataDirectory = getDataDirectory()): Promise< export async function writeManifest(manifest: FetchManifest, dataDirectory = getDataDirectory()): Promise { const manifestPath = getManifestPath(dataDirectory); await mkdir(dirname(manifestPath), { recursive: true }); - const payload = JSON.stringify({ ...manifest, updated_at: new Date().toISOString() }, null, 2); + const mergedManifest = await mergeManifestForWrite(manifest, dataDirectory); + const payload = JSON.stringify({ ...mergedManifest, updated_at: new Date().toISOString() }, null, 2); const temporaryPath = `${manifestPath}.tmp-${process.pid}-${Date.now()}`; await writeFile(temporaryPath, `${payload}\n`, { encoding: 'utf8', mode: 0o600 }); await rename(temporaryPath, manifestPath); } +async function mergeManifestForWrite(manifest: FetchManifest, dataDirectory: string): Promise { + const existingManifest = await readExistingManifestForMerge(dataDirectory); + if (existingManifest === null) { + return manifest; + } + + const incomingGovInfoBulk = manifest.sources['govinfo-bulk']; + const existingGovInfoBulk = existingManifest.sources['govinfo-bulk']; + + return { + ...existingManifest, + ...manifest, + sources: { + ...existingManifest.sources, + ...manifest.sources, + 'govinfo-bulk': mergeGovInfoBulkManifestState(existingGovInfoBulk, incomingGovInfoBulk), + }, + runs: Array.isArray(manifest.runs) ? manifest.runs : existingManifest.runs, + }; +} + +async function readExistingManifestForMerge(dataDirectory: string): Promise { + try { + const raw = await readFile(getManifestPath(dataDirectory), 'utf8'); + return normalizeManifest(parseManifestJson(raw)); + } catch (error) { + if (isMissingFileError(error)) { + return null; + } + throw error; + } +} + +function mergeGovInfoBulkManifestState(existing: GovInfoBulkManifestState, incoming: GovInfoBulkManifestState): GovInfoBulkManifestState { + const mergedCollections: GovInfoBulkManifestState['collections'] = { ...existing.collections }; + for (const collection of Object.keys(incoming.collections) as Array) { + const incomingCollection = incoming.collections[collection]; + if (!incomingCollection) { + continue; + } + const existingCollection = mergedCollections[collection]; + mergedCollections[collection] = existingCollection + ? { + ...existingCollection, + ...incomingCollection, + discovered_congresses: [...new Set([...existingCollection.discovered_congresses, ...incomingCollection.discovered_congresses])].sort((left, right) => left - right), + congress_runs: mergeGovInfoBulkCongressRuns(existingCollection.congress_runs, incomingCollection.congress_runs), + } + : incomingCollection; + } + + return { + last_success_at: selectLatestTimestamp(existing.last_success_at, incoming.last_success_at), + last_failure: incoming.last_failure ?? existing.last_failure, + checkpoints: { ...existing.checkpoints, ...incoming.checkpoints }, + collections: mergedCollections, + files: { ...existing.files, ...incoming.files }, + }; +} + +function mergeGovInfoBulkCongressRuns( + existing: NonNullable['congress_runs'], + incoming: NonNullable['congress_runs'], +): NonNullable['congress_runs'] { + const merged = { ...existing }; + for (const [key, incomingRun] of Object.entries(incoming)) { + const existingRun = merged[key]; + merged[key] = existingRun + ? { + ...existingRun, + ...incomingRun, + completed_at: selectLatestTimestamp(existingRun.completed_at, incomingRun.completed_at), + file_keys: [...new Set([...existingRun.file_keys, ...incomingRun.file_keys])], + directories_visited: Math.max(existingRun.directories_visited, incomingRun.directories_visited), + files_discovered: Math.max(existingRun.files_discovered, incomingRun.files_discovered), + files_downloaded: Math.max(existingRun.files_downloaded, incomingRun.files_downloaded), + files_skipped: Math.max(existingRun.files_skipped, incomingRun.files_skipped), + files_failed: Math.max(existingRun.files_failed, incomingRun.files_failed), + } + : incomingRun; + } + return merged; +} + +function selectLatestTimestamp(left: string | null, right: string | null): string | null { + if (left === null) { + return right; + } + if (right === null) { + return left; + } + return left >= right ? left : right; +} + function parseManifestJson(raw: string): Partial { try { return JSON.parse(raw) as Partial; diff --git a/tests/unit/sources/govinfo-bulk.test.ts b/tests/unit/sources/govinfo-bulk.test.ts index 765bc4a..b9c3efd 100644 --- a/tests/unit/sources/govinfo-bulk.test.ts +++ b/tests/unit/sources/govinfo-bulk.test.ts @@ -282,7 +282,7 @@ describe('govinfo bulk utilities', () => { expect(secondResult.ok).toBe(true); const finalPath = resolve(dataDirectory, 'cache/govinfo-bulk/BILLSUM/119/summaries.xml'); - expect(readFileSync(finalPath, 'utf8')).toContain('second'); + expect(readFileSync(finalPath, 'utf8')).toContain('first'); const manifest = await readManifest(dataDirectory); const state = (manifest.sources as typeof manifest.sources & { From e21f781764241ead930de8c5990759a5e55cf5bc Mon Sep 17 00:00:00 2001 From: v1d0b0t Date: Fri, 3 Apr 2026 12:00:16 -0400 Subject: [PATCH 08/10] test: add final-path overlap regression for #40 --- tests/unit/sources/govinfo-bulk.test.ts | 63 +++++++++++++++++-------- 1 file changed, 44 insertions(+), 19 deletions(-) diff --git a/tests/unit/sources/govinfo-bulk.test.ts b/tests/unit/sources/govinfo-bulk.test.ts index b9c3efd..dd12654 100644 --- a/tests/unit/sources/govinfo-bulk.test.ts +++ b/tests/unit/sources/govinfo-bulk.test.ts @@ -1,8 +1,9 @@ -import { describe, expect, it } from 'vitest'; +import { afterEach, describe, expect, it, vi } from 'vitest'; import { mkdtempSync, writeFileSync, rmSync, existsSync, readFileSync } from 'node:fs'; import { execFileSync } from 'node:child_process'; import { tmpdir } from 'node:os'; import { join, resolve } from 'node:path'; +import * as manifestModule from '../../../src/utils/manifest.js'; import { readManifest, writeManifest } from '../../../src/utils/manifest.js'; import { fetchGovInfoBulkSource } from '../../../src/sources/govinfo-bulk.js'; import { parseGovInfoBulkListing, resolveGovInfoBulkUrl } from '../../../src/utils/govinfo-bulk-listing.js'; @@ -25,6 +26,10 @@ function createZipBytes(fileName: string, contents: string): Buffer { } describe('govinfo bulk utilities', () => { + afterEach(() => { + vi.restoreAllMocks(); + }); + it('parses XML directory listings and resolves directory/file entries', () => { const entries = parseGovInfoBulkListing( `119119/BILLSTATUS-119hr.xml.zip119/hr/BILLSTATUS-119hr.xml.zip`, @@ -242,15 +247,33 @@ describe('govinfo bulk utilities', () => { } }); - it('keeps the first completed artifact when overlapping runs target the same file', async () => { - const dataDirectory = mkdtempSync(join(tmpdir(), 'govinfo-bulk-overlap-')); + it('skips overwrite when another writer has already created the final cache path before manifest completion is recorded', async () => { + const dataDirectory = mkdtempSync(join(tmpdir(), 'govinfo-bulk-final-path-race-')); const firstXml = 'first'; const secondXml = 'second'; - let releaseFirstDownload: (() => void) | null = null; - const firstDownloadReady = new Promise((resolveReady) => { - releaseFirstDownload = resolveReady; - }); + const targetPath = resolve(dataDirectory, 'cache/govinfo-bulk/BILLSUM/119/summaries.xml'); let fileRequestCount = 0; + let releaseCompletedManifestWrite: (() => void) | null = null; + const completedManifestWriteBlocked = new Promise((resolveBlocked) => { + releaseCompletedManifestWrite = resolveBlocked; + }); + let firstCompletedWriteIntercepted = false; + + const originalWriteManifest = manifestModule.writeManifest; + vi.spyOn(manifestModule, 'writeManifest').mockImplementation(async (manifest, targetDataDirectory) => { + const bulkState = (manifest.sources as typeof manifest.sources & { + 'govinfo-bulk': { files: Record }; + })['govinfo-bulk']; + const summariesEntry = Object.values(bulkState.files).find((entry) => entry.relative_cache_path === 'BILLSUM/119/summaries.xml'); + + if (!firstCompletedWriteIntercepted && summariesEntry?.completed_at) { + firstCompletedWriteIntercepted = true; + expect(existsSync(targetPath)).toBe(true); + await completedManifestWriteBlocked; + } + + return originalWriteManifest(manifest, targetDataDirectory); + }); const fetchImpl: typeof fetch = async (input) => { const url = typeof input === 'string' ? input : input.toString(); @@ -262,27 +285,29 @@ describe('govinfo bulk utilities', () => { } if (url === 'https://www.govinfo.gov/bulkdata/BILLSUM/119/summaries.xml') { fileRequestCount += 1; - if (fileRequestCount === 1) { - await firstDownloadReady; - return response(firstXml, 'application/xml'); - } - return response(secondXml, 'application/xml'); + return response(fileRequestCount === 1 ? firstXml : secondXml, 'application/xml'); } throw new Error(`Unexpected URL: ${url}`); }; try { const firstRun = fetchGovInfoBulkSource({ force: false, congress: 119, collection: 'BILLSUM', dataDirectory, fetchImpl }); - await new Promise((resolveDelay) => setTimeout(resolveDelay, 10)); + + await vi.waitFor(() => { + expect(firstCompletedWriteIntercepted).toBe(true); + }); + const secondRun = fetchGovInfoBulkSource({ force: false, congress: 119, collection: 'BILLSUM', dataDirectory, fetchImpl }); - releaseFirstDownload?.(); + const secondResult = await secondRun; + releaseCompletedManifestWrite?.(); + const firstResult = await firstRun; - const [firstResult, secondResult] = await Promise.all([firstRun, secondRun]); expect(firstResult.ok).toBe(true); expect(secondResult.ok).toBe(true); - - const finalPath = resolve(dataDirectory, 'cache/govinfo-bulk/BILLSUM/119/summaries.xml'); - expect(readFileSync(finalPath, 'utf8')).toContain('first'); + expect(secondResult.files_skipped).toBe(1); + expect(secondResult.files_downloaded).toBe(0); + expect(readFileSync(targetPath, 'utf8')).toContain('first'); + expect(fileRequestCount).toBe(2); const manifest = await readManifest(dataDirectory); const state = (manifest.sources as typeof manifest.sources & { @@ -290,8 +315,8 @@ describe('govinfo bulk utilities', () => { })['govinfo-bulk']; expect(Object.values(state.files)).toHaveLength(1); expect(Object.values(state.files)[0].completed_at).not.toBeNull(); - expect(fileRequestCount).toBe(2); } finally { + releaseCompletedManifestWrite?.(); rmSync(dataDirectory, { recursive: true, force: true }); } }); From b29a14980593a4f0508e3429433010ed8fb2648e Mon Sep 17 00:00:00 2001 From: v1d0b0t Date: Fri, 3 Apr 2026 12:05:12 -0400 Subject: [PATCH 09/10] fix: guard govinfo bulk overlap final-path race (#40) --- src/sources/govinfo-bulk.ts | 47 ++++++++++++++++++++++++++++++++----- 1 file changed, 41 insertions(+), 6 deletions(-) diff --git a/src/sources/govinfo-bulk.ts b/src/sources/govinfo-bulk.ts index e86eacf..cf42b33 100644 --- a/src/sources/govinfo-bulk.ts +++ b/src/sources/govinfo-bulk.ts @@ -349,7 +349,13 @@ async function downloadBulkArtifact(options: { validateXmlPayload(payload); } - if (!options.force && await wasArtifactCompletedByAnotherWriter(options.fileKey, options.dataDirectory)) { + if (!options.force && await wasArtifactCompletedByAnotherWriter({ + fileKey: options.fileKey, + dataDirectory: options.dataDirectory, + targetPath, + fileKind, + extractionRoot, + })) { await rm(temporaryPath, { force: true }); if (temporaryExtractionRoot !== null) { await rm(temporaryExtractionRoot, { recursive: true, force: true }); @@ -426,14 +432,43 @@ async function streamResponseToDisk(response: Response, destinationPath: string) return byteCount; } -async function wasArtifactCompletedByAnotherWriter(fileKey: string, dataDirectory: string): Promise { - const refreshedManifest = await readManifest(dataDirectory); +async function wasArtifactCompletedByAnotherWriter(options: { + fileKey: string; + dataDirectory: string; + targetPath: string; + fileKind: GovInfoBulkFileState['file_kind']; + extractionRoot: string | null; +}): Promise { + const refreshedManifest = await readManifest(options.dataDirectory); const refreshedState = ensureGovInfoBulkState(refreshedManifest); - const refreshedEntry = refreshedState.files[fileKey]; - if (!refreshedEntry) { + const refreshedEntry = refreshedState.files[options.fileKey]; + if (refreshedEntry && await isResumeComplete(refreshedEntry, options.dataDirectory)) { + return true; + } + + const hasTargetPath = await pathExists(options.targetPath); + if (!hasTargetPath) { + return false; + } + + if (options.fileKind !== 'zip') { + return true; + } + + if (options.extractionRoot === null) { + return true; + } + + return pathExists(options.extractionRoot); +} + +async function pathExists(path: string): Promise { + try { + await access(path, fsConstants.F_OK); + return true; + } catch { return false; } - return isResumeComplete(refreshedEntry, dataDirectory); } async function fetchListing(url: string, fetchImpl: typeof fetch): Promise { From bb62a01186ac03db7086d7851fb7bbce402afc0f Mon Sep 17 00:00:00 2001 From: v1d0b0t Date: Fri, 3 Apr 2026 12:12:47 -0400 Subject: [PATCH 10/10] docs: capture govinfo bulk knowledge (#40) --- .dark-factory/skills/architecture.md | 13 ++++++++++++ .dark-factory/skills/changelog.md | 30 ++++++++++++++++++++++++++++ .dark-factory/skills/decisions.md | 21 +++++++++++++++++++ .dark-factory/skills/dev.md | 26 +++++++++++++++++++++++- .dark-factory/skills/security.md | 14 +++++++++++++ .dark-factory/skills/test.md | 5 +++++ 6 files changed, 108 insertions(+), 1 deletion(-) diff --git a/.dark-factory/skills/architecture.md b/.dark-factory/skills/architecture.md index b35a896..7428716 100644 --- a/.dark-factory/skills/architecture.md +++ b/.dark-factory/skills/architecture.md @@ -29,6 +29,8 @@ - `src/sources/congress.ts` — Congress.gov bulk fetch orchestration, shared-rate-limit use, member snapshot reuse, congress checkpoint updates. - `src/sources/congress-member-snapshot.ts` — freshness evaluation for the reusable Congress global-member snapshot. - `src/sources/govinfo.ts` — GovInfo PLAW walk, checkpointed resume state, retained-package summary/granule finalization. +- `src/sources/govinfo-bulk.ts` — GovInfo Bulk Data Repository discovery/download orchestration, streaming ZIP/XML writes, extraction/validation, overlap loser checks, and manifest-backed resume state. +- `src/utils/govinfo-bulk-listing.ts` — GovInfo bulk XML directory-listing parser, URL resolution, and origin/path allowlisting. - `src/sources/voteview.ts` — static CSV download plus in-memory indexes for congress/member lookups. - `src/sources/unitedstates.ts` — YAML download, lightweight parsing, Congress-snapshot-based bioguide crosswalk generation/skip handling. - `src/utils/cache.ts` — raw response cache keying, TTL reads, atomic body/metadata writes. @@ -154,6 +156,7 @@ - OLRC additive discovery metadata under `sources.olrc.available_vintages` - Congress `bulk_scope`, `member_snapshot`, `congress_runs`, `bulk_history_checkpoint` - GovInfo `query_scopes` and `checkpoints` + - GovInfo bulk state under `sources["govinfo-bulk"]` with per-request checkpoints, per-collection/per-congress run state, and per-artifact file records (`download_status`, `validation_status`, `file_kind`, `relative_cache_path`, `extraction_root`) - legislators `cross_reference` state with explicit skip statuses - Congress global-member snapshot is intentionally separate from per-congress bill/committee runs. `src/sources/unitedstates.ts` may use it only when the latest snapshot is both `status: 'complete'` and still fresh per `evaluateCongressMemberSnapshotFreshness()`. - `fetch --all` runs sources serially in fixed order: `olrc`, `congress`, `govinfo`, `voteview`, `legislators`. @@ -175,6 +178,10 @@ - legislators skip states must not leave a stale `data/cache/legislators/bioguide-crosswalk.json` on disk - Congress and GovInfo now both consult the shared in-process limiter singleton from `src/utils/rate-limit.ts`, so one process no longer keeps separate per-source budgets for the same `API_DATA_GOV_KEY` - Congress/GovInfo `429` handling now keeps `nextRequestAt` numeric through the throw path and converts it to ISO only in `normalizeError()`, preserving the public `next_request_at` summary + - GovInfo bulk listing and file URLs are constrained to `https://www.govinfo.gov/bulkdata/` via `src/utils/govinfo-bulk-listing.ts` + - GovInfo bulk downloads now stream response bodies directly to temp files before validation/extraction; they do not materialize whole ZIPs in memory + - GovInfo bulk overlap handling is intentionally loser-check-based rather than full locking: immediately before final rename, `downloadBulkArtifact()` re-reads manifest state and final on-disk artifact/extraction-root existence to avoid clobbering another writer that already completed the same file + - manifest writes now merge `sources["govinfo-bulk"]` file/collection/congress state with on-disk manifest contents so stale snapshots do not drop another writer's completed file records - OLRC cookie state is memory-only inside `src/sources/olrc.ts`; it must never be persisted in manifest/cache metadata/output - OLRC releasepoint discovery is `download.shtml`-first and only Title 53 may be downgraded to `reserved_empty` - OLRC ZIP extraction now tolerates current large-title payloads via the 128 MiB large-entry ceiling while keeping bounded extraction caps @@ -283,6 +290,12 @@ - `fetch --source=olrc --all-vintages` discovers once, iterates every vintage in descending order, keeps successful earlier vintages on disk when later ones fail, and reports per-vintage results - manifest normalization is additive only: old manifests load with `vintages: {}` and `available_vintages: null` - latest-mode compatibility remains intentional: plain `fetch --source=olrc` still fetches only the newest vintage and mirrors that state to top-level `selected_vintage` + `titles` + - issue #40 GovInfo bulk backfill layer: + - `fetch --source=govinfo-bulk [--collection=] [--congress=]` is an explicit historical backfill path and is intentionally excluded from `fetch --all` + - discovery walks GovInfo XML directory listings recursively per collection/congress and preserves remote layout under `data/cache/govinfo-bulk/{collection}/{congress}/...` + - ZIP/XML artifacts are validated before manifest completion; `BILLSTATUS` ZIPs additionally require parseable extracted XML + - resume state lives under `sources["govinfo-bulk"]` with request checkpoints, per-collection/per-congress counters, and per-file completion metadata + - local download concurrency is bounded at 2, downloads stream to disk, and overlapping writers must skip instead of overwriting when a final cache path already exists before manifest completion is persisted - issue #29 chapter-rendering correctness layer: - standalone section markdown remains H1, but embedded chapter-mode sections render as `## § ... {#section-...}` with statutory notes at `###` / `####` and editorial notes at `###` - chapter frontmatter `source` is now the concrete title URL `https://uscode.house.gov/view.xhtml?req=granuleid:USC-prelim-title{title}` while section frontmatter still uses section-specific canonical URLs diff --git a/.dark-factory/skills/changelog.md b/.dark-factory/skills/changelog.md index 9f4ccab..b3e04fe 100644 --- a/.dark-factory/skills/changelog.md +++ b/.dark-factory/skills/changelog.md @@ -469,3 +469,33 @@ - `npx tsc --noEmit` ✅ - `npm run build` ✅ - `npx vitest run` ✅ (`222 passed, 1 skipped`) + +## Feature #40 — GovInfo bulk repository fetch source +- Updated `src/commands/fetch.ts`: + - registered `govinfo-bulk` as a first-class fetch source + - added `--collection=` parsing/validation + - kept `govinfo-bulk` out of `fetch --all` + - allowed anonymous entry into the bulk path without `API_DATA_GOV_KEY` +- Added GovInfo bulk discovery/download implementation: + - `src/utils/govinfo-bulk-listing.ts` parses GovInfo XML directory listings, resolves relative links, and enforces the `https://www.govinfo.gov/bulkdata/` allowlist + - `src/sources/govinfo-bulk.ts` recursively discovers congress/type files, streams artifact downloads to temp files, validates XML/ZIP payloads, extracts ZIPs under sibling `extracted/` directories, and records per-file resume state +- Updated `src/utils/manifest.ts`: + - normalized new `sources["govinfo-bulk"]` state + - merged incoming bulk state with on-disk manifest contents before write so stale snapshots preserve other writers' completed file records +- Added/expanded coverage in: + - `tests/cli/fetch.test.ts` + - `tests/unit/sources/govinfo-bulk.test.ts` +- Runbook/docs status: + - `docs/DATA-ACQUISITION-RUNBOOK.md` now documents bulk usage, filters, cache layout, resume behavior, and collection priority +- Review/fix history captured from issue #40: + - `8d430f7` — main `govinfo-bulk` implementation + - `ea8bfde` / `332aa62` — QA and adversary regression coverage + - `73d3954` — manifest-merge hardening for stale-snapshot writes + - `b29a149` — final-path overlap loser guard before overwrite + - active PR is `#41` for branch `df2/issue-40` + - `[adversary-review]` is APPROVED with no findings +- Verification captured from issue context: + - `npx tsc --noEmit` ✅ + - `npm run build` ✅ + - `npx vitest run tests/unit/sources/govinfo-bulk.test.ts` ✅ + - `npx vitest run` ⚠️ two unrelated pre-existing OLRC failures remained at the dev handoff diff --git a/.dark-factory/skills/decisions.md b/.dark-factory/skills/decisions.md index c6ba5aa..66ef66b 100644 --- a/.dark-factory/skills/decisions.md +++ b/.dark-factory/skills/decisions.md @@ -382,3 +382,24 @@ - **Decision:** `src/transforms/markdown.ts` now activates a dedicated nested rendering path only when a subsection subtree has deeper labeled descendants. In that path, descendants render through `renderGithubSafeLabeledParagraph(...)` with bold labels at column 0, and continuation/body text nodes also render without four-space indentation. - **Consequence:** GitHub-safe nested output is now a renderer contract. Future agents must preserve the narrow compatibility gate that keeps flat sections and top-level-subsection-only sections byte-stable while preventing `\n (i)`-style regressions in affected hierarchies. - **Feature:** #36 Sub-subsection indentation renders as code blocks on GitHub + +### ADR-054: GovInfo bulk is an additive explicit fetch source with manifest-backed bulk state +- **Status:** Active +- **Context:** Historical GovInfo backfill via the API is too slow and rate-limited; the bulk repository exposes equivalent anonymous ZIP/XML artifacts with a different directory-listing model. +- **Decision:** Add `fetch --source=govinfo-bulk` as a separate source in `src/commands/fetch.ts`, keep it excluded from `fetch --all`, constrain optional `--collection` to `BILLSTATUS | BILLS | BILLSUM | PLAW`, and persist canonical progress under `sources["govinfo-bulk"]` in `data/manifest.json`. +- **Consequence:** Future agents should extend the dedicated bulk-source path instead of overloading the API-based `govinfo` client or inventing sidecar state files. +- **Feature:** #40 Add bulk download from GovInfo Bulk Data Repository + +### ADR-055: GovInfo bulk downloads stream to temp files and validate before completion +- **Status:** Active +- **Context:** Bulk repository artifacts can be large enough that buffering entire ZIPs in memory would violate the architecture’s resource-usage controls and make concurrent downloads fragile. +- **Decision:** `src/sources/govinfo-bulk.ts` streams `response.body` directly to a temp file, derives byte counts from streamed bytes or `content-length`, validates XML/ZIP payloads before rename, and requires parseable extracted XML for `BILLSTATUS` ZIPs. +- **Consequence:** Future agents must preserve the streamed temp-file path and should treat a return to `response.arrayBuffer()` as a resource/safety regression. +- **Feature:** #40 Add bulk download from GovInfo Bulk Data Repository + +### ADR-056: GovInfo bulk overlap safety is merge-on-write plus loser-skip checks, not blind rename +- **Status:** Active +- **Context:** Overlapping local bulk-fetch processes can race on both `data/manifest.json` and the final cache path for the same artifact. +- **Decision:** `src/utils/manifest.ts` merges incoming `sources["govinfo-bulk"]` state with the on-disk manifest before rename, and `downloadBulkArtifact()` re-reads manifest state plus final artifact/extraction-root existence immediately before destructive rename so a loser skips once another writer already completed the file. +- **Consequence:** Future agents should preserve these stale-snapshot merge and final-path re-check seams when changing bulk persistence; last-writer-wins manifest rewrites and overwrite-on-rename behavior are now explicitly rejected branch designs. +- **Feature:** #40 Add bulk download from GovInfo Bulk Data Repository diff --git a/.dark-factory/skills/dev.md b/.dark-factory/skills/dev.md index 765af50..28993a1 100644 --- a/.dark-factory/skills/dev.md +++ b/.dark-factory/skills/dev.md @@ -47,6 +47,7 @@ - Run fetch after build: - `node dist/index.js fetch --status` - `node dist/index.js fetch --source=congress --congress=119` + - `node dist/index.js fetch --source=govinfo-bulk --collection=BILLSTATUS --congress=119` - `node dist/index.js fetch --all --congress=119` - Public CLI entry in `package.json`: `us-code-tools -> ./dist/index.js` - CI/build note: integration/CLI tests shell out to `dist/index.js`, so `npm run build` must happen before Vitest when validating compiled CLI behavior. @@ -77,9 +78,10 @@ - `src/sources/congress.ts` — Congress fetch orchestration - `src/sources/congress-member-snapshot.ts` — member snapshot freshness contract - `src/sources/govinfo.ts` — GovInfo collection walk/checkpointing +- `src/sources/govinfo-bulk.ts` — GovInfo bulk listing walk, streaming download/extract, resume, and overlap-guard logic - `src/sources/voteview.ts` — VoteView file download/index helpers - `src/sources/unitedstates.ts` — legislators download/parsing/crosswalk -- `src/utils/cache.ts`, `manifest.ts`, `fetch-config.ts`, `logger.ts`, `rate-limit.ts`, `retry.ts` — acquisition infrastructure +- `src/utils/cache.ts`, `manifest.ts`, `fetch-config.ts`, `govinfo-bulk-listing.ts`, `logger.ts`, `rate-limit.ts`, `retry.ts` — acquisition infrastructure - `tests/cli/` — fetch CLI contract coverage - `tests/unit/` — pure-module coverage - `tests/integration/` — built CLI end-to-end coverage @@ -94,6 +96,7 @@ - `src/sources/congress.ts` → `src/utils/cache.ts`, `src/utils/manifest.ts`, `src/utils/rate-limit.ts`, `src/utils/retry.ts`, `src/utils/logger.ts`, `src/sources/congress-member-snapshot.ts` (this source uses `getSharedApiDataGovLimiter()` and throws numeric `nextRequestAt` values that `normalizeError()` serializes into the public `next_request_at` field) - `src/sources/congress-member-snapshot.ts` → `src/utils/manifest.ts` (freshness derives from manifest snapshot metadata + artifact existence) - `src/sources/govinfo.ts` → `src/utils/cache.ts`, `src/utils/manifest.ts`, `src/utils/rate-limit.ts`, `src/utils/retry.ts`, `src/utils/logger.ts` (this source also uses `getSharedApiDataGovLimiter()` and preserves numeric `nextRequestAt` through `normalizeError()`) +- `src/sources/govinfo-bulk.ts` → `src/utils/govinfo-bulk-listing.ts`, `src/utils/manifest.ts`, `src/utils/logger.ts`, `fast-xml-parser`, `yauzl`, Node streams/fs (this module owns recursive bulk discovery, streaming file writes, ZIP/XML validation, per-file resume checks, and overlap loser checks before final rename) - `src/sources/unitedstates.ts` → `src/utils/manifest.ts`, `src/sources/congress-member-snapshot.ts`, current Congress cache layout in `src/sources/congress.ts` - `src/sources/voteview.ts` → `src/utils/manifest.ts` and its in-memory index cache (`inMemoryIndexes`) - `src/sources/olrc.ts` → `src/domain/model.ts`, `src/domain/normalize.ts`, `src/types/yauzl.d.ts`, `src/utils/manifest.ts`, `src/utils/logger.ts` (issue #8/#21: this module owns OLRC homepage bootstrap, in-memory cookie forwarding, `download.shtml` parsing, descending/deduped vintage discovery, discovered per-vintage title URL maps, Title 53 `reserved_empty` classification, the 128 MiB large-title entry cap, aggregate `--all-vintages` execution, and `resolveCachedOlrcTitleZipPath()`) @@ -149,6 +152,16 @@ src/index.ts (main) → isRateLimitExhausted() / markRateLimitUse() → parseRetryAfter() on HTTP 429, then throw numeric `nextRequestAt` for `normalizeError()` to serialize → writeManifest() + → fetchGovInfoBulkSource() + → fetchListing() + → parseGovInfoBulkListing() + → resolveGovInfoBulkUrl() / isAllowedGovInfoBulkUrl() + → discoverFilesForCongress() + → downloadBulkArtifact() + → streamResponseToDisk() + → validateXmlPayload() | extractZipSafely() + → wasArtifactCompletedByAnotherWriter() + → writeManifest() → fetchVoteViewSource() → fetchWithTimeout() → writeManifest() @@ -182,6 +195,8 @@ src/index.ts (main) - `OlrcTitleState` / `OlrcTitleReservedEmptyState` in `src/utils/manifest.ts` — per-title OLRC cache/result contract for issues #8/#21 - `OlrcVintageState` / `OlrcAvailableVintagesState` / `OlrcManifestState` in `src/utils/manifest.ts` — historical OLRC manifest contract and latest-mode compatibility mirror - `CongressMemberSnapshotState` / `CongressRunState` / `GovInfoCheckpointState` / `LegislatorsCrossReferenceState` in `src/utils/manifest.ts` — per-source manifest contracts +- `GovInfoBulkManifestState` / `GovInfoBulkCollectionState` / `GovInfoBulkCongressState` / `GovInfoBulkFileState` in `src/sources/govinfo-bulk.ts` — bulk repository manifest contracts merged by `src/utils/manifest.ts` +- `GovInfoBulkCollection` / `GovInfoBulkListingEntry` in `src/utils/govinfo-bulk-listing.ts` — bulk listing parser and allowed-collection contract - `CurrentCongressResolution` in `src/utils/fetch-config.ts` — `override`/`live`/`fallback` current-congress contract - `RawResponseCacheMetadata` in `src/utils/cache.ts` — raw API response cache metadata contract - `RateLimitState` / `RateLimitExhaustion` in `src/utils/rate-limit.ts` — shared limiter contract @@ -205,7 +220,11 @@ src/index.ts (main) - The implementation uses `git fast-import` for historical commits, then `git reset --hard HEAD` to restore a clean working tree. - Fetch-path conventions: - `src/commands/fetch.ts` owns CLI validation and top-level fail-open source ordering + - `govinfo-bulk` is an explicit source only; keep it out of `fetch --all` unless spec/architecture change because it can trigger multi-GB historical downloads - `src/utils/manifest.ts` is permissive on read/normalize but all writers should emit the canonical shape + - GovInfo bulk manifest writes intentionally merge on-disk `sources["govinfo-bulk"]` state before rename so stale snapshots do not delete another writer's completed file entries + - GovInfo bulk downloads must stream `response.body` to disk; do not reintroduce `response.arrayBuffer()` for multi-GB artifacts + - GovInfo bulk overlap safety depends on the pre-rename `wasArtifactCompletedByAnotherWriter(...)` re-check of refreshed manifest state plus final artifact/extraction-root existence; preserve that loser-skip seam if you touch rename/extraction flow - Congress/GovInfo raw API caching goes through `src/utils/cache.ts`; cache keys normalize away `api_key` - Congress and GovInfo both call `getSharedApiDataGovLimiter()` / `resetSharedApiDataGovLimiter()` from `src/utils/rate-limit.ts`; update tests and any mocks at that shared-module seam rather than assuming per-source limiter state - `src/utils/rate-limit.ts` owns `parseRetryAfter()`, and both `src/sources/congress.ts` and `src/sources/govinfo.ts` now keep the parsed retry horizon numeric until `normalizeError()` converts it into the public ISO `next_request_at` field @@ -373,6 +392,11 @@ src/index.ts - chapter-level xrefs never point to `section-*.md`; they resolve through writer-built `sectionTargetsByRef` entries or exact `uscode.house.gov` section URLs - `_title.md` intentionally keeps only title/chapter navigation, while nested labeled content now renders as multiple indented lines with parent-before-child ordering - ordered and non-ordered parse paths must agree on `SectionIR.heading`; the shared helper seam is `readSectionHeading(...)` + - issue #40 GovInfo bulk fetch work: + - CLI adds `--source=govinfo-bulk` and `--collection=`; `--collection` is invalid for every other source + - the production path is XML-listing-driven: recurse from `https://www.govinfo.gov/bulkdata/{collection}/`, filter optional `--congress`, then download file entries only + - file cache paths preserve remote layout under `data/cache/govinfo-bulk/{collection}/{congress}/...`; ZIPs keep the downloaded archive plus a sibling `extracted/` directory + - manifest merge/overlap behavior is part of the contract, not an implementation detail: stale snapshots must preserve other writers' completed file keys, and overlap losers must skip once the final cache path already exists before manifest completion lands - What's intentionally deferred: - What's intentionally deferred: - additional backfill phases diff --git a/.dark-factory/skills/security.md b/.dark-factory/skills/security.md index 1573634..6949c7c 100644 --- a/.dark-factory/skills/security.md +++ b/.dark-factory/skills/security.md @@ -40,6 +40,15 @@ - `src/sources/unitedstates.ts` - skips cross-reference unless the latest Congress snapshot is complete and fresh - deletes stale `bioguide-crosswalk.json` on skip paths so manifest state and disk state cannot disagree +- `src/utils/govinfo-bulk-listing.ts` + - resolves only `https://www.govinfo.gov/bulkdata/...` URLs and rejects redirects/origins outside that prefix + - treats HTML listings as invalid XML payloads instead of silently traversing them +- `src/sources/govinfo-bulk.ts` + - streams GovInfo bulk responses to temp files instead of buffering whole artifacts in memory + - validates XML payloads and ZIP extraction before completion, with `BILLSTATUS` requiring parseable extracted XML + - re-checks refreshed manifest state plus final artifact/extraction-root existence immediately before final rename so overlap losers skip instead of overwriting a completed writer +- `src/utils/manifest.ts` + - merges incoming `sources["govinfo-bulk"]` state with the on-disk manifest before rename so stale snapshots do not drop another writer's completed file records ### Backfill Target Safety - `src/backfill/target-repo.ts` @@ -138,6 +147,8 @@ - **Issue #16 warning classification is part of the public contract:** uncategorized sections surface via `TransformWarning` / `warnings[]`, not `ParseError`, so successful runs can still report zero `parse_errors`. - **Issue #21 historical OLRC fetches must remain discovery-driven:** once the listing is parsed, `selectVintagePlan()` must reuse the discovered per-vintage title URL map rather than synthesizing `resolveTitleUrl(title, vintage)` for titles that were never advertised. - **Issue #21 listing mode is intentionally side-effect free:** `listOlrcVintages()` may perform OLRC discovery but must not persist manifest state, cache artifacts, or cookie material. +- **Issue #40 GovInfo bulk is an anonymous historical backfill path:** `fetch --source=govinfo-bulk` must not require `API_DATA_GOV_KEY`, and its network surface is constrained to the GovInfo bulk repository instead of the api.data.gov path. +- **Issue #40 streaming download + loser-skip overlap checks are the reviewed DoS/race controls:** future agents should preserve the streamed temp-file write path and the pre-rename refreshed-manifest/final-path re-check rather than reverting to `arrayBuffer()` or blind rename-overwrite behavior. - **Issue #29 chapter-mode links are allowlisted outputs, not best-effort guesses:** renderer output may only use writer-derived relative chapter targets or exact `https://uscode.house.gov/` canonical fallbacks; emitting `section-*.md`, arbitrary domains, or guessed chapter filenames is a contract violation. - **Issue #29 canonical slash-bearing refs are integrity-sensitive:** parse-output links may use filename-safe hrefs, but chapter-mode rewriting must preserve canonical ids like `125/d` for both map lookup keys and fallback URLs; collapsing them to `125d` or `125-d` changes the legal reference target. - **Issue #29 heading extraction must fail closed to empty string:** `readSectionHeading(...)` may read only real `` content; substituting descendant paragraph text into `SectionIR.heading` would be an integrity bug, not a resilience feature. @@ -150,6 +161,8 @@ - `git fast-import` is intentional for historical author/date control; do not replace it casually with ordinary `git commit` without revalidating exact-history guarantees. - Congress and GovInfo no longer keep separate module-local limiter instances; both sources now import the shared singleton from `src/utils/rate-limit.ts`, so future agents should treat duplicate per-source limiter state as obsolete branch knowledge. - Congress and GovInfo parse upstream `Retry-After` and preserve the parsed numeric `nextRequestAt` until `normalizeError()` serializes the public `next_request_at` field; future changes should keep that boundary intact. +- GovInfo bulk is intentionally a no-key path; treating missing `API_DATA_GOV_KEY` as a govinfo-bulk failure is incorrect. +- GovInfo bulk overlap handling is intentionally merge-on-write + loser-skip, not a global lock manager; the branch contract is to avoid duplicate completion/overwrite, not to serialize all fetch activity. - OLRC cookie bootstrap and `download.shtml` discovery are required production behavior, not temporary test scaffolding. - Title 53 `reserved_empty` manifest entries are expected machine-readable skip states, not generic fetch failures and not cache artifacts. - VoteView indexing is currently in-memory only; lack of on-disk index files is an implementation choice, not accidental data loss. @@ -230,6 +243,7 @@ - issue #20 path-integrity hardening: one shared title-directory normalization boundary, filesystem-safe heading slugification, exact fallback to legacy `title-{NN}`, and shared-link enforcement across writer and parser surfaces - issue #21 historical OLRC hardening: duplicate/malformed `--vintage` rejection before discovery, in-memory-only cookie reuse across list/latest/single/all-vintages modes, additive manifest normalization for old OLRC state, and discovery-driven sparse-vintage handling - issue #29 output-integrity hardening: centralized embedded-anchor normalization, exact canonical fallback URL generation, title-directory-safe cross-title chapter links, section-heading parity across ordered/non-ordered parsing, and elimination of broken local `section-*.md` chapter-mode refs + - issue #40 GovInfo bulk hardening: origin/path-allowlisted XML listing traversal, streamed temp-file downloads, ZIP/XML validation before completion, merge-on-write manifest persistence for bulk state, and final-path loser checks to avoid overlap overwrites - What's intentionally deferred: - signed-commit enforcement - remote authenticity verification beyond operator-configured git remotes diff --git a/.dark-factory/skills/test.md b/.dark-factory/skills/test.md index 9e65763..f3e532d 100644 --- a/.dark-factory/skills/test.md +++ b/.dark-factory/skills/test.md @@ -21,6 +21,7 @@ - `tests/utils/issue21-manifest-historical.test.ts` — pre-feature OLRC manifest compatibility for additive `vintages` / `available_vintages` normalization. - `tests/utils/rate-limit.test.ts` — limiter arithmetic and exhaustion timing for the shared helper primitives. - `tests/unit/sources/olrc.test.ts` — OLRC source/cache behavior, cookie bootstrap, `download.shtml` discovery, Title 42 extraction ceiling, Title 53 reserved-empty classification, selected-vintage cache regressions, and the source-level seams historical-vintage behavior builds on. +- `tests/unit/sources/govinfo-bulk.test.ts` — GovInfo bulk listing parsing, ZIP/XML extraction + validation, manifest-backed resume, streaming-download contract, stale-manifest merge behavior, and overlap loser final-path skip regression coverage. - `tests/unit/transforms/uslm-to-ir.test.ts` — legacy `uslm` fixtures plus current namespace-qualified `uscDoc` fixture coverage, canonical `` precedence, empty-attribute fallback, disagreement cases, mixed punctuation cleanup, structural XSD-shape assertions, issue #14 fixture regressions for section `chapeau`, paragraph body text, subsection body text, nested subclause bodies, and parent-level continuation text, plus issue #31 regressions for canonical `sourceUrlTemplate`, note paragraph/table preservation, and note-scoped embedded-Act containment. - `tests/unit/transforms/issue12-recursive-metadata.test.ts` — real-fixture regression suite for recursive hierarchy walking, hierarchy frontmatter, singular `source_credit`, statutory notes, preserved `noteType`, relative USC ref rendering, canonical ordering, mixed-case suffix ordering, and zero-padded filename derivation. - `tests/unit/transforms/markdown.test.ts` — markdown rendering contracts, including issue #14 regression coverage for Title 42 § 10307 paragraph completeness, parenthesized label normalization, issue #20 coverage for slugged cross-title links on both helper and real parser/render paths, issue #31 coverage for canonical OLRC URL shape, standalone embedded anchors, blank-line-separated subsection siblings, and preserved note/table markdown output, plus issue #36 coverage for bold GitHub-safe nested descendant labels, exact `\n\n**(i)**` boundaries, byte-stable top-level subsection output, and continuation lines that stay below the 4-space code-block threshold. @@ -56,6 +57,7 @@ ## Patterns to Follow - For pure backfill modules, import source files directly and assert behavior without shelling out where possible. - For fetch utilities/sources, prefer fixture-backed unit tests over live requests; default `npm test` remains offline. +- GovInfo bulk tests should keep using synthetic XML directory listings plus in-test ZIP fixtures generated with `zip`; do not add live `govinfo.gov` fetches for routine coverage. - For historical OLRC fetch modes, keep fixtures capable of expressing sparse vintages where a requested vintage advertises only a subset of title ZIPs; that is how the branch locks in the “missing_titles, not fabricated 404 failures” contract. - When touching OLRC fetch logic, isolate `US_CODE_TOOLS_DATA_DIR` in tests that depend on uncached fetch behavior so ambient `data/` state cannot suppress the request path. - For CLI tests, build first and run `dist/index.js` with `spawnSync`. @@ -76,6 +78,7 @@ - CLI changes: assert usage/error text and no-side-effect behavior for bad invocations. - OLRC issue #8 changes: cover homepage cookie bootstrap, authenticated follow-on requests, `download.shtml` parsing, current `uscDoc` parsing, selected-vintage transform lookup, Title 42 large-entry acceptance, and Title 53 reserved-empty handling without live outbound access. - Issue #21 historical OLRC changes: cover duplicate `--vintage` pre-discovery rejection, side-effect-free `--list-vintages`, unknown-vintage no-mutation behavior, `--all-vintages` fail-open aggregation, sparse-vintage discovered-link reuse, and pre-feature manifest normalization to `vintages: {}` / `available_vintages: null`. +- Issue #40 GovInfo bulk changes: cover `fetch --status` integration, `--collection` validation, XML listing parsing, ZIP/XML validation, skip-on-resume behavior, streaming responses that throw on `arrayBuffer()`, stale manifest merge preservation, and the race where writer B must skip after writer A creates the final cache path before manifest completion is written. - Issue #29 chapter-rendering changes: cover embedded heading-level separation (`#` standalone vs `##` embedded), concrete chapter/title `source` frontmatter, mapped local/cross-title chapter-anchor rewriting, exact canonical fallback URLs, real parse-output slash-bearing link recovery via `#ref=` fragments, deterministic anchors (`411`, `125d`, `301-1`, `125/d`), `_title.md` without `## Sections`, nested labeled-node indentation/blank-line rules, and ordered/non-ordered `SectionIR.heading` parity. - Issue #10 parser changes: assert `@value` beats display text for title/chapter/section nodes, whitespace-only attributes fall back cleanly, mixed trailing `.—` decoration is removed in fallback mode, Title 1 current-format fixture yields `titleIr.chapters.length === 1` + 53 canonical section numbers, and output paths never contain decorated `` text. - Issue #12 transform changes: assert fixture `
` count equality for Titles 1/5/10/26, rendered hierarchy frontmatter for sampled deep-nesting sections, `source_credit` presence when `` exists, `## Statutory Notes` rendering when `` exists, preserved `noteType: 'uscNote'`, relative markdown links for transformable USC refs, canonical slash-ref mapping (`/us/usc/t10/s125/d` → `../title-10/section-00125d.md`), zero-padded filenames, canonical mixed-width/mixed-case section ordering (`106`, `106A`, `106a`, `106b`), and normalized mixed-content source-credit/note text retention (`Aug. 10, 1956, ch. 1041`, `70A Stat. 3`). @@ -112,6 +115,7 @@ - `tests/adversary-round2-issue5.test.ts` now mocks `src/utils/rate-limit.ts` at the shared-module seam and verifies Congress stops immediately with `rate_limit_exhausted` when the shared limiter reports zero remaining budget. - There is still no dedicated regression that forces a real upstream `429 Retry-After` response through Congress/GovInfo end-to-end; current confidence comes from the source implementations plus the targeted adversary regressions that now pass locally. - Full suite still depends on a built `dist/index.js` because transform/backfill/fetch CLI tests execute the compiled entrypoint. +- GovInfo bulk unit coverage currently exercises the important adversary seams directly in-process rather than through a full CLI integration harness; if you add new overlap/manifest semantics, extend `tests/unit/sources/govinfo-bulk.test.ts` first. - `tests/unit/transforms/uslm-to-ir.test.ts` asserts that raw namespace-qualified `uscDoc` XML parses directly; callers should not strip namespaces before invoking `parseUslmToIr()`. - `tests/integration/transform-cli.test.ts` generates the current-format title matrix inside `buildCurrentFormatFixtureZip(...)` by deterministic string substitution from the committed Title 1 fixture; do not add live OLRC downloads or a pile of per-title committed XML fixtures for this coverage. - `tests/integration/transform-cli.test.ts` now seeds a canonical `data/manifest.json` for selected-vintage OLRC cache resolution instead of relying only on the fixture env override. @@ -166,6 +170,7 @@ - issue #20 regression coverage for title-directory slug normalization, slugged default/chapter output roots, numeric-title matrix safety, and real parser-path cross-title links - issue #21 regression coverage for historical OLRC selector validation, discovery-only listing mode, unknown-vintage handling, fail-open all-vintages execution, sparse-vintage discovered-link reuse, and additive manifest compatibility - issue #29 regression coverage for chapter heading hierarchy, chapter/title source URL concreteness, chapter-anchor xref rewriting, slash-bearing parse-output link recovery, title-index simplification, structured nested subsection formatting, and Title 51 heading retention + - issue #40 regression coverage for GovInfo bulk CLI validation/status, XML directory traversal, streaming ZIP/XML downloads, manifest merge safety, and overlap-loser final-path skipping - existing transform regression coverage remains intact - What's intentionally deferred: - live external Constitution-source verification during tests