Skip to content
13 changes: 13 additions & 0 deletions .dark-factory/skills/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,8 @@
- `src/sources/congress.ts` — Congress.gov bulk fetch orchestration, shared-rate-limit use, member snapshot reuse, congress checkpoint updates.
- `src/sources/congress-member-snapshot.ts` — freshness evaluation for the reusable Congress global-member snapshot.
- `src/sources/govinfo.ts` — GovInfo PLAW walk, checkpointed resume state, retained-package summary/granule finalization.
- `src/sources/govinfo-bulk.ts` — GovInfo Bulk Data Repository discovery/download orchestration, streaming ZIP/XML writes, extraction/validation, overlap loser checks, and manifest-backed resume state.
- `src/utils/govinfo-bulk-listing.ts` — GovInfo bulk XML directory-listing parser, URL resolution, and origin/path allowlisting.
- `src/sources/voteview.ts` — static CSV download plus in-memory indexes for congress/member lookups.
- `src/sources/unitedstates.ts` — YAML download, lightweight parsing, Congress-snapshot-based bioguide crosswalk generation/skip handling.
- `src/utils/cache.ts` — raw response cache keying, TTL reads, atomic body/metadata writes.
Expand Down Expand Up @@ -154,6 +156,7 @@
- OLRC additive discovery metadata under `sources.olrc.available_vintages`
- Congress `bulk_scope`, `member_snapshot`, `congress_runs`, `bulk_history_checkpoint`
- GovInfo `query_scopes` and `checkpoints`
- GovInfo bulk state under `sources["govinfo-bulk"]` with per-request checkpoints, per-collection/per-congress run state, and per-artifact file records (`download_status`, `validation_status`, `file_kind`, `relative_cache_path`, `extraction_root`)
- legislators `cross_reference` state with explicit skip statuses
- Congress global-member snapshot is intentionally separate from per-congress bill/committee runs. `src/sources/unitedstates.ts` may use it only when the latest snapshot is both `status: 'complete'` and still fresh per `evaluateCongressMemberSnapshotFreshness()`.
- `fetch --all` runs sources serially in fixed order: `olrc`, `congress`, `govinfo`, `voteview`, `legislators`.
Expand All @@ -175,6 +178,10 @@
- legislators skip states must not leave a stale `data/cache/legislators/bioguide-crosswalk.json` on disk
- Congress and GovInfo now both consult the shared in-process limiter singleton from `src/utils/rate-limit.ts`, so one process no longer keeps separate per-source budgets for the same `API_DATA_GOV_KEY`
- Congress/GovInfo `429` handling now keeps `nextRequestAt` numeric through the throw path and converts it to ISO only in `normalizeError()`, preserving the public `next_request_at` summary
- GovInfo bulk listing and file URLs are constrained to `https://www.govinfo.gov/bulkdata/` via `src/utils/govinfo-bulk-listing.ts`
- GovInfo bulk downloads now stream response bodies directly to temp files before validation/extraction; they do not materialize whole ZIPs in memory
- GovInfo bulk overlap handling is intentionally loser-check-based rather than full locking: immediately before final rename, `downloadBulkArtifact()` re-reads manifest state and final on-disk artifact/extraction-root existence to avoid clobbering another writer that already completed the same file
- manifest writes now merge `sources["govinfo-bulk"]` file/collection/congress state with on-disk manifest contents so stale snapshots do not drop another writer's completed file records
- OLRC cookie state is memory-only inside `src/sources/olrc.ts`; it must never be persisted in manifest/cache metadata/output
- OLRC releasepoint discovery is `download.shtml`-first and only Title 53 may be downgraded to `reserved_empty`
- OLRC ZIP extraction now tolerates current large-title payloads via the 128 MiB large-entry ceiling while keeping bounded extraction caps
Expand Down Expand Up @@ -283,6 +290,12 @@
- `fetch --source=olrc --all-vintages` discovers once, iterates every vintage in descending order, keeps successful earlier vintages on disk when later ones fail, and reports per-vintage results
- manifest normalization is additive only: old manifests load with `vintages: {}` and `available_vintages: null`
- latest-mode compatibility remains intentional: plain `fetch --source=olrc` still fetches only the newest vintage and mirrors that state to top-level `selected_vintage` + `titles`
- issue #40 GovInfo bulk backfill layer:
- `fetch --source=govinfo-bulk [--collection=<BILLSTATUS|BILLS|BILLSUM|PLAW>] [--congress=<n>]` is an explicit historical backfill path and is intentionally excluded from `fetch --all`
- discovery walks GovInfo XML directory listings recursively per collection/congress and preserves remote layout under `data/cache/govinfo-bulk/{collection}/{congress}/...`
- ZIP/XML artifacts are validated before manifest completion; `BILLSTATUS` ZIPs additionally require parseable extracted XML
- resume state lives under `sources["govinfo-bulk"]` with request checkpoints, per-collection/per-congress counters, and per-file completion metadata
- local download concurrency is bounded at 2, downloads stream to disk, and overlapping writers must skip instead of overwriting when a final cache path already exists before manifest completion is persisted
- issue #29 chapter-rendering correctness layer:
- standalone section markdown remains H1, but embedded chapter-mode sections render as `## § ... {#section-...}` with statutory notes at `###` / `####` and editorial notes at `###`
- chapter frontmatter `source` is now the concrete title URL `https://uscode.house.gov/view.xhtml?req=granuleid:USC-prelim-title{title}` while section frontmatter still uses section-specific canonical URLs
Expand Down
30 changes: 30 additions & 0 deletions .dark-factory/skills/changelog.md
Original file line number Diff line number Diff line change
Expand Up @@ -469,3 +469,33 @@
- `npx tsc --noEmit` ✅
- `npm run build` ✅
- `npx vitest run` ✅ (`222 passed, 1 skipped`)

## Feature #40 — GovInfo bulk repository fetch source
- Updated `src/commands/fetch.ts`:
- registered `govinfo-bulk` as a first-class fetch source
- added `--collection=<BILLSTATUS|BILLS|BILLSUM|PLAW>` parsing/validation
- kept `govinfo-bulk` out of `fetch --all`
- allowed anonymous entry into the bulk path without `API_DATA_GOV_KEY`
- Added GovInfo bulk discovery/download implementation:
- `src/utils/govinfo-bulk-listing.ts` parses GovInfo XML directory listings, resolves relative links, and enforces the `https://www.govinfo.gov/bulkdata/` allowlist
- `src/sources/govinfo-bulk.ts` recursively discovers congress/type files, streams artifact downloads to temp files, validates XML/ZIP payloads, extracts ZIPs under sibling `extracted/` directories, and records per-file resume state
- Updated `src/utils/manifest.ts`:
- normalized new `sources["govinfo-bulk"]` state
- merged incoming bulk state with on-disk manifest contents before write so stale snapshots preserve other writers' completed file records
- Added/expanded coverage in:
- `tests/cli/fetch.test.ts`
- `tests/unit/sources/govinfo-bulk.test.ts`
- Runbook/docs status:
- `docs/DATA-ACQUISITION-RUNBOOK.md` now documents bulk usage, filters, cache layout, resume behavior, and collection priority
- Review/fix history captured from issue #40:
- `8d430f7` — main `govinfo-bulk` implementation
- `ea8bfde` / `332aa62` — QA and adversary regression coverage
- `73d3954` — manifest-merge hardening for stale-snapshot writes
- `b29a149` — final-path overlap loser guard before overwrite
- active PR is `#41` for branch `df2/issue-40`
- `[adversary-review]` is APPROVED with no findings
- Verification captured from issue context:
- `npx tsc --noEmit` ✅
- `npm run build` ✅
- `npx vitest run tests/unit/sources/govinfo-bulk.test.ts` ✅
- `npx vitest run` ⚠️ two unrelated pre-existing OLRC failures remained at the dev handoff
21 changes: 21 additions & 0 deletions .dark-factory/skills/decisions.md
Original file line number Diff line number Diff line change
Expand Up @@ -382,3 +382,24 @@
- **Decision:** `src/transforms/markdown.ts` now activates a dedicated nested rendering path only when a subsection subtree has deeper labeled descendants. In that path, descendants render through `renderGithubSafeLabeledParagraph(...)` with bold labels at column 0, and continuation/body text nodes also render without four-space indentation.
- **Consequence:** GitHub-safe nested output is now a renderer contract. Future agents must preserve the narrow compatibility gate that keeps flat sections and top-level-subsection-only sections byte-stable while preventing `\n (i)`-style regressions in affected hierarchies.
- **Feature:** #36 Sub-subsection indentation renders as code blocks on GitHub

### ADR-054: GovInfo bulk is an additive explicit fetch source with manifest-backed bulk state
- **Status:** Active
- **Context:** Historical GovInfo backfill via the API is too slow and rate-limited; the bulk repository exposes equivalent anonymous ZIP/XML artifacts with a different directory-listing model.
- **Decision:** Add `fetch --source=govinfo-bulk` as a separate source in `src/commands/fetch.ts`, keep it excluded from `fetch --all`, constrain optional `--collection` to `BILLSTATUS | BILLS | BILLSUM | PLAW`, and persist canonical progress under `sources["govinfo-bulk"]` in `data/manifest.json`.
- **Consequence:** Future agents should extend the dedicated bulk-source path instead of overloading the API-based `govinfo` client or inventing sidecar state files.
- **Feature:** #40 Add bulk download from GovInfo Bulk Data Repository

### ADR-055: GovInfo bulk downloads stream to temp files and validate before completion
- **Status:** Active
- **Context:** Bulk repository artifacts can be large enough that buffering entire ZIPs in memory would violate the architecture’s resource-usage controls and make concurrent downloads fragile.
- **Decision:** `src/sources/govinfo-bulk.ts` streams `response.body` directly to a temp file, derives byte counts from streamed bytes or `content-length`, validates XML/ZIP payloads before rename, and requires parseable extracted XML for `BILLSTATUS` ZIPs.
- **Consequence:** Future agents must preserve the streamed temp-file path and should treat a return to `response.arrayBuffer()` as a resource/safety regression.
- **Feature:** #40 Add bulk download from GovInfo Bulk Data Repository

### ADR-056: GovInfo bulk overlap safety is merge-on-write plus loser-skip checks, not blind rename
- **Status:** Active
- **Context:** Overlapping local bulk-fetch processes can race on both `data/manifest.json` and the final cache path for the same artifact.
- **Decision:** `src/utils/manifest.ts` merges incoming `sources["govinfo-bulk"]` state with the on-disk manifest before rename, and `downloadBulkArtifact()` re-reads manifest state plus final artifact/extraction-root existence immediately before destructive rename so a loser skips once another writer already completed the file.
- **Consequence:** Future agents should preserve these stale-snapshot merge and final-path re-check seams when changing bulk persistence; last-writer-wins manifest rewrites and overwrite-on-rename behavior are now explicitly rejected branch designs.
- **Feature:** #40 Add bulk download from GovInfo Bulk Data Repository
26 changes: 25 additions & 1 deletion .dark-factory/skills/dev.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,7 @@
- Run fetch after build:
- `node dist/index.js fetch --status`
- `node dist/index.js fetch --source=congress --congress=119`
- `node dist/index.js fetch --source=govinfo-bulk --collection=BILLSTATUS --congress=119`
- `node dist/index.js fetch --all --congress=119`
- Public CLI entry in `package.json`: `us-code-tools -> ./dist/index.js`
- CI/build note: integration/CLI tests shell out to `dist/index.js`, so `npm run build` must happen before Vitest when validating compiled CLI behavior.
Expand Down Expand Up @@ -77,9 +78,10 @@
- `src/sources/congress.ts` — Congress fetch orchestration
- `src/sources/congress-member-snapshot.ts` — member snapshot freshness contract
- `src/sources/govinfo.ts` — GovInfo collection walk/checkpointing
- `src/sources/govinfo-bulk.ts` — GovInfo bulk listing walk, streaming download/extract, resume, and overlap-guard logic
- `src/sources/voteview.ts` — VoteView file download/index helpers
- `src/sources/unitedstates.ts` — legislators download/parsing/crosswalk
- `src/utils/cache.ts`, `manifest.ts`, `fetch-config.ts`, `logger.ts`, `rate-limit.ts`, `retry.ts` — acquisition infrastructure
- `src/utils/cache.ts`, `manifest.ts`, `fetch-config.ts`, `govinfo-bulk-listing.ts`, `logger.ts`, `rate-limit.ts`, `retry.ts` — acquisition infrastructure
- `tests/cli/` — fetch CLI contract coverage
- `tests/unit/` — pure-module coverage
- `tests/integration/` — built CLI end-to-end coverage
Expand All @@ -94,6 +96,7 @@
- `src/sources/congress.ts` → `src/utils/cache.ts`, `src/utils/manifest.ts`, `src/utils/rate-limit.ts`, `src/utils/retry.ts`, `src/utils/logger.ts`, `src/sources/congress-member-snapshot.ts` (this source uses `getSharedApiDataGovLimiter()` and throws numeric `nextRequestAt` values that `normalizeError()` serializes into the public `next_request_at` field)
- `src/sources/congress-member-snapshot.ts` → `src/utils/manifest.ts` (freshness derives from manifest snapshot metadata + artifact existence)
- `src/sources/govinfo.ts` → `src/utils/cache.ts`, `src/utils/manifest.ts`, `src/utils/rate-limit.ts`, `src/utils/retry.ts`, `src/utils/logger.ts` (this source also uses `getSharedApiDataGovLimiter()` and preserves numeric `nextRequestAt` through `normalizeError()`)
- `src/sources/govinfo-bulk.ts` → `src/utils/govinfo-bulk-listing.ts`, `src/utils/manifest.ts`, `src/utils/logger.ts`, `fast-xml-parser`, `yauzl`, Node streams/fs (this module owns recursive bulk discovery, streaming file writes, ZIP/XML validation, per-file resume checks, and overlap loser checks before final rename)
- `src/sources/unitedstates.ts` → `src/utils/manifest.ts`, `src/sources/congress-member-snapshot.ts`, current Congress cache layout in `src/sources/congress.ts`
- `src/sources/voteview.ts` → `src/utils/manifest.ts` and its in-memory index cache (`inMemoryIndexes`)
- `src/sources/olrc.ts` → `src/domain/model.ts`, `src/domain/normalize.ts`, `src/types/yauzl.d.ts`, `src/utils/manifest.ts`, `src/utils/logger.ts` (issue #8/#21: this module owns OLRC homepage bootstrap, in-memory cookie forwarding, `download.shtml` parsing, descending/deduped vintage discovery, discovered per-vintage title URL maps, Title 53 `reserved_empty` classification, the 128 MiB large-title entry cap, aggregate `--all-vintages` execution, and `resolveCachedOlrcTitleZipPath()`)
Expand Down Expand Up @@ -149,6 +152,16 @@ src/index.ts (main)
→ isRateLimitExhausted() / markRateLimitUse()
→ parseRetryAfter() on HTTP 429, then throw numeric `nextRequestAt` for `normalizeError()` to serialize
→ writeManifest()
→ fetchGovInfoBulkSource()
→ fetchListing()
→ parseGovInfoBulkListing()
→ resolveGovInfoBulkUrl() / isAllowedGovInfoBulkUrl()
→ discoverFilesForCongress()
→ downloadBulkArtifact()
→ streamResponseToDisk()
→ validateXmlPayload() | extractZipSafely()
→ wasArtifactCompletedByAnotherWriter()
→ writeManifest()
→ fetchVoteViewSource()
→ fetchWithTimeout()
→ writeManifest()
Expand Down Expand Up @@ -182,6 +195,8 @@ src/index.ts (main)
- `OlrcTitleState` / `OlrcTitleReservedEmptyState` in `src/utils/manifest.ts` — per-title OLRC cache/result contract for issues #8/#21
- `OlrcVintageState` / `OlrcAvailableVintagesState` / `OlrcManifestState` in `src/utils/manifest.ts` — historical OLRC manifest contract and latest-mode compatibility mirror
- `CongressMemberSnapshotState` / `CongressRunState` / `GovInfoCheckpointState` / `LegislatorsCrossReferenceState` in `src/utils/manifest.ts` — per-source manifest contracts
- `GovInfoBulkManifestState` / `GovInfoBulkCollectionState` / `GovInfoBulkCongressState` / `GovInfoBulkFileState` in `src/sources/govinfo-bulk.ts` — bulk repository manifest contracts merged by `src/utils/manifest.ts`
- `GovInfoBulkCollection` / `GovInfoBulkListingEntry` in `src/utils/govinfo-bulk-listing.ts` — bulk listing parser and allowed-collection contract
- `CurrentCongressResolution` in `src/utils/fetch-config.ts` — `override`/`live`/`fallback` current-congress contract
- `RawResponseCacheMetadata` in `src/utils/cache.ts` — raw API response cache metadata contract
- `RateLimitState` / `RateLimitExhaustion` in `src/utils/rate-limit.ts` — shared limiter contract
Expand All @@ -205,7 +220,11 @@ src/index.ts (main)
- The implementation uses `git fast-import` for historical commits, then `git reset --hard HEAD` to restore a clean working tree.
- Fetch-path conventions:
- `src/commands/fetch.ts` owns CLI validation and top-level fail-open source ordering
- `govinfo-bulk` is an explicit source only; keep it out of `fetch --all` unless spec/architecture change because it can trigger multi-GB historical downloads
- `src/utils/manifest.ts` is permissive on read/normalize but all writers should emit the canonical shape
- GovInfo bulk manifest writes intentionally merge on-disk `sources["govinfo-bulk"]` state before rename so stale snapshots do not delete another writer's completed file entries
- GovInfo bulk downloads must stream `response.body` to disk; do not reintroduce `response.arrayBuffer()` for multi-GB artifacts
- GovInfo bulk overlap safety depends on the pre-rename `wasArtifactCompletedByAnotherWriter(...)` re-check of refreshed manifest state plus final artifact/extraction-root existence; preserve that loser-skip seam if you touch rename/extraction flow
- Congress/GovInfo raw API caching goes through `src/utils/cache.ts`; cache keys normalize away `api_key`
- Congress and GovInfo both call `getSharedApiDataGovLimiter()` / `resetSharedApiDataGovLimiter()` from `src/utils/rate-limit.ts`; update tests and any mocks at that shared-module seam rather than assuming per-source limiter state
- `src/utils/rate-limit.ts` owns `parseRetryAfter()`, and both `src/sources/congress.ts` and `src/sources/govinfo.ts` now keep the parsed retry horizon numeric until `normalizeError()` converts it into the public ISO `next_request_at` field
Expand Down Expand Up @@ -373,6 +392,11 @@ src/index.ts
- chapter-level xrefs never point to `section-*.md`; they resolve through writer-built `sectionTargetsByRef` entries or exact `uscode.house.gov` section URLs
- `_title.md` intentionally keeps only title/chapter navigation, while nested labeled content now renders as multiple indented lines with parent-before-child ordering
- ordered and non-ordered parse paths must agree on `SectionIR.heading`; the shared helper seam is `readSectionHeading(...)`
- issue #40 GovInfo bulk fetch work:
- CLI adds `--source=govinfo-bulk` and `--collection=<BILLSTATUS|BILLS|BILLSUM|PLAW>`; `--collection` is invalid for every other source
- the production path is XML-listing-driven: recurse from `https://www.govinfo.gov/bulkdata/{collection}/`, filter optional `--congress`, then download file entries only
- file cache paths preserve remote layout under `data/cache/govinfo-bulk/{collection}/{congress}/...`; ZIPs keep the downloaded archive plus a sibling `extracted/` directory
- manifest merge/overlap behavior is part of the contract, not an implementation detail: stale snapshots must preserve other writers' completed file keys, and overlap losers must skip once the final cache path already exists before manifest completion lands
- What's intentionally deferred:
- What's intentionally deferred:
- additional backfill phases
Expand Down
Loading
Loading