Skip to content

Add bulk download from GovInfo Bulk Data Repository #40

@v1d0b0t

Description

@v1d0b0t

Summary

Add a `fetch --source=govinfo-bulk` command that downloads data from the GovInfo Bulk Data Repository instead of crawling the GovInfo API one request at a time. This replaces days/weeks of rate-limited API crawling with a few hours of direct ZIP downloads.

Background

The current `fetch --source=govinfo` client crawls the GovInfo API at 5,000 req/hr (shared with Congress.gov). A full historical crawl takes days to weeks of hourly sessions.

GovInfo publishes the same data as bulk ZIP downloads with no API key and no rate limits:

Available Collections

Collection Path Coverage What It Contains
BILLSTATUS `/bulkdata/BILLSTATUS/{congress}/{type}/` 108th–present (2003+) Bill lifecycle, sponsors, actions, committees, cosponsors, related bills
BILLS `/bulkdata/BILLS/{congress}/{type}/` 113th–present (2013+) Full bill text as XML
BILLSUM `/bulkdata/BILLSUM/{congress}/` 113th–present CRS bill summaries
PLAW `/bulkdata/PLAW/{congress}/` Public & private laws Enacted law text (USLM XML)

Directory structure example:
```
/bulkdata/BILLSTATUS/119/hr/ → ZIP of all House bill statuses, 119th Congress
/bulkdata/BILLSTATUS/119/s/ → ZIP of all Senate bill statuses, 119th Congress
/bulkdata/BILLSTATUS/118/hr/ → 118th Congress House bills
...back to 108th Congress
```

Proposed Implementation

New CLI command

```bash

Download all bulk collections

npx us-code-tools fetch --source=govinfo-bulk

Download specific collection

npx us-code-tools fetch --source=govinfo-bulk --collection=BILLSTATUS

Download specific congress only

npx us-code-tools fetch --source=govinfo-bulk --congress=119

Download specific collection + congress

npx us-code-tools fetch --source=govinfo-bulk --collection=BILLSTATUS --congress=119
```

Behavior

  1. Enumerate available congresses by crawling the bulk data directory listing (XML format)
  2. Download ZIPs for each congress/type combination
  3. Extract to `data/cache/govinfo-bulk/{collection}/{congress}/{type}/`
  4. Track progress in manifest (congress + collection granularity)
  5. Support resume — skip already-downloaded ZIPs (check size/date)
  6. No API key required
  7. No rate limiting needed (just be polite — maybe 1-2 concurrent downloads)

Cache structure

```
data/cache/govinfo-bulk/
├── BILLSTATUS/
│ ├── 119/
│ │ ├── hr/ → extracted XML files
│ │ └── s/
│ ├── 118/
│ │ ├── hr/
│ │ └── s/
│ └── ...back to 108
├── BILLS/
│ └── ...
├── BILLSUM/
│ └── ...
└── PLAW/
└── ...
```

Collections priority

  1. BILLSTATUS — most important, has bill lifecycle data needed for "bills as PRs"
  2. PLAW — public law text for linking code changes to specific laws
  3. BILLS — full bill text (large, defer if disk space is tight)
  4. BILLSUM — summaries (nice to have, small)

Relationship to Existing Clients

  • `fetch --source=govinfo` (API client) remains for real-time/incremental updates
  • `fetch --source=govinfo-bulk` is for initial historical bulk load
  • `fetch --source=congress` (Congress.gov API) may become unnecessary for bill data if BILLSTATUS covers the same fields — evaluate after bulk download completes

Acceptance Criteria

  • `fetch --source=govinfo-bulk` downloads and extracts ZIPs from GovInfo bulk repository
  • Supports `--collection` and `--congress` filters
  • Progress tracked in manifest with resume support
  • No API key required
  • Downloads BILLSTATUS for all available congresses (108–119)
  • Extracted XML files are valid and parseable
  • Runbook updated with bulk download instructions

Estimated Download Size

  • BILLSTATUS: ~2-5 GB across all congresses (rough estimate)
  • PLAW: ~500 MB–1 GB
  • BILLS: ~10-20 GB (full text of every bill version)
  • BILLSUM: ~500 MB

Total for BILLSTATUS + PLAW (priority): probably under 5 GB.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions