Summary
Add a `fetch --source=govinfo-bulk` command that downloads data from the GovInfo Bulk Data Repository instead of crawling the GovInfo API one request at a time. This replaces days/weeks of rate-limited API crawling with a few hours of direct ZIP downloads.
Background
The current `fetch --source=govinfo` client crawls the GovInfo API at 5,000 req/hr (shared with Congress.gov). A full historical crawl takes days to weeks of hourly sessions.
GovInfo publishes the same data as bulk ZIP downloads with no API key and no rate limits:
Available Collections
| Collection |
Path |
Coverage |
What It Contains |
| BILLSTATUS |
`/bulkdata/BILLSTATUS/{congress}/{type}/` |
108th–present (2003+) |
Bill lifecycle, sponsors, actions, committees, cosponsors, related bills |
| BILLS |
`/bulkdata/BILLS/{congress}/{type}/` |
113th–present (2013+) |
Full bill text as XML |
| BILLSUM |
`/bulkdata/BILLSUM/{congress}/` |
113th–present |
CRS bill summaries |
| PLAW |
`/bulkdata/PLAW/{congress}/` |
Public & private laws |
Enacted law text (USLM XML) |
Directory structure example:
```
/bulkdata/BILLSTATUS/119/hr/ → ZIP of all House bill statuses, 119th Congress
/bulkdata/BILLSTATUS/119/s/ → ZIP of all Senate bill statuses, 119th Congress
/bulkdata/BILLSTATUS/118/hr/ → 118th Congress House bills
...back to 108th Congress
```
Proposed Implementation
New CLI command
```bash
Download all bulk collections
npx us-code-tools fetch --source=govinfo-bulk
Download specific collection
npx us-code-tools fetch --source=govinfo-bulk --collection=BILLSTATUS
Download specific congress only
npx us-code-tools fetch --source=govinfo-bulk --congress=119
Download specific collection + congress
npx us-code-tools fetch --source=govinfo-bulk --collection=BILLSTATUS --congress=119
```
Behavior
- Enumerate available congresses by crawling the bulk data directory listing (XML format)
- Download ZIPs for each congress/type combination
- Extract to `data/cache/govinfo-bulk/{collection}/{congress}/{type}/`
- Track progress in manifest (congress + collection granularity)
- Support resume — skip already-downloaded ZIPs (check size/date)
- No API key required
- No rate limiting needed (just be polite — maybe 1-2 concurrent downloads)
Cache structure
```
data/cache/govinfo-bulk/
├── BILLSTATUS/
│ ├── 119/
│ │ ├── hr/ → extracted XML files
│ │ └── s/
│ ├── 118/
│ │ ├── hr/
│ │ └── s/
│ └── ...back to 108
├── BILLS/
│ └── ...
├── BILLSUM/
│ └── ...
└── PLAW/
└── ...
```
Collections priority
- BILLSTATUS — most important, has bill lifecycle data needed for "bills as PRs"
- PLAW — public law text for linking code changes to specific laws
- BILLS — full bill text (large, defer if disk space is tight)
- BILLSUM — summaries (nice to have, small)
Relationship to Existing Clients
- `fetch --source=govinfo` (API client) remains for real-time/incremental updates
- `fetch --source=govinfo-bulk` is for initial historical bulk load
- `fetch --source=congress` (Congress.gov API) may become unnecessary for bill data if BILLSTATUS covers the same fields — evaluate after bulk download completes
Acceptance Criteria
Estimated Download Size
- BILLSTATUS: ~2-5 GB across all congresses (rough estimate)
- PLAW: ~500 MB–1 GB
- BILLS: ~10-20 GB (full text of every bill version)
- BILLSUM: ~500 MB
Total for BILLSTATUS + PLAW (priority): probably under 5 GB.
Summary
Add a `fetch --source=govinfo-bulk` command that downloads data from the GovInfo Bulk Data Repository instead of crawling the GovInfo API one request at a time. This replaces days/weeks of rate-limited API crawling with a few hours of direct ZIP downloads.
Background
The current `fetch --source=govinfo` client crawls the GovInfo API at 5,000 req/hr (shared with Congress.gov). A full historical crawl takes days to weeks of hourly sessions.
GovInfo publishes the same data as bulk ZIP downloads with no API key and no rate limits:
Available Collections
Directory structure example:
```
/bulkdata/BILLSTATUS/119/hr/ → ZIP of all House bill statuses, 119th Congress
/bulkdata/BILLSTATUS/119/s/ → ZIP of all Senate bill statuses, 119th Congress
/bulkdata/BILLSTATUS/118/hr/ → 118th Congress House bills
...back to 108th Congress
```
Proposed Implementation
New CLI command
```bash
Download all bulk collections
npx us-code-tools fetch --source=govinfo-bulk
Download specific collection
npx us-code-tools fetch --source=govinfo-bulk --collection=BILLSTATUS
Download specific congress only
npx us-code-tools fetch --source=govinfo-bulk --congress=119
Download specific collection + congress
npx us-code-tools fetch --source=govinfo-bulk --collection=BILLSTATUS --congress=119
```
Behavior
Cache structure
```
data/cache/govinfo-bulk/
├── BILLSTATUS/
│ ├── 119/
│ │ ├── hr/ → extracted XML files
│ │ └── s/
│ ├── 118/
│ │ ├── hr/
│ │ └── s/
│ └── ...back to 108
├── BILLS/
│ └── ...
├── BILLSUM/
│ └── ...
└── PLAW/
└── ...
```
Collections priority
Relationship to Existing Clients
Acceptance Criteria
Estimated Download Size
Total for BILLSTATUS + PLAW (priority): probably under 5 GB.