[codex] Add data quality and search artifacts by ethanolivertroy · Pull Request #24 · hackIDLE/nist-cmvp-api

ethanolivertroy · 2026-05-14T13:33:44Z

What changed

Cleans detailed algorithm rows by trimming PDF policy headers, copyright notices, reproduction boilerplate, and standalone table headers from fresh and cached extraction output.
Adds api/data-quality.json with misses, refreshed records, fallback usage, changed certificates, cache reuse checks, and the next scheduled weekly run.
Adds consumer examples in api/examples.json plus split search indexes under api/indexes/ for vendor, algorithm, status, and standard.
Wires the new artifacts into API discovery, OpenAPI, Markdown docs, homepage links, JSON Schemas, README, and strict validation.
Regenerates the static API artifacts from the current cached dataset.

Validation

.venv-codex/bin/python -m py_compile scraper.py validate_api.py test_scraper.py
.venv-codex/bin/python test_scraper.py
.venv-codex/bin/python validate_api.py --require-current-schema --forbid-firecrawl-run-source
git diff --check

Notes

The regenerated run reused all 5,256 certificate detail records and all cached algorithm payloads, with 0 HTML failures, 0 PDF failures, and 0 certificate timeouts. Certificate 5267 was spot-checked after regeneration: its detailed algorithm rows no longer include the policy title, copyright footer, reproduction notice, or standalone Cert fragment.

qodo-code-review · 2026-05-14T13:36:29Z

Review Summary by Qodo

Add data quality report, consumer examples, and split search indexes with algorithm row cleaning

✨ Enhancement 📝 Documentation

Walkthroughs

Description

• Adds three new API artifacts: data-quality.json (quality metrics and cache statistics),
  examples.json (consumer examples), and split search indexes under api/indexes/ (vendors,
  algorithms, statuses, standards)
• Cleans detailed algorithm rows by trimming PDF policy headers, copyright notices, reproduction
  boilerplate, and standalone table headers from extraction output
• Integrates new artifacts into API discovery, OpenAPI specifications, Markdown documentation,
  homepage links, JSON schemas, and strict validation
• Regenerates all certificate data with updated timestamps and consistent field ordering
  (algorithm_extraction before algorithms)
• Updates README with feature descriptions, endpoint documentation, and curl example commands for
  new endpoints
• All 5,256 certificate detail records reused with 0 HTML failures, 0 PDF failures, and 0
  certificate timeouts

Diagram

flowchart LR
  A["Certificate Data"] -->|"Clean algorithm rows"| B["Cleaned Extraction Output"]
  B -->|"Generate"| C["data-quality.json"]
  B -->|"Generate"| D["examples.json"]
  B -->|"Generate"| E["Split Indexes<br/>vendors/algorithms/<br/>statuses/standards"]
  C -->|"Wire into"| F["API Discovery & OpenAPI"]
  D -->|"Wire into"| F
  E -->|"Wire into"| F
  F -->|"Update"| G["README & Docs"]

File Changes

1. README.md 📝 Documentation +24/-0

Documentation updates for new API artifacts and endpoints

• Added three new feature descriptions: Search Indexes, Data Quality Report, and Consumer Examples
• Added new endpoint documentation for data-quality.json, examples.json, and split search
 indexes (indexes/vendors.json, indexes/algorithms.json, indexes/statuses.json,
 indexes/standards.json)
• Added reference link to api/examples.json in the "For Agents" section
• Added four new curl example commands demonstrating split index lookups, data quality checks, and
 consumer examples

README.md

2. api/certificates/2988.json Miscellaneous +15/-15