[codex] Add data quality and search artifacts#24
Conversation
Review Summary by QodoAdd data quality report, consumer examples, and split search indexes with algorithm row cleaning
WalkthroughsDescription• Adds three new API artifacts: data-quality.json (quality metrics and cache statistics), examples.json (consumer examples), and split search indexes under api/indexes/ (vendors, algorithms, statuses, standards) • Cleans detailed algorithm rows by trimming PDF policy headers, copyright notices, reproduction boilerplate, and standalone table headers from extraction output • Integrates new artifacts into API discovery, OpenAPI specifications, Markdown documentation, homepage links, JSON schemas, and strict validation • Regenerates all certificate data with updated timestamps and consistent field ordering (algorithm_extraction before algorithms) • Updates README with feature descriptions, endpoint documentation, and curl example commands for new endpoints • All 5,256 certificate detail records reused with 0 HTML failures, 0 PDF failures, and 0 certificate timeouts Diagramflowchart LR
A["Certificate Data"] -->|"Clean algorithm rows"| B["Cleaned Extraction Output"]
B -->|"Generate"| C["data-quality.json"]
B -->|"Generate"| D["examples.json"]
B -->|"Generate"| E["Split Indexes<br/>vendors/algorithms/<br/>statuses/standards"]
C -->|"Wire into"| F["API Discovery & OpenAPI"]
D -->|"Wire into"| F
E -->|"Wire into"| F
F -->|"Update"| G["README & Docs"]
File Changes1. README.md
|
Code Review by Qodo
1. Cached provenance hides misses
|
| "status": "cached", | ||
| "configured_source": "crawl4ai", | ||
| "source": "none", | ||
| "source_url": null, | ||
| "cached": false, | ||
| "fallback_used": true, | ||
| "cached": true, | ||
| "fallback_used": false, | ||
| "cache_version": "2026-04-15-legacy-v1", | ||
| "algorithm_count": 0, | ||
| "detailed_algorithm_count": 0, | ||
| "attempts": [ | ||
| { | ||
| "source": "crawl4ai", | ||
| "url": "https://csrc.nist.gov/CSRC/media/projects/cryptographic-module-validation-program/documents/security-policies/140sp116.pdf", | ||
| "status": "no_algorithms" | ||
| }, | ||
| { | ||
| "source": "security_policy_pdf", | ||
| "url": "https://csrc.nist.gov/CSRC/media/projects/cryptographic-module-validation-program/documents/security-policies/140sp116.pdf", | ||
| "status": "no_algorithms" | ||
| } | ||
| ] | ||
| "detailed_algorithm_count": 0 |
There was a problem hiding this comment.
1. Cached provenance hides misses 🐞 Bug ≡ Correctness
Some regenerated certificate detail records change algorithm_extraction.status from "miss" to "cached" and reset fallback_used/attempts even though algorithm_count remains 0. This prevents build_data_quality_report() from surfacing these certificates as misses/fallback usage and yields misleading api/data-quality.json metrics (e.g., summary.misses: 0).
Agent Prompt
## Issue description
When certificate details reuse cached algorithm extraction results, the regenerated `algorithm_extraction` provenance is rebuilt with `status: "cached"` and default `fallback_used: false` and no `attempts`, even when the cached/previous provenance indicated a miss and/or fallback. This causes `api/data-quality.json` to underreport misses/fallbacks because it keys off `status == "miss"` and `fallback_used/attempts`.
## Issue Context
- In the cached-reuse path, `build_algorithm_extraction_provenance(..., "cached", ..., cached=True)` does not carry forward the previous `status`, `fallback_used`, or `attempts`.
- Data-quality reporting counts misses only when `algorithm_extraction.status == "miss"` and counts fallback usage when `fallback_used` is true or attempts length > 1.
## Fix Focus Areas
- scraper.py[1748-1770]
- scraper.py[488-501]
- scraper.py[2737-2756]
- scraper.py[1883-1892]
## Implementation guidance
1. In the `trusted_algorithm_reuse` branch (cached reuse), derive provenance fields from the previous `algorithm_extraction` object when present:
- Keep `cached: true` to indicate reuse.
- Preserve `fallback_used` from previous extraction (or infer from previous attempts length > 1).
- Preserve `attempts` (at least for detail payloads) so fallback/miss reasons remain observable.
- Set `status` to the previous status when valid; otherwise set to `"cached"` only when categories/detailed are non-empty, else `"miss"` (similar to the timeout fallback logic).
2. Consider extending `cached_algorithm_extraction_source()` (or adding a new helper) to return the previous extraction object (or its fallback/attempts) so the reuse path can copy it reliably.
3. Ensure regenerated artifacts still satisfy `validate_api.py` and JSON schema requirements (attempts remain optional).
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
What changed
api/data-quality.jsonwith misses, refreshed records, fallback usage, changed certificates, cache reuse checks, and the next scheduled weekly run.api/examples.jsonplus split search indexes underapi/indexes/for vendor, algorithm, status, and standard.Validation
.venv-codex/bin/python -m py_compile scraper.py validate_api.py test_scraper.py.venv-codex/bin/python test_scraper.py.venv-codex/bin/python validate_api.py --require-current-schema --forbid-firecrawl-run-sourcegit diff --checkNotes
The regenerated run reused all 5,256 certificate detail records and all cached algorithm payloads, with 0 HTML failures, 0 PDF failures, and 0 certificate timeouts. Certificate 5267 was spot-checked after regeneration: its detailed algorithm rows no longer include the policy title, copyright footer, reproduction notice, or standalone
Certfragment.