Skip to content

[codex] Add data quality and search artifacts#24

Merged
ethanolivertroy merged 1 commit into
mainfrom
codex/data-quality-and-search-artifacts
May 14, 2026
Merged

[codex] Add data quality and search artifacts#24
ethanolivertroy merged 1 commit into
mainfrom
codex/data-quality-and-search-artifacts

Conversation

@ethanolivertroy
Copy link
Copy Markdown
Member

What changed

  • Cleans detailed algorithm rows by trimming PDF policy headers, copyright notices, reproduction boilerplate, and standalone table headers from fresh and cached extraction output.
  • Adds api/data-quality.json with misses, refreshed records, fallback usage, changed certificates, cache reuse checks, and the next scheduled weekly run.
  • Adds consumer examples in api/examples.json plus split search indexes under api/indexes/ for vendor, algorithm, status, and standard.
  • Wires the new artifacts into API discovery, OpenAPI, Markdown docs, homepage links, JSON Schemas, README, and strict validation.
  • Regenerates the static API artifacts from the current cached dataset.

Validation

  • .venv-codex/bin/python -m py_compile scraper.py validate_api.py test_scraper.py
  • .venv-codex/bin/python test_scraper.py
  • .venv-codex/bin/python validate_api.py --require-current-schema --forbid-firecrawl-run-source
  • git diff --check

Notes

The regenerated run reused all 5,256 certificate detail records and all cached algorithm payloads, with 0 HTML failures, 0 PDF failures, and 0 certificate timeouts. Certificate 5267 was spot-checked after regeneration: its detailed algorithm rows no longer include the policy title, copyright footer, reproduction notice, or standalone Cert fragment.

@ethanolivertroy ethanolivertroy marked this pull request as ready for review May 14, 2026 13:34
@ethanolivertroy ethanolivertroy merged commit e028487 into main May 14, 2026
1 check passed
@ethanolivertroy ethanolivertroy deleted the codex/data-quality-and-search-artifacts branch May 14, 2026 13:35
@qodo-code-review
Copy link
Copy Markdown

Review Summary by Qodo

Add data quality report, consumer examples, and split search indexes with algorithm row cleaning

✨ Enhancement 📝 Documentation

Grey Divider

Walkthroughs

Description
• Adds three new API artifacts: data-quality.json (quality metrics and cache statistics),
  examples.json (consumer examples), and split search indexes under api/indexes/ (vendors,
  algorithms, statuses, standards)
• Cleans detailed algorithm rows by trimming PDF policy headers, copyright notices, reproduction
  boilerplate, and standalone table headers from extraction output
• Integrates new artifacts into API discovery, OpenAPI specifications, Markdown documentation,
  homepage links, JSON schemas, and strict validation
• Regenerates all certificate data with updated timestamps and consistent field ordering
  (algorithm_extraction before algorithms)
• Updates README with feature descriptions, endpoint documentation, and curl example commands for
  new endpoints
• All 5,256 certificate detail records reused with 0 HTML failures, 0 PDF failures, and 0
  certificate timeouts
Diagram
flowchart LR
  A["Certificate Data"] -->|"Clean algorithm rows"| B["Cleaned Extraction Output"]
  B -->|"Generate"| C["data-quality.json"]
  B -->|"Generate"| D["examples.json"]
  B -->|"Generate"| E["Split Indexes<br/>vendors/algorithms/<br/>statuses/standards"]
  C -->|"Wire into"| F["API Discovery & OpenAPI"]
  D -->|"Wire into"| F
  E -->|"Wire into"| F
  F -->|"Update"| G["README & Docs"]
Loading

Grey Divider

File Changes

1. README.md 📝 Documentation +24/-0

Documentation updates for new API artifacts and endpoints

• Added three new feature descriptions: Search Indexes, Data Quality Report, and Consumer Examples
• Added new endpoint documentation for data-quality.json, examples.json, and split search
 indexes (indexes/vendors.json, indexes/algorithms.json, indexes/statuses.json,
 indexes/standards.json)
• Added reference link to api/examples.json in the "For Agents" section
• Added four new curl example commands demonstrating split index lookups, data quality checks, and
 consumer examples

README.md


2. api/certificates/2988.json Miscellaneous +15/-15

Regenerated certificate with updated timestamp and field reordering

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent field ordering
• No functional changes to extraction metadata or algorithm data

api/certificates/2988.json


3. api/certificates/1996.json Miscellaneous +15/-15

Regenerated certificate with updated timestamp and field reordering

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent field ordering
• No functional changes to extraction metadata or algorithm data

api/certificates/1996.json


View more (250)
4. api/certificates/2087.json Miscellaneous +15/-15

Regenerated certificate with updated timestamp and field reordering

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent field ordering
• No functional changes to extraction metadata or algorithm data

api/certificates/2087.json


5. api/certificates/2094.json Miscellaneous +15/-15

Regenerated certificate with updated timestamp and field reordering

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent field ordering
• No functional changes to extraction metadata or algorithm data

api/certificates/2094.json


6. api/certificates/2145.json Miscellaneous +15/-15

Regenerated certificate with updated timestamp and field reordering

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent field ordering
• No functional changes to extraction metadata or algorithm data

api/certificates/2145.json


7. api/certificates/2146.json Miscellaneous +15/-15

Regenerated certificate with updated timestamp and field reordering

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent field ordering
• No functional changes to extraction metadata or algorithm data

api/certificates/2146.json


8. api/certificates/2152.json Miscellaneous +15/-15

Regenerated certificate with updated timestamp and field reordering

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent field ordering
• No functional changes to extraction metadata or algorithm data

api/certificates/2152.json


9. api/certificates/2160.json Miscellaneous +15/-15

Regenerated certificate with updated timestamp and field reordering

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent field ordering
• No functional changes to extraction metadata or algorithm data

api/certificates/2160.json


10. api/certificates/2241.json Miscellaneous +15/-15

Regenerated certificate with updated timestamp and field reordering

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent field ordering
• No functional changes to extraction metadata or algorithm data

api/certificates/2241.json


11. api/certificates/2242.json Miscellaneous +15/-15

Regenerated certificate with updated timestamp and field reordering

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent field ordering
• No functional changes to extraction metadata or algorithm data

api/certificates/2242.json


12. api/certificates/2244.json Miscellaneous +15/-15

Regenerated certificate with updated timestamp and field reordering

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent field ordering
• No functional changes to extraction metadata or algorithm data

api/certificates/2244.json


13. api/certificates/2265.json Miscellaneous +15/-15

Regenerated certificate with updated timestamp and field reordering

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent field ordering
• No functional changes to extraction metadata or algorithm data

api/certificates/2265.json


14. api/certificates/2268.json Miscellaneous +15/-15

Regenerated certificate with updated timestamp and field reordering

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent field ordering
• No functional changes to extraction metadata or algorithm data

api/certificates/2268.json


15. api/certificates/2269.json Miscellaneous +15/-15

Regenerated certificate with updated timestamp and field reordering

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent field ordering
• No functional changes to extraction metadata or algorithm data

api/certificates/2269.json


16. api/certificates/2286.json Miscellaneous +15/-15

Regenerated certificate with updated timestamp and field reordering

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent field ordering
• No functional changes to extraction metadata or algorithm data

api/certificates/2286.json


17. api/certificates/2409.json Miscellaneous +15/-15

Regenerated certificate with updated timestamp and field reordering

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent field ordering
• No functional changes to extraction metadata or algorithm data

api/certificates/2409.json


18. api/certificates/2496.json Miscellaneous +15/-15

Regenerated certificate with updated timestamp and field reordering

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent field ordering
• No functional changes to extraction metadata or algorithm data

api/certificates/2496.json


19. api/certificates/2505.json Miscellaneous +15/-15

Regenerated certificate with updated timestamp and field reordering

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent field ordering
• No functional changes to extraction metadata or algorithm data

api/certificates/2505.json


20. api/certificates/2511.json Miscellaneous +15/-15

Regenerated certificate with updated timestamp and field reordering

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent field ordering
• No functional changes to extraction metadata or algorithm data

api/certificates/2511.json


21. api/certificates/2516.json Miscellaneous +15/-15

Regenerated certificate with updated timestamp and field reordering

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent field ordering
• No functional changes to extraction metadata or algorithm data

api/certificates/2516.json


22. api/certificates/2580.json Miscellaneous +15/-15

Regenerated certificate with updated timestamp and field reordering

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent field ordering
• No functional changes to extraction metadata or algorithm data

api/certificates/2580.json


23. api/certificates/2581.json Miscellaneous +15/-15

Regenerated certificate with updated timestamp and field reordering

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent field ordering
• No functional changes to extraction metadata or algorithm data

api/certificates/2581.json


24. api/certificates/2618.json Miscellaneous +15/-15

Regenerated certificate with updated timestamp and field reordering

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent field ordering
• No functional changes to extraction metadata or algorithm data

api/certificates/2618.json


25. api/certificates/2648.json Miscellaneous +15/-15

Regenerated certificate with updated timestamp and field reordering

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent field ordering
• No functional changes to extraction metadata or algorithm data

api/certificates/2648.json


26. api/certificates/2650.json Miscellaneous +15/-15

Regenerated certificate with updated timestamp and field reordering

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent field ordering
• No functional changes to extraction metadata or algorithm data

api/certificates/2650.json


27. api/certificates/2675.json Miscellaneous +15/-15

Regenerated certificate with updated timestamp and field reordering

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent field ordering
• No functional changes to extraction metadata or algorithm data

api/certificates/2675.json


28. api/certificates/2682.json Miscellaneous +15/-15

Regenerated certificate with updated timestamp and field reordering

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent field ordering
• No functional changes to extraction metadata or algorithm data

api/certificates/2682.json


29. api/certificates/2683.json Miscellaneous +15/-15

Regenerated certificate with updated timestamp and field reordering

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent field ordering
• No functional changes to extraction metadata or algorithm data

api/certificates/2683.json


30. api/certificates/2684.json Miscellaneous +15/-15

Regenerated certificate with updated timestamp and field reordering

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent field ordering
• No functional changes to extraction metadata or algorithm data

api/certificates/2684.json


31. api/certificates/2704.json Miscellaneous +15/-15

Regenerated certificate with updated timestamp and field reordering

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent field ordering
• No functional changes to extraction metadata or algorithm data

api/certificates/2704.json


32. api/certificates/2748.json Miscellaneous +15/-15

Regenerated certificate with updated timestamp and field reordering

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent field ordering
• No functional changes to extraction metadata or algorithm data

api/certificates/2748.json


33. api/certificates/2770.json Miscellaneous +15/-15

Regenerated certificate with updated timestamp and field reordering

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent field ordering
• No functional changes to extraction metadata or algorithm data

api/certificates/2770.json


34. api/certificates/2771.json Miscellaneous +15/-15

Regenerated certificate with updated timestamp and field reordering

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent field ordering
• No functional changes to extraction metadata or algorithm data

api/certificates/2771.json


35. api/certificates/2818.json Miscellaneous +15/-15

Regenerated certificate with updated timestamp and field reordering

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent field ordering
• No functional changes to extraction metadata or algorithm data

api/certificates/2818.json


36. api/certificates/2820.json Miscellaneous +15/-15

Regenerated certificate with updated timestamp and field reordering

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent field ordering
• No functional changes to extraction metadata or algorithm data

api/certificates/2820.json


37. api/certificates/2856.json Miscellaneous +15/-15

Regenerated certificate with updated timestamp and field reordering

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent field ordering
• No functional changes to extraction metadata or algorithm data

api/certificates/2856.json


38. api/certificates/2898.json Miscellaneous +15/-15

Regenerated certificate with updated timestamp and field reordering

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent field ordering
• No functional changes to extraction metadata or algorithm data

api/certificates/2898.json


39. api/certificates/2904.json Miscellaneous +15/-15

Regenerated certificate with updated timestamp and field reordering

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent field ordering
• No functional changes to extraction metadata or algorithm data

api/certificates/2904.json


40. api/certificates/2919.json Miscellaneous +15/-15

Regenerated certificate with updated timestamp and field reordering

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent field ordering
• No functional changes to extraction metadata or algorithm data

api/certificates/2919.json


41. api/certificates/2921.json Miscellaneous +15/-15

Regenerated certificate with updated timestamp and field reordering

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent field ordering
• No functional changes to extraction metadata or algorithm data

api/certificates/2921.json


42. api/certificates/2923.json Miscellaneous +15/-15

Regenerated certificate with updated timestamp and field reordering

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent field ordering
• No functional changes to extraction metadata or algorithm data

api/certificates/2923.json


43. api/certificates/2926.json Miscellaneous +15/-15

Regenerated certificate with updated timestamp and field reordering

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent field ordering
• No functional changes to extraction metadata or algorithm data

api/certificates/2926.json


44. api/certificates/2979.json Miscellaneous +15/-15

Regenerated certificate with updated timestamp and field reordering

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent field ordering
• No functional changes to extraction metadata or algorithm data

api/certificates/2979.json


45. api/certificates/2984.json Miscellaneous +15/-15

Regenerated certificate with updated timestamp and field reordering

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent field ordering
• No functional changes to extraction metadata or algorithm data

api/certificates/2984.json


46. api/certificates/3000.json Miscellaneous +15/-15

Regenerated certificate with updated timestamp and field reordering

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent field ordering
• No functional changes to extraction metadata or algorithm data

api/certificates/3000.json


47. api/certificates/3025.json Miscellaneous +15/-15

Regenerated certificate with updated timestamp and field reordering

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent field ordering
• No functional changes to extraction metadata or algorithm data

api/certificates/3025.json


48. api/certificates/3026.json Miscellaneous +15/-15

Regenerated certificate with updated timestamp and field reordering

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent field ordering
• No functional changes to extraction metadata or algorithm data

api/certificates/3026.json


49. api/certificates/3027.json Miscellaneous +15/-15

Regenerated certificate with updated timestamp and field reordering

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent field ordering
• No functional changes to extraction metadata or algorithm data

api/certificates/3027.json


50. api/certificates/3029.json Miscellaneous +15/-15

Regenerated certificate with updated timestamp and field reordering

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent field ordering
• No functional changes to extraction metadata or algorithm data

api/certificates/3029.json


51. api/certificates/3037.json Miscellaneous +15/-15

Regenerated certificate data with timestamp update

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent JSON structure
• No changes to actual data content, only reordering and timestamp update

api/certificates/3037.json


52. api/certificates/3038.json Miscellaneous +15/-15

Regenerated certificate data with timestamp update

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent JSON structure
• No changes to actual data content, only reordering and timestamp update

api/certificates/3038.json


53. api/certificates/3080.json Miscellaneous +15/-15

Regenerated certificate data with timestamp update

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent JSON structure
• No changes to actual data content, only reordering and timestamp update

api/certificates/3080.json


54. api/certificates/3100.json Miscellaneous +15/-15

Regenerated certificate data with timestamp update

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent JSON structure
• No changes to actual data content, only reordering and timestamp update

api/certificates/3100.json


55. api/certificates/3101.json Miscellaneous +15/-15

Regenerated certificate data with timestamp update

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent JSON structure
• No changes to actual data content, only reordering and timestamp update

api/certificates/3101.json


56. api/certificates/3105.json Miscellaneous +15/-15

Regenerated certificate data with timestamp update

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent JSON structure
• No changes to actual data content, only reordering and timestamp update

api/certificates/3105.json


57. api/certificates/3109.json Miscellaneous +15/-15

Regenerated certificate data with timestamp update

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent JSON structure
• No changes to actual data content, only reordering and timestamp update

api/certificates/3109.json


58. api/certificates/3113.json Miscellaneous +15/-15

Regenerated certificate data with timestamp update

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent JSON structure
• No changes to actual data content, only reordering and timestamp update

api/certificates/3113.json


59. api/certificates/3119.json Miscellaneous +15/-15

Regenerated certificate data with timestamp update

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent JSON structure
• No changes to actual data content, only reordering and timestamp update

api/certificates/3119.json


60. api/certificates/3120.json Miscellaneous +15/-15

Regenerated certificate data with timestamp update

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent JSON structure
• No changes to actual data content, only reordering and timestamp update

api/certificates/3120.json


61. api/certificates/3121.json Miscellaneous +15/-15

Regenerated certificate data with timestamp update

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent JSON structure
• No changes to actual data content, only reordering and timestamp update

api/certificates/3121.json


62. api/certificates/3126.json Miscellaneous +15/-15

Regenerated certificate data with timestamp update

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent JSON structure
• No changes to actual data content, only reordering and timestamp update

api/certificates/3126.json


63. api/certificates/3127.json Miscellaneous +15/-15

Regenerated certificate data with timestamp update

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent JSON structure
• No changes to actual data content, only reordering and timestamp update

api/certificates/3127.json


64. api/certificates/3132.json Miscellaneous +15/-15

Regenerated certificate data with timestamp update

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent JSON structure
• No changes to actual data content, only reordering and timestamp update

api/certificates/3132.json


65. api/certificates/3136.json Miscellaneous +15/-15

Regenerated certificate data with timestamp update

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent JSON structure
• No changes to actual data content, only reordering and timestamp update

api/certificates/3136.json


66. api/certificates/3137.json Miscellaneous +15/-15

Regenerated certificate data with timestamp update

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent JSON structure
• No changes to actual data content, only reordering and timestamp update

api/certificates/3137.json


67. api/certificates/3138.json Miscellaneous +15/-15

Regenerated certificate data with timestamp update

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent JSON structure
• No changes to actual data content, only reordering and timestamp update

api/certificates/3138.json


68. api/certificates/3140.json Miscellaneous +15/-15

Regenerated certificate data with timestamp update

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent JSON structure
• No changes to actual data content, only reordering and timestamp update

api/certificates/3140.json


69. api/certificates/3174.json Miscellaneous +15/-15

Regenerated certificate data with timestamp update

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent JSON structure
• No changes to actual data content, only reordering and timestamp update

api/certificates/3174.json


70. api/certificates/3182.json Miscellaneous +15/-15

Regenerated certificate data with timestamp update

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent JSON structure
• No changes to actual data content, only reordering and timestamp update

api/certificates/3182.json


71. api/certificates/3188.json Miscellaneous +15/-15

Regenerated certificate data with timestamp update

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent JSON structure
• No changes to actual data content, only reordering and timestamp update

api/certificates/3188.json


72. api/certificates/3189.json Miscellaneous +15/-15

Regenerated certificate data with timestamp update

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent JSON structure
• No changes to actual data content, only reordering and timestamp update

api/certificates/3189.json


73. api/certificates/3200.json Miscellaneous +15/-15

Regenerated certificate data with timestamp update

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent JSON structure
• No changes to actual data content, only reordering and timestamp update

api/certificates/3200.json


74. api/certificates/3205.json Miscellaneous +15/-15

Regenerated certificate data with timestamp update

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent JSON structure
• No changes to actual data content, only reordering and timestamp update

api/certificates/3205.json


75. api/certificates/3225.json Miscellaneous +15/-15

Regenerated certificate data with timestamp update

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent JSON structure
• No changes to actual data content, only reordering and timestamp update

api/certificates/3225.json


76. api/certificates/3229.json Miscellaneous +15/-15

Regenerated certificate data with timestamp update

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent JSON structure
• No changes to actual data content, only reordering and timestamp update

api/certificates/3229.json


77. api/certificates/3232.json Miscellaneous +15/-15

Regenerated certificate data with timestamp update

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent JSON structure
• No changes to actual data content, only reordering and timestamp update

api/certificates/3232.json


78. api/certificates/3236.json Miscellaneous +15/-15

Regenerated certificate data with timestamp update

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent JSON structure
• No changes to actual data content, only reordering and timestamp update

api/certificates/3236.json


79. api/certificates/3246.json Miscellaneous +15/-15

Regenerated certificate data with timestamp update

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent JSON structure
• No changes to actual data content, only reordering and timestamp update

api/certificates/3246.json


80. api/certificates/3247.json Miscellaneous +15/-15

Regenerated certificate data with timestamp update

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent JSON structure
• No changes to actual data content, only reordering and timestamp update

api/certificates/3247.json


81. api/certificates/3249.json Miscellaneous +15/-15

Regenerated certificate data with timestamp update

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent JSON structure
• No changes to actual data content, only reordering and timestamp update

api/certificates/3249.json


82. api/certificates/3266.json Miscellaneous +15/-15

Regenerated certificate data with timestamp update

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent JSON structure
• No changes to actual data content, only reordering and timestamp update

api/certificates/3266.json


83. api/certificates/3286.json Miscellaneous +15/-15

Regenerated certificate data with timestamp update

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent JSON structure
• No changes to actual data content, only reordering and timestamp update

api/certificates/3286.json


84. api/certificates/3287.json Miscellaneous +15/-15

Regenerated certificate data with timestamp update

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent JSON structure
• No changes to actual data content, only reordering and timestamp update

api/certificates/3287.json


85. api/certificates/3306.json Miscellaneous +15/-15

Regenerated certificate data with timestamp update

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent JSON structure
• No changes to actual data content, only reordering and timestamp update

api/certificates/3306.json


86. api/certificates/3315.json Miscellaneous +15/-15

Regenerated certificate data with timestamp update

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z
• Moved algorithm_extraction object before algorithms array for consistent JSON structure
• No changes to actual data content, only reordering and timestamp update

api/certificates/3315.json


87. api/certificates/3319.json Miscellaneous +15/-15

Regenerated certificate data with timestamp update

• Updated generated_at timestamp from 2026-05-14T09:06:04.998714Z to
 2026-05-14T13:31:01.314185Z

@qodo-code-review
Copy link
Copy Markdown

qodo-code-review Bot commented May 14, 2026

Code Review by Qodo

🐞 Bugs (1) 📘 Rule violations (0)

Grey Divider


Action required

1. Cached provenance hides misses 🐞 Bug ≡ Correctness
Description
Some regenerated certificate detail records change algorithm_extraction.status from "miss" to
"cached" and reset fallback_used/attempts even though algorithm_count remains 0. This prevents
build_data_quality_report() from surfacing these certificates as misses/fallback usage and yields
misleading api/data-quality.json metrics (e.g., summary.misses: 0).
Code

api/certificates/116.json[R71-79]

+      "status": "cached",
      "configured_source": "crawl4ai",
      "source": "none",
      "source_url": null,
-      "cached": false,
-      "fallback_used": true,
+      "cached": true,
+      "fallback_used": false,
      "cache_version": "2026-04-15-legacy-v1",
      "algorithm_count": 0,
-      "detailed_algorithm_count": 0,
-      "attempts": [
-        {
-          "source": "crawl4ai",
-          "url": "https://csrc.nist.gov/CSRC/media/projects/cryptographic-module-validation-program/documents/security-policies/140sp116.pdf",
-          "status": "no_algorithms"
-        },
-        {
-          "source": "security_policy_pdf",
-          "url": "https://csrc.nist.gov/CSRC/media/projects/cryptographic-module-validation-program/documents/security-policies/140sp116.pdf",
-          "status": "no_algorithms"
-        }
-      ]
+      "detailed_algorithm_count": 0
Evidence
The PR diff shows certificate 116’s algorithm_extraction switched from a miss with fallback and
attempts to a cached result with fallback disabled and attempts removed; the scraper’s data-quality
logic only reports misses when status == "miss" and only reports fallback usage when
fallback_used or multiple attempts exist, so these regenerated records will no longer be reported.
The generated api/data-quality.json in the PR branch reflects this by reporting `summary.misses:
0` even though many certificates have no extracted algorithms.

pr_files_diffs/api_certificates_116_json.patch[25-56]
scraper.py[1748-1770]
scraper.py[2737-2756]
scraper.py[1883-1892]
api/data-quality.json[2-82]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
When certificate details reuse cached algorithm extraction results, the regenerated `algorithm_extraction` provenance is rebuilt with `status: "cached"` and default `fallback_used: false` and no `attempts`, even when the cached/previous provenance indicated a miss and/or fallback. This causes `api/data-quality.json` to underreport misses/fallbacks because it keys off `status == "miss"` and `fallback_used/attempts`.

## Issue Context
- In the cached-reuse path, `build_algorithm_extraction_provenance(..., "cached", ..., cached=True)` does not carry forward the previous `status`, `fallback_used`, or `attempts`.
- Data-quality reporting counts misses only when `algorithm_extraction.status == "miss"` and counts fallback usage when `fallback_used` is true or attempts length > 1.

## Fix Focus Areas
- scraper.py[1748-1770]
- scraper.py[488-501]
- scraper.py[2737-2756]
- scraper.py[1883-1892]

## Implementation guidance
1. In the `trusted_algorithm_reuse` branch (cached reuse), derive provenance fields from the previous `algorithm_extraction` object when present:
  - Keep `cached: true` to indicate reuse.
  - Preserve `fallback_used` from previous extraction (or infer from previous attempts length > 1).
  - Preserve `attempts` (at least for detail payloads) so fallback/miss reasons remain observable.
  - Set `status` to the previous status when valid; otherwise set to `"cached"` only when categories/detailed are non-empty, else `"miss"` (similar to the timeout fallback logic).
2. Consider extending `cached_algorithm_extraction_source()` (or adding a new helper) to return the previous extraction object (or its fallback/attempts) so the reuse path can copy it reliably.
3. Ensure regenerated artifacts still satisfy `validate_api.py` and JSON schema requirements (attempts remain optional).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

ⓘ You are approaching your monthly quota for Qodo. Upgrade your plan

Qodo Logo

Comment thread api/certificates/116.json
Comment on lines +71 to +79
"status": "cached",
"configured_source": "crawl4ai",
"source": "none",
"source_url": null,
"cached": false,
"fallback_used": true,
"cached": true,
"fallback_used": false,
"cache_version": "2026-04-15-legacy-v1",
"algorithm_count": 0,
"detailed_algorithm_count": 0,
"attempts": [
{
"source": "crawl4ai",
"url": "https://csrc.nist.gov/CSRC/media/projects/cryptographic-module-validation-program/documents/security-policies/140sp116.pdf",
"status": "no_algorithms"
},
{
"source": "security_policy_pdf",
"url": "https://csrc.nist.gov/CSRC/media/projects/cryptographic-module-validation-program/documents/security-policies/140sp116.pdf",
"status": "no_algorithms"
}
]
"detailed_algorithm_count": 0
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

1. Cached provenance hides misses 🐞 Bug ≡ Correctness

Some regenerated certificate detail records change algorithm_extraction.status from "miss" to
"cached" and reset fallback_used/attempts even though algorithm_count remains 0. This prevents
build_data_quality_report() from surfacing these certificates as misses/fallback usage and yields
misleading api/data-quality.json metrics (e.g., summary.misses: 0).
Agent Prompt
## Issue description
When certificate details reuse cached algorithm extraction results, the regenerated `algorithm_extraction` provenance is rebuilt with `status: "cached"` and default `fallback_used: false` and no `attempts`, even when the cached/previous provenance indicated a miss and/or fallback. This causes `api/data-quality.json` to underreport misses/fallbacks because it keys off `status == "miss"` and `fallback_used/attempts`.

## Issue Context
- In the cached-reuse path, `build_algorithm_extraction_provenance(..., "cached", ..., cached=True)` does not carry forward the previous `status`, `fallback_used`, or `attempts`.
- Data-quality reporting counts misses only when `algorithm_extraction.status == "miss"` and counts fallback usage when `fallback_used` is true or attempts length > 1.

## Fix Focus Areas
- scraper.py[1748-1770]
- scraper.py[488-501]
- scraper.py[2737-2756]
- scraper.py[1883-1892]

## Implementation guidance
1. In the `trusted_algorithm_reuse` branch (cached reuse), derive provenance fields from the previous `algorithm_extraction` object when present:
   - Keep `cached: true` to indicate reuse.
   - Preserve `fallback_used` from previous extraction (or infer from previous attempts length > 1).
   - Preserve `attempts` (at least for detail payloads) so fallback/miss reasons remain observable.
   - Set `status` to the previous status when valid; otherwise set to `"cached"` only when categories/detailed are non-empty, else `"miss"` (similar to the timeout fallback logic).
2. Consider extending `cached_algorithm_extraction_source()` (or adding a new helper) to return the previous extraction object (or its fallback/attempts) so the reuse path can copy it reliably.
3. Ensure regenerated artifacts still satisfy `validate_api.py` and JSON schema requirements (attempts remain optional).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant