[codex] Bound certificate processing time by ethanolivertroy · Pull Request #20 · hackIDLE/nist-cmvp-api

ethanolivertroy · 2026-05-14T06:35:44Z

Summary

Adds a configurable CERT_PROCESS_TIMEOUT for per-certificate processing and records certificate_timeouts in extraction metrics. If one certificate times out, the scraper preserves previously cached detail and algorithm data when available and records a timeout attempt in provenance.

The certificate artifact builder now collects tasks with asyncio.as_completed, so progress reporting is based on completed work instead of input order.

Validation

venv/bin/python -m py_compile scraper.py test_scraper.py validate_api.py
venv/bin/python test_scraper.py
venv/bin/python validate_api.py
git diff --check

qodo-code-review · 2026-05-14T06:37:03Z

Review Summary by Qodo

Add configurable certificate processing timeout with fallback to cached data

✨ Enhancement

Walkthroughs

Description

• Add configurable per-certificate processing timeout with CERT_PROCESS_TIMEOUT environment
  variable (default 900s)
• Implement timeout handling that preserves cached detail and algorithm data when available
• Record certificate_timeouts metric in extraction statistics
• Switch certificate artifact builder to use asyncio.as_completed() for progress based on
  completed work

Diagram

flowchart LR
  A["Certificate Processing"] -->|"timeout"| B["Timeout Handler"]
  B -->|"has cached data"| C["Preserve Cached Detail & Algorithms"]
  B -->|"no cached data"| D["Record Timeout Attempt"]
  C --> E["Return with Metrics"]
  D --> E
  F["asyncio.as_completed"] -->|"track progress"| G["Completion-based Reporting"]

File Changes

1. scraper.py ✨ Enhancement +155/-3

Implement certificate processing timeout with cached data fallback

• Add CERT_PROCESS_TIMEOUT environment variable configuration (default 900 seconds)
• Implement build_certificate_timeout_result() function to handle timeout fallback with cached
 data preservation
• Add process_certificate_record_with_timeout() wrapper that applies timeout and handles
 asyncio.TimeoutError
• Update build_certificate_artifacts() to use asyncio.as_completed() for progress tracking and
 call timeout wrapper
• Add certificate_timeouts to extraction metrics schema and include timeout in concurrency metrics
• Update module docstring to document new CERT_PROCESS_TIMEOUT environment variable

scraper.py

2. test_scraper.py 🧪 Tests +83/-1

Add timeout handling tests and metric validation

• Add import of scraper_module to enable runtime patching in tests
• Update test_algorithm_extraction_provenance_and_metrics() to verify certificate_timeouts
 metric
• Add new test test_process_certificate_record_timeout_preserves_cached_data() that validates
 timeout handling preserves cached algorithms and detail data
• Verify timeout wrapper returns correct index and records timeout attempt in provenance

test_scraper.py

3. README.md 📝 Documentation +1/-0

Document certificate processing timeout configuration
• Document new CERT_PROCESS_TIMEOUT environment variable in configuration table with default value
 of 900 seconds
README.md

qodo-code-review · 2026-05-14T06:37:04Z

Code Review by Qodo

🐞 Bugs (2) 📘 Rule violations (0)

1. Cached algorithms dropped 🐞 Bug ≡ Correctness

Description

build_certificate_timeout_result() strips algorithm fields and returns empty categories when
previous_detail is missing, even if previous_module contains cached algorithms. This causes
timed-out certificates to lose previously extracted algorithm data loaded from modules.json.

Code

scraper.py[R1763-1830]

+    categories, detailed = cached_algorithm_fields(previous_module, previous_detail)
+    attempt = {
+        "source": algorithm_source,
+        "url": str(source_url or ""),
+        "status": "timeout",
+    }
+
+    detail_payload: Optional[Dict] = None
+    if cert_number is not None and previous_detail:
+        detail_payload = prepare_reused_detail_payload(
+            previous_detail,
+            module,
+            cert_number,
+            dataset,
+            generated_at,
+        )
+        stats["html_reused"] += 1
+        for key in MODULE_DETAIL_FIELDS:
+            value = detail_payload.get(key)
+            if value not in (None, [], "", {}):
+                module_out[key] = value
+        module_out["security_policy_url"] = detail_payload.get("security_policy_url") or module_out.get("security_policy_url")
+        cached_source, cached_source_url = cached_algorithm_extraction_source(
+            previous_module,
+            previous_detail,
+            previous_metadata,
+        )
+        provenance = build_algorithm_extraction_provenance(
+            algorithm_source,
+            "cached" if categories or detailed else "miss",
+            cached_source if categories or detailed else "timeout",
+            cached_source_url or source_url,
+            categories,
+            detailed,
+            cached=bool(categories or detailed),
+            attempts=[attempt],
+        )
+        if categories or detailed:
+            stats["pdf_reused"] += 1
+            stats["algorithm_cache_hits"] += 1
+            stats["algorithm_successes"] += 1
+        else:
+            stats["algorithm_misses"] += 1
+        apply_algorithm_fields(detail_payload, categories, detailed)
+        apply_algorithm_extraction_provenance(detail_payload, provenance, include_attempts=True)
+        apply_algorithm_fields(module_out, categories, detailed)
+        apply_algorithm_extraction_provenance(module_out, provenance)
+        module_out["detail_available"] = True
+        return module_out, detail_payload, categories, stats
+
+    stats["html_failed"] += 1
+    if algorithm_source in CACHEABLE_ALGORITHM_SOURCES:
+        stats["pdf_failed"] += 1
+    if algorithm_source != "none":
+        stats["algorithm_misses"] += 1
+    strip_algorithm_fields(module_out)
+    provenance = build_algorithm_extraction_provenance(
+        algorithm_source,
+        "miss",
+        "timeout",
+        source_url,
+        [],
+        [],
+        attempts=[attempt],
+    )
+    apply_algorithm_extraction_provenance(module_out, provenance)
+    module_out["detail_available"] = False
+    return module_out, None, [], stats

Evidence

The repo can have cached module rows without cached detail payloads because they are loaded from
different locations. The timeout fallback computes cached algorithms from
previous_module/previous_detail but discards them when previous_detail is missing, stripping
algorithm fields and returning an empty list.

scraper.py[830-860]
scraper.py[1742-1830]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
In `build_certificate_timeout_result()`, cached algorithm fields are computed from `(previous_module, previous_detail)` but are only applied/returned when `previous_detail` exists. If the detail JSON is missing but the cached module row contains algorithms, the timeout fallback currently calls `strip_algorithm_fields(module_out)` and returns `[]`, discarding available cached algorithms.

## Issue Context
`load_previous_outputs()` loads cached module rows from `api/modules.json` and certificate detail payloads separately from `DETAIL_DIR`. It is therefore valid for `previous_module` to exist while `previous_detail` is absent; the timeout fallback should still preserve cached algorithm fields from `previous_module` in that case.

## Fix Focus Areas
- scraper.py[830-860]
- scraper.py[1742-1830]

### Suggested implementation direction
- On the `previous_detail is None` timeout path, if `categories`/`detailed` from `cached_algorithm_fields(previous_module, None)` are non-empty:
 - apply them to `module_out` via `apply_algorithm_fields(...)`
 - build provenance with a status like `cached` (or a clear timeout-specific cached status) and include the timeout attempt
 - increment the same cache-hit counters used in the `previous_detail` branch (`pdf_reused`, `algorithm_cache_hits`, `algorithm_successes`) as appropriate
 - return those categories instead of `[]`
- Keep `detail_available` as `False` and `detail_payload` as `None` when there is no cached detail payload file.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

2. Orphaned PDF fetch tasks 🐞 Bug ☼ Reliability

Description

When process_certificate_record_with_timeout() times out, it cancels the certificate coroutine but
any PDF download task created in fetch_policy_pdf_bytes() can keep running in the background. This
can exceed the intended PDF_FETCH_CONCURRENCY (semaphore is released on cancellation while the
background task continues) and can also lead to unhandled task exceptions/log spam if no other
waiter ever awaits that cached task.

Code

scraper.py[R1851-1868]

+        module_out, detail_payload, categories, stats = await asyncio.wait_for(
+            process_certificate_record(
+                module,
+                dataset,
+                generated_at,
+                algorithm_source,
+                previous_module,
+                previous_detail,
+                previous_metadata,
+                client,
+                cert_semaphore,
+                pdf_semaphore,
+                pdf_cache,
+                pdf_cache_lock,
+                database_algorithms_map,
+            ),
+            timeout=CERT_PROCESS_TIMEOUT,
+        )

Evidence

The timeout wrapper introduces cancellation (asyncio.wait_for) around the entire certificate
coroutine. PDF downloads are started as background tasks in fetch_policy_pdf_bytes() and awaited
under pdf_semaphore in fetch_certificate_algorithms(), so cancellation releases the semaphore
while leaving the background task running and potentially never awaited again.

scraper.py[1833-1869]
scraper.py[1333-1348]
scraper.py[1483-1490]
scraper.py[1889-1961]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`process_certificate_record_with_timeout()` uses `asyncio.wait_for()` to time out and cancel per-certificate work. During algorithm extraction, `fetch_policy_pdf_bytes()` spawns a background task (`asyncio.create_task(...)`) to fetch a PDF and stores it in `pdf_cache`. If the certificate coroutine is cancelled due to timeout while awaiting that task, the `pdf_semaphore` context in `fetch_certificate_algorithms()` exits (releasing the semaphore) but the background PDF-fetch task continues running without semaphore backpressure and may never be awaited again (leading to noisy “Task exception was never retrieved” and wasted I/O).

## Issue Context
- The semaphore limiting happens in the *caller* (`fetch_certificate_algorithms`) not inside the cached task itself.
- Timeouts introduce cancellation in the middle of this flow, which is what makes the cached-task behavior problematic.

## Fix Focus Areas
- scraper.py[1833-1887]
- scraper.py[1333-1348]
- scraper.py[1426-1490]
- scraper.py[1889-1961]

### Suggested implementation direction
- Ensure PDF-fetch tasks are either:
 - cancelled/awaited (with `return_exceptions=True`) when no longer needed (e.g., at the end of `build_certificate_artifacts()`), **and/or**
 - created in a way that preserves concurrency limits even if the waiter is cancelled (e.g., the cached task itself acquires `pdf_semaphore`, and the outer semaphore acquisition is removed), **and/or**
 - has a `done_callback` that consumes exceptions and evicts failed tasks from `pdf_cache` so failures don’t become “never retrieved” warnings.
- Consider evicting completed tasks from `pdf_cache` to prevent growth across long runs.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 937eb8ab2c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-14T06:40:44Z

+    categories, detailed = cached_algorithm_fields(previous_module, previous_detail)
+    attempt = {


Honor ALGORITHM_SOURCE=none in timeout fallback

When a certificate times out and cached detail exists, this branch always pulls cached algorithms and later reapplies them, regardless of the configured algorithm_source. In the ALGORITHM_SOURCE=none mode, a timeout will therefore repopulate algorithms/algorithms_detailed and mark extraction as cached instead of skipped, which contradicts the explicit skip setting and yields inconsistent API payloads/metrics for timeouted records.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-14T06:40:44Z

+        module_out, detail_payload, categories, stats = await asyncio.wait_for(
+            process_certificate_record(
+                module,
+                dataset,
+                generated_at,


Handle cancellation from timed out certificate tasks

Using asyncio.wait_for here cancels process_certificate_record on timeout; if that coroutine is awaiting a shared PDF-cache task, cancellation propagates to the shared task. Other certificate workers awaiting the same cached task can then receive CancelledError (not TimeoutError), and this wrapper does not catch it, so build_certificate_artifacts can fail instead of emitting the timeout fallback. This is reproducible when multiple records share a policy URL and one reaches the timeout boundary first.

Useful? React with 👍 / 👎.

qodo-code-review · 2026-05-14T06:41:07Z

+    categories, detailed = cached_algorithm_fields(previous_module, previous_detail)
+    attempt = {
+        "source": algorithm_source,
+        "url": str(source_url or ""),
+        "status": "timeout",
+    }
+
+    detail_payload: Optional[Dict] = None
+    if cert_number is not None and previous_detail:
+        detail_payload = prepare_reused_detail_payload(
+            previous_detail,
+            module,
+            cert_number,
+            dataset,
+            generated_at,
+        )
+        stats["html_reused"] += 1
+        for key in MODULE_DETAIL_FIELDS:
+            value = detail_payload.get(key)
+            if value not in (None, [], "", {}):
+                module_out[key] = value
+        module_out["security_policy_url"] = detail_payload.get("security_policy_url") or module_out.get("security_policy_url")
+        cached_source, cached_source_url = cached_algorithm_extraction_source(
+            previous_module,
+            previous_detail,
+            previous_metadata,
+        )
+        provenance = build_algorithm_extraction_provenance(
+            algorithm_source,
+            "cached" if categories or detailed else "miss",
+            cached_source if categories or detailed else "timeout",
+            cached_source_url or source_url,
+            categories,
+            detailed,
+            cached=bool(categories or detailed),
+            attempts=[attempt],
+        )
+        if categories or detailed:
+            stats["pdf_reused"] += 1
+            stats["algorithm_cache_hits"] += 1
+            stats["algorithm_successes"] += 1
+        else:
+            stats["algorithm_misses"] += 1
+        apply_algorithm_fields(detail_payload, categories, detailed)
+        apply_algorithm_extraction_provenance(detail_payload, provenance, include_attempts=True)
+        apply_algorithm_fields(module_out, categories, detailed)
+        apply_algorithm_extraction_provenance(module_out, provenance)
+        module_out["detail_available"] = True
+        return module_out, detail_payload, categories, stats
+
+    stats["html_failed"] += 1
+    if algorithm_source in CACHEABLE_ALGORITHM_SOURCES:
+        stats["pdf_failed"] += 1
+    if algorithm_source != "none":
+        stats["algorithm_misses"] += 1
+    strip_algorithm_fields(module_out)
+    provenance = build_algorithm_extraction_provenance(
+        algorithm_source,
+        "miss",
+        "timeout",
+        source_url,
+        [],
+        [],
+        attempts=[attempt],
+    )
+    apply_algorithm_extraction_provenance(module_out, provenance)
+    module_out["detail_available"] = False
+    return module_out, None, [], stats


1. Cached algorithms dropped 🐞 Bug ≡ Correctness

build_certificate_timeout_result() strips algorithm fields and returns empty categories when previous_detail is missing, even if previous_module contains cached algorithms. This causes timed-out certificates to lose previously extracted algorithm data loaded from modules.json.

Agent Prompt

## Issue description In `build_certificate_timeout_result()`, cached algorithm fields are computed from `(previous_module, previous_detail)` but are only applied/returned when `previous_detail` exists. If the detail JSON is missing but the cached module row contains algorithms, the timeout fallback currently calls `strip_algorithm_fields(module_out)` and returns `[]`, discarding available cached algorithms. ## Issue Context `load_previous_outputs()` loads cached module rows from `api/modules.json` and certificate detail payloads separately from `DETAIL_DIR`. It is therefore valid for `previous_module` to exist while `previous_detail` is absent; the timeout fallback should still preserve cached algorithm fields from `previous_module` in that case. ## Fix Focus Areas - scraper.py[830-860] - scraper.py[1742-1830] ### Suggested implementation direction - On the `previous_detail is None` timeout path, if `categories`/`detailed` from `cached_algorithm_fields(previous_module, None)` are non-empty: - apply them to `module_out` via `apply_algorithm_fields(...)` - build provenance with a status like `cached` (or a clear timeout-specific cached status) and include the timeout attempt - increment the same cache-hit counters used in the `previous_detail` branch (`pdf_reused`, `algorithm_cache_hits`, `algorithm_successes`) as appropriate - return those categories instead of `[]` - Keep `detail_available` as `False` and `detail_payload` as `None` when there is no cached detail payload file.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Bound certificate processing time

937eb8a

ethanolivertroy marked this pull request as ready for review May 14, 2026 06:36

ethanolivertroy merged commit 4cb5c28 into main May 14, 2026
1 check passed

ethanolivertroy deleted the codex/bound-certificate-processing branch May 14, 2026 06:36

chatgpt-codex-connector Bot reviewed May 14, 2026

View reviewed changes

qodo-code-review Bot reviewed May 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[codex] Bound certificate processing time#20

[codex] Bound certificate processing time#20
ethanolivertroy merged 1 commit into
mainfrom
codex/bound-certificate-processing

ethanolivertroy commented May 14, 2026

Uh oh!

Uh oh!

qodo-code-review Bot commented May 14, 2026

Uh oh!

qodo-code-review Bot commented May 14, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 14, 2026

Uh oh!

chatgpt-codex-connector Bot May 14, 2026

Uh oh!

qodo-code-review Bot May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		categories, detailed = cached_algorithm_fields(previous_module, previous_detail)
		attempt = {

Conversation

ethanolivertroy commented May 14, 2026

Summary

Validation

Uh oh!

Uh oh!

qodo-code-review Bot commented May 14, 2026

Review Summary by Qodo

Walkthroughs

File Changes

Uh oh!

qodo-code-review Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review by Qodo

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

qodo-code-review Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

qodo-code-review Bot commented May 14, 2026 •

edited

Loading