Skip to content

[codex] Bound certificate processing time#20

Merged
ethanolivertroy merged 1 commit into
mainfrom
codex/bound-certificate-processing
May 14, 2026
Merged

[codex] Bound certificate processing time#20
ethanolivertroy merged 1 commit into
mainfrom
codex/bound-certificate-processing

Conversation

@ethanolivertroy
Copy link
Copy Markdown
Member

Summary

Adds a configurable CERT_PROCESS_TIMEOUT for per-certificate processing and records certificate_timeouts in extraction metrics. If one certificate times out, the scraper preserves previously cached detail and algorithm data when available and records a timeout attempt in provenance.

The certificate artifact builder now collects tasks with asyncio.as_completed, so progress reporting is based on completed work instead of input order.

Validation

  • venv/bin/python -m py_compile scraper.py test_scraper.py validate_api.py
  • venv/bin/python test_scraper.py
  • venv/bin/python validate_api.py
  • git diff --check

@ethanolivertroy ethanolivertroy marked this pull request as ready for review May 14, 2026 06:36
@ethanolivertroy ethanolivertroy merged commit 4cb5c28 into main May 14, 2026
1 check passed
@ethanolivertroy ethanolivertroy deleted the codex/bound-certificate-processing branch May 14, 2026 06:36
@qodo-code-review
Copy link
Copy Markdown

Review Summary by Qodo

Add configurable certificate processing timeout with fallback to cached data

✨ Enhancement

Grey Divider

Walkthroughs

Description
• Add configurable per-certificate processing timeout with CERT_PROCESS_TIMEOUT environment
  variable (default 900s)
• Implement timeout handling that preserves cached detail and algorithm data when available
• Record certificate_timeouts metric in extraction statistics
• Switch certificate artifact builder to use asyncio.as_completed() for progress based on
  completed work
Diagram
flowchart LR
  A["Certificate Processing"] -->|"timeout"| B["Timeout Handler"]
  B -->|"has cached data"| C["Preserve Cached Detail & Algorithms"]
  B -->|"no cached data"| D["Record Timeout Attempt"]
  C --> E["Return with Metrics"]
  D --> E
  F["asyncio.as_completed"] -->|"track progress"| G["Completion-based Reporting"]
Loading

Grey Divider

File Changes

1. scraper.py ✨ Enhancement +155/-3

Implement certificate processing timeout with cached data fallback

• Add CERT_PROCESS_TIMEOUT environment variable configuration (default 900 seconds)
• Implement build_certificate_timeout_result() function to handle timeout fallback with cached
 data preservation
• Add process_certificate_record_with_timeout() wrapper that applies timeout and handles
 asyncio.TimeoutError
• Update build_certificate_artifacts() to use asyncio.as_completed() for progress tracking and
 call timeout wrapper
• Add certificate_timeouts to extraction metrics schema and include timeout in concurrency metrics
• Update module docstring to document new CERT_PROCESS_TIMEOUT environment variable

scraper.py


2. test_scraper.py 🧪 Tests +83/-1

Add timeout handling tests and metric validation

• Add import of scraper_module to enable runtime patching in tests
• Update test_algorithm_extraction_provenance_and_metrics() to verify certificate_timeouts
 metric
• Add new test test_process_certificate_record_timeout_preserves_cached_data() that validates
 timeout handling preserves cached algorithms and detail data
• Verify timeout wrapper returns correct index and records timeout attempt in provenance

test_scraper.py


3. README.md 📝 Documentation +1/-0

Document certificate processing timeout configuration

• Document new CERT_PROCESS_TIMEOUT environment variable in configuration table with default value
 of 900 seconds

README.md


Grey Divider

Qodo Logo

@qodo-code-review
Copy link
Copy Markdown

qodo-code-review Bot commented May 14, 2026

Code Review by Qodo

🐞 Bugs (2) 📘 Rule violations (0)

Grey Divider


Action required

1. Cached algorithms dropped 🐞 Bug ≡ Correctness
Description
build_certificate_timeout_result() strips algorithm fields and returns empty categories when
previous_detail is missing, even if previous_module contains cached algorithms. This causes
timed-out certificates to lose previously extracted algorithm data loaded from modules.json.
Code

scraper.py[R1763-1830]

+    categories, detailed = cached_algorithm_fields(previous_module, previous_detail)
+    attempt = {
+        "source": algorithm_source,
+        "url": str(source_url or ""),
+        "status": "timeout",
+    }
+
+    detail_payload: Optional[Dict] = None
+    if cert_number is not None and previous_detail:
+        detail_payload = prepare_reused_detail_payload(
+            previous_detail,
+            module,
+            cert_number,
+            dataset,
+            generated_at,
+        )
+        stats["html_reused"] += 1
+        for key in MODULE_DETAIL_FIELDS:
+            value = detail_payload.get(key)
+            if value not in (None, [], "", {}):
+                module_out[key] = value
+        module_out["security_policy_url"] = detail_payload.get("security_policy_url") or module_out.get("security_policy_url")
+        cached_source, cached_source_url = cached_algorithm_extraction_source(
+            previous_module,
+            previous_detail,
+            previous_metadata,
+        )
+        provenance = build_algorithm_extraction_provenance(
+            algorithm_source,
+            "cached" if categories or detailed else "miss",
+            cached_source if categories or detailed else "timeout",
+            cached_source_url or source_url,
+            categories,
+            detailed,
+            cached=bool(categories or detailed),
+            attempts=[attempt],
+        )
+        if categories or detailed:
+            stats["pdf_reused"] += 1
+            stats["algorithm_cache_hits"] += 1
+            stats["algorithm_successes"] += 1
+        else:
+            stats["algorithm_misses"] += 1
+        apply_algorithm_fields(detail_payload, categories, detailed)
+        apply_algorithm_extraction_provenance(detail_payload, provenance, include_attempts=True)
+        apply_algorithm_fields(module_out, categories, detailed)
+        apply_algorithm_extraction_provenance(module_out, provenance)
+        module_out["detail_available"] = True
+        return module_out, detail_payload, categories, stats
+
+    stats["html_failed"] += 1
+    if algorithm_source in CACHEABLE_ALGORITHM_SOURCES:
+        stats["pdf_failed"] += 1
+    if algorithm_source != "none":
+        stats["algorithm_misses"] += 1
+    strip_algorithm_fields(module_out)
+    provenance = build_algorithm_extraction_provenance(
+        algorithm_source,
+        "miss",
+        "timeout",
+        source_url,
+        [],
+        [],
+        attempts=[attempt],
+    )
+    apply_algorithm_extraction_provenance(module_out, provenance)
+    module_out["detail_available"] = False
+    return module_out, None, [], stats
Evidence
The repo can have cached module rows without cached detail payloads because they are loaded from
different locations. The timeout fallback computes cached algorithms from
previous_module/previous_detail but discards them when previous_detail is missing, stripping
algorithm fields and returning an empty list.

scraper.py[830-860]
scraper.py[1742-1830]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
In `build_certificate_timeout_result()`, cached algorithm fields are computed from `(previous_module, previous_detail)` but are only applied/returned when `previous_detail` exists. If the detail JSON is missing but the cached module row contains algorithms, the timeout fallback currently calls `strip_algorithm_fields(module_out)` and returns `[]`, discarding available cached algorithms.

## Issue Context
`load_previous_outputs()` loads cached module rows from `api/modules.json` and certificate detail payloads separately from `DETAIL_DIR`. It is therefore valid for `previous_module` to exist while `previous_detail` is absent; the timeout fallback should still preserve cached algorithm fields from `previous_module` in that case.

## Fix Focus Areas
- scraper.py[830-860]
- scraper.py[1742-1830]

### Suggested implementation direction
- On the `previous_detail is None` timeout path, if `categories`/`detailed` from `cached_algorithm_fields(previous_module, None)` are non-empty:
 - apply them to `module_out` via `apply_algorithm_fields(...)`
 - build provenance with a status like `cached` (or a clear timeout-specific cached status) and include the timeout attempt
 - increment the same cache-hit counters used in the `previous_detail` branch (`pdf_reused`, `algorithm_cache_hits`, `algorithm_successes`) as appropriate
 - return those categories instead of `[]`
- Keep `detail_available` as `False` and `detail_payload` as `None` when there is no cached detail payload file.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

2. Orphaned PDF fetch tasks 🐞 Bug ☼ Reliability
Description
When process_certificate_record_with_timeout() times out, it cancels the certificate coroutine but
any PDF download task created in fetch_policy_pdf_bytes() can keep running in the background. This
can exceed the intended PDF_FETCH_CONCURRENCY (semaphore is released on cancellation while the
background task continues) and can also lead to unhandled task exceptions/log spam if no other
waiter ever awaits that cached task.
Code

scraper.py[R1851-1868]

+        module_out, detail_payload, categories, stats = await asyncio.wait_for(
+            process_certificate_record(
+                module,
+                dataset,
+                generated_at,
+                algorithm_source,
+                previous_module,
+                previous_detail,
+                previous_metadata,
+                client,
+                cert_semaphore,
+                pdf_semaphore,
+                pdf_cache,
+                pdf_cache_lock,
+                database_algorithms_map,
+            ),
+            timeout=CERT_PROCESS_TIMEOUT,
+        )
Evidence
The timeout wrapper introduces cancellation (asyncio.wait_for) around the entire certificate
coroutine. PDF downloads are started as background tasks in fetch_policy_pdf_bytes() and awaited
under pdf_semaphore in fetch_certificate_algorithms(), so cancellation releases the semaphore
while leaving the background task running and potentially never awaited again.

scraper.py[1833-1869]
scraper.py[1333-1348]
scraper.py[1483-1490]
scraper.py[1889-1961]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`process_certificate_record_with_timeout()` uses `asyncio.wait_for()` to time out and cancel per-certificate work. During algorithm extraction, `fetch_policy_pdf_bytes()` spawns a background task (`asyncio.create_task(...)`) to fetch a PDF and stores it in `pdf_cache`. If the certificate coroutine is cancelled due to timeout while awaiting that task, the `pdf_semaphore` context in `fetch_certificate_algorithms()` exits (releasing the semaphore) but the background PDF-fetch task continues running without semaphore backpressure and may never be awaited again (leading to noisy “Task exception was never retrieved” and wasted I/O).

## Issue Context
- The semaphore limiting happens in the *caller* (`fetch_certificate_algorithms`) not inside the cached task itself.
- Timeouts introduce cancellation in the middle of this flow, which is what makes the cached-task behavior problematic.

## Fix Focus Areas
- scraper.py[1833-1887]
- scraper.py[1333-1348]
- scraper.py[1426-1490]
- scraper.py[1889-1961]

### Suggested implementation direction
- Ensure PDF-fetch tasks are either:
 - cancelled/awaited (with `return_exceptions=True`) when no longer needed (e.g., at the end of `build_certificate_artifacts()`), **and/or**
 - created in a way that preserves concurrency limits even if the waiter is cancelled (e.g., the cached task itself acquires `pdf_semaphore`, and the outer semaphore acquisition is removed), **and/or**
 - has a `done_callback` that consumes exceptions and evicts failed tasks from `pdf_cache` so failures don’t become “never retrieved” warnings.
- Consider evicting completed tasks from `pdf_cache` to prevent growth across long runs.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

Qodo Logo

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 937eb8ab2c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread scraper.py
Comment on lines +1763 to +1764
categories, detailed = cached_algorithm_fields(previous_module, previous_detail)
attempt = {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Honor ALGORITHM_SOURCE=none in timeout fallback

When a certificate times out and cached detail exists, this branch always pulls cached algorithms and later reapplies them, regardless of the configured algorithm_source. In the ALGORITHM_SOURCE=none mode, a timeout will therefore repopulate algorithms/algorithms_detailed and mark extraction as cached instead of skipped, which contradicts the explicit skip setting and yields inconsistent API payloads/metrics for timeouted records.

Useful? React with 👍 / 👎.

Comment thread scraper.py
Comment on lines +1851 to +1855
module_out, detail_payload, categories, stats = await asyncio.wait_for(
process_certificate_record(
module,
dataset,
generated_at,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Handle cancellation from timed out certificate tasks

Using asyncio.wait_for here cancels process_certificate_record on timeout; if that coroutine is awaiting a shared PDF-cache task, cancellation propagates to the shared task. Other certificate workers awaiting the same cached task can then receive CancelledError (not TimeoutError), and this wrapper does not catch it, so build_certificate_artifacts can fail instead of emitting the timeout fallback. This is reproducible when multiple records share a policy URL and one reaches the timeout boundary first.

Useful? React with 👍 / 👎.

Comment thread scraper.py
Comment on lines +1763 to +1830
categories, detailed = cached_algorithm_fields(previous_module, previous_detail)
attempt = {
"source": algorithm_source,
"url": str(source_url or ""),
"status": "timeout",
}

detail_payload: Optional[Dict] = None
if cert_number is not None and previous_detail:
detail_payload = prepare_reused_detail_payload(
previous_detail,
module,
cert_number,
dataset,
generated_at,
)
stats["html_reused"] += 1
for key in MODULE_DETAIL_FIELDS:
value = detail_payload.get(key)
if value not in (None, [], "", {}):
module_out[key] = value
module_out["security_policy_url"] = detail_payload.get("security_policy_url") or module_out.get("security_policy_url")
cached_source, cached_source_url = cached_algorithm_extraction_source(
previous_module,
previous_detail,
previous_metadata,
)
provenance = build_algorithm_extraction_provenance(
algorithm_source,
"cached" if categories or detailed else "miss",
cached_source if categories or detailed else "timeout",
cached_source_url or source_url,
categories,
detailed,
cached=bool(categories or detailed),
attempts=[attempt],
)
if categories or detailed:
stats["pdf_reused"] += 1
stats["algorithm_cache_hits"] += 1
stats["algorithm_successes"] += 1
else:
stats["algorithm_misses"] += 1
apply_algorithm_fields(detail_payload, categories, detailed)
apply_algorithm_extraction_provenance(detail_payload, provenance, include_attempts=True)
apply_algorithm_fields(module_out, categories, detailed)
apply_algorithm_extraction_provenance(module_out, provenance)
module_out["detail_available"] = True
return module_out, detail_payload, categories, stats

stats["html_failed"] += 1
if algorithm_source in CACHEABLE_ALGORITHM_SOURCES:
stats["pdf_failed"] += 1
if algorithm_source != "none":
stats["algorithm_misses"] += 1
strip_algorithm_fields(module_out)
provenance = build_algorithm_extraction_provenance(
algorithm_source,
"miss",
"timeout",
source_url,
[],
[],
attempts=[attempt],
)
apply_algorithm_extraction_provenance(module_out, provenance)
module_out["detail_available"] = False
return module_out, None, [], stats
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

1. Cached algorithms dropped 🐞 Bug ≡ Correctness

build_certificate_timeout_result() strips algorithm fields and returns empty categories when
previous_detail is missing, even if previous_module contains cached algorithms. This causes
timed-out certificates to lose previously extracted algorithm data loaded from modules.json.
Agent Prompt
## Issue description
In `build_certificate_timeout_result()`, cached algorithm fields are computed from `(previous_module, previous_detail)` but are only applied/returned when `previous_detail` exists. If the detail JSON is missing but the cached module row contains algorithms, the timeout fallback currently calls `strip_algorithm_fields(module_out)` and returns `[]`, discarding available cached algorithms.

## Issue Context
`load_previous_outputs()` loads cached module rows from `api/modules.json` and certificate detail payloads separately from `DETAIL_DIR`. It is therefore valid for `previous_module` to exist while `previous_detail` is absent; the timeout fallback should still preserve cached algorithm fields from `previous_module` in that case.

## Fix Focus Areas
- scraper.py[830-860]
- scraper.py[1742-1830]

### Suggested implementation direction
- On the `previous_detail is None` timeout path, if `categories`/`detailed` from `cached_algorithm_fields(previous_module, None)` are non-empty:
  - apply them to `module_out` via `apply_algorithm_fields(...)`
  - build provenance with a status like `cached` (or a clear timeout-specific cached status) and include the timeout attempt
  - increment the same cache-hit counters used in the `previous_detail` branch (`pdf_reused`, `algorithm_cache_hits`, `algorithm_successes`) as appropriate
  - return those categories instead of `[]`
- Keep `detail_available` as `False` and `detail_payload` as `None` when there is no cached detail payload file.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant