[codex] Bound certificate processing time#20
Conversation
Review Summary by QodoAdd configurable certificate processing timeout with fallback to cached data
WalkthroughsDescription• Add configurable per-certificate processing timeout with CERT_PROCESS_TIMEOUT environment variable (default 900s) • Implement timeout handling that preserves cached detail and algorithm data when available • Record certificate_timeouts metric in extraction statistics • Switch certificate artifact builder to use asyncio.as_completed() for progress based on completed work Diagramflowchart LR
A["Certificate Processing"] -->|"timeout"| B["Timeout Handler"]
B -->|"has cached data"| C["Preserve Cached Detail & Algorithms"]
B -->|"no cached data"| D["Record Timeout Attempt"]
C --> E["Return with Metrics"]
D --> E
F["asyncio.as_completed"] -->|"track progress"| G["Completion-based Reporting"]
File Changes1. scraper.py
|
Code Review by Qodo
1. Cached algorithms dropped
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 937eb8ab2c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| categories, detailed = cached_algorithm_fields(previous_module, previous_detail) | ||
| attempt = { |
There was a problem hiding this comment.
Honor ALGORITHM_SOURCE=none in timeout fallback
When a certificate times out and cached detail exists, this branch always pulls cached algorithms and later reapplies them, regardless of the configured algorithm_source. In the ALGORITHM_SOURCE=none mode, a timeout will therefore repopulate algorithms/algorithms_detailed and mark extraction as cached instead of skipped, which contradicts the explicit skip setting and yields inconsistent API payloads/metrics for timeouted records.
Useful? React with 👍 / 👎.
| module_out, detail_payload, categories, stats = await asyncio.wait_for( | ||
| process_certificate_record( | ||
| module, | ||
| dataset, | ||
| generated_at, |
There was a problem hiding this comment.
Handle cancellation from timed out certificate tasks
Using asyncio.wait_for here cancels process_certificate_record on timeout; if that coroutine is awaiting a shared PDF-cache task, cancellation propagates to the shared task. Other certificate workers awaiting the same cached task can then receive CancelledError (not TimeoutError), and this wrapper does not catch it, so build_certificate_artifacts can fail instead of emitting the timeout fallback. This is reproducible when multiple records share a policy URL and one reaches the timeout boundary first.
Useful? React with 👍 / 👎.
| categories, detailed = cached_algorithm_fields(previous_module, previous_detail) | ||
| attempt = { | ||
| "source": algorithm_source, | ||
| "url": str(source_url or ""), | ||
| "status": "timeout", | ||
| } | ||
|
|
||
| detail_payload: Optional[Dict] = None | ||
| if cert_number is not None and previous_detail: | ||
| detail_payload = prepare_reused_detail_payload( | ||
| previous_detail, | ||
| module, | ||
| cert_number, | ||
| dataset, | ||
| generated_at, | ||
| ) | ||
| stats["html_reused"] += 1 | ||
| for key in MODULE_DETAIL_FIELDS: | ||
| value = detail_payload.get(key) | ||
| if value not in (None, [], "", {}): | ||
| module_out[key] = value | ||
| module_out["security_policy_url"] = detail_payload.get("security_policy_url") or module_out.get("security_policy_url") | ||
| cached_source, cached_source_url = cached_algorithm_extraction_source( | ||
| previous_module, | ||
| previous_detail, | ||
| previous_metadata, | ||
| ) | ||
| provenance = build_algorithm_extraction_provenance( | ||
| algorithm_source, | ||
| "cached" if categories or detailed else "miss", | ||
| cached_source if categories or detailed else "timeout", | ||
| cached_source_url or source_url, | ||
| categories, | ||
| detailed, | ||
| cached=bool(categories or detailed), | ||
| attempts=[attempt], | ||
| ) | ||
| if categories or detailed: | ||
| stats["pdf_reused"] += 1 | ||
| stats["algorithm_cache_hits"] += 1 | ||
| stats["algorithm_successes"] += 1 | ||
| else: | ||
| stats["algorithm_misses"] += 1 | ||
| apply_algorithm_fields(detail_payload, categories, detailed) | ||
| apply_algorithm_extraction_provenance(detail_payload, provenance, include_attempts=True) | ||
| apply_algorithm_fields(module_out, categories, detailed) | ||
| apply_algorithm_extraction_provenance(module_out, provenance) | ||
| module_out["detail_available"] = True | ||
| return module_out, detail_payload, categories, stats | ||
|
|
||
| stats["html_failed"] += 1 | ||
| if algorithm_source in CACHEABLE_ALGORITHM_SOURCES: | ||
| stats["pdf_failed"] += 1 | ||
| if algorithm_source != "none": | ||
| stats["algorithm_misses"] += 1 | ||
| strip_algorithm_fields(module_out) | ||
| provenance = build_algorithm_extraction_provenance( | ||
| algorithm_source, | ||
| "miss", | ||
| "timeout", | ||
| source_url, | ||
| [], | ||
| [], | ||
| attempts=[attempt], | ||
| ) | ||
| apply_algorithm_extraction_provenance(module_out, provenance) | ||
| module_out["detail_available"] = False | ||
| return module_out, None, [], stats |
There was a problem hiding this comment.
1. Cached algorithms dropped 🐞 Bug ≡ Correctness
build_certificate_timeout_result() strips algorithm fields and returns empty categories when previous_detail is missing, even if previous_module contains cached algorithms. This causes timed-out certificates to lose previously extracted algorithm data loaded from modules.json.
Agent Prompt
## Issue description
In `build_certificate_timeout_result()`, cached algorithm fields are computed from `(previous_module, previous_detail)` but are only applied/returned when `previous_detail` exists. If the detail JSON is missing but the cached module row contains algorithms, the timeout fallback currently calls `strip_algorithm_fields(module_out)` and returns `[]`, discarding available cached algorithms.
## Issue Context
`load_previous_outputs()` loads cached module rows from `api/modules.json` and certificate detail payloads separately from `DETAIL_DIR`. It is therefore valid for `previous_module` to exist while `previous_detail` is absent; the timeout fallback should still preserve cached algorithm fields from `previous_module` in that case.
## Fix Focus Areas
- scraper.py[830-860]
- scraper.py[1742-1830]
### Suggested implementation direction
- On the `previous_detail is None` timeout path, if `categories`/`detailed` from `cached_algorithm_fields(previous_module, None)` are non-empty:
- apply them to `module_out` via `apply_algorithm_fields(...)`
- build provenance with a status like `cached` (or a clear timeout-specific cached status) and include the timeout attempt
- increment the same cache-hit counters used in the `previous_detail` branch (`pdf_reused`, `algorithm_cache_hits`, `algorithm_successes`) as appropriate
- return those categories instead of `[]`
- Keep `detail_available` as `False` and `detail_payload` as `None` when there is no cached detail payload file.
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
Summary
Adds a configurable
CERT_PROCESS_TIMEOUTfor per-certificate processing and recordscertificate_timeoutsin extraction metrics. If one certificate times out, the scraper preserves previously cached detail and algorithm data when available and records a timeout attempt in provenance.The certificate artifact builder now collects tasks with
asyncio.as_completed, so progress reporting is based on completed work instead of input order.Validation
venv/bin/python -m py_compile scraper.py test_scraper.py validate_api.pyvenv/bin/python test_scraper.pyvenv/bin/python validate_api.pygit diff --check