[codex] Shield shared PDF cache fetches#22
Merged
Conversation
Review Summary by QodoShield shared PDF cache fetches from waiter cancellation
WalkthroughsDescription• Shield shared PDF cache fetch tasks from cancellation by individual waiters • Prevent poisoned cache entries when certificate processing times out • Add regression test verifying cache task survives waiter cancellation Diagramflowchart LR
A["Waiter cancels task"] -->|without shield| B["Cache task cancelled"]
A -->|with shield| C["Cache task survives"]
B --> D["Poisoned cache entry"]
C --> E["Other waiters reuse task"]
File Changes1. scraper.py
|
Code Review by Qodo
1. Shield breaks semaphore bound
|
| pdf_cache[url] = task | ||
|
|
||
| result = await task | ||
| result = await asyncio.shield(task) |
There was a problem hiding this comment.
1. Shield breaks semaphore bound 🐞 Bug ☼ Reliability
Because fetch_policy_pdf_bytes now awaits asyncio.shield(task), a per-certificate timeout (asyncio.wait_for) cancels the waiter and releases pdf_semaphore while the underlying PDF fetch task continues running. This can exceed PDF_FETCH_CONCURRENCY under rate limiting/slow fetches and degrade stability via many concurrent/sleeping background fetch tasks.
Agent Prompt
## Issue description
`asyncio.shield(task)` prevents cancellation of the shared fetch task, but the current concurrency control (`pdf_semaphore`) is held by the *waiter* coroutine, not by the underlying fetch task. When a waiter is cancelled by `asyncio.wait_for(..., timeout=CERT_PROCESS_TIMEOUT)`, it releases `pdf_semaphore` while the shielded fetch keeps running, allowing more fetches to start and exceeding the intended PDF concurrency bound.
## Issue Context
- `fetch_certificate_algorithms()` uses `async with pdf_semaphore:` around `await fetch_policy_pdf_bytes(...)`.
- `process_certificate_record_with_timeout()` cancels work on timeout via `asyncio.wait_for`.
- `fetch_with_retry()` can sleep for long `Retry-After` values on 429, making it plausible for the waiter to time out while the fetch task continues.
## Fix Focus Areas
- scraper.py[1333-1348]
- scraper.py[1426-1492]
## Suggested fix approach
- Move semaphore acquisition into the cached fetch task itself, so the semaphore is held for the *lifetime of the real fetch*, independent of waiter cancellation.
- Option A (preferred): pass `pdf_semaphore` into `fetch_policy_pdf_bytes()` and create the cached task as a wrapper coroutine:
- `async def _run(): async with pdf_semaphore: return await fetch_with_retry(...)`
- Cache `asyncio.create_task(_run())`
- Remove the outer `async with pdf_semaphore:` around `fetch_policy_pdf_bytes()` to avoid double-limiting.
- Ensure the cache/lock still guarantees one task per URL.
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Why
Certificate-level processing is now bounded by
CERT_PROCESS_TIMEOUT. Without shielding, a timeout while waiting on a shared cached PDF fetch can cancel the cache task itself, causing other certificate records that need the same PDF to see a poisoned cache entry or trigger avoidable failures.Validation
git diff --checkvenv/bin/python -m py_compile scraper.py test_scraper.py validate_api.pyvenv/bin/python test_scraper.pyvenv/bin/python validate_api.py