Skip to content

[codex] Shield shared PDF cache fetches#22

Merged
ethanolivertroy merged 1 commit into
mainfrom
codex/shield-pdf-cache-tasks
May 14, 2026
Merged

[codex] Shield shared PDF cache fetches#22
ethanolivertroy merged 1 commit into
mainfrom
codex/shield-pdf-cache-tasks

Conversation

@ethanolivertroy
Copy link
Copy Markdown
Member

Summary

  • Shield shared in-run Security Policy PDF fetch tasks from cancellation by individual waiters.
  • Add a regression test that cancels one waiter and verifies a later waiter reuses the original cached fetch task.

Why

Certificate-level processing is now bounded by CERT_PROCESS_TIMEOUT. Without shielding, a timeout while waiting on a shared cached PDF fetch can cancel the cache task itself, causing other certificate records that need the same PDF to see a poisoned cache entry or trigger avoidable failures.

Validation

  • git diff --check
  • venv/bin/python -m py_compile scraper.py test_scraper.py validate_api.py
  • venv/bin/python test_scraper.py
  • venv/bin/python validate_api.py

@ethanolivertroy ethanolivertroy marked this pull request as ready for review May 14, 2026 06:53
@ethanolivertroy ethanolivertroy merged commit 9e38ae0 into main May 14, 2026
1 check passed
@ethanolivertroy ethanolivertroy deleted the codex/shield-pdf-cache-tasks branch May 14, 2026 06:54
@qodo-code-review
Copy link
Copy Markdown

Review Summary by Qodo

Shield shared PDF cache fetches from waiter cancellation

🐞 Bug fix 🧪 Tests

Grey Divider

Walkthroughs

Description
• Shield shared PDF cache fetch tasks from cancellation by individual waiters
• Prevent poisoned cache entries when certificate processing times out
• Add regression test verifying cache task survives waiter cancellation
Diagram
flowchart LR
  A["Waiter cancels task"] -->|without shield| B["Cache task cancelled"]
  A -->|with shield| C["Cache task survives"]
  B --> D["Poisoned cache entry"]
  C --> E["Other waiters reuse task"]
Loading

Grey Divider

File Changes

1. scraper.py 🐞 Bug fix +1/-1

Shield PDF cache task from cancellation

• Wrap await task with asyncio.shield() to protect shared cache task from cancellation
• Prevents individual waiter timeouts from poisoning the PDF cache for other certificate records

scraper.py


2. test_scraper.py 🧪 Tests +60/-0

Add regression test for cache task cancellation shielding

• Add new test test_fetch_policy_pdf_bytes_shields_shared_cache_task_from_cancellation() that
 verifies cache task survives when one waiter is cancelled
• Test creates two waiters, cancels the first, and confirms the second waiter successfully reuses
 the original cached task
• Register new test in main() test runner

test_scraper.py


Grey Divider

Qodo Logo

@qodo-code-review
Copy link
Copy Markdown

qodo-code-review Bot commented May 14, 2026

Code Review by Qodo

🐞 Bugs (3) 📘 Rule violations (0)

Grey Divider


Action required

1. Shield breaks semaphore bound 🐞 Bug ☼ Reliability
Description
Because fetch_policy_pdf_bytes now awaits asyncio.shield(task), a per-certificate timeout
(asyncio.wait_for) cancels the waiter and releases pdf_semaphore while the underlying PDF fetch task
continues running. This can exceed PDF_FETCH_CONCURRENCY under rate limiting/slow fetches and
degrade stability via many concurrent/sleeping background fetch tasks.
Code

scraper.py[1347]

+    result = await asyncio.shield(task)
Evidence
The shielded task can outlive the caller cancellation, but the semaphore is only held by the caller
scope; timeouts cancel the caller and release the semaphore while the fetch continues, defeating
concurrency limits.

scraper.py[1333-1348]
scraper.py[1426-1492]
scraper.py[1833-1887]
scraper.py[1299-1315]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`asyncio.shield(task)` prevents cancellation of the shared fetch task, but the current concurrency control (`pdf_semaphore`) is held by the *waiter* coroutine, not by the underlying fetch task. When a waiter is cancelled by `asyncio.wait_for(..., timeout=CERT_PROCESS_TIMEOUT)`, it releases `pdf_semaphore` while the shielded fetch keeps running, allowing more fetches to start and exceeding the intended PDF concurrency bound.

## Issue Context
- `fetch_certificate_algorithms()` uses `async with pdf_semaphore:` around `await fetch_policy_pdf_bytes(...)`.
- `process_certificate_record_with_timeout()` cancels work on timeout via `asyncio.wait_for`.
- `fetch_with_retry()` can sleep for long `Retry-After` values on 429, making it plausible for the waiter to time out while the fetch task continues.

## Fix Focus Areas
- scraper.py[1333-1348]
- scraper.py[1426-1492]

## Suggested fix approach
- Move semaphore acquisition into the cached fetch task itself, so the semaphore is held for the *lifetime of the real fetch*, independent of waiter cancellation.
 - Option A (preferred): pass `pdf_semaphore` into `fetch_policy_pdf_bytes()` and create the cached task as a wrapper coroutine:
   - `async def _run(): async with pdf_semaphore: return await fetch_with_retry(...)`
   - Cache `asyncio.create_task(_run())`
   - Remove the outer `async with pdf_semaphore:` around `fetch_policy_pdf_bytes()` to avoid double-limiting.
 - Ensure the cache/lock still guarantees one task per URL.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

2. Failed PDF tasks stay cached 🐞 Bug ≡ Correctness
Description
fetch_policy_pdf_bytes caches the created task and never evicts it even when the fetch returns None
(or the task errors/cancels), so later callers get cache_hit=True but cannot retry in-run. This can
cause persistent per-run algorithm extraction failures after transient HTTP/network errors.
Code

scraper.py[R1344-1348]

            task = asyncio.create_task(fetch_with_retry(client, url, response_type="bytes"))
            pdf_cache[url] = task

-    result = await task
+    result = await asyncio.shield(task)
    return result if isinstance(result, bytes) else None, cache_hit
Evidence
The code stores a task in pdf_cache[url] and returns None on non-bytes without removing it,
while fetch_with_retry can return None on common failure paths; this makes the cached task a
permanent in-run failure for that URL.

scraper.py[1333-1348]
scraper.py[1299-1330]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The PDF cache stores an `asyncio.Task` per URL and never removes it. If `fetch_with_retry(...)` returns `None` (e.g., after retries) or if the task completes exceptionally, the cache entry remains and subsequent calls will await the already-failed task and immediately get `None`/exception again, preventing any in-run retry.

## Issue Context
`fetch_with_retry()` returns `None` on HTTPStatusError/RequestError after retries, and `fetch_policy_pdf_bytes()` converts any non-`bytes` result to `(None, cache_hit)` without clearing the cache.

## Fix Focus Areas
- scraper.py[1299-1330]
- scraper.py[1333-1348]

## Suggested fix approach
- After awaiting the task, if the result is not `bytes`, remove the cache entry *iff* it still points to the same task (guard with the lock) so future calls can recreate a new fetch task.
- Also consider removing the entry if the task ends in cancelled/exception state, and optionally attach a done-callback to consume/log exceptions to avoid "Task exception was never retrieved" when all waiters are cancelled.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Advisory comments

3. Test relies on single tick 🐞 Bug ☼ Reliability
Description
The new cancellation-shielding test cancels the first waiter after a single asyncio.sleep(0) and
then directly indexes pdf_cache[url], which is theoretically a race if the waiter hasn't yet
inserted the cache entry. This is unlikely in practice here, but easy to make deterministic and more
robust.
Code

test_scraper.py[R844-852]

+        await asyncio.sleep(0)
+        first_waiter.cancel()
+        try:
+            await first_waiter
+        except asyncio.CancelledError:
+            pass
+
+        cached_task = pdf_cache["https://csrc.nist.gov/slow.pdf"]
+        was_cancelled = cached_task.cancelled()
Evidence
The test uses sleep(0) then cancels and immediately does pdf_cache[URL]; cache insertion happens
inside the concurrently scheduled coroutine, so the ordering is assumed rather than enforced.

test_scraper.py[812-867]
scraper.py[1339-1346]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The test assumes the first waiter has populated `pdf_cache[url]` after a single event-loop tick. If scheduling changes, this could become a KeyError.

## Issue Context
The test cancels the waiter almost immediately and then indexes the cache directly.

## Fix Focus Areas
- test_scraper.py[832-859]

## Suggested fix approach
- Wait until the URL key is present before cancelling, e.g.:
 - `while url not in pdf_cache: await asyncio.sleep(0)`
 - or use an `asyncio.Event` set right after cache insertion.
- Alternatively use `cached_task = pdf_cache.get(url)` with an assert.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

Qodo Logo

Comment thread scraper.py
pdf_cache[url] = task

result = await task
result = await asyncio.shield(task)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

1. Shield breaks semaphore bound 🐞 Bug ☼ Reliability

Because fetch_policy_pdf_bytes now awaits asyncio.shield(task), a per-certificate timeout
(asyncio.wait_for) cancels the waiter and releases pdf_semaphore while the underlying PDF fetch task
continues running. This can exceed PDF_FETCH_CONCURRENCY under rate limiting/slow fetches and
degrade stability via many concurrent/sleeping background fetch tasks.
Agent Prompt
## Issue description
`asyncio.shield(task)` prevents cancellation of the shared fetch task, but the current concurrency control (`pdf_semaphore`) is held by the *waiter* coroutine, not by the underlying fetch task. When a waiter is cancelled by `asyncio.wait_for(..., timeout=CERT_PROCESS_TIMEOUT)`, it releases `pdf_semaphore` while the shielded fetch keeps running, allowing more fetches to start and exceeding the intended PDF concurrency bound.

## Issue Context
- `fetch_certificate_algorithms()` uses `async with pdf_semaphore:` around `await fetch_policy_pdf_bytes(...)`.
- `process_certificate_record_with_timeout()` cancels work on timeout via `asyncio.wait_for`.
- `fetch_with_retry()` can sleep for long `Retry-After` values on 429, making it plausible for the waiter to time out while the fetch task continues.

## Fix Focus Areas
- scraper.py[1333-1348]
- scraper.py[1426-1492]

## Suggested fix approach
- Move semaphore acquisition into the cached fetch task itself, so the semaphore is held for the *lifetime of the real fetch*, independent of waiter cancellation.
  - Option A (preferred): pass `pdf_semaphore` into `fetch_policy_pdf_bytes()` and create the cached task as a wrapper coroutine:
    - `async def _run(): async with pdf_semaphore: return await fetch_with_retry(...)`
    - Cache `asyncio.create_task(_run())`
    - Remove the outer `async with pdf_semaphore:` around `fetch_policy_pdf_bytes()` to avoid double-limiting.
  - Ensure the cache/lock still guarantees one task per URL.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant