Skip to content

[codex] Bound certificate task scheduling#21

Merged
ethanolivertroy merged 1 commit into
mainfrom
codex/fix-timeout-window
May 14, 2026
Merged

[codex] Bound certificate task scheduling#21
ethanolivertroy merged 1 commit into
mainfrom
codex/fix-timeout-window

Conversation

@ethanolivertroy
Copy link
Copy Markdown
Member

Summary

Fixes the per-certificate timeout scheduler introduced in the previous PR so timeout accounting starts only for a bounded active task window instead of for every certificate at once.

The artifact builder now keeps at most max(CERT_FETCH_CONCURRENCY, PDF_FETCH_CONCURRENCY) certificate tasks active, schedules another task as each completes, and still preserves the original output order.

Validation

  • venv/bin/python -m py_compile scraper.py test_scraper.py validate_api.py
  • venv/bin/python test_scraper.py
  • venv/bin/python validate_api.py
  • git diff --check

@ethanolivertroy ethanolivertroy marked this pull request as ready for review May 14, 2026 06:42
@ethanolivertroy ethanolivertroy merged commit ddafb26 into main May 14, 2026
1 check passed
@ethanolivertroy ethanolivertroy deleted the codex/fix-timeout-window branch May 14, 2026 06:42
@qodo-code-review
Copy link
Copy Markdown

Review Summary by Qodo

Bound certificate task scheduling with active window limits

✨ Enhancement 🧪 Tests

Grey Divider

Walkthroughs

Description
• Implement bounded task scheduling for certificate processing
  - Limits concurrent active tasks to max(CERT_FETCH_CONCURRENCY, PDF_FETCH_CONCURRENCY)
  - Schedules new tasks as each completes instead of starting all at once
  - Preserves original output order despite concurrent execution
• Add comprehensive test for bounded task scheduling behavior
  - Validates maximum concurrent tasks never exceeds window size
  - Verifies all modules are processed exactly once
  - Confirms output order matches input order
  - Ensures stats accumulate correctly from bounded tasks
Diagram
flowchart LR
  A["All modules queued"] --> B["Schedule up to task_window tasks"]
  B --> C["Process tasks concurrently"]
  C --> D["Task completes"]
  D --> E["Schedule next pending module"]
  E --> F["Maintain output order"]
  F --> G["Return results"]
Loading

Grey Divider

File Changes

1. scraper.py ✨ Enhancement +37/-19

Implement bounded certificate task scheduling

• Replace eager task creation with bounded scheduling using asyncio.wait()
• Introduce pending_tasks set and schedule_next_certificate() function
• Calculate task_window as max(CERT_FETCH_CONCURRENCY, PDF_FETCH_CONCURRENCY)
• Schedule new task after each completion to maintain bounded concurrency
• Preserve output order by storing results in index-keyed dictionary

scraper.py


2. test_scraper.py 🧪 Tests +96/-0

Add bounded task scheduling validation test

• Add import for build_certificate_artifacts function
• Create test_build_certificate_artifacts_bounds_active_tasks() test function
• Mock process_certificate_record_with_timeout to track concurrent execution
• Verify max concurrent tasks respects window size of 3
• Validate all modules scheduled, output order preserved, and stats accumulated
• Register new test in main test suite

test_scraper.py


Grey Divider

Qodo Logo

@qodo-code-review
Copy link
Copy Markdown

qodo-code-review Bot commented May 14, 2026

Code Review by Qodo

🐞 Bugs (1) 📘 Rule violations (0)

Grey Divider


Remediation recommended

1. Orphan tasks on exception 🐞 Bug ☼ Reliability
Description
In build_certificate_artifacts, tasks are spawned with asyncio.create_task but there is no
try/finally to cancel+await remaining pending_tasks if an unexpected exception propagates while
awaiting a completed task. This can leave background tasks running while the surrounding
httpx.AsyncClient context is exiting, causing secondary failures and resource leakage (dangling
tasks).
Code

scraper.py[R1957-1964]

+        while pending_tasks:
+            done, pending_tasks = await asyncio.wait(
+                pending_tasks,
+                return_when=asyncio.FIRST_COMPLETED,
+            )
+            for task in done:
+                index, module_out, detail_payload, categories, task_stats = await task
+                completed += 1
Evidence
The function spawns background tasks and awaits them in a loop, but there is no try/finally
cleanup to cancel remaining tasks if the loop aborts early; those tasks use the shared
httpx.AsyncClient for HTTP calls, so they may continue attempting requests after the client
context is exited.

scraper.py[1914-1979]
scraper.py[1299-1348]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`build_certificate_artifacts()` creates background tasks via `asyncio.create_task()` and drives them via `asyncio.wait()`. If an unexpected exception propagates while awaiting/processing a completed task, the function can exit the `async with httpx.AsyncClient(...)` block with other tasks still pending, leaving them running against a closing/closed client.

### Issue Context
This is an error-path reliability problem (not the happy path). A robust pattern is to wrap the scheduling/collection loop in `try/finally`, cancel any remaining tasks in `finally`, and `await asyncio.gather(*pending, return_exceptions=True)` to avoid orphaned tasks and “Task exception was never retrieved” warnings.

### Fix Focus Areas
- scraper.py[1914-1978]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

Qodo Logo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant