Summary
Add a retry mechanism with exponential backoff for transient GitHub API failures (HTTP 500/502/503) to prevent webhook processing failures during temporary GitHub outages.
Problem / Motivation
The webhook server currently has no retry logic for transient GitHub API failures. When GitHub's API experiences temporary issues, all API calls fail immediately after urllib3's built-in retries exhaust (which happens very quickly with only ~3 retries and no meaningful backoff).
This was observed in production on the RedHat webhook server for the mtv-api-tests repository where multiple check_run and pull_request webhooks failed with:
HTTPSConnectionPool(host='api.github.com', port=443): Max retries exceeded with url: /repos/RedHatQE/mtv-api-tests/pulls/420/requested_reviewers (Caused by ResponseError('too many 500 error responses'))
Multiple operations failed in a cascade: label management, assignee assignment, reviewer requests, and Compare API calls.
Requirements
- Add
tenacity library as a dependency for retry logic
- Create a utility wrapper function (e.g.,
github_api_call()) that wraps asyncio.to_thread() calls with retry + exponential backoff
- Retry ONLY on transient errors: HTTP 500, 502, 503,
ConnectionError, MaxRetryError, ResponseError
- Do NOT retry on permanent errors: 401 (Unauthorized), 403 (Forbidden), 404 (Not Found), 422 (Validation)
- Use exponential backoff: e.g., 2s → 4s → 8s → 16s (max ~4 retries, ~30s total)
- Log each retry attempt with warning level
- Replace raw
asyncio.to_thread() calls across handlers with the new retry wrapper
Deliverables
Summary
Add a retry mechanism with exponential backoff for transient GitHub API failures (HTTP 500/502/503) to prevent webhook processing failures during temporary GitHub outages.
Problem / Motivation
The webhook server currently has no retry logic for transient GitHub API failures. When GitHub's API experiences temporary issues, all API calls fail immediately after urllib3's built-in retries exhaust (which happens very quickly with only ~3 retries and no meaningful backoff).
This was observed in production on the RedHat webhook server for the
mtv-api-testsrepository where multiplecheck_runandpull_requestwebhooks failed with:Multiple operations failed in a cascade: label management, assignee assignment, reviewer requests, and Compare API calls.
Requirements
tenacitylibrary as a dependency for retry logicgithub_api_call()) that wrapsasyncio.to_thread()calls with retry + exponential backoffConnectionError,MaxRetryError,ResponseErrorasyncio.to_thread()calls across handlers with the new retry wrapperDeliverables
tenacitytopyproject.tomldependencieswebhook_server/utils/github_retry.pyasyncio.to_thread()calls inwebhook_server/libs/github_api.pywith retry wrapperasyncio.to_thread()calls in handler files (labels_handler.py,pull_request_handler.py,issue_comment_handler.py,owners_files_handler.py,check_run_handler.py,pull_request_review_handler.py,runner_handler.py) with retry wrapper