feat(int-evolution-go): exponential backoff + jitter on api_request (PR-2)#79
feat(int-evolution-go): exponential backoff + jitter on api_request (PR-2)#79mt-alarcon wants to merge 3 commits into
Conversation
…ps 1+2) Step 1 — Migration (models.py + app.py): - TriggerExecution ganha 3 colunas nullable: idempotency_key, error_category, last_replay_at - to_dict() exposto com os 3 campos novos - Auto-migrate idempotente no startup: ALTER TABLE + IF NOT EXISTS em cada bloco - Partial unique index uq_trigger_idem (trigger_id, idempotency_key) WHERE NOT NULL - Basic index ix_trigger_executions_idem_key para lookups por key - SQLite 3.51 confirmado — partial index nativo; EXPLAIN QUERY PLAN confirma uso do índice Step 2 — Silent dedup (triggers.py): - webhook_receiver extrai idem_key de idempotency_key / messageId / data.messageId - Se key já existe: log idempotent_replay + 200 OK silencioso (pattern F6) - Race condition: IntegrityError no db.commit() → rollback + 200 OK silencioso - Legado (GitHub, Stripe, Linear): sem key → idem_key=None → fluxo normal inalterado - Limpeza: current_app movido para import no topo; imports inline removidos Testes passados: migration up/down idempotente, partial index unicidade, NULLs livres, extração de key (6 casos), race condition via IntegrityError, EXPLAIN QUERY PLAN. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…PR-2) Adds resilient retry logic to the Evolution Go API client. Transient HTTP errors (5xx, URLError, socket.timeout) now retry up to 3 times with exponential backoff + jitter; HTTP 4xx errors raise immediately (no retry, deterministic client errors). ## Changes **`.claude/skills/int-evolution-go/scripts/evolution_go_client.py`:** - New helper `_retry_http_call_client(do_call, max_attempts=3, base_delay=2.0, max_delay=8.0)` — generic retry wrapper with exponential backoff + jitter. Logs structured JSON events (`api_request_retry` / `api_request_failed`). - Refactor `api_request` to wrap the actual HTTP call in `_do_call` and delegate to the retry helper. Removes inline try/except/sys.exit so library callers can handle errors; the CLI `main()` catches and exits as before. - Refactor `main()` to wrap handler invocation in try/except for `urllib.error.HTTPError` / `URLError` / `socket.timeout` — preserves the exact CLI exit-1-on-failure behavior while letting library users propagate. **`tests/whatsapp/test_retry_backoff.py` (new):** 4 synthetic tests covering: 1. HTTP 500 x3 → retries then raises (TestApiRequestRetry) 2. HTTP 400 → raises immediately, no retry 3. URLError x3 → retries then raises 4. Success on third attempt → returns result, call_count=3 All 4 tests pass against the modified `api_request`. ## Worst-case latency 3 attempts with all 5xx: sleep ≤ 2^0 + 0.5 + 2^1 + 0.5 = 5.5s total (capped at max_delay=8s per attempt). Acceptable for an API call retry. ## Series - PR-1 (evolution-foundation#78 — merged or pending): idempotency_key migration + silent dedup - **PR-2 (this):** exponential backoff + jitter - PR-3 (next): DLQ classification + UI Replay + instrumentation
Reviewer's GuideAdds resilient exponential backoff + jitter retry logic around Evolution Go API HTTP calls, refactors CLI error handling to raise instead of exiting for library callers while preserving CLI behavior, and wires in idempotent trigger execution support and schema/index migration plus tests for the new retry behavior. Sequence diagram for Evolution Go API request retry and CLI error handlingsequenceDiagram
actor User
participant CLI as main
participant Handler
participant Client as api_request
participant Retry as _retry_http_call_client
participant HTTP as urllib_request_urlopen
User->>CLI: invoke command
CLI->>Handler: handler(args)
Handler->>Client: api_request(method, path, data)
Client->>Retry: _retry_http_call_client(_do_call)
loop attempts
Retry->>HTTP: _do_call() / urlopen(req)
alt transient_error
HTTP-->>Retry: raise HTTPError 5xx or URLError or timeout
Retry->>Retry: time.sleep(delay)
else success
HTTP-->>Retry: response
Retry-->>Client: parsed JSON
Client-->>Handler: result
Handler-->>CLI: success
CLI-->>User: success output
end
end
alt persistent_failure
Retry-->>Client: raise last exception
Client-->>Handler: propagate exception
Handler-->>CLI: propagate exception
CLI->>CLI: catch HTTPError/URLError/timeout
CLI-->>User: print JSON error
CLI->>CLI: sys.exit(1)
end
Entity relationship diagram for Trigger and TriggerExecution idempotent execution supporterDiagram
Trigger {
int id PK
}
TriggerExecution {
int id PK
int trigger_id FK
text event_data
string status
string idempotency_key
string error_category
datetime last_replay_at
}
Trigger ||--o{ TriggerExecution : has_executions
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've found 2 issues, and left some high level feedback:
- In
_retry_http_call_client, consider validatingmax_attempts >= 1(and/or initializing a sensible default exception) so you never reachraise last_excwithlast_excstillNoneif the helper is called withmax_attempts=0. - The retry branches for 5xx
HTTPErrorand(URLError, socket.timeout)duplicate the delay calculation and logging; factoring this into a small internal helper would reduce repetition and make it easier to tweak the retry behavior or log schema in one place.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In `_retry_http_call_client`, consider validating `max_attempts >= 1` (and/or initializing a sensible default exception) so you never reach `raise last_exc` with `last_exc` still `None` if the helper is called with `max_attempts=0`.
- The retry branches for 5xx `HTTPError` and `(URLError, socket.timeout)` duplicate the delay calculation and logging; factoring this into a small internal helper would reduce repetition and make it easier to tweak the retry behavior or log schema in one place.
## Individual Comments
### Comment 1
<location path=".claude/skills/int-evolution-go/scripts/evolution_go_client.py" line_range="50-59" />
<code_context>
+def _retry_http_call_client(do_call, max_attempts=3, base_delay=2.0, max_delay=8.0):
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Guard against max_attempts <= 0 and make base_delay influence the first retry delay
Two robustness points in `_retry_http_call_client`:
1) For `max_attempts <= 0`, the loop never executes and `last_exc` stays `None`, so `raise last_exc` will fail with a `TypeError`. A guard like `if max_attempts < 1: raise ValueError("max_attempts must be >= 1")` makes this case explicit.
2) The current backoff uses `base_delay ** attempt`, so the first retry delay is always `1` (`attempt == 0`). To let `base_delay` control the first delay and follow common exponential backoff patterns, consider `delay = min(base_delay * (2 ** attempt) + jitter, max_delay)` (or start the exponent at 1).
Suggested implementation:
```python
def _retry_http_call_client(do_call, max_attempts=3, base_delay=2.0, max_delay=8.0):
"""Exponential backoff + jitter for Evolution Go API calls.
Retries on HTTP 5xx, urllib.error.URLError, and socket.timeout (transient).
NEVER retries on HTTP 4xx (deterministic client errors).
Returns the result of do_call() on success.
Raises the last exception after max_attempts are exhausted.
Raises immediately on HTTP 4xx (no retry).
"""
if max_attempts < 1:
raise ValueError("max_attempts must be >= 1")
last_exc = None
```
I can’t see the body of the retry loop, but to make `base_delay` influence the first retry delay, you should update the backoff calculation wherever it’s computed. For example, if you currently have something like:
```python
delay = min(base_delay ** attempt + jitter, max_delay)
```
or:
```python
delay = min(base_delay ** attempt, max_delay)
```
change it to a standard exponential-backoff-with-jitter pattern:
```python
delay = min(base_delay * (2 ** attempt) + jitter, max_delay)
```
This ensures:
- On the first retry (`attempt == 0`), the delay is approximately `base_delay` (plus jitter).
- Subsequent retries double the delay (capped by `max_delay`), maintaining the intended exponential backoff while making `base_delay` meaningful from the first retry.
Make sure this change is applied consistently wherever the retry delay is calculated in `_retry_http_call_client`.
</issue_to_address>
### Comment 2
<location path="tests/whatsapp/test_retry_backoff.py" line_range="125-85" />
<code_context>
+ def test_success_on_third_attempt_returns_result(self):
</code_context>
<issue_to_address>
**suggestion (testing):** Consider asserting sleep is called exactly twice on the failing attempts
Since time.sleep is patched, you can assert it’s called exactly twice in this success-on-third-attempt case (after the first two failures, but not after the third). This more directly verifies the backoff behavior when the call eventually succeeds.
Suggested implementation:
```python
with patch("urllib.request.urlopen", side_effect=_mock_urlopen):
with patch.object(_client.time, "sleep") as mock_sleep:
with self.assertRaises(urllib.error.URLError):
_client.api_request("GET", "/instance/status")
self.assertEqual(call_count, 3)
self.assertEqual(mock_sleep.call_count, 2)
```
From the provided snippet I can’t see the full body of `test_success_on_third_attempt_returns_result`. To fully implement your suggestion for that specific test, you should:
1. Locate the `with patch.object(_client.time, "sleep"):` context inside `test_success_on_third_attempt_returns_result`.
2. Change it to `with patch.object(_client.time, "sleep") as mock_sleep:`.
3. After the call to `_client.api_request(...)` and any assertions on the returned result / `call_count`, add:
```python
self.assertEqual(mock_sleep.call_count, 2)
```
This ensures you assert `time.sleep` is called exactly twice in the “success on third attempt” case (after the first two failures, but not after the third).
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| _client.api_request("GET", "/instance/status") | ||
|
|
||
| self.assertEqual(ctx.exception.code, 500) | ||
| self.assertEqual(call_count, 3) |
There was a problem hiding this comment.
suggestion (testing): Consider asserting sleep is called exactly twice on the failing attempts
Since time.sleep is patched, you can assert it’s called exactly twice in this success-on-third-attempt case (after the first two failures, but not after the third). This more directly verifies the backoff behavior when the call eventually succeeds.
Suggested implementation:
with patch("urllib.request.urlopen", side_effect=_mock_urlopen):
with patch.object(_client.time, "sleep") as mock_sleep:
with self.assertRaises(urllib.error.URLError):
_client.api_request("GET", "/instance/status")
self.assertEqual(call_count, 3)
self.assertEqual(mock_sleep.call_count, 2)From the provided snippet I can’t see the full body of test_success_on_third_attempt_returns_result. To fully implement your suggestion for that specific test, you should:
- Locate the
with patch.object(_client.time, "sleep"):context insidetest_success_on_third_attempt_returns_result. - Change it to
with patch.object(_client.time, "sleep") as mock_sleep:. - After the call to
_client.api_request(...)and any assertions on the returned result /call_count, add:This ensures you assertself.assertEqual(mock_sleep.call_count, 2)
time.sleepis called exactly twice in the “success on third attempt” case (after the first two failures, but not after the third).
Endereça review do Sourcery em PR evolution-foundation#79: - Adiciona `ValueError` se `max_attempts < 1`. Antes, com max_attempts=0 o loop nunca executava e `raise last_exc` quebrava com TypeError porque last_exc permanecia None. - Troca `base_delay ** attempt` por `base_delay * (2 ** attempt)`. Antes, com base_delay=2.0, a primeira retry sempre tinha delay ~1s (2^0=1) independente do base_delay configurado — o parâmetro era ignorado no primeiro retry. Agora segue o padrão clássico de exponential backoff: primeira retry ~= base_delay, dobra a cada attempt. Worst-case latency (3 attempts, all 5xx) com base_delay=2.0, max_delay=8.0: attempt 0: ~2-2.5s, attempt 1: ~4-4.5s → total ~6.5s, dentro do budget de 8s definido nos critérios de aceite. Testes existentes em tests/whatsapp/test_retry_backoff.py passam sem alteração (asseguram call_count e tipo de exceção; sleep é mockado e não afirmava delays específicos).
…e sleep Polish do review do Sourcery em PR evolution-foundation#79 (sugestões code-quality): 1. Extrai `_backoff_retry_or_log_final` para eliminar duplicação entre os branches 5xx (HTTPError) e URLError/socket.timeout do retry loop. Antes, 18 linhas idênticas (compute delay + print retry log + sleep OU print failed log) apareciam duas vezes — qualquer ajuste em log schema ou fórmula precisava ser feito nos dois lugares. Agora há um ponto único; as chaves variantes do log vão como `extras` dict. Comportamento idêntico: testes existentes passam sem alteração e os logs JSON emitidos batem com a versão anterior (verificado via stdout dos tests). 2. Em `test_success_on_third_attempt_returns_result`, captura `mock_sleep` e assertiona `call_count == 2`. Antes o sleep estava mockado mas não verificado, então não havia garantia de que o backoff não estava sendo chamado também após o success. Agora o teste verifica que sleep só roda em failures retriadas, nunca depois de sucesso. Sem mudança funcional — só refactor e cobertura de teste mais rigorosa.
… índice + logs Consolidado dos fixes pedidos pelo Sourcery no review do PR evolution-foundation#78: 1. `except IntegrityError` restrito à violação de idempotência (dashboard/backend/routes/triggers.py) Antes, qualquer IntegrityError no `webhook_receiver` virava 200 OK silencioso — tratado como "replay idempotente". Outros problemas (NOT NULL, FK quebrada, novo unique constraint adicionado depois) ficariam mascarados como sucesso e o cliente nunca saberia que o evento não foi processado. Agora o catch só absorve o erro se: a) `idem_key` está definido (sem chave de idempotência não há como ser violação de uq_trigger_idem) b) A mensagem do erro do driver menciona `uq_trigger_idem` ou `idempotency_key` (constraint específica) Para os demais IntegrityError, loga com contexto completo (trigger_id, key, mensagem original) e re-raise — webhook retorna 500 e o erro fica visível em logs em vez de virar 200 silencioso. 2. Remove índice redundante (models.py) `TriggerExecution.idempotency_key` estava criando 3 índices pra mesma coluna: - `ix_trigger_executions_idempotency_key` (auto-gerado via index=True no model) - `ix_trigger_executions_idem_key` (raw SQL no startup) - `uq_trigger_idem` (partial unique, raw SQL) Mantemos só os dois últimos — explicitamente criados pela migration de startup, com nomes versionados no rollback plan. `index=True` removido da definição do model. 3. Loga falha de CREATE INDEX em vez de silenciar (app.py) Substitui `except Exception: pass` por log estruturado nos dois `CREATE INDEX` (basic e partial unique). Antes, qualquer falha (ex.: versão de SQLite sem suporte a partial index) era engolida silenciosamente — operador só descobriria no primeiro race. Agora loga `sqlite_lib`, `sqlite_runtime` e a exceção original. Sintaxe validada com `python3 -m py_compile`. Sem teste dedicado pro path nessa branch (testes WPP só entram nos PRs evolution-foundation#79/evolution-foundation#80).
… refactor Consolidado dos fixes pedidos pelo Sourcery no review do PR evolution-foundation#79: 1. Guard `max_attempts >= 1` (`.claude/skills/int-evolution-go/scripts/evolution_go_client.py`) Antes, com `max_attempts=0` o loop nunca executava e `raise last_exc` quebrava com TypeError porque last_exc permanecia None. Adicionado `ValueError("max_attempts must be >= 1")` no início para falhar explicitamente. 2. Corrige fórmula de backoff Antes: `min(base_delay ** attempt + jitter, max_delay)` Com `base_delay=2.0`, a primeira retry sempre tinha delay ~1s (2^0=1) independente do `base_delay` configurado — o parâmetro era ignorado no primeiro retry. Agora: `min(base_delay * (2 ** attempt) + jitter, max_delay)` Segue o padrão clássico de exponential backoff: primeira retry ≈ base_delay, dobra a cada attempt. Worst-case latency (3 attempts, all 5xx) com base_delay=2.0, max_delay=8.0: attempt 0 ~2-2.5s, attempt 1 ~4-4.5s → total ~6.5s, dentro do budget de 8s definido nos critérios de aceite. 3. Refactor — extrai `_backoff_retry_or_log_final` Os branches 5xx (HTTPError) e URLError/socket.timeout tinham 18 linhas idênticas (compute delay + print retry log + sleep OU print failed log). Extraído em helper único; chaves variantes do log (`http_status` vs `error`) vão como `extras` dict. Comportamento idêntico — logs JSON batem com a versão anterior. 4. Reforça asserts em `test_success_on_third_attempt_returns_result` (tests/whatsapp/test_retry_backoff.py) Captura `mock_sleep` e assertiona `call_count == 2`. Antes, sleep estava mockado mas não verificado — não havia garantia de que backoff não estava sendo chamado também após o success. Agora garante que sleep só roda em failures retriadas, nunca depois de sucesso. Todos os 4 testes em tests/whatsapp/test_retry_backoff.py passam.
829aea6 to
b93aa65
Compare
Context
Second of a 3-PR series implementing the WhatsApp retry pattern. Builds on #78 (PR-1: idempotency_key migration + silent dedup) by adding resilient retry logic to the Evolution Go API client.
Problem
api_request()inevolution_go_client.pyissues a single HTTP call and callssys.exit(1)on any error. This means:Solution
New
_retry_http_call_client(do_call, max_attempts=3, base_delay=2.0, max_delay=8.0)helper:urllib.error.URLError,socket.timeout(transient categories)min(2^attempt + uniform(0, 0.5), max_delay)per attemptevt: api_request_retry/api_request_failed)do_call()on success; raises the last exception aftermax_attemptsapi_request()refactored to wrap the actual HTTP call in_do_calland delegate to the retry helper. Removes inlinesys.exit(1)so library callers can handle errors.main()wraps handler invocation in try/except to preserve the exact CLI behavior (exit 1 on failure) while letting library users propagate.Tests
tests/whatsapp/test_retry_backoff.py(153 lines, 4 tests):All 4 pass:
Worst-case latency
3 attempts, all 5xx:
sleep(1.0±0.5) + sleep(2.0±0.5) ≈ 3.5–4.5stotal. Capped atmax_delay=8sper attempt. Acceptable for an API call retry on a user-initiated command.Verification
python tests/whatsapp/test_retry_backoff.py→ 4/4 passast.parse(evolution_go_client.py)→ OKmain()exits 1 on HTTPError/URLError as before)Series
Summary by Sourcery
Introduce resilient, retryable Evolution Go API client behavior with exponential backoff, while wiring idempotent trigger execution support and database schema needed for WhatsApp-style retries.
New Features:
Bug Fixes:
Enhancements:
Tests: