Switch outbound HTTP to urllib3; expose per-stage timing#58
Conversation
Switch the storage upload and webhook calls from urllib.request to a shared urllib3.PoolManager. Behaviour is preserved (no retries, same timeout values applied to both connect and read), with two incidental benefits: - Connect and read timeouts can be configured separately if desired (urllib3.Timeout(connect=..., read=...)). The current call sites pass the same value to both, matching prior behaviour. - The PoolManager amortises socket setup across requests, instead of opening a fresh connection for every call. Non-2xx responses no longer raise; they are handled by checking response.status. Exception handling is consolidated to urllib3's ConnectTimeoutError, ReadTimeoutError, NewConnectionError, ProtocolError, SSLError, and HTTPError, with the same network-error classification (timeout / connection_refused / connection_reset / network_error) the previous code produced.
Two fixes for resource lifecycle around the new shared PoolManager: 1. Drop pooled connections on SIGHUP. Hot-reloading the config can change webhook_url or storage_url, but pooled connections to the old endpoint would keep their DNS pinning until they aged out. Calling PoolManager.clear() in the reload handler restores the property that each call starts from a clean slate after a config change. 2. Retry once on connection-level errors. A pooled connection that the server has closed surfaces as an error on the next reuse; without a retry the milter would see a spurious network failure where the prior urllib.request implementation would have transparently re-resolved DNS for a fresh socket. The retry is constrained to connect errors only (read=False, status_forcelist=(), other=0) so application-level responses still flow through unchanged: the milter remains responsible for translating non-2xx into REJECT vs TEMPFAIL on a single attempt.
Three related fixes in the same outbound-HTTP-lifecycle area: 1. Reset the cached boto3 S3 client on SIGHUP. Pre-existing parity bug with the new urllib3 pool: a storage_url change going from HTTP to S3 (or back) would leave one cache stale and the other warm. The reload now drops both via _reset_outbound_http_clients(), and the docstring on _get_s3_client() documents the lifecycle. 2. Move storage_upload_threshold into ReloadableConfig. It was the only "reloadable in spirit" value still wired as a startup-only global, so changing it via the config file silently had no effect until restart. Now read off the per-message snapshot like the rest of the storage knobs. 3. Bump PoolManager maxsize from 10 to 50 with explanatory comment. The earlier value left handshake savings on the table at the load level the milter actually sees; 50 keeps most concurrent calls on warm connections while preserving block=False (bursts past the pool spawn transient connections rather than blocking, since SMTP timeouts are worse than redundant TCP for the milter's role).
Splits the previously-opaque webhook latency measurement into three stages so we can localise where time goes under concurrency: - milter_outbound_connect_seconds — TCP+TLS handshake. Recorded only when urllib3 opens a fresh socket (i.e. pool miss). Reused pooled connections skip connect() entirely, so this histogram represents exactly the cost the milter pays for cold-pool calls. - milter_outbound_ttfb_seconds — request start to first response byte. Includes connect on cold calls; on warm calls it is request transit + remote processing. Comparing against remote-reported processing time isolates transit and middlebox queuing from real server work. - milter_outbound_body_read_seconds — response body read. Sub-ms for small JSON in steady state; growth points at egress problems. Plus two counters: - milter_outbound_new_connections_total — fresh-socket count, for deriving pool reuse rate - milter_outbound_requests_total — request count regardless of reuse The connect histogram is recorded by subclassing urllib3's connection class (HTTPSConnection / HTTPConnection) so the timing happens in the exact place urllib3 opens a socket. The pool manager is wired to use the timing-aware classes via pool_classes_by_scheme. Both call sites (webhook + non-S3 storage upload) issue requests with preload_content=False so the headers and body phases can be observed separately at the call site. Labels are (host, scheme), keyed off the parsed URL — bounded cardinality on whatever webhook/storage targets are configured, with port stripped so accidental port variance doesn't multiply series.
OUTBOUND_HTTP_POOL_MAXSIZE env var controls the per-host warm-connection cap on the shared PoolManager. Defaults to 50, which is comfortable for most deployments; operators with high concurrent outbound HTTP fan-out can raise it without rebuilding the image. Invalid values (non-integer, < 1) log a warning and fall back to the default rather than failing startup.
This comment has been minimized.
This comment has been minimized.
Greptile SummaryThis PR replaces Confidence Score: 5/5Safe to merge — all previously flagged P1 concerns have been resolved and no new issues were found. The storage_upload_threshold SIGHUP safety issue from the prior review is fixed via _coerce_positive_int. The urllib3 migration is structurally sound: MaxRetryError is caught via HTTPError inheritance and correctly unwrapped, the preload_content=False timing split is guarded by finally blocks to prevent metric loss on body-read failures, and per-call auth is never cached in the pool. No P0/P1 findings remain. No files require special attention. Important Files Changed
Sequence DiagramsequenceDiagram
participant M as PrimitiveMailMilter
participant HC as _call_webhook_for_recipient / _upload_to_http
participant PM as _HTTP (urllib3.PoolManager)
participant TC as _TimingHTTPS/HTTPConnection
participant R as Remote Endpoint
M->>HC: eom() triggers webhook/storage path
HC->>PM: request(POST, url, preload_content=False)
alt pool miss (fresh socket)
PM->>TC: connect()
Note over TC: t0 = time.monotonic()
TC->>R: TCP + TLS handshake
Note over TC: record HTTP_CONNECT_DURATION, increment HTTP_CONNECTIONS_NEW
else pool hit (reused socket)
PM-->>R: (reuse existing socket)
end
Note over HC: t_send = time.monotonic()
PM->>R: send request headers + body
R-->>PM: response headers
PM-->>HC: HTTPResponse (headers only)
Note over HC: t_headers = time.monotonic()
HC->>HC: try: response.data
R-->>HC: response body bytes
Note over HC: t_done = time.monotonic()
HC->>PM: release_conn()
Note over HC: [finally] record TTFB, body, requests_total
alt 2xx
HC->>M: success result
else non-2xx
HC->>M: failure result
else urllib3 network exception
HC->>M: network_error result
end
Note over M: SIGHUP: _HTTP.clear() + _s3_client = None
Reviews (5): Last reviewed commit: "Tolerate non-UTF-8 webhook response bodi..." | Re-trigger Greptile |
Threshold parsing moved from a startup-only path to the SIGHUP-reload path in this branch, but kept a bare int() cast. A bad value in the config file would now propagate ValueError out of _build_reloadable_config and into the signal handler, terminating the milter — a regression relative to the pre-PR behaviour where the same input was harmless. Generalise the existing pool-maxsize coercion into _coerce_positive_int and use it for storage_upload_threshold (default 3_000_000), matching the resilience the other reloadable values have. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous tests patched urllib.request.urlopen, but this branch moved the outbound HTTP path to a shared urllib3 PoolManager (pm._HTTP). Every webhook test in the suite was therefore making real DNS lookups for test.example.com and failing. - MockResponse now mirrors urllib3.HTTPResponse with preload_content=False (.status, .data, .release_conn). - New patch_http(...) helper patches pm._HTTP.request, with a small adapter so existing capture(request, timeout=...) callbacks keep working without per-call edits. - Connection-error paths use urllib3.exceptions.ProtocolError (caught by the milter's existing exception handler). - Non-2xx HTTP status tests now return MockResponse(code, body) instead of raising urllib.error.HTTPError, matching urllib3's raise_on_status=False behaviour. - TestStorageUpload / TestSighupReload save+restore reaches into _rcfg.storage_upload_threshold instead of the removed module global. Full pytest suite passes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When the configured one-shot retry also times out, urllib3 raises MaxRetryError(reason=ConnectTimeoutError). The previous classification checked isinstance against ConnectTimeoutError / ReadTimeoutError on the caught exception directly, missing the wrapper case and labelling the structured log field as error_type=network_error instead of timeout. That field was a 1:1 carry-over from the urllib.request baseline this PR replaced, so the regression silently widened the 'network_error' bucket. Unwrap MaxRetryError.reason before classifying. Adds a regression test asserting error_type=timeout in the log when MaxRetryError wraps a ConnectTimeoutError. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two observability fixes pointed out in review of the urllib3 swap: 1. Mid-body read failures dropped HTTP-stage metrics. With preload_content=False, the body read happens on response.data; if it raises (read timeout, RST), the exception propagated past the record_metrics call below the try/finally — so HTTP_BODY_READ_DURATION / HTTP_TTFB_DURATION / HTTP_REQUESTS_TOTAL silently undercounted exactly the tail those histograms exist to measure. Move the record_metrics call inside the finally so observed timings are captured even when the read fails. Same shape applied to the storage upload path. 2. storage_upload_threshold=0 silently fell back to 3MB. _coerce_positive_int defaulted to minimum=1, but 0 is a legitimate pre-PR config value meaning 'always upload to storage'. Pass minimum=0 for the threshold so that semantics is preserved; -1 / non-integer still fall back to the default safely. Adds regression tests: - test_body_read_failure_still_records_http_metrics — patches record_metrics to capture lambda source lines, asserts the HTTP-stage block near the body-read site fires even when response.data raises. - test_storage_upload_threshold_zero_preserved — round-trips 0 through _apply_config and asserts it stays 0 (not the 3MB default). - test_storage_upload_threshold_negative_falls_back / _garbage_falls_back — pin the safe-fallback behaviour the SIGHUP path depends on. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replacing urllib.request with urllib3 unified the success and error
decode paths through response.data.decode('utf-8'). The old
urllib.error.HTTPError branch had absorbed bad-decode cases, so a
non-UTF-8 error body (e.g. text/html with a wrong charset declared)
used to flow through the normal HTTP-status fallback. Without that
guard, a UnicodeDecodeError now escapes the urllib3 exception block
and lands on the catchall, surfacing as WEBHOOK_UNEXPECTED_ERROR
instead of WEBHOOK_HTTP_ERROR — same TEMPFAIL outcome, but a noisier
and less-classifiable log signature for ops.
Decode with errors='replace' so the body still reaches
_interpret_webhook_response. JSON parsing inside that helper already
tolerates malformed bodies (silently falls through to status code).
Adds a regression test asserting the structured log signature, since
the status-code outcome alone wouldn't catch this regression.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Replaces the milter's outbound HTTP layer (
urllib.request) with a sharedurllib3.PoolManager, and uses the swap to add per-stage timinghistograms — connect / TTFB / body — so operators can see where outbound
webhook and storage-upload latency is actually spent.
Why
urllib.requestis a black box: a single stopwatch aroundurlopen()isall you can measure, which makes it impossible to tell whether time is
being spent in TLS handshakes, in remote processing, or in body transit.
This is the only HTTP path the milter takes for the webhook and the
HTTP-mode storage upload, so the visibility gap matters disproportionately.
What changes
module-level
urllib3.PoolManager. Previously each call opened a freshTCP+TLS connection; the pool reuses sockets across calls.
(host, scheme):milter_outbound_connect_seconds— TCP + TLS handshake time.Recorded only when urllib3 opens a fresh socket; reused pooled
connections never enter
connect()so the count of this metric isalso the pool-miss rate.
milter_outbound_ttfb_seconds— request start → first response byte.milter_outbound_body_read_seconds— first byte → response complete.milter_outbound_new_connections_total,milter_outbound_requests_total.urllib3.connection.HTTPSConnection/HTTPConnectionto timeconnect()in place, and by issuing requests withpreload_content=Falseat the call site so the headers and bodyphases can be observed separately.
_HTTP.clear()and a reset of the cachedboto3S3 clientare wired into the existing
SIGHUPreload path so awebhook_urlorstorage_urlchange drops idle connections instead of leaving thempinned to the old endpoint.
storage_upload_thresholdis now part of the reloadable config (itwas a startup-only global, which was inconsistent with
storage_urlbeing reloadable).
outbound_http_pool_maxsizeconfig-file key or the
OUTBOUND_HTTP_POOL_MAXSIZEenv var (default50, config file wins over env var, matching the rest of the milter's
resolution pattern). Invalid values fall back to the default rather
than failing startup, and the resolved value + its source are logged
at startup.
Behaviour preserved
built fresh per request — pooled connections carry no auth state.
existing
_interpret_webhook_responsepath; network errors are stillclassified into
timeout/connection_refused/connection_reset/
network_error.urllib3.Retryis configuredwith
read=False,status_forcelist=(), andtotal=1, connect=1.The one retry that exists fires only on connect-level errors — the
failure mode where the server has closed our pooled keep-alive
between calls. This preserves the original "retry would only happen
on a fresh connection anyway" behaviour of
urllib.request, whilepreventing the spurious-network-error class that would otherwise be
visible from pooling.
urllib.request.urlopenis now passed to
urllib3.Timeout(connect=N, read=N), so theper-call budget is identical.
is dropping the cached HTTP/S3 clients in the same handler.
New dependency
urllib3>=2.0,<3is added explicitly inDockerfile. It is alreadypulled in transitively as a dependency of
boto3, so the runtimefootprint is unchanged.
Testing
HTTPSConnection/HTTPConnectionsubclasses registercorrectly via
PoolManager.pool_classes_by_scheme— confirmed by alive HTTPS round-trip in a Python REPL using the same wiring.
:9901/metricsendpoint with the documented labels.
milter_webhook_duration_seconds/milter_storage_upload_duration_seconds/milter_*_calls_total/milter_errors_totalcontinue to be emittedwith unchanged label sets.
OUTBOUND_HTTP_POOL_MAXSIZEparsing covers integer / non-integer /zero / negative / empty / unset cases — invalid inputs degrade to
the documented default.
Out of scope
connect/readtimeouts are still passed the same value topreserve baseline behaviour. Splitting them (e.g. a short connect
deadline + a longer read deadline) would be a follow-up.