Skip to content

Switch outbound HTTP to urllib3; expose per-stage timing#58

Merged
prim-8 merged 10 commits into
mainfrom
urllib3-http-client
Apr 26, 2026
Merged

Switch outbound HTTP to urllib3; expose per-stage timing#58
prim-8 merged 10 commits into
mainfrom
urllib3-http-client

Conversation

@prim-8
Copy link
Copy Markdown
Contributor

@prim-8 prim-8 commented Apr 26, 2026

Summary

Replaces the milter's outbound HTTP layer (urllib.request) with a shared
urllib3.PoolManager, and uses the swap to add per-stage timing
histograms — connect / TTFB / body — so operators can see where outbound
webhook and storage-upload latency is actually spent.

Why

urllib.request is a black box: a single stopwatch around urlopen() is
all you can measure, which makes it impossible to tell whether time is
being spent in TLS handshakes, in remote processing, or in body transit.
This is the only HTTP path the milter takes for the webhook and the
HTTP-mode storage upload, so the visibility gap matters disproportionately.

What changes

  • The webhook call and the non-S3 storage upload now go through one
    module-level urllib3.PoolManager. Previously each call opened a fresh
    TCP+TLS connection; the pool reuses sockets across calls.
  • New Prometheus histograms, labelled by (host, scheme):
    • milter_outbound_connect_seconds — TCP + TLS handshake time.
      Recorded only when urllib3 opens a fresh socket; reused pooled
      connections never enter connect() so the count of this metric is
      also the pool-miss rate.
    • milter_outbound_ttfb_seconds — request start → first response byte.
    • milter_outbound_body_read_seconds — first byte → response complete.
    • Plus counters: milter_outbound_new_connections_total,
      milter_outbound_requests_total.
  • Per-stage timing is implemented by subclassing
    urllib3.connection.HTTPSConnection / HTTPConnection to time
    connect() in place, and by issuing requests with
    preload_content=False at the call site so the headers and body
    phases can be observed separately.
  • Lifecycle: _HTTP.clear() and a reset of the cached boto3 S3 client
    are wired into the existing SIGHUP reload path so a webhook_url or
    storage_url change drops idle connections instead of leaving them
    pinned to the old endpoint.
  • storage_upload_threshold is now part of the reloadable config (it
    was a startup-only global, which was inconsistent with storage_url
    being reloadable).
  • Pool size is configurable via the outbound_http_pool_maxsize
    config-file key or the OUTBOUND_HTTP_POOL_MAXSIZE env var (default
    50, config file wins over env var, matching the rest of the milter's
    resolution pattern). Invalid values fall back to the default rather
    than failing startup, and the resolved value + its source are logged
    at startup.

Behaviour preserved

  • Per-call auth: the bearer secret and any extra webhook headers are
    built fresh per request — pooled connections carry no auth state.
  • Error handling: non-2xx responses are still surfaced through the
    existing _interpret_webhook_response path; network errors are still
    classified into timeout / connection_refused / connection_reset
    / network_error.
  • Single-attempt application semantics: urllib3.Retry is configured
    with read=False, status_forcelist=(), and total=1, connect=1.
    The one retry that exists fires only on connect-level errors — the
    failure mode where the server has closed our pooled keep-alive
    between calls. This preserves the original "retry would only happen
    on a fresh connection anyway" behaviour of urllib.request, while
    preventing the spurious-network-error class that would otherwise be
    visible from pooling.
  • Timeouts: the same numeric budget passed to urllib.request.urlopen
    is now passed to urllib3.Timeout(connect=N, read=N), so the
    per-call budget is identical.
  • SIGHUP: continues to swap reloadable config atomically; the new step
    is dropping the cached HTTP/S3 clients in the same handler.

New dependency

urllib3>=2.0,<3 is added explicitly in Dockerfile. It is already
pulled in transitively as a dependency of boto3, so the runtime
footprint is unchanged.

Testing

  • Module compiles and imports cleanly.
  • The custom HTTPSConnection / HTTPConnection subclasses register
    correctly via PoolManager.pool_classes_by_scheme — confirmed by a
    live HTTPS round-trip in a Python REPL using the same wiring.
  • New histograms appear in the milter's existing :9901/metrics
    endpoint with the documented labels.
  • Existing milter_webhook_duration_seconds /
    milter_storage_upload_duration_seconds /
    milter_*_calls_total / milter_errors_total continue to be emitted
    with unchanged label sets.
  • OUTBOUND_HTTP_POOL_MAXSIZE parsing covers integer / non-integer /
    zero / negative / empty / unset cases — invalid inputs degrade to
    the documented default.

Out of scope

  • connect / read timeouts are still passed the same value to
    preserve baseline behaviour. Splitting them (e.g. a short connect
    deadline + a longer read deadline) would be a follow-up.

prim-8 added 5 commits April 24, 2026 18:48
Switch the storage upload and webhook calls from urllib.request to a
shared urllib3.PoolManager. Behaviour is preserved (no retries, same
timeout values applied to both connect and read), with two incidental
benefits:

- Connect and read timeouts can be configured separately if desired
  (urllib3.Timeout(connect=..., read=...)). The current call sites pass
  the same value to both, matching prior behaviour.
- The PoolManager amortises socket setup across requests, instead of
  opening a fresh connection for every call.

Non-2xx responses no longer raise; they are handled by checking
response.status. Exception handling is consolidated to urllib3's
ConnectTimeoutError, ReadTimeoutError, NewConnectionError,
ProtocolError, SSLError, and HTTPError, with the same network-error
classification (timeout / connection_refused / connection_reset /
network_error) the previous code produced.
Two fixes for resource lifecycle around the new shared PoolManager:

1. Drop pooled connections on SIGHUP. Hot-reloading the config can change
   webhook_url or storage_url, but pooled connections to the old endpoint
   would keep their DNS pinning until they aged out. Calling
   PoolManager.clear() in the reload handler restores the property that
   each call starts from a clean slate after a config change.

2. Retry once on connection-level errors. A pooled connection that the
   server has closed surfaces as an error on the next reuse; without a
   retry the milter would see a spurious network failure where the prior
   urllib.request implementation would have transparently re-resolved DNS
   for a fresh socket. The retry is constrained to connect errors only
   (read=False, status_forcelist=(), other=0) so application-level
   responses still flow through unchanged: the milter remains responsible
   for translating non-2xx into REJECT vs TEMPFAIL on a single attempt.
Three related fixes in the same outbound-HTTP-lifecycle area:

1. Reset the cached boto3 S3 client on SIGHUP. Pre-existing parity bug
   with the new urllib3 pool: a storage_url change going from HTTP to
   S3 (or back) would leave one cache stale and the other warm. The
   reload now drops both via _reset_outbound_http_clients(), and the
   docstring on _get_s3_client() documents the lifecycle.

2. Move storage_upload_threshold into ReloadableConfig. It was the
   only "reloadable in spirit" value still wired as a startup-only
   global, so changing it via the config file silently had no effect
   until restart. Now read off the per-message snapshot like the rest
   of the storage knobs.

3. Bump PoolManager maxsize from 10 to 50 with explanatory comment.
   The earlier value left handshake savings on the table at the load
   level the milter actually sees; 50 keeps most concurrent calls on
   warm connections while preserving block=False (bursts past the
   pool spawn transient connections rather than blocking, since SMTP
   timeouts are worse than redundant TCP for the milter's role).
Splits the previously-opaque webhook latency measurement into three
stages so we can localise where time goes under concurrency:

- milter_outbound_connect_seconds — TCP+TLS handshake. Recorded only
  when urllib3 opens a fresh socket (i.e. pool miss). Reused pooled
  connections skip connect() entirely, so this histogram represents
  exactly the cost the milter pays for cold-pool calls.

- milter_outbound_ttfb_seconds — request start to first response byte.
  Includes connect on cold calls; on warm calls it is request transit
  + remote processing. Comparing against remote-reported processing
  time isolates transit and middlebox queuing from real server work.

- milter_outbound_body_read_seconds — response body read. Sub-ms for
  small JSON in steady state; growth points at egress problems.

Plus two counters:
- milter_outbound_new_connections_total — fresh-socket count, for
  deriving pool reuse rate
- milter_outbound_requests_total — request count regardless of reuse

The connect histogram is recorded by subclassing urllib3's connection
class (HTTPSConnection / HTTPConnection) so the timing happens in the
exact place urllib3 opens a socket. The pool manager is wired to use
the timing-aware classes via pool_classes_by_scheme. Both call sites
(webhook + non-S3 storage upload) issue requests with
preload_content=False so the headers and body phases can be observed
separately at the call site.

Labels are (host, scheme), keyed off the parsed URL — bounded
cardinality on whatever webhook/storage targets are configured, with
port stripped so accidental port variance doesn't multiply series.
OUTBOUND_HTTP_POOL_MAXSIZE env var controls the per-host warm-connection
cap on the shared PoolManager. Defaults to 50, which is comfortable for
most deployments; operators with high concurrent outbound HTTP fan-out
can raise it without rebuilding the image.

Invalid values (non-integer, < 1) log a warning and fall back to the
default rather than failing startup.
@blacksmith-sh

This comment has been minimized.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 26, 2026

Greptile Summary

This PR replaces urllib.request with a shared urllib3.PoolManager and introduces per-stage outbound HTTP timing histograms (connect, TTFB, body). The previously-flagged SIGHUP safety gap for storage_upload_threshold has been addressed: int() is now guarded by _coerce_positive_int, consistent with how outbound_http_pool_maxsize is handled. The timing-aware connection subclasses, MaxRetryError unwrapping, preload_content=False split, and SIGHUP reset path are all implemented correctly.

Confidence Score: 5/5

Safe to merge — all previously flagged P1 concerns have been resolved and no new issues were found.

The storage_upload_threshold SIGHUP safety issue from the prior review is fixed via _coerce_positive_int. The urllib3 migration is structurally sound: MaxRetryError is caught via HTTPError inheritance and correctly unwrapped, the preload_content=False timing split is guarded by finally blocks to prevent metric loss on body-read failures, and per-call auth is never cached in the pool. No P0/P1 findings remain.

No files require special attention.

Important Files Changed

Filename Overview
milter/primitivemail_milter.py Core change: replaces urllib.request with a shared urllib3.PoolManager backed by timing-aware connection subclasses; adds TTFB/connect/body histograms; promotes storage_upload_threshold to reloadable config with safe coercion; MaxRetryError correctly unwrapped for timeout classification; 2xx/non-2xx handling unified cleanly.
test_milter.py Test harness updated: MockResponse reshaped to the urllib3 HTTPResponse interface; patch_http() helper added to adapt old capture callbacks; new tests cover MaxRetryError timeout unwrapping, mid-body failure metric recording, and non-UTF-8 error bodies.
Dockerfile Pins urllib3>=2.0,<3 explicitly; previously only a transitive dep of boto3 — explicit pin matches the PR's stated intent and prevents a future boto3 upgrade from silently pulling in urllib3 v3.

Sequence Diagram

sequenceDiagram
    participant M as PrimitiveMailMilter
    participant HC as _call_webhook_for_recipient / _upload_to_http
    participant PM as _HTTP (urllib3.PoolManager)
    participant TC as _TimingHTTPS/HTTPConnection
    participant R as Remote Endpoint

    M->>HC: eom() triggers webhook/storage path

    HC->>PM: request(POST, url, preload_content=False)
    alt pool miss (fresh socket)
        PM->>TC: connect()
        Note over TC: t0 = time.monotonic()
        TC->>R: TCP + TLS handshake
        Note over TC: record HTTP_CONNECT_DURATION, increment HTTP_CONNECTIONS_NEW
    else pool hit (reused socket)
        PM-->>R: (reuse existing socket)
    end

    Note over HC: t_send = time.monotonic()
    PM->>R: send request headers + body
    R-->>PM: response headers
    PM-->>HC: HTTPResponse (headers only)
    Note over HC: t_headers = time.monotonic()

    HC->>HC: try: response.data
    R-->>HC: response body bytes
    Note over HC: t_done = time.monotonic()
    HC->>PM: release_conn()
    Note over HC: [finally] record TTFB, body, requests_total

    alt 2xx
        HC->>M: success result
    else non-2xx
        HC->>M: failure result
    else urllib3 network exception
        HC->>M: network_error result
    end

    Note over M: SIGHUP: _HTTP.clear() + _s3_client = None
Loading

Reviews (5): Last reviewed commit: "Tolerate non-UTF-8 webhook response bodi..." | Re-trigger Greptile

Comment thread milter/primitivemail_milter.py Outdated
prim-8 and others added 5 commits April 25, 2026 21:49
Threshold parsing moved from a startup-only path to the SIGHUP-reload
path in this branch, but kept a bare int() cast. A bad value in the
config file would now propagate ValueError out of _build_reloadable_config
and into the signal handler, terminating the milter — a regression
relative to the pre-PR behaviour where the same input was harmless.

Generalise the existing pool-maxsize coercion into _coerce_positive_int
and use it for storage_upload_threshold (default 3_000_000), matching
the resilience the other reloadable values have.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous tests patched urllib.request.urlopen, but this branch
moved the outbound HTTP path to a shared urllib3 PoolManager
(pm._HTTP). Every webhook test in the suite was therefore making
real DNS lookups for test.example.com and failing.

- MockResponse now mirrors urllib3.HTTPResponse with preload_content=False
  (.status, .data, .release_conn).
- New patch_http(...) helper patches pm._HTTP.request, with a small
  adapter so existing capture(request, timeout=...) callbacks keep
  working without per-call edits.
- Connection-error paths use urllib3.exceptions.ProtocolError (caught by
  the milter's existing exception handler).
- Non-2xx HTTP status tests now return MockResponse(code, body) instead
  of raising urllib.error.HTTPError, matching urllib3's
  raise_on_status=False behaviour.
- TestStorageUpload / TestSighupReload save+restore reaches into
  _rcfg.storage_upload_threshold instead of the removed module global.

Full pytest suite passes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When the configured one-shot retry also times out, urllib3 raises
MaxRetryError(reason=ConnectTimeoutError). The previous classification
checked isinstance against ConnectTimeoutError / ReadTimeoutError on the
caught exception directly, missing the wrapper case and labelling the
structured log field as error_type=network_error instead of timeout.
That field was a 1:1 carry-over from the urllib.request baseline this PR
replaced, so the regression silently widened the 'network_error' bucket.

Unwrap MaxRetryError.reason before classifying. Adds a regression test
asserting error_type=timeout in the log when MaxRetryError wraps a
ConnectTimeoutError.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two observability fixes pointed out in review of the urllib3 swap:

1. Mid-body read failures dropped HTTP-stage metrics. With
   preload_content=False, the body read happens on response.data; if it
   raises (read timeout, RST), the exception propagated past the
   record_metrics call below the try/finally — so HTTP_BODY_READ_DURATION /
   HTTP_TTFB_DURATION / HTTP_REQUESTS_TOTAL silently undercounted exactly
   the tail those histograms exist to measure. Move the record_metrics
   call inside the finally so observed timings are captured even when
   the read fails. Same shape applied to the storage upload path.

2. storage_upload_threshold=0 silently fell back to 3MB. _coerce_positive_int
   defaulted to minimum=1, but 0 is a legitimate pre-PR config value
   meaning 'always upload to storage'. Pass minimum=0 for the threshold
   so that semantics is preserved; -1 / non-integer still fall back to
   the default safely.

Adds regression tests:
- test_body_read_failure_still_records_http_metrics — patches
  record_metrics to capture lambda source lines, asserts the HTTP-stage
  block near the body-read site fires even when response.data raises.
- test_storage_upload_threshold_zero_preserved — round-trips 0 through
  _apply_config and asserts it stays 0 (not the 3MB default).
- test_storage_upload_threshold_negative_falls_back / _garbage_falls_back
  — pin the safe-fallback behaviour the SIGHUP path depends on.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replacing urllib.request with urllib3 unified the success and error
decode paths through response.data.decode('utf-8'). The old
urllib.error.HTTPError branch had absorbed bad-decode cases, so a
non-UTF-8 error body (e.g. text/html with a wrong charset declared)
used to flow through the normal HTTP-status fallback. Without that
guard, a UnicodeDecodeError now escapes the urllib3 exception block
and lands on the catchall, surfacing as WEBHOOK_UNEXPECTED_ERROR
instead of WEBHOOK_HTTP_ERROR — same TEMPFAIL outcome, but a noisier
and less-classifiable log signature for ops.

Decode with errors='replace' so the body still reaches
_interpret_webhook_response. JSON parsing inside that helper already
tolerates malformed bodies (silently falls through to status code).

Adds a regression test asserting the structured log signature, since
the status-code outcome alone wouldn't catch this regression.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@prim-8 prim-8 merged commit 7c2b7b8 into main Apr 26, 2026
8 checks passed
@prim-8 prim-8 deleted the urllib3-http-client branch April 26, 2026 07:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant