udp: batch macOS receive with recvmsg_x#89
Conversation
The Linux recv path drains up to 32 datagrams per kernel wakeup via recvmmsg(2), amortising the per-syscall + per-task-wakeup cost across the burst (commit 253ddda). macOS still fell through to single-packet recv_from, so the same overhead capped inbound rate on Apple builds. Add an equivalent batch path for Darwin using recvmsg_x(2). It is a xnu-private syscall (not in the public SDK) but is the canonical amortisation primitive on macOS — same shape used by quinn-udp for the same reason. ABI is the public msghdr layout plus a trailing msg_datalen (per-datagram bytes-received output), declared via `unsafe extern "C"` against a local repr(C) `msghdr_x`. Same `(count, kernel_drops)` contract as the Linux `recv_batch`. macOS has no SO_RXQ_OVFL equivalent, so `kernel_drops` is always 0 — the 1Hz `sample_transport_congestion()` detector simply sees no kernel drop signal on Apple hosts (it already tolerates that, since the field has been 0 there pre-batching too). cmsg buffer is intentionally null: we never consume ancillary data on this path, and quinn-udp documents that `recvmsg_x` does not overwrite `msg_controllen` on macOS 10.15+ (zeroed init is the only safe state). The udp_receive_loop dispatch widens from `cfg(linux)` to `cfg(any(linux, macos))`; the per-packet recv_from path is now used only on the remaining unix targets (BSDs etc.) and Windows. Adds `test_burst_recv_batch` exercising 10 in-flight datagrams to verify per-datagram boundaries and arrival order across the batch. All 25 UDP tests pass on aarch64-apple-darwin.
Reproducible bench for the recv-side syscall amortization win added in
the previous commit. Sender(s) run on dedicated blocking std threads so
the kernel rx queue stays saturated regardless of how the tokio receiver
schedules — that's the scenario where recvmmsg / recvmsg_x is meant to
help. Receiver runs both `recv_from` (single recvmsg per syscall) and
`recv_batch` (recvmsg_x with up to 32 datagrams) for fixed wall-clock
windows, sweeping over 1/2/4/8 sender threads to vary queue depth.
Numbers on aarch64-apple-darwin (M-series, 100B payloads, 3s windows,
multi_thread(2) tokio), recv-side throughput at the receiver:
senders=1: recv_from 398k pps recv_batch 432k pps 1.09x batch≈5.5
senders=2: recv_from 353k pps recv_batch 608k pps 1.72x batch≈11.9
senders=4: recv_from 322k pps recv_batch 503k pps 1.56x batch≈17.2
senders=8: recv_from 353k pps recv_batch 515k pps 1.46x batch≈22.7
The win scales with kernel-queue depth: under single-stream load the
kernel queue is mostly empty between wakeups so each batched syscall
reaps just ~5 packets and we save ~9%; once two or more sender threads
deepen the queue, batches grow to 12-23 packets per syscall and the
receiver sustains 50-70% more pps. (The Linux PR's 3.5x is larger
because Linux per-syscall + scheduler-hop overhead is heavier than
macOS kqueue's; the Apple ceiling is lower but still meaningful.)
Marked #[ignore] so it doesn't run in default `cargo test`. Run with:
cargo test --release -p fips --lib \
transport::udp::socket::tests::bench_udp_recv_amortization \
-- --ignored --nocapture
|
Landed on The four commits were rebased onto the post-#87/#88 master tip ( CI ran on the squashed commit before merge: 42/42 green, including Closing manually since the squash rewrote the head SHAs and GitHub's patch-equivalence auto-close didn't fire. Thanks! |
Summary
recvmsg_x(2)UDP receive batch path matching the existing Linuxrecvmmsg(2)contract.udp_receive_loopbatch drain so one readiness wake can reap up to 32 datagrams.IpAddrimport incontrol::listeningso macOS test builds are warning-clean.Benchmark
aarch64-apple-darwin, 100-byte UDP loopback receive throughput over 3 s windows:
The win appears once the kernel receive queue is deep enough for
recvmsg_xto amortize wakeup + syscall cost across a batch.Validation
cargo fmt --check,git diff --check, full lib suite, transport::udp suite (25 passed, 1 ignored), and the burst-recv regression test all clean. The ignored amortization bench above is reproducible viacargo test --release --lib transport::udp::socket::tests::bench_udp_recv_amortization -- --ignored --nocapture.