Skip to content

udp: batch macOS receive with recvmsg_x#89

Closed
mmalmi wants to merge 4 commits into
jmcorgan:masterfrom
mmalmi:codex/darwin-recvmsg-x
Closed

udp: batch macOS receive with recvmsg_x#89
mmalmi wants to merge 4 commits into
jmcorgan:masterfrom
mmalmi:codex/darwin-recvmsg-x

Conversation

@mmalmi
Copy link
Copy Markdown
Contributor

@mmalmi mmalmi commented May 14, 2026

Summary

  • Add a macOS recvmsg_x(2) UDP receive batch path matching the existing Linux recvmmsg(2) contract.
  • Route macOS through the existing udp_receive_loop batch drain so one readiness wake can reap up to 32 datagrams.
  • Add a burst-receive regression test and an ignored loopback amortization benchmark.
  • Gate the Linux-only IpAddr import in control::listening so macOS test builds are warning-clean.

Benchmark

aarch64-apple-darwin, 100-byte UDP loopback receive throughput over 3 s windows:

senders=1: baseline=348308 pps  batched=382982 pps  speedup=1.10×
senders=2: baseline=322229 pps  batched=521407 pps  speedup=1.62×
senders=4: baseline=333902 pps  batched=525601 pps  speedup=1.57×
senders=8: baseline=347476 pps  batched=445765 pps  speedup=1.28×

The win appears once the kernel receive queue is deep enough for recvmsg_x to amortize wakeup + syscall cost across a batch.

Validation

cargo fmt --check, git diff --check, full lib suite, transport::udp suite (25 passed, 1 ignored), and the burst-recv regression test all clean. The ignored amortization bench above is reproducible via cargo test --release --lib transport::udp::socket::tests::bench_udp_recv_amortization -- --ignored --nocapture.

mmalmi added 4 commits May 15, 2026 01:05
The Linux recv path drains up to 32 datagrams per kernel wakeup via
recvmmsg(2), amortising the per-syscall + per-task-wakeup cost across
the burst (commit 253ddda). macOS still fell through to single-packet
recv_from, so the same overhead capped inbound rate on Apple builds.

Add an equivalent batch path for Darwin using recvmsg_x(2). It is a
xnu-private syscall (not in the public SDK) but is the canonical
amortisation primitive on macOS — same shape used by quinn-udp for
the same reason. ABI is the public msghdr layout plus a trailing
msg_datalen (per-datagram bytes-received output), declared via
`unsafe extern "C"` against a local repr(C) `msghdr_x`.

Same `(count, kernel_drops)` contract as the Linux `recv_batch`. macOS
has no SO_RXQ_OVFL equivalent, so `kernel_drops` is always 0 — the
1Hz `sample_transport_congestion()` detector simply sees no kernel
drop signal on Apple hosts (it already tolerates that, since the
field has been 0 there pre-batching too).

cmsg buffer is intentionally null: we never consume ancillary data on
this path, and quinn-udp documents that `recvmsg_x` does not overwrite
`msg_controllen` on macOS 10.15+ (zeroed init is the only safe state).

The udp_receive_loop dispatch widens from `cfg(linux)` to
`cfg(any(linux, macos))`; the per-packet recv_from path is now used
only on the remaining unix targets (BSDs etc.) and Windows.

Adds `test_burst_recv_batch` exercising 10 in-flight datagrams to
verify per-datagram boundaries and arrival order across the batch.
All 25 UDP tests pass on aarch64-apple-darwin.
Reproducible bench for the recv-side syscall amortization win added in
the previous commit. Sender(s) run on dedicated blocking std threads so
the kernel rx queue stays saturated regardless of how the tokio receiver
schedules — that's the scenario where recvmmsg / recvmsg_x is meant to
help. Receiver runs both `recv_from` (single recvmsg per syscall) and
`recv_batch` (recvmsg_x with up to 32 datagrams) for fixed wall-clock
windows, sweeping over 1/2/4/8 sender threads to vary queue depth.

Numbers on aarch64-apple-darwin (M-series, 100B payloads, 3s windows,
multi_thread(2) tokio), recv-side throughput at the receiver:

  senders=1:  recv_from 398k pps   recv_batch 432k pps   1.09x   batch≈5.5
  senders=2:  recv_from 353k pps   recv_batch 608k pps   1.72x   batch≈11.9
  senders=4:  recv_from 322k pps   recv_batch 503k pps   1.56x   batch≈17.2
  senders=8:  recv_from 353k pps   recv_batch 515k pps   1.46x   batch≈22.7

The win scales with kernel-queue depth: under single-stream load the
kernel queue is mostly empty between wakeups so each batched syscall
reaps just ~5 packets and we save ~9%; once two or more sender threads
deepen the queue, batches grow to 12-23 packets per syscall and the
receiver sustains 50-70% more pps. (The Linux PR's 3.5x is larger
because Linux per-syscall + scheduler-hop overhead is heavier than
macOS kqueue's; the Apple ceiling is lower but still meaningful.)

Marked #[ignore] so it doesn't run in default `cargo test`. Run with:
  cargo test --release -p fips --lib \
    transport::udp::socket::tests::bench_udp_recv_amortization \
    -- --ignored --nocapture
@jmcorgan
Copy link
Copy Markdown
Owner

Landed on master as 59225cc. Authorship preserved.

The four commits were rebased onto the post-#87/#88 master tip (b05c80e) and collapsed into one via soft-reset. Pre- and post-squash trees are byte-identical across src/control/listening.rs, src/transport/udp/mod.rs, and src/transport/udp/socket.rs, so nothing was rewritten in content. The two fallout commits (control: gate linux-only IpAddr import and udp: fix recv benchmark clippy lint) folded cleanly since both were mechanical consequences of the cfg widening in the substance commits.

CI ran on the squashed commit before merge: 42/42 green, including Unit tests (macOS) exercising test_burst_recv_batch on the macos-latest Apple Silicon runner through the new recvmsg_x path.

Closing manually since the squash rewrote the head SHAs and GitHub's patch-equivalence auto-close didn't fire. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants