Data-plane perf overhaul (+~2× single-stream TCP) by mmalmi · Pull Request #91 · jmcorgan/fips

mmalmi · 2026-05-15T14:52:36Z

Data-plane perf overhaul: off-task encrypt + decrypt, GSO, connected UDP

TL;DR

Moves both AEAD (Authenticated Encryption with Associated Data — ChaCha20-Poly1305 in FIPS, one round per layer per packet) layers plus the sendmsg syscall off the rx_loop task onto a per-shard worker pool, adds per-peer connect(2)-ed UDP with SO_REUSEPORT, and uses Linux UDP GSO (Generic Segmentation Offload — the kernel splits one large super-skb into N on-the-wire datagrams in a single trip through the TX stack) when packets in a batch are uniform-size. GSO is the same kernel primitive WireGuard's in-kernel module and Cloudflare's BoringTun userspace tunnel use to hit 2.5–3.2 Gbps single-stream.

Single TCP stream: ~+1.9× across every static-peer path.

Path	Before this PR (Mbps)	With this PR (Mbps)	Speedup	RTT Δ
A→D	1379	2708	1.96×	+0.12 ms
A→E	1394	2663	1.91×	+0.11 ms
E→A	1406	2624	1.87×	+0.19 ms

testing/static/scripts/bench-multirun.sh, 5 × 15 s × 1 stream medians on a Linux x86_64 docker-bridge mesh (8 vCPU / 24 GiB, kernel 7.0). All CoV < 3 %, 0 outliers (Δ > 20 % from median), 0 % ICMP loss on both branches. The +100–200 µs RTT is the worker queue handoff added by moving AEAD off rx_loop.

Topology

5 nodes on one docker-bridge subnet (testing/static/configs/topologies/mesh.yaml). Static peers::

A: D, E       B: C        C: B, D, E
D: A, C, E    E: A, C, D

Diff shape

30 files, +6173 / -101. ~85 % of new code is in six new files that drop in alongside the existing transport stack:

File	Lines	Role
`src/node/encrypt_worker.rs`	+2238	FMP+FSP AEAD-seal + sendmmsg / UDP_GSO worker pool
`src/node/decrypt_worker.rs`	+693	shard-owned receive AEAD-open + replay window
`src/transport/udp/peer_drain.rs`	+457	per-peer recv drain thread
`src/transport/udp/connected_peer.rs`	+436	per-peer `connect(2)`-ed UDP socket
`src/perf_profile.rs`	+405	optional per-stage timing reporter
`src/transport/udp/darwin_sockopts.rs`	+197	macOS UDP tuning

The rest is integration glue. All #[cfg(unix)]-gated; Windows continues on the existing tokio-based send/recv.

What's in scope

Off-task encrypt (encrypt_worker) — std::thread + crossbeam_channel; hash-by-destination dispatch pins a TCP flow to one worker so wire ordering is preserved; per-worker sendmmsg(2) batching up to 32; Linux uses sendmsg(2)+UDP_SEGMENT when packets in a group are uniform-size.
Off-task decrypt (decrypt_worker) — receive-side mirror. Each shard owns its session's recv cipher + replay window in a thread-local HashMap (no shared RwLock/Mutex). Sessions are handed off at promote_connection and re-registered on K-bit flip / rekey cutover.
FSP+FMP pipelined send — both AEAD layers seal in-place in the worker on a single wire-buffer allocation; no intermediate inner_plaintext / fsp_payload Vecs.
Per-peer connected UDP (Linux + macOS) — SO_REUSEPORT so the per-peer connected socket can bind to the same wildcard port the listen socket holds; the worker sends with msg_name = NULL and the kernel uses its cached 5-tuple (skips per-packet route + neighbor lookup). Tick-driven activation, idempotent.
Receive zero-copy — mem::replace the recvmmsg backing buffer instead of to_vec() per packet; SessionDatagramRef zero-copy view for local delivery; TransportAddr::from_socket_addr collapses two allocs to one.
rx_loop ordering — fallback drain promoted ahead of packet_rx in the select!; interleaved fallback drain every 32 packets inside the rx burst loop so TCP ACKs don't pile up behind a 256-packet inbound burst.
Worker pool sizing — both default to num_cpus, overridable via FIPS_ENCRYPT_WORKERS=N / FIPS_DECRYPT_WORKERS=N.
FIPS_PERF=1 — optional per-stage timing reporter. Off by default, zero overhead when disabled.
Bench harness (testing/static/scripts/bench-multirun.sh) — N reruns (default 5), median / min / max / CoV % / per-run outlier flag, avg ping RTT, ICMP loss %, TCP retransmit total. Pre-bench peer-convergence check + per-path route verification via stats.bytes_sent deltas — fails fast if traffic exits via a non-static-peer link.

Test plan

cargo fmt --check clean
cargo clippy --all-targets --all-features -- -D warnings clean on Linux (what CI runs)
cargo nextest run --all --profile ci — 1228 passed / 4 skipped / 0 failed
Session-specific tests 127 / 127, rekey-specific 9 / 9
bash -n on both modified shell scripts
CI to confirm macOS + Windows builds

jmcorgan · 2026-05-17T17:17:30Z

This PR has gone DIRTY against current master. Three landings arrived after it opened: #89 macOS recvmsg_x at 59225cc, #90 rx zero-copy at b1af151, and the coord cache surgical-invalidation merge wave at f51dde6. A rebase onto current master is needed before this can land.

Once you've rebased, I'll do a detailed pass and follow up. Thanks.

Moves both AEAD layers (ChaCha20-Poly1305, one round per layer per packet) plus the sendmsg syscall off the rx_loop task onto a per-shard worker pool, adds per-peer connect(2)-ed UDP with SO_REUSEPORT, and uses Linux UDP GSO (sendmsg+UDP_SEGMENT — kernel splits one super-skb into N on-the-wire datagrams in a single TX-stack walk) when packets in a batch are uniform-size. Same kernel primitive WireGuard's in-kernel module and BoringTun use to hit 2.5–3.2 Gbps single-stream. Single TCP stream on a 5-node docker-bridge mesh, 5 x 15 s x P=1: A→D: 1379 → 2708 Mbps (1.96x, RTT +0.12 ms) A→E: 1394 → 2663 Mbps (1.91x, RTT +0.11 ms) E→A: 1406 → 2624 Mbps (1.87x, RTT +0.19 ms) Static-peer pairs only — every CoV under 3%, 0 outliers, 0% ICMP loss. The ~+100 µs RTT is the worker queue handoff cost; AEAD + sendmmsg now run on a separate core in exchange. What lands: - src/node/encrypt_worker.rs: std::thread + crossbeam_channel workers; hash-by-destination dispatch pins a TCP flow to one worker so wire ordering is preserved; per-worker sendmmsg(2) batching up to 32; Linux uses sendmsg(2)+UDP_SEGMENT when packets in a group are uniform-size. - src/node/decrypt_worker.rs: receive-side mirror. Each shard owns its session's recv cipher + replay window in a thread-local HashMap (no shared RwLock/Mutex). Sessions are handed off at promote_connection and re-registered on K-bit flip / rekey cutover. - src/node/handlers/session.rs try_send_session_data_pipelined: FSP+FMP both seal in-place in the worker on one wire-buffer alloc; no intermediate inner_plaintext / fsp_payload Vecs. - src/transport/udp/connected_peer.rs + peer_drain.rs: per-peer connect(2)-ed UDP socket with SO_REUSEPORT (set on the listen socket too — without that, EADDRINUSE on activation and every packet falls back to the wildcard path); the worker sends with msg_name=NULL and the kernel uses its cached 5-tuple. Tick- driven activation in handlers/connected_udp.rs, idempotent. - src/transport/udp/mod.rs: mem::replace the recvmmsg backing buffer instead of buf.to_vec() per packet — single pointer swap, no MTU-sized memcpy. - src/protocol/link.rs SessionDatagramRef: zero-copy borrowed view used by handle_session_datagram for the bulk local-delivery path; handle_session_payload takes the borrowed payload directly (no payload[35..].to_vec()). - src/transport/mod.rs TransportAddr::from_socket_addr: collapses the two-alloc from_string(addr.to_string()) pattern to one. - src/node/handlers/rx_loop.rs: decrypt-fallback drain promoted ahead of packet_rx in the select! (TCP ACK starvation fix); interleaved fallback drain every 32 packets inside the rx burst loop. - noise::Session: send_cipher_clone / recv_cipher_clone / recv_replay_snapshot_owned / take_send_counter / accept_replay so off-task workers can hold a cloned cipher + reserved counter while the dispatcher keeps replay/counter sequencing serial. CipherState::cipher_clone returns a refcount-bumped LessSafeKey. AsyncUdpSocket: AsRawFd so workers issue raw sendmmsg / sendmsg without going through the tokio reactor. - Worker pool sizing: both default to num_cpus, overridable via FIPS_ENCRYPT_WORKERS=N / FIPS_DECRYPT_WORKERS=N. - src/perf_profile.rs: optional per-stage timing reporter under FIPS_PERF=1. Off by default; zero overhead when disabled. - All cfg(unix)-gated. Windows continues on the existing tokio- based send/recv. Testing: - testing/static/scripts/bench-multirun.sh: multi-run iperf3 + ping bench. N reruns (default 5), median / min / max / CoV % / per-run outlier flag, avg ping RTT, ICMP loss %, TCP retransmit total. Plain client→dest labels + topology header. Pre-bench peer-convergence check (FIPS_BENCH_CONVERGE_SECS, default 15); per-path route verification via stats.bytes_sent deltas — fails fast if traffic exits via a non-static-peer link. - testing/static/docker-compose.yml: passes FIPS_ENCRYPT_WORKERS / FIPS_DECRYPT_WORKERS / FIPS_PERF through to containers for A/B benchmarking without rebuilds. - testing/static/scripts/iperf-test.sh: same plain client→dest labels + topology header (was multihop/direct/N hop, which conflated topology distance with on-wire path). - .config/nextest.toml: synthetic UDP node tests serialized through a max-threads=1 test group. Localhost handshakes drop on shared CI runners under parallel load; one-at-a-time keeps assertions reliable. - src/node/tests/spanning_tree.rs: repair_missing_edge_handshakes — retries up to 5 times for synthetic edges whose msg1 was dropped, with a drain after each edge retry instead of after each attempt's full burst. Cherry-picks from mmalmi/master (paths translated from crates/fips-core/src/ to src/): 9b7c723, 0deb5cb, 13f7339, e036c0e, 3740a68, 3792f83, 8510193, 4910b07, e53f545, e4e2896, 5fe4af5, 1d01ada, 8c37008, e12469e, 6eb2860.

mmalmi · 2026-05-19T07:15:13Z

Rebased onto current master (2b6d402). Status is now MERGEABLE / CLEAN.

Conflicts resolved:

src/protocol/link.rs — master's rx: avoid copies in receive hot paths #90 already added SessionDatagramRef (b1af151); dropped the PR's redundant duplicate definition, kept master's slightly cleaner SessionDatagram::decode → SessionDatagramRef::decode + into_owned() flow.
src/node/handlers/forwarding.rs — same situation; kept master's ref-based handle_session_datagram (uses into_owned(); both branches independently arrived at the same zero-copy ref pattern).
src/transport/mod.rs, src/transport/udp/mod.rs — comment-only conflicts; kept master's comment in TransportAddr::from_socket_addr, kept the PR's mem::replace explanation in the recv loop.

Verified on Linux x86_64 (matches CI target):

cargo fmt --check clean
cargo clippy --all-targets --all-features -- -D warnings clean
cargo nextest run --all --profile ci: 1247 passed / 6 skipped / 0 failed (222s)

Also built on Windows — succeeds, same pre-existing Windows-only dead-code warnings as before the rebase (process_authentic_fmp_plaintext only referenced from cfg(unix) worker code; ESTABLISHED_HEADER_SIZE import unused on non-unix). Not regressions from the rebase, but happy to gate them on the next pass if you'd like.

mmalmi force-pushed the pr/sender-path-overhaul branch 2 times, most recently from 0852b4f to c3d7652 Compare May 15, 2026 16:23

mmalmi force-pushed the pr/sender-path-overhaul branch from c3d7652 to 2b6d402 Compare May 19, 2026 07:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data-plane perf overhaul (+~2× single-stream TCP)#91

Data-plane perf overhaul (+~2× single-stream TCP)#91
mmalmi wants to merge 1 commit into
jmcorgan:masterfrom
mmalmi:pr/sender-path-overhaul

mmalmi commented May 15, 2026 •

edited

Loading

Uh oh!

jmcorgan commented May 17, 2026

Uh oh!

mmalmi commented May 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mmalmi commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Data-plane perf overhaul: off-task encrypt + decrypt, GSO, connected UDP

TL;DR

Topology

Diff shape

What's in scope

Test plan

Uh oh!

jmcorgan commented May 17, 2026

Uh oh!

mmalmi commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mmalmi commented May 15, 2026 •

edited

Loading

mmalmi commented May 19, 2026 •

edited

Loading