Skip to content

Data-plane perf overhaul (+~2× single-stream TCP)#91

Open
mmalmi wants to merge 1 commit into
jmcorgan:masterfrom
mmalmi:pr/sender-path-overhaul
Open

Data-plane perf overhaul (+~2× single-stream TCP)#91
mmalmi wants to merge 1 commit into
jmcorgan:masterfrom
mmalmi:pr/sender-path-overhaul

Conversation

@mmalmi
Copy link
Copy Markdown
Contributor

@mmalmi mmalmi commented May 15, 2026

Data-plane perf overhaul: off-task encrypt + decrypt, GSO, connected UDP

TL;DR

Moves both AEAD (Authenticated Encryption with Associated Data — ChaCha20-Poly1305 in FIPS, one round per layer per packet) layers plus the sendmsg syscall off the rx_loop task onto a per-shard worker pool, adds per-peer connect(2)-ed UDP with SO_REUSEPORT, and uses Linux UDP GSO (Generic Segmentation Offload — the kernel splits one large super-skb into N on-the-wire datagrams in a single trip through the TX stack) when packets in a batch are uniform-size. GSO is the same kernel primitive WireGuard's in-kernel module and Cloudflare's BoringTun userspace tunnel use to hit 2.5–3.2 Gbps single-stream.

Single TCP stream: ~+1.9× across every static-peer path.

Path Before this PR (Mbps) With this PR (Mbps) Speedup RTT Δ
A→D 1379 2708 1.96× +0.12 ms
A→E 1394 2663 1.91× +0.11 ms
E→A 1406 2624 1.87× +0.19 ms

testing/static/scripts/bench-multirun.sh, 5 × 15 s × 1 stream medians on a Linux x86_64 docker-bridge mesh (8 vCPU / 24 GiB, kernel 7.0). All CoV < 3 %, 0 outliers (Δ > 20 % from median), 0 % ICMP loss on both branches. The +100–200 µs RTT is the worker queue handoff added by moving AEAD off rx_loop.

Topology

5 nodes on one docker-bridge subnet (testing/static/configs/topologies/mesh.yaml). Static peers::

A: D, E       B: C        C: B, D, E
D: A, C, E    E: A, C, D

Diff shape

30 files, +6173 / -101. ~85 % of new code is in six new files that drop in alongside the existing transport stack:

File Lines Role
src/node/encrypt_worker.rs +2238 FMP+FSP AEAD-seal + sendmmsg / UDP_GSO worker pool
src/node/decrypt_worker.rs +693 shard-owned receive AEAD-open + replay window
src/transport/udp/peer_drain.rs +457 per-peer recv drain thread
src/transport/udp/connected_peer.rs +436 per-peer connect(2)-ed UDP socket
src/perf_profile.rs +405 optional per-stage timing reporter
src/transport/udp/darwin_sockopts.rs +197 macOS UDP tuning

The rest is integration glue. All #[cfg(unix)]-gated; Windows continues on the existing tokio-based send/recv.

What's in scope

  • Off-task encrypt (encrypt_worker) — std::thread + crossbeam_channel; hash-by-destination dispatch pins a TCP flow to one worker so wire ordering is preserved; per-worker sendmmsg(2) batching up to 32; Linux uses sendmsg(2)+UDP_SEGMENT when packets in a group are uniform-size.
  • Off-task decrypt (decrypt_worker) — receive-side mirror. Each shard owns its session's recv cipher + replay window in a thread-local HashMap (no shared RwLock/Mutex). Sessions are handed off at promote_connection and re-registered on K-bit flip / rekey cutover.
  • FSP+FMP pipelined send — both AEAD layers seal in-place in the worker on a single wire-buffer allocation; no intermediate inner_plaintext / fsp_payload Vecs.
  • Per-peer connected UDP (Linux + macOS) — SO_REUSEPORT so the per-peer connected socket can bind to the same wildcard port the listen socket holds; the worker sends with msg_name = NULL and the kernel uses its cached 5-tuple (skips per-packet route + neighbor lookup). Tick-driven activation, idempotent.
  • Receive zero-copymem::replace the recvmmsg backing buffer instead of to_vec() per packet; SessionDatagramRef zero-copy view for local delivery; TransportAddr::from_socket_addr collapses two allocs to one.
  • rx_loop ordering — fallback drain promoted ahead of packet_rx in the select!; interleaved fallback drain every 32 packets inside the rx burst loop so TCP ACKs don't pile up behind a 256-packet inbound burst.
  • Worker pool sizing — both default to num_cpus, overridable via FIPS_ENCRYPT_WORKERS=N / FIPS_DECRYPT_WORKERS=N.
  • FIPS_PERF=1 — optional per-stage timing reporter. Off by default, zero overhead when disabled.
  • Bench harness (testing/static/scripts/bench-multirun.sh) — N reruns (default 5), median / min / max / CoV % / per-run outlier flag, avg ping RTT, ICMP loss %, TCP retransmit total. Pre-bench peer-convergence check + per-path route verification via stats.bytes_sent deltas — fails fast if traffic exits via a non-static-peer link.

Test plan

  • cargo fmt --check clean
  • cargo clippy --all-targets --all-features -- -D warnings clean on Linux (what CI runs)
  • cargo nextest run --all --profile ci — 1228 passed / 4 skipped / 0 failed
  • Session-specific tests 127 / 127, rekey-specific 9 / 9
  • bash -n on both modified shell scripts
  • CI to confirm macOS + Windows builds

@mmalmi mmalmi force-pushed the pr/sender-path-overhaul branch 2 times, most recently from 0852b4f to c3d7652 Compare May 15, 2026 16:23
@jmcorgan
Copy link
Copy Markdown
Owner

This PR has gone DIRTY against current master. Three landings arrived after it opened: #89 macOS recvmsg_x at 59225cc, #90 rx zero-copy at b1af151, and the coord cache surgical-invalidation merge wave at f51dde6. A rebase onto current master is needed before this can land.

Once you've rebased, I'll do a detailed pass and follow up. Thanks.

Moves both AEAD layers (ChaCha20-Poly1305, one round per layer per
packet) plus the sendmsg syscall off the rx_loop task onto a per-shard
worker pool, adds per-peer connect(2)-ed UDP with SO_REUSEPORT, and
uses Linux UDP GSO (sendmsg+UDP_SEGMENT — kernel splits one super-skb
into N on-the-wire datagrams in a single TX-stack walk) when packets
in a batch are uniform-size. Same kernel primitive WireGuard's
in-kernel module and BoringTun use to hit 2.5–3.2 Gbps single-stream.

Single TCP stream on a 5-node docker-bridge mesh, 5 x 15 s x P=1:

  A→D:  1379 → 2708 Mbps  (1.96x, RTT +0.12 ms)
  A→E:  1394 → 2663 Mbps  (1.91x, RTT +0.11 ms)
  E→A:  1406 → 2624 Mbps  (1.87x, RTT +0.19 ms)

Static-peer pairs only — every CoV under 3%, 0 outliers, 0% ICMP
loss. The ~+100 µs RTT is the worker queue handoff cost; AEAD +
sendmmsg now run on a separate core in exchange.

What lands:

- src/node/encrypt_worker.rs: std::thread + crossbeam_channel
  workers; hash-by-destination dispatch pins a TCP flow to one
  worker so wire ordering is preserved; per-worker sendmmsg(2)
  batching up to 32; Linux uses sendmsg(2)+UDP_SEGMENT when
  packets in a group are uniform-size.

- src/node/decrypt_worker.rs: receive-side mirror. Each shard owns
  its session's recv cipher + replay window in a thread-local
  HashMap (no shared RwLock/Mutex). Sessions are handed off at
  promote_connection and re-registered on K-bit flip / rekey
  cutover.

- src/node/handlers/session.rs try_send_session_data_pipelined:
  FSP+FMP both seal in-place in the worker on one wire-buffer
  alloc; no intermediate inner_plaintext / fsp_payload Vecs.

- src/transport/udp/connected_peer.rs + peer_drain.rs: per-peer
  connect(2)-ed UDP socket with SO_REUSEPORT (set on the listen
  socket too — without that, EADDRINUSE on activation and every
  packet falls back to the wildcard path); the worker sends with
  msg_name=NULL and the kernel uses its cached 5-tuple. Tick-
  driven activation in handlers/connected_udp.rs, idempotent.

- src/transport/udp/mod.rs: mem::replace the recvmmsg backing buffer
  instead of buf.to_vec() per packet — single pointer swap, no
  MTU-sized memcpy.

- src/protocol/link.rs SessionDatagramRef: zero-copy borrowed view
  used by handle_session_datagram for the bulk local-delivery
  path; handle_session_payload takes the borrowed payload
  directly (no payload[35..].to_vec()).

- src/transport/mod.rs TransportAddr::from_socket_addr: collapses
  the two-alloc from_string(addr.to_string()) pattern to one.

- src/node/handlers/rx_loop.rs: decrypt-fallback drain promoted
  ahead of packet_rx in the select! (TCP ACK starvation fix);
  interleaved fallback drain every 32 packets inside the rx burst
  loop.

- noise::Session: send_cipher_clone / recv_cipher_clone /
  recv_replay_snapshot_owned / take_send_counter / accept_replay
  so off-task workers can hold a cloned cipher + reserved counter
  while the dispatcher keeps replay/counter sequencing serial.
  CipherState::cipher_clone returns a refcount-bumped LessSafeKey.
  AsyncUdpSocket: AsRawFd so workers issue raw sendmmsg / sendmsg
  without going through the tokio reactor.

- Worker pool sizing: both default to num_cpus, overridable via
  FIPS_ENCRYPT_WORKERS=N / FIPS_DECRYPT_WORKERS=N.

- src/perf_profile.rs: optional per-stage timing reporter under
  FIPS_PERF=1. Off by default; zero overhead when disabled.

- All cfg(unix)-gated. Windows continues on the existing tokio-
  based send/recv.

Testing:

- testing/static/scripts/bench-multirun.sh: multi-run iperf3 +
  ping bench. N reruns (default 5), median / min / max / CoV % /
  per-run outlier flag, avg ping RTT, ICMP loss %, TCP retransmit
  total. Plain client→dest labels + topology header. Pre-bench
  peer-convergence check (FIPS_BENCH_CONVERGE_SECS, default 15);
  per-path route verification via stats.bytes_sent deltas — fails
  fast if traffic exits via a non-static-peer link.

- testing/static/docker-compose.yml: passes FIPS_ENCRYPT_WORKERS /
  FIPS_DECRYPT_WORKERS / FIPS_PERF through to containers for A/B
  benchmarking without rebuilds.

- testing/static/scripts/iperf-test.sh: same plain client→dest
  labels + topology header (was multihop/direct/N hop, which
  conflated topology distance with on-wire path).

- .config/nextest.toml: synthetic UDP node tests serialized
  through a max-threads=1 test group. Localhost handshakes drop
  on shared CI runners under parallel load; one-at-a-time keeps
  assertions reliable.

- src/node/tests/spanning_tree.rs: repair_missing_edge_handshakes
  — retries up to 5 times for synthetic edges whose msg1 was
  dropped, with a drain after each edge retry instead of after
  each attempt's full burst.

Cherry-picks from mmalmi/master (paths translated from
crates/fips-core/src/ to src/): 9b7c723, 0deb5cb, 13f7339, e036c0e,
3740a68, 3792f83, 8510193, 4910b07, e53f545, e4e2896, 5fe4af5,
1d01ada, 8c37008, e12469e, 6eb2860.
@mmalmi mmalmi force-pushed the pr/sender-path-overhaul branch from c3d7652 to 2b6d402 Compare May 19, 2026 07:14
@mmalmi
Copy link
Copy Markdown
Contributor Author

mmalmi commented May 19, 2026

Rebased onto current master (2b6d402). Status is now MERGEABLE / CLEAN.

Conflicts resolved:

  • src/protocol/link.rs — master's rx: avoid copies in receive hot paths #90 already added SessionDatagramRef (b1af151); dropped the PR's redundant duplicate definition, kept master's slightly cleaner SessionDatagram::decode → SessionDatagramRef::decode + into_owned() flow.
  • src/node/handlers/forwarding.rs — same situation; kept master's ref-based handle_session_datagram (uses into_owned(); both branches independently arrived at the same zero-copy ref pattern).
  • src/transport/mod.rs, src/transport/udp/mod.rs — comment-only conflicts; kept master's comment in TransportAddr::from_socket_addr, kept the PR's mem::replace explanation in the recv loop.

Verified on Linux x86_64 (matches CI target):

  • cargo fmt --check clean
  • cargo clippy --all-targets --all-features -- -D warnings clean
  • cargo nextest run --all --profile ci: 1247 passed / 6 skipped / 0 failed (222s)

Also built on Windows — succeeds, same pre-existing Windows-only dead-code warnings as before the rebase (process_authentic_fmp_plaintext only referenced from cfg(unix) worker code; ESTABLISHED_HEADER_SIZE import unused on non-unix). Not regressions from the rebase, but happy to gate them on the next pass if you'd like.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants