Data-plane perf overhaul (+~2× single-stream TCP)#91
Conversation
0852b4f to
c3d7652
Compare
|
This PR has gone DIRTY against current master. Three landings arrived after it opened: #89 macOS recvmsg_x at 59225cc, #90 rx zero-copy at b1af151, and the coord cache surgical-invalidation merge wave at f51dde6. A rebase onto current master is needed before this can land. Once you've rebased, I'll do a detailed pass and follow up. Thanks. |
Moves both AEAD layers (ChaCha20-Poly1305, one round per layer per packet) plus the sendmsg syscall off the rx_loop task onto a per-shard worker pool, adds per-peer connect(2)-ed UDP with SO_REUSEPORT, and uses Linux UDP GSO (sendmsg+UDP_SEGMENT — kernel splits one super-skb into N on-the-wire datagrams in a single TX-stack walk) when packets in a batch are uniform-size. Same kernel primitive WireGuard's in-kernel module and BoringTun use to hit 2.5–3.2 Gbps single-stream. Single TCP stream on a 5-node docker-bridge mesh, 5 x 15 s x P=1: A→D: 1379 → 2708 Mbps (1.96x, RTT +0.12 ms) A→E: 1394 → 2663 Mbps (1.91x, RTT +0.11 ms) E→A: 1406 → 2624 Mbps (1.87x, RTT +0.19 ms) Static-peer pairs only — every CoV under 3%, 0 outliers, 0% ICMP loss. The ~+100 µs RTT is the worker queue handoff cost; AEAD + sendmmsg now run on a separate core in exchange. What lands: - src/node/encrypt_worker.rs: std::thread + crossbeam_channel workers; hash-by-destination dispatch pins a TCP flow to one worker so wire ordering is preserved; per-worker sendmmsg(2) batching up to 32; Linux uses sendmsg(2)+UDP_SEGMENT when packets in a group are uniform-size. - src/node/decrypt_worker.rs: receive-side mirror. Each shard owns its session's recv cipher + replay window in a thread-local HashMap (no shared RwLock/Mutex). Sessions are handed off at promote_connection and re-registered on K-bit flip / rekey cutover. - src/node/handlers/session.rs try_send_session_data_pipelined: FSP+FMP both seal in-place in the worker on one wire-buffer alloc; no intermediate inner_plaintext / fsp_payload Vecs. - src/transport/udp/connected_peer.rs + peer_drain.rs: per-peer connect(2)-ed UDP socket with SO_REUSEPORT (set on the listen socket too — without that, EADDRINUSE on activation and every packet falls back to the wildcard path); the worker sends with msg_name=NULL and the kernel uses its cached 5-tuple. Tick- driven activation in handlers/connected_udp.rs, idempotent. - src/transport/udp/mod.rs: mem::replace the recvmmsg backing buffer instead of buf.to_vec() per packet — single pointer swap, no MTU-sized memcpy. - src/protocol/link.rs SessionDatagramRef: zero-copy borrowed view used by handle_session_datagram for the bulk local-delivery path; handle_session_payload takes the borrowed payload directly (no payload[35..].to_vec()). - src/transport/mod.rs TransportAddr::from_socket_addr: collapses the two-alloc from_string(addr.to_string()) pattern to one. - src/node/handlers/rx_loop.rs: decrypt-fallback drain promoted ahead of packet_rx in the select! (TCP ACK starvation fix); interleaved fallback drain every 32 packets inside the rx burst loop. - noise::Session: send_cipher_clone / recv_cipher_clone / recv_replay_snapshot_owned / take_send_counter / accept_replay so off-task workers can hold a cloned cipher + reserved counter while the dispatcher keeps replay/counter sequencing serial. CipherState::cipher_clone returns a refcount-bumped LessSafeKey. AsyncUdpSocket: AsRawFd so workers issue raw sendmmsg / sendmsg without going through the tokio reactor. - Worker pool sizing: both default to num_cpus, overridable via FIPS_ENCRYPT_WORKERS=N / FIPS_DECRYPT_WORKERS=N. - src/perf_profile.rs: optional per-stage timing reporter under FIPS_PERF=1. Off by default; zero overhead when disabled. - All cfg(unix)-gated. Windows continues on the existing tokio- based send/recv. Testing: - testing/static/scripts/bench-multirun.sh: multi-run iperf3 + ping bench. N reruns (default 5), median / min / max / CoV % / per-run outlier flag, avg ping RTT, ICMP loss %, TCP retransmit total. Plain client→dest labels + topology header. Pre-bench peer-convergence check (FIPS_BENCH_CONVERGE_SECS, default 15); per-path route verification via stats.bytes_sent deltas — fails fast if traffic exits via a non-static-peer link. - testing/static/docker-compose.yml: passes FIPS_ENCRYPT_WORKERS / FIPS_DECRYPT_WORKERS / FIPS_PERF through to containers for A/B benchmarking without rebuilds. - testing/static/scripts/iperf-test.sh: same plain client→dest labels + topology header (was multihop/direct/N hop, which conflated topology distance with on-wire path). - .config/nextest.toml: synthetic UDP node tests serialized through a max-threads=1 test group. Localhost handshakes drop on shared CI runners under parallel load; one-at-a-time keeps assertions reliable. - src/node/tests/spanning_tree.rs: repair_missing_edge_handshakes — retries up to 5 times for synthetic edges whose msg1 was dropped, with a drain after each edge retry instead of after each attempt's full burst. Cherry-picks from mmalmi/master (paths translated from crates/fips-core/src/ to src/): 9b7c723, 0deb5cb, 13f7339, e036c0e, 3740a68, 3792f83, 8510193, 4910b07, e53f545, e4e2896, 5fe4af5, 1d01ada, 8c37008, e12469e, 6eb2860.
c3d7652 to
2b6d402
Compare
|
Rebased onto current master (2b6d402). Status is now MERGEABLE / CLEAN. Conflicts resolved:
Verified on Linux x86_64 (matches CI target):
Also built on Windows — succeeds, same pre-existing Windows-only dead-code warnings as before the rebase ( |
Data-plane perf overhaul: off-task encrypt + decrypt, GSO, connected UDP
TL;DR
Moves both AEAD (Authenticated Encryption with Associated Data — ChaCha20-Poly1305 in FIPS, one round per layer per packet) layers plus the
sendmsgsyscall off the rx_loop task onto a per-shard worker pool, adds per-peerconnect(2)-ed UDP withSO_REUSEPORT, and uses Linux UDP GSO (Generic Segmentation Offload — the kernel splits one large super-skb into N on-the-wire datagrams in a single trip through the TX stack) when packets in a batch are uniform-size. GSO is the same kernel primitive WireGuard's in-kernel module and Cloudflare's BoringTun userspace tunnel use to hit 2.5–3.2 Gbps single-stream.Single TCP stream: ~+1.9× across every static-peer path.
testing/static/scripts/bench-multirun.sh, 5 × 15 s × 1 stream medians on a Linux x86_64 docker-bridge mesh (8 vCPU / 24 GiB, kernel 7.0). All CoV < 3 %, 0 outliers (Δ > 20 % from median), 0 % ICMP loss on both branches. The +100–200 µs RTT is the worker queue handoff added by moving AEAD off rx_loop.Topology
5 nodes on one docker-bridge subnet (
testing/static/configs/topologies/mesh.yaml). Staticpeers::Diff shape
30 files, +6173 / -101. ~85 % of new code is in six new files that drop in alongside the existing transport stack:
src/node/encrypt_worker.rssrc/node/decrypt_worker.rssrc/transport/udp/peer_drain.rssrc/transport/udp/connected_peer.rsconnect(2)-ed UDP socketsrc/perf_profile.rssrc/transport/udp/darwin_sockopts.rsThe rest is integration glue. All
#[cfg(unix)]-gated; Windows continues on the existing tokio-based send/recv.What's in scope
encrypt_worker) —std::thread+crossbeam_channel; hash-by-destination dispatch pins a TCP flow to one worker so wire ordering is preserved; per-workersendmmsg(2)batching up to 32; Linux usessendmsg(2)+UDP_SEGMENTwhen packets in a group are uniform-size.decrypt_worker) — receive-side mirror. Each shard owns its session's recv cipher + replay window in a thread-localHashMap(no sharedRwLock/Mutex). Sessions are handed off atpromote_connectionand re-registered on K-bit flip / rekey cutover.inner_plaintext/fsp_payloadVecs.SO_REUSEPORTso the per-peer connected socket can bind to the same wildcard port the listen socket holds; the worker sends withmsg_name = NULLand the kernel uses its cached 5-tuple (skips per-packet route + neighbor lookup). Tick-driven activation, idempotent.mem::replacethe recvmmsg backing buffer instead ofto_vec()per packet;SessionDatagramRefzero-copy view for local delivery;TransportAddr::from_socket_addrcollapses two allocs to one.packet_rxin theselect!; interleaved fallback drain every 32 packets inside the rx burst loop so TCP ACKs don't pile up behind a 256-packet inbound burst.num_cpus, overridable viaFIPS_ENCRYPT_WORKERS=N/FIPS_DECRYPT_WORKERS=N.FIPS_PERF=1— optional per-stage timing reporter. Off by default, zero overhead when disabled.testing/static/scripts/bench-multirun.sh) — N reruns (default 5), median / min / max / CoV % / per-run outlier flag, avg ping RTT, ICMP loss %, TCP retransmit total. Pre-bench peer-convergence check + per-path route verification viastats.bytes_sentdeltas — fails fast if traffic exits via a non-static-peer link.Test plan
cargo fmt --checkcleancargo clippy --all-targets --all-features -- -D warningsclean on Linux (what CI runs)cargo nextest run --all --profile ci— 1228 passed / 4 skipped / 0 failedbash -non both modified shell scripts