A clean-room WireGuard implementation in Rust — userspace and kernel module. No libwg. Cross-platform userspace, Linux kernel module targeting 6.10+.
8,500+ lines of Rust · 80 tests · 7 crates · userspace + kernel module
Built as an experiment to understand WireGuard internals by implementing the full protocol from scratch, pushing userspace performance to its limits, then going in-kernel to eliminate the TUN bottleneck entirely.
rustguard-crypto/ X25519, ChaCha20-Poly1305, HMAC-BLAKE2s, HKDF, TAI64N (dual std/no_std)
rustguard-core/ Noise_IK handshake, transport sessions, replay window, timers (dual std/no_std)
rustguard-tun/ macOS utun, Linux TUN, multi-queue TUN, AF_XDP, io_uring, BPF loader
rustguard-daemon/ Standard wg.conf tunnel mode (rustguard up)
rustguard-enroll/ Zero-config enrollment, IP pool, Zigbee-style pairing, persistence
rustguard-cli/ CLI: up, serve, join, open, close, status, genkey, pubkey
rustguard-kmod/ Linux kernel module (Rust + C shim, out-of-tree, targets 6.10+)
Commit 1 — Working tunnel in pure Rust
Started with the WireGuard whitepaper and built up from cryptographic primitives. X25519 key exchange, ChaCha20-Poly1305 AEAD, BLAKE2s hashing with HKDF key derivation, TAI64N timestamps. Implemented the Noise_IKpsk2 handshake (initiation → response → transport), wire format serialization, and a macOS utun TUN device via raw kernel control sockets. Wired it all into a tunnel loop: TUN reads plaintext → encrypt → UDP send, UDP recv → decrypt → TUN write.
First test: two peers on localhost, ICMP ping through utun interfaces. 1.2ms RTT. It worked.
Commit 2 — Replay protection, timers, Linux TUN
Added the 2048-bit sliding window anti-replay bitmap. Built the timer state machine (rekey after 120s or 2^60 messages, keepalive, handshake retry with jitter, session expiry). Ported TUN to Linux (/dev/net/tun with IFF_TUN | IFF_NO_PI). Added signal handling for clean shutdown with route cleanup. Replaced the nanosecond clock hack for sender indices with a proper CSPRNG.
Tested on two Debian 12 VMs on a bridged network. 0% packet loss, <1ms RTT.
Commit 3 — Cookie mechanism, PSK support, XChaCha20
Implemented WireGuard's DoS protection: CookieChecker (server side, rotating secrets) and CookieState (client side). XChaCha20-Poly1305 for cookie encryption. MAC1 always present, MAC2 required under load. Added pre-shared key support with backward-compatible wrappers. Cookie Reply message (type 3) with full wire format.
Commit 4 — IPv6 dual-stack
Extended the tunnel to handle both IPv4 and IPv6 packets based on IP version byte inspection. Dual-stack TUN configuration on both macOS (ifconfig inet6) and Linux (ip -6 addr add). AllowedIPs parser handles both v4 and v6 CIDR ranges. Route table updated for both families.
Dual-stack ping (v4 + v6) between Debian VMs, 0% packet loss.
Commit 5 — Critical and high fixes (OWASP-style audit)
Went through the codebase with an adversarial mindset:
- HMAC-BLAKE2s was wrong. Was using keyed BLAKE2s directly instead of RFC 2104 HMAC (ipad/opad double-hash). KDF output was incorrect and non-interoperable. Fixed.
- MAC1 checked too late. DH operations happened before MAC1 verification — an attacker could burn CPU with unauthenticated packets. Moved MAC1 check first.
- Timestamp replay not enforced. Stale handshake initiations were accepted, enabling session hijack via captured packets. Fixed.
- Replay window poisoning. Window updated before AEAD verification — garbage counters could advance the window. Split into
check()/update(). - FD leaks on TUN creation error paths (both platforms).
- Linux ifreq padding was 12 bytes instead of 20 — reading stack garbage for MTU.
- Peer lookup operator precedence bug for handshake responses.
constant_time_equsedblack_boxto prevent LLVM from optimizing out the comparison.
Commit 6 — Medium fixes
Replaced hand-rolled constant-time comparisons with subtle::ConstantTimeEq. Swapped /dev/urandom for the getrandom crate. Added ZeroizeOnDrop on handshake state (chaining key, hash, PSK). encrypt() returns Option instead of panicking on nonce exhaustion. O_CLOEXEC on all file descriptors.
Commit 7 — serve + join
The standard WireGuard setup flow (generate keys on both sides, exchange public keys, assign IPs, write config files) is painful for homelabs. Built a new rustguard-enroll crate with a custom enrollment protocol: token-derived XChaCha20 key encrypts the key exchange, CIDR-based IP pool allocator (server gets .1, clients get sequential IPs).
# Server
rustguard serve --pool 10.150.0.0/24 --token mysecret
# Client (any machine)
rustguard join 1.2.3.4:51820 --token mysecret
No config files. No manual key exchange. No IP planning. Backward compatible — rustguard up wg0.conf still works.
Commit 8 — Zigbee-style pairing window
Enrollment always-open felt wrong. Added a UNIX domain control socket for runtime management. Server starts with enrollment closed. Physical presence model:
rustguard open 60 # open enrollment for 60 seconds
rustguard close # close immediately
rustguard status # show window state + peer count
Window auto-closes on expiry via atomic timestamp. Existing peers unaffected.
Commit 9 — Persistence + benchmarks
Server state (peer keys, assigned IPs) persisted to ~/.rustguard/state.json. Survives restarts. Added benchmark infrastructure.
The interesting part. Test rig: two Debian 12 VMs on an n305 hypervisor, Intel i225 2.5G NICs passed through (enp7s0, 192.168.99.1/2). All benchmarks with iperf3, single TCP stream unless noted.
Baseline (no VPN): 2,356 Mbps Kernel WireGuard: 1,096 Mbps (46.5% of wire speed)
Commit 10 — Lock optimization
Replaced global Mutex with RwLock for peer table (concurrent reads for packet routing) and per-peer Mutex for session state. Arc<EnrolledPeer> for lock-free references after lookup. In-place AEAD (seal_to / open_to) to avoid heap allocation per packet.
Result: 175 Mbps. TUN syscall overhead dominates, not locking.
Commit 11 — recvmmsg batched UDP
recvmmsg with MSG_WAITFORONE — receive up to 32 datagrams per syscall on the inbound path. macOS fallback to single recvfrom.
Result: 173 Mbps. Confirmed: TUN read/write is the bottleneck, not UDP.
Commit 12 — AF_XDP zero-copy sockets
Built a full AF_XDP stack: UMEM allocation (mmap'd shared memory), RX/TX/Fill/Completion ring buffers with atomics, XDP_USE_NEED_WAKEUP for minimal syscalls, frame pool for TX buffers. All unsafe contained behind a safe Rust API.
Commit 13 — BPF loader + XDP filter
Wrote a minimal BPF ELF loader (no libbpf dependency) that parses section headers, extracts programs, and patches map relocations. The XDP program filters WireGuard packets (UDP port 51820, message types 1-4) and redirects them to the AF_XDP socket. Non-WireGuard traffic passes through the kernel stack normally.
Commit 14 — AF_XDP wired into the server
rustguard serve --pool 10.150.0.0/24 --token s --xdp enp7s0 — the full pipeline. BPF program loaded, XDP filter attached, AF_XDP socket receiving WireGuard packets zero-copy.
Commit 15 — BPF ELF relocation fix
The BPF loader's relocation patching was broken on real hardware (worked in VMs by luck). Fixed the ELF section parsing to correctly patch map FD references in BPF instructions.
AF_XDP result: ~175 Mbps. Same as before — TUN is still the bottleneck. UDP path is now zero-copy but it doesn't matter when the slow side is TUN read/write syscalls.
Commit 16 — Multi-queue TUN
IFF_MULTI_QUEUE on Linux opens N TUN file descriptors for the same interface. N outbound worker threads, each with its own TUN fd and UDP socket. Inbound still single-threaded (one UDP recv loop dispatching to peers).
rustguard serve --pool 10.150.0.0/24 --token s --queues 2
Result: 472 Mbps with 2 queues. Nearly 3× improvement. This is the real win — parallelizing TUN I/O across cores.
Commit 17 — io_uring TUN engine
Replaced blocking read()/write() on TUN fds with io_uring. Pre-registered buffer pool (256 slots × 2KB). Pre-fill SQ with ReadFixed SQEs. On completion: process packet, push WriteFixed SQE. One io_uring_enter() per batch instead of per packet.
Result: 225 Mbps. Slower than multi-queue. At 30K packets/sec, the io_uring submission overhead exceeds the savings from batching. io_uring wins at higher packet rates (small packets, many flows) but loses to raw multi-queue for bulk throughput.
The userspace experiments proved TUN is the bottleneck. The only way to close the gap: stop leaving the kernel.
Commit 18-28 — Rust kernel module
Built an out-of-tree Linux kernel module in Rust targeting 6.10+:
-
no_stdrefactor: Maderustguard-cryptoandrustguard-coredualstd/no_stdwith feature flags. Handshake functions got_withvariants that accept explicit RNG and timestamps instead of relying onOsRng/SystemTime. -
C shim architecture: The kernel's Rust bindings (6.12) don't expose networking or crypto APIs. Solution: thin C shim files handle
net_deviceregistration (wg_net.c), crypto (wg_crypto.c), and UDP sockets (wg_socket.c). Rust handles the WireGuard protocol logic and packet routing. -
Crypto: Uses the kernel's own
chacha20poly1305library (same functions Jason uses in the C WireGuard module) — not the kernel crypto API. Direct buffer-oriented calls, no scatterlists, no TFM allocation on the hot path. -
Net device: Registers
wg0as aPOINTOPOINT | NOARPinterface with MTU 1420.ndo_start_xmitcalls into Rust for encryption. UDPencap_rcvcallback calls into Rust for decryption. -
The
skb_linearizelesson: Large packets (full MSS TCP segments) arrive in fragmented skbs. Readingskb->dataonly gets the linear portion. One line ofskb_linearize()before decryption fixed the AEAD authentication failures that killed TCP bulk throughput.
| Configuration | Throughput | % of Wire Speed |
|---|---|---|
| Bare metal (no VPN) | 2,356 Mbps | 100% |
| Kernel WireGuard | 1,096 Mbps | 46.5% |
| RustGuard (2 queues) | 472 Mbps | 20.0% |
| RustGuard (io_uring) | 225 Mbps | 9.6% |
| RustGuard (single queue) | 175 Mbps | 7.4% |
| Configuration | Throughput | Retransmits | % of Kernel WG |
|---|---|---|---|
| Bare wire (I226-V 2.5G) | 2,350 Mbps | 0 | — |
| Kernel WireGuard (C) | 2,230 Mbps | 0 | 100% |
| RustGuard sync (default) | 2,130 Mbps | 25K | 95.5% |
| RustGuard async (workqueue) | 1,530 Mbps | 7K | 68.6% |
Two modes via async_crypto module param:
- sync (default): encrypt/decrypt inline in softirq. Max throughput, higher retransmits.
- async (
async_crypto=1): SG encrypt/decrypt in per-CPU workqueue. Lower retransmits (75% reduction), lower throughput from scheduling overhead.
| Configuration | Throughput | % of Wire Speed |
|---|---|---|
| Bare wire (virtio-net) | 33,100 Mbps | 100% |
| Kernel WireGuard (C) | 4,190 Mbps | 12.7% |
| RustGuard kmod | 1,180 Mbps | 3.6% |
We spent 15+ commits debugging "SG crashes" — trying skb_cow_data, skb_copy, pskb_expand_head, reverting to buffer paths, blaming page boundaries. The root cause was a 16-byte stack buffer overflow in the AEAD verification path:
let mut verify = [0u8; 16]; // 16 bytes
// chacha20poly1305_decrypt writes src_len - 16 bytes to dst
// For a 1436-byte ciphertext: writes 1420 bytes into 16-byte buffer
wg_chacha20poly1305_decrypt(..., verify.as_mut_ptr()) // STACK SMASHOnce fixed, the SG scatter-gather AEAD path worked perfectly in the workqueue — both encrypt and decrypt. Every "crash" we attributed to SG, page boundaries, or skb_cow_data was this one bug. Serial console output on the crash confirmed: RIP: rustguard_rx+0x248b.
The async workqueue path (1.53 Gbps) vs kernel WireGuard (2.23 Gbps) gap is from per-packet overhead:
- Packet batching — Jason queues N packets then processes them in one worker invocation. We queue one work item per packet (
kmalloc+queue_workper packet). - TX skb reuse — we allocate a fresh skb per packet. Kernel WireGuard reuses the original with headroom expansion.
- NAPI integration — kernel WireGuard uses NAPI for RX batching. We inject via
netif_rxper packet.
The fundamental result: going from userspace (TUN) to kernel (net_device) gave a 3.5–4.5x throughput improvement — from 472 Mbps to 1,530–2,130 Mbps — confirming that TUN was always the bottleneck.
rustguard up /etc/wireguard/wg0.conf# Server
rustguard serve --pool 10.150.0.0/24 --token <shared-secret> --port 51820
# Client
rustguard join 203.0.113.1:51820 --token <shared-secret>
# Manage enrollment
rustguard open 60 # accept new peers for 60 seconds
rustguard close # stop accepting
rustguard status # show peers and window staterustguard genkey # generate private key
rustguard genkey | rustguard pubkey # derive public keyrustguard serve --pool 10.150.0.0/24 --token s --queues 2 # multi-queue TUN
rustguard serve --pool 10.150.0.0/24 --token s --xdp enp7s0 # AF_XDP zero-copy
rustguard serve --pool 10.150.0.0/24 --token s --uring # io_uring batched I/O9,400+ lines of Rust across 7 crates + 1,155 lines of C kernel shims. 95 commits.
Kernel module breakdown (3,300 lines total):
| File | Lines | Role |
|---|---|---|
lib.rs |
926 | Module init, TX/RX paths, peer management, rekeying, timer tick |
noise.rs |
540 | Noise_IK handshake, Curve25519 DH, BLAKE2s HKDF, key zeroization |
allowedips.rs |
211 | IPv4/IPv6 radix trie for cryptokey routing |
cookie.rs |
205 | DoS protection: MAC1/MAC2 verification, cookie reply |
timers.rs |
154 | Session timers: rekey, keepalive, expiry state machine |
replay.rs |
98 | Anti-replay 2048-bit sliding window |
wg_net.c |
302 | net_device, module params, skb helpers, TX skb pipeline |
wg_crypto.c |
267 | ChaCha20-Poly1305, BLAKE2s, HKDF, Curve25519, secure memory |
wg_queue.c |
213 | Async crypto workqueue: SG encrypt/decrypt in process context |
wg_socket.c |
179 | Kernel UDP socket, encap_rcv callback |
wg_genl.c |
140 | Genetlink skeleton for wg tool compatibility |
wg_timer.c |
54 | Periodic workqueue timer for rekey/keepalive |
65% Rust, 35% C. The C is pure plumbing (kernel APIs without Rust bindings). All protocol logic — handshake, routing, replay, timers, cookies, rekeying — is Rust. For reference, kernel WireGuard is ~4,000 lines of C; we're at 3,300 (83%) with full protocol coverage + two crypto modes.
cargo build --release
cargo build --release --target x86_64-unknown-linux-musl # cross-compileRequires a Linux 6.10+ kernel source tree with CONFIG_RUST=y:
cd rustguard-kmod
make modules KDIR=/path/to/linux-source KBUILD_MODPOST_WARN=1 CONFIG_DEBUG_INFO_BTF_MODULES=The Makefile stages shared crate sources, rewrites imports for Kbuild, and compiles via the kernel build system.
# Load with peer configuration
sudo insmod rustguard.ko peer_ip=0x0A0A0033 peer_port=51820 role=0 \
peer_pubkey=<64-char-hex>
# Configure interface
sudo ip addr add 10.150.0.1/24 dev wg0
sudo ip link set wg0 up- Two Debian 12 VMs on an Intel N305 hypervisor
- Intel i225 2.5G NICs with PCIe passthrough
- Direct connection:
192.168.99.1↔192.168.99.2
- Same VMs, virtio-net (shared memory between VMs)
- Custom kernel 6.12.74 built with
CONFIG_RUST=yon i9-9900K WSL2 - Module cross-compiled on WSL2, deployed via scp
-
TUN is the bottleneck, not crypto. ChaCha20-Poly1305 runs at >10 Gbps on modern CPUs. The kernel boundary crossing for TUN read/write is what kills userspace VPN throughput. Proved it by going in-kernel: 472 Mbps → 1,180 Mbps.
-
Multi-queue TUN is the biggest userspace win. Parallelizing TUN I/O across cores gave a 2.7× improvement. Everything else (AF_XDP, io_uring, recvmmsg) was noise in comparison.
-
AF_XDP solves the wrong problem. Zero-copy UDP is fast, but the bottleneck is on the TUN side. AF_XDP would matter if both sides were AF_XDP (no TUN at all), but then you're building a kernel module with extra steps. Which is what we ended up doing.
-
Rust in the kernel works, but you're on your own for networking. Kernel 6.12 Rust bindings only cover block devices and PHY drivers. For networking, you write C shim functions and call them from Rust via FFI. Not elegant, but it works.
-
Jason's chacha20poly1305 library is the right abstraction. The kernel crypto API (
crypto_alloc_aead, scatterlists, requests) is designed for block devices and TLS. WireGuard needs buffer-in, buffer-out. Jason wrotelib/crypto/chacha20poly1305.cfor exactly this reason. We wasted time on the crypto API before finding this. -
skb_linearizeis the kernel networking rite of passage. Large packets arrive in fragmented skbs. Readingskb->dataonly gives you the linear head. Your AEAD will fail withEBADMSGand you'll stare at it for an hour before realizing the data isn't contiguous. Every kernel network developer has this story. -
The WireGuard protocol is beautifully simple. Noise_IK, 1-RTT handshake, 64-bit nonce counter, rotating keys. Jason Donenfeld's 4,000 lines of kernel code is an achievement in discipline. Building it from scratch is the best way to appreciate that.
-
Security audits find real bugs. The HMAC-BLAKE2s implementation was wrong (keyed BLAKE2s ≠ HMAC). MAC1 was checked too late. The replay window was updatable before AEAD verification. None of these showed up in functional tests.
MIT OR Apache-2.0