Enable nested virtualization with performance benchmarks by ejc3 · Pull Request #50 · ejc3/fcvm

ejc3 · 2025-12-30T15:45:29Z

Summary

Enable ARM64 nested virtualization (FEAT_NV2) with L1→L2 inception working and comprehensive performance benchmarks.

VHE mode enabled for guest KVM support
MMFR4 kernel override to fix ID register visibility issue
L2 inception test with egress and disk benchmarks
L3+ tests documented as blocked (FUSE chain latency)

Key Changes

Nested Virtualization Support

Switch from nVHE to VHE mode (kvm-arm.mode=nested)
Add arm64.nv2 boot parameter to override MMFR4 for recursive nesting
Kernel patch: kernel/patches/mmfr4-override.patch

Performance Benchmarks

L2 inception test now measures at each level:

Egress: curl to ifconfig.me (both L1 and L2 work)
Local disk: dd 10MB write/read
FUSE disk: dd 10MB write/read through vsock

Results (c7g.metal)

Metric	L1 (1-hop)	L2 (2-hop)	Ratio
Egress	✅ OK	✅ OK	Both work
Local Write 10MB	4ms	11ms	2.75x
Local Read 10MB	2ms	5ms	2.5x
FUSE Write 10MB	77ms	205ms	2.66x
FUSE Read 10MB	41ms	138ms	3.37x

L3 Limitation Documented

L3+ tests are #[ignore] because 3-hop FUSE chain causes ~200-500ms per-request latency (vs bulk throughput which is acceptable). Boot takes 10+ minutes due to thousands of small file accesses.

Test Plan

make test-root FILTER=inception_l2 passes (~41s)
L1 and L2 egress verified (same public IP)
FUSE bulk throughput measured at each level
L3 limitation documented with clear root cause

Switch from nVHE to VHE mode for nested virtualization support. VHE mode (E2H=1) allows the guest kernel to run at EL2, which is required for kvm-arm.mode=nested to work in the guest. Changes: - podman.rs: Set FCVM_NV2=1 when --kernel is used, change boot param from kvm-arm.mode=nvhe to kvm-arm.mode=nested - test_kvm.rs: Remove #[ignore] from 4-level inception test, add proper exit status checking to catch failures - Makefile: Add rebuild-fc and dev-fc-test targets for Firecracker development workflow Results: - L1 guest now boots with VHE mode: "kvm [1]: VHE mode initialized" - Basic nested KVM works (KVM_CREATE_VM succeeds in L1) - KVM_CAP_ARM_EL2 is reported as 1048576 in L1 Known limitation: Recursive nested virtualization (L1 creating L2 with HAS_EL2) fails: "Error initializing the vcpu: No such file or directory" L1's KVM advertises the capability but KVM_ARM_VCPU_INIT with HAS_EL2 fails. This is a kernel limitation - NV2 patches note recursive nesting is "not tested yet". Tested on AWS c7g.metal (Graviton3) with kernel 6.18.2.

Root cause analysis for recursive nesting failure: - Host KVM correctly stores emulated ID values (MMFR4=0xe100000, PFR0.EL2=1) - But ID register reads from virtual EL2+VHE bypass KVM emulation - Guest reads hardware directly: MMFR4=0, PFR0.EL2=0 - Evidence: 38,904 sysreg traps, ZERO ID registers, access_id_reg never called Also adds Makefile targets for inception VM debugging: - inception-vm: Start single development VM - inception-exec: Run commands in VM - inception-stop: Stop VM - inception-status: Show VM status Tested: make inception-vm, make inception-exec CMD="cat /proc/cpuinfo"

The problem: L1 guest's KVM reported KVM_CAP_ARM_EL2=0, blocking L2+ VM creation. Root cause is that virtual EL2 reads ID registers directly from hardware - HCR_EL2.TID3 only traps EL1 reads, not EL2 reads. So L1 kernel saw MMFR4=0 instead of the emulated NV2_ONLY value. The solution: Add arm64.nv2 boot parameter that overrides MMFR4.NV_frac to advertise NV2 support. Key insight: the kernel's override mechanism normally only allows lowering feature values (FTR_LOWER_SAFE). Changed to FTR_HIGHER_SAFE for MMFR4 NV_frac to allow upward overrides. Changes: - kernel/patches/mmfr4-override.patch: Kernel patch adding MMFR4 override support with FTR_HIGHER_SAFE, plus arm64.nv2 alias - kernel/build.sh: Update to kernel 6.18, auto-compute SHA from build inputs, apply MMFR4 patch during build - src/commands/podman.rs: Add arm64.nv2 to boot args for inception Tested: Successfully ran L2 VM inside L1: [ctr:stdout] Hello from L2 The recursive nesting chain now works: Host -> L1 -> L2.

- Add "Complex/Advanced PRs" section with detailed template for architectural changes, workarounds, and kernel patches - Update inception docs to reflect that recursive nesting now works - Replace "LIMITATION" with "Solved" and document the solution

- test_inception_l2 runs fcvm inside L1 VM to start L2 - Uses script file on shared storage to avoid shell escaping complexity - L1 imports image from shared cache via skopeo, then runs fcvm - L2 echoes marker to verify successful execution Key components: - Inception kernel 6.18 with FUSE_REMAP_FILE_RANGE support - Shared storage /mnt/fcvm-btrfs via FUSE-over-vsock - Image cache for sharing container images between levels Also includes: - ensure_inception_image() with content-addressable caching - Updated Makefile with inception image build target - fc-agent FUSE logging improvements Tested: make test-root FILTER=inception_l2 (passes in ~46s)

L2 inception test passes but L3+ tests blocked by FUSE chain latency: - 3-hop FUSE chain (L3→L2→L1→HOST) causes ~3-5 second latency per request - PassthroughFs + spawn_blocking serialization at each hop - FUSE mount initialization alone takes 10+ minutes at L3 Changes: - Add test_inception_l3 and test_inception_l4 tests with #[ignore] - Add run_inception_n_levels() helper for arbitrary nesting depth - Add request count logging to fuse-pipe client/server for diagnostics - Add detailed error logging for socket write failures (error type, kind) - Update disk.rs docs about FUSE_REMAP_FILE_RANGE kernel requirement The L3/L4 tests document the current limitation. Future work needed: - Request pipelining to avoid spawn_blocking serialization - Async PassthroughFs implementation

Replace simple marker echo with comprehensive benchmarks at each level: - Egress test: curl to ifconfig.me to verify network works - Local disk: dd 10MB write/read to measure rootfs performance - FUSE disk: dd 10MB write/read to measure vsock FUSE performance Results from c7g.metal: - Egress: Both L1 and L2 can reach internet (same public IP) - Local disk L1: 4ms write, 2ms read (10MB) - Local disk L2: 11ms write, 5ms read (10MB) - ~2.5x nested overhead - FUSE L1: 77ms write, 41ms read - ~130/244 MB/s - FUSE L2: 205ms write, 138ms read - ~49/72 MB/s Key finding: FUSE bulk throughput is usable at L2 (~3x slower per hop), but L3 fails due to per-request latency during boot (hundreds of ms per small file access through 3-hop chain). Scripts written to /mnt/fcvm-btrfs/inception-scripts/ for reuse.

Changes: - Host-side: Export localhost/ images as OCI archive tar files using `podman save --format oci-archive` instead of skopeo directory format - Guest-side: Run directly from `oci-archive:/path` format without import - Strip sha256: prefix from digest for valid filenames Benefits: - Single file transfer over FUSE (previously directory with many blobs) - No skopeo import step needed in guest - podman runs directly from archive - Faster startup for localhost/ images Tested: sudo fcvm podman run localhost/test-oci:latest Container output: OCI_ARCHIVE_TEST_SUCCESS make test-root FILTER=sanity # 2 passed

- Update Makefile to copy binaries to artifacts/ instead of bin/ - Update Containerfile.inception COPY paths to use artifacts/ - Update test_kvm.rs to use artifacts/ for SHA marker and binaries - Add Containerfile.inception and inception.sh source files The artifacts/ directory is already in .gitignore, so derived files (fcvm, fc-agent, firecracker-nv2) won't be tracked. Tested: test_inception_l2 passes in 68.87s

Benchmark results (256 workers, 1024 × 4KB files): - 1 reader: 3.0s writes (serialization bottleneck) - 64 readers: 165ms writes ≈ host 161ms - 256 readers: 162ms writes (3ms improvement, 4x memory) 64 readers achieves near-host performance while using 1/4 the memory (512MB vs 2GB for reader stacks at 8MB each).

Added FCVM_FUSE_TRACE_RATE env var to enable per-operation tracing: - Passed to guest via kernel boot param (fuse_trace_rate=N) - Shows operation name, total latency, server time, fs time - Handles clock skew between VMs (shows "?" for invalid deltas) Key findings from L2 investigation: - Async writes: 3x slower (expected for 2 FUSE hops) - Fsync: 16x slower (blocks synchronously through both layers) - Combined sync writes: 7x slower Files changed: - fc-agent/src/fuse/mod.rs: Parse fuse_trace_rate from /proc/cmdline - src/commands/podman.rs: Pass FCVM_FUSE_TRACE_RATE via boot args - fuse-pipe/src/protocol/wire.rs: Fix overflow, add op_name to trace - fuse-pipe/src/client/multiplexer.rs: Pass op_name to print() - tests/test_kvm.rs: Enable tracing in L2 test - .claude/CLAUDE.md: Document tracing capabilities Tested: make test-root FILTER=inception_l2 (passed)

Three fixes for TTY mode exec: 1. Set stdin to null in test TTY wrapper - Prevents garbage from test harness being sent to VM's PTY - Root cause of null byte (`^@`) in TTY output 2. Fix output lost when exit JSON in same buffer - When command output and exit JSON arrive in same packet, the old code returned immediately without writing output - Now writes data BEFORE the exit JSON before returning 3. Increase vsock backlog from 5 to 128 - Prevents connection refused under parallel exec stress Added test_exec_parallel_tty_stress test: - 100 parallel TTY execs - Verifies no null bytes or lost output - Requires 100% success rate Tested: sudo -E cargo test --release -p fcvm --test test_exec \ test_exec_parallel_tty_stress --features integration-fast # 100/100 success, 248 execs/sec

Matches podman/docker semantics: - -t: allocate PTY (colors/formatting) - -i: forward stdin - -it: both (interactive shell) - neither: plain exec Changes: - Add exec-proto crate for binary framing protocol - Unify run_framed_mode() for both TTY and non-TTY interactive - fc-agent uses pipes for non-TTY interactive, PTY only with -t - Add tests for all 4 flag combinations (VM + container)

ejc3 · 2025-12-31T06:40:54Z

Changes already in the PR chain

ejc3 added 13 commits December 29, 2025 15:06

ejc3 closed this Dec 31, 2025

ejc3 deleted the fuse-latency-investigation branch December 31, 2025 17:47

ejc3 mentioned this pull request Feb 7, 2026

Fix 21 bugs from codebase review (Waves 1+2) #268

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable nested virtualization with performance benchmarks#50

Enable nested virtualization with performance benchmarks#50
ejc3 wants to merge 13 commits intomainfrom
fuse-latency-investigation

ejc3 commented Dec 30, 2025

Uh oh!

ejc3 commented Dec 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ejc3 commented Dec 30, 2025

Summary

Key Changes

Nested Virtualization Support

Performance Benchmarks

Results (c7g.metal)

L3 Limitation Documented

Test Plan

Uh oh!

ejc3 commented Dec 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant