Enable nested virtualization with performance benchmarks#50
Closed
Enable nested virtualization with performance benchmarks#50
Conversation
Switch from nVHE to VHE mode for nested virtualization support. VHE mode (E2H=1) allows the guest kernel to run at EL2, which is required for kvm-arm.mode=nested to work in the guest. Changes: - podman.rs: Set FCVM_NV2=1 when --kernel is used, change boot param from kvm-arm.mode=nvhe to kvm-arm.mode=nested - test_kvm.rs: Remove #[ignore] from 4-level inception test, add proper exit status checking to catch failures - Makefile: Add rebuild-fc and dev-fc-test targets for Firecracker development workflow Results: - L1 guest now boots with VHE mode: "kvm [1]: VHE mode initialized" - Basic nested KVM works (KVM_CREATE_VM succeeds in L1) - KVM_CAP_ARM_EL2 is reported as 1048576 in L1 Known limitation: Recursive nested virtualization (L1 creating L2 with HAS_EL2) fails: "Error initializing the vcpu: No such file or directory" L1's KVM advertises the capability but KVM_ARM_VCPU_INIT with HAS_EL2 fails. This is a kernel limitation - NV2 patches note recursive nesting is "not tested yet". Tested on AWS c7g.metal (Graviton3) with kernel 6.18.2.
Root cause analysis for recursive nesting failure: - Host KVM correctly stores emulated ID values (MMFR4=0xe100000, PFR0.EL2=1) - But ID register reads from virtual EL2+VHE bypass KVM emulation - Guest reads hardware directly: MMFR4=0, PFR0.EL2=0 - Evidence: 38,904 sysreg traps, ZERO ID registers, access_id_reg never called Also adds Makefile targets for inception VM debugging: - inception-vm: Start single development VM - inception-exec: Run commands in VM - inception-stop: Stop VM - inception-status: Show VM status Tested: make inception-vm, make inception-exec CMD="cat /proc/cpuinfo"
The problem: L1 guest's KVM reported KVM_CAP_ARM_EL2=0, blocking L2+ VM creation. Root cause is that virtual EL2 reads ID registers directly from hardware - HCR_EL2.TID3 only traps EL1 reads, not EL2 reads. So L1 kernel saw MMFR4=0 instead of the emulated NV2_ONLY value. The solution: Add arm64.nv2 boot parameter that overrides MMFR4.NV_frac to advertise NV2 support. Key insight: the kernel's override mechanism normally only allows lowering feature values (FTR_LOWER_SAFE). Changed to FTR_HIGHER_SAFE for MMFR4 NV_frac to allow upward overrides. Changes: - kernel/patches/mmfr4-override.patch: Kernel patch adding MMFR4 override support with FTR_HIGHER_SAFE, plus arm64.nv2 alias - kernel/build.sh: Update to kernel 6.18, auto-compute SHA from build inputs, apply MMFR4 patch during build - src/commands/podman.rs: Add arm64.nv2 to boot args for inception Tested: Successfully ran L2 VM inside L1: [ctr:stdout] Hello from L2 The recursive nesting chain now works: Host -> L1 -> L2.
- Add "Complex/Advanced PRs" section with detailed template for architectural changes, workarounds, and kernel patches - Update inception docs to reflect that recursive nesting now works - Replace "LIMITATION" with "Solved" and document the solution
- test_inception_l2 runs fcvm inside L1 VM to start L2 - Uses script file on shared storage to avoid shell escaping complexity - L1 imports image from shared cache via skopeo, then runs fcvm - L2 echoes marker to verify successful execution Key components: - Inception kernel 6.18 with FUSE_REMAP_FILE_RANGE support - Shared storage /mnt/fcvm-btrfs via FUSE-over-vsock - Image cache for sharing container images between levels Also includes: - ensure_inception_image() with content-addressable caching - Updated Makefile with inception image build target - fc-agent FUSE logging improvements Tested: make test-root FILTER=inception_l2 (passes in ~46s)
L2 inception test passes but L3+ tests blocked by FUSE chain latency: - 3-hop FUSE chain (L3→L2→L1→HOST) causes ~3-5 second latency per request - PassthroughFs + spawn_blocking serialization at each hop - FUSE mount initialization alone takes 10+ minutes at L3 Changes: - Add test_inception_l3 and test_inception_l4 tests with #[ignore] - Add run_inception_n_levels() helper for arbitrary nesting depth - Add request count logging to fuse-pipe client/server for diagnostics - Add detailed error logging for socket write failures (error type, kind) - Update disk.rs docs about FUSE_REMAP_FILE_RANGE kernel requirement The L3/L4 tests document the current limitation. Future work needed: - Request pipelining to avoid spawn_blocking serialization - Async PassthroughFs implementation
Replace simple marker echo with comprehensive benchmarks at each level: - Egress test: curl to ifconfig.me to verify network works - Local disk: dd 10MB write/read to measure rootfs performance - FUSE disk: dd 10MB write/read to measure vsock FUSE performance Results from c7g.metal: - Egress: Both L1 and L2 can reach internet (same public IP) - Local disk L1: 4ms write, 2ms read (10MB) - Local disk L2: 11ms write, 5ms read (10MB) - ~2.5x nested overhead - FUSE L1: 77ms write, 41ms read - ~130/244 MB/s - FUSE L2: 205ms write, 138ms read - ~49/72 MB/s Key finding: FUSE bulk throughput is usable at L2 (~3x slower per hop), but L3 fails due to per-request latency during boot (hundreds of ms per small file access through 3-hop chain). Scripts written to /mnt/fcvm-btrfs/inception-scripts/ for reuse.
Changes: - Host-side: Export localhost/ images as OCI archive tar files using `podman save --format oci-archive` instead of skopeo directory format - Guest-side: Run directly from `oci-archive:/path` format without import - Strip sha256: prefix from digest for valid filenames Benefits: - Single file transfer over FUSE (previously directory with many blobs) - No skopeo import step needed in guest - podman runs directly from archive - Faster startup for localhost/ images Tested: sudo fcvm podman run localhost/test-oci:latest Container output: OCI_ARCHIVE_TEST_SUCCESS make test-root FILTER=sanity # 2 passed
- Update Makefile to copy binaries to artifacts/ instead of bin/ - Update Containerfile.inception COPY paths to use artifacts/ - Update test_kvm.rs to use artifacts/ for SHA marker and binaries - Add Containerfile.inception and inception.sh source files The artifacts/ directory is already in .gitignore, so derived files (fcvm, fc-agent, firecracker-nv2) won't be tracked. Tested: test_inception_l2 passes in 68.87s
Benchmark results (256 workers, 1024 × 4KB files): - 1 reader: 3.0s writes (serialization bottleneck) - 64 readers: 165ms writes ≈ host 161ms - 256 readers: 162ms writes (3ms improvement, 4x memory) 64 readers achieves near-host performance while using 1/4 the memory (512MB vs 2GB for reader stacks at 8MB each).
Added FCVM_FUSE_TRACE_RATE env var to enable per-operation tracing: - Passed to guest via kernel boot param (fuse_trace_rate=N) - Shows operation name, total latency, server time, fs time - Handles clock skew between VMs (shows "?" for invalid deltas) Key findings from L2 investigation: - Async writes: 3x slower (expected for 2 FUSE hops) - Fsync: 16x slower (blocks synchronously through both layers) - Combined sync writes: 7x slower Files changed: - fc-agent/src/fuse/mod.rs: Parse fuse_trace_rate from /proc/cmdline - src/commands/podman.rs: Pass FCVM_FUSE_TRACE_RATE via boot args - fuse-pipe/src/protocol/wire.rs: Fix overflow, add op_name to trace - fuse-pipe/src/client/multiplexer.rs: Pass op_name to print() - tests/test_kvm.rs: Enable tracing in L2 test - .claude/CLAUDE.md: Document tracing capabilities Tested: make test-root FILTER=inception_l2 (passed)
Three fixes for TTY mode exec:
1. Set stdin to null in test TTY wrapper
- Prevents garbage from test harness being sent to VM's PTY
- Root cause of null byte (`^@`) in TTY output
2. Fix output lost when exit JSON in same buffer
- When command output and exit JSON arrive in same packet,
the old code returned immediately without writing output
- Now writes data BEFORE the exit JSON before returning
3. Increase vsock backlog from 5 to 128
- Prevents connection refused under parallel exec stress
Added test_exec_parallel_tty_stress test:
- 100 parallel TTY execs
- Verifies no null bytes or lost output
- Requires 100% success rate
Tested:
sudo -E cargo test --release -p fcvm --test test_exec \
test_exec_parallel_tty_stress --features integration-fast
# 100/100 success, 248 execs/sec
Matches podman/docker semantics: - -t: allocate PTY (colors/formatting) - -i: forward stdin - -it: both (interactive shell) - neither: plain exec Changes: - Add exec-proto crate for binary framing protocol - Unify run_framed_mode() for both TTY and non-TTY interactive - fc-agent uses pipes for non-TTY interactive, PTY only with -t - Add tests for all 4 flag combinations (VM + container)
Owner
Author
|
Changes already in the PR chain |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Enable ARM64 nested virtualization (FEAT_NV2) with L1→L2 inception working and comprehensive performance benchmarks.
Key Changes
Nested Virtualization Support
kvm-arm.mode=nested)arm64.nv2boot parameter to override MMFR4 for recursive nestingkernel/patches/mmfr4-override.patchPerformance Benchmarks
L2 inception test now measures at each level:
Results (c7g.metal)
L3 Limitation Documented
L3+ tests are
#[ignore]because 3-hop FUSE chain causes ~200-500ms per-request latency (vs bulk throughput which is acceptable). Boot takes 10+ minutes due to thousands of small file accesses.Test Plan
make test-root FILTER=inception_l2passes (~41s)