Enable VHE mode and recursive nested virtualization#49
Merged
Conversation
Switch from nVHE to VHE mode for nested virtualization support. VHE mode (E2H=1) allows the guest kernel to run at EL2, which is required for kvm-arm.mode=nested to work in the guest. Changes: - podman.rs: Set FCVM_NV2=1 when --kernel is used, change boot param from kvm-arm.mode=nvhe to kvm-arm.mode=nested - test_kvm.rs: Remove #[ignore] from 4-level inception test, add proper exit status checking to catch failures - Makefile: Add rebuild-fc and dev-fc-test targets for Firecracker development workflow Results: - L1 guest now boots with VHE mode: "kvm [1]: VHE mode initialized" - Basic nested KVM works (KVM_CREATE_VM succeeds in L1) - KVM_CAP_ARM_EL2 is reported as 1048576 in L1 Known limitation: Recursive nested virtualization (L1 creating L2 with HAS_EL2) fails: "Error initializing the vcpu: No such file or directory" L1's KVM advertises the capability but KVM_ARM_VCPU_INIT with HAS_EL2 fails. This is a kernel limitation - NV2 patches note recursive nesting is "not tested yet". Tested on AWS c7g.metal (Graviton3) with kernel 6.18.2.
Root cause analysis for recursive nesting failure: - Host KVM correctly stores emulated ID values (MMFR4=0xe100000, PFR0.EL2=1) - But ID register reads from virtual EL2+VHE bypass KVM emulation - Guest reads hardware directly: MMFR4=0, PFR0.EL2=0 - Evidence: 38,904 sysreg traps, ZERO ID registers, access_id_reg never called Also adds Makefile targets for inception VM debugging: - inception-vm: Start single development VM - inception-exec: Run commands in VM - inception-stop: Stop VM - inception-status: Show VM status Tested: make inception-vm, make inception-exec CMD="cat /proc/cpuinfo"
The problem: L1 guest's KVM reported KVM_CAP_ARM_EL2=0, blocking L2+ VM creation. Root cause is that virtual EL2 reads ID registers directly from hardware - HCR_EL2.TID3 only traps EL1 reads, not EL2 reads. So L1 kernel saw MMFR4=0 instead of the emulated NV2_ONLY value. The solution: Add arm64.nv2 boot parameter that overrides MMFR4.NV_frac to advertise NV2 support. Key insight: the kernel's override mechanism normally only allows lowering feature values (FTR_LOWER_SAFE). Changed to FTR_HIGHER_SAFE for MMFR4 NV_frac to allow upward overrides. Changes: - kernel/patches/mmfr4-override.patch: Kernel patch adding MMFR4 override support with FTR_HIGHER_SAFE, plus arm64.nv2 alias - kernel/build.sh: Update to kernel 6.18, auto-compute SHA from build inputs, apply MMFR4 patch during build - src/commands/podman.rs: Add arm64.nv2 to boot args for inception Tested: Successfully ran L2 VM inside L1: [ctr:stdout] Hello from L2 The recursive nesting chain now works: Host -> L1 -> L2.
- Add "Complex/Advanced PRs" section with detailed template for architectural changes, workarounds, and kernel patches - Update inception docs to reflect that recursive nesting now works - Replace "LIMITATION" with "Solved" and document the solution
- test_inception_l2 runs fcvm inside L1 VM to start L2 - Uses script file on shared storage to avoid shell escaping complexity - L1 imports image from shared cache via skopeo, then runs fcvm - L2 echoes marker to verify successful execution Key components: - Inception kernel 6.18 with FUSE_REMAP_FILE_RANGE support - Shared storage /mnt/fcvm-btrfs via FUSE-over-vsock - Image cache for sharing container images between levels Also includes: - ensure_inception_image() with content-addressable caching - Updated Makefile with inception image build target - fc-agent FUSE logging improvements Tested: make test-root FILTER=inception_l2 (passes in ~46s)
L2 inception test passes but L3+ tests blocked by FUSE chain latency: - 3-hop FUSE chain (L3→L2→L1→HOST) causes ~3-5 second latency per request - PassthroughFs + spawn_blocking serialization at each hop - FUSE mount initialization alone takes 10+ minutes at L3 Changes: - Add test_inception_l3 and test_inception_l4 tests with #[ignore] - Add run_inception_n_levels() helper for arbitrary nesting depth - Add request count logging to fuse-pipe client/server for diagnostics - Add detailed error logging for socket write failures (error type, kind) - Update disk.rs docs about FUSE_REMAP_FILE_RANGE kernel requirement The L3/L4 tests document the current limitation. Future work needed: - Request pipelining to avoid spawn_blocking serialization - Async PassthroughFs implementation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Enable ARM64 FEAT_NV2 nested virtualization with VHE mode, allowing fcvm to run recursively inside itself (inception). This PR enables:
What Works
test_inception_l2): Host spawns L1, L1 spawns L2, L2 echoes markerKVM_CAP_ARM_EL2=1via MMFR4 overrideWhat Doesn't Work Yet
spawn_blockingserialization at each hop#[ignore]d pending async PassthroughFs implementationThe ID Register Problem (Solved)
When running fcvm inside an fcvm VM, the L1 guest's KVM initially reported
KVM_CAP_ARM_EL2=0.Root cause: Virtual EL2 reads ID registers directly from hardware.
MMFR4.NV_frac=NV2_ONLYHCR_EL2.TID3traps ID register reads - but only for EL1 readsMMFR4=0(hardware) instead of emulated valueSolution: Use kernel's ID register override mechanism with
arm64.nv2boot parameter.Commits
VHE mode for Firecracker (
81a817d) - Boot guest in VHE mode with HAS_EL2 featureDocument ID register limitation (
4c0a086) - Why L1 sees MMFR4=0Enable recursive nesting (
bc115b0) - Kernel patch for MMFR4 overrideDocument complex PR format (
04bbb00) - Update CLAUDE.mdL1→L2 inception test working (
40934a2) - test_inception_l2 passesAdd L3/L4 tests and FUSE diagnostics (
647101f)#[ignore]and explanatory commentsFiles Changed
kernel/patches/mmfr4-override.patch- MMFR4 override with FTR_HIGHER_SAFEkernel/build.sh- Build 6.18 kernel with patchsrc/commands/podman.rs- VHE boot argstests/test_kvm.rs- L2/L3/L4 inception testsfuse-pipe/src/client/multiplexer.rs- Request logging and error detailsfuse-pipe/src/server/pipelined.rs- Request count loggingsrc/storage/disk.rs- Document FUSE_REMAP_FILE_RANGE requirementTest Results
Test Plan
forced to 2