Skip to content

Enable VHE mode and recursive nested virtualization#49

Merged
ejc3 merged 6 commits intomainfrom
vhe-mode-inception
Dec 31, 2025
Merged

Enable VHE mode and recursive nested virtualization#49
ejc3 merged 6 commits intomainfrom
vhe-mode-inception

Conversation

@ejc3
Copy link
Copy Markdown
Owner

@ejc3 ejc3 commented Dec 29, 2025

Summary

Enable ARM64 FEAT_NV2 nested virtualization with VHE mode, allowing fcvm to run recursively inside itself (inception). This PR enables:

  • L1 nesting: Host → L1 VM with full KVM support
  • L2 nesting: L1 → L2 via shared FUSE storage
  • L3+ tests: Documented but blocked by FUSE chain latency

What Works

  • L1→L2 inception test passes (test_inception_l2): Host spawns L1, L1 spawns L2, L2 echoes marker
  • Shared storage via FUSE-over-vsock: L2 imports container image from L1's FUSE mount
  • Recursive KVM: L1 kernel sees KVM_CAP_ARM_EL2=1 via MMFR4 override

What Doesn't Work Yet

  • L3+ inception blocked by FUSE latency: 3-hop FUSE chain causes ~3-5 second latency per request
  • Root cause: PassthroughFs + spawn_blocking serialization at each hop
  • FUSE mount initialization alone takes 10+ minutes at L3
  • Tests added but #[ignore]d pending async PassthroughFs implementation

The ID Register Problem (Solved)

When running fcvm inside an fcvm VM, the L1 guest's KVM initially reported KVM_CAP_ARM_EL2=0.

Root cause: Virtual EL2 reads ID registers directly from hardware.

  1. Host KVM properly emulates ID registers - sets MMFR4.NV_frac=NV2_ONLY
  2. HCR_EL2.TID3 traps ID register reads - but only for EL1 reads
  3. Virtual EL2 accesses bypass TID3 and read hardware directly
  4. Guest sees MMFR4=0 (hardware) instead of emulated value

Solution: Use kernel's ID register override mechanism with arm64.nv2 boot parameter.

Commits

  1. VHE mode for Firecracker (81a817d) - Boot guest in VHE mode with HAS_EL2 feature

  2. Document ID register limitation (4c0a086) - Why L1 sees MMFR4=0

  3. Enable recursive nesting (bc115b0) - Kernel patch for MMFR4 override

  4. Document complex PR format (04bbb00) - Update CLAUDE.md

  5. L1→L2 inception test working (40934a2) - test_inception_l2 passes

  6. Add L3/L4 tests and FUSE diagnostics (647101f)

    • Add test_inception_l3/l4 with #[ignore] and explanatory comments
    • Add request count logging to fuse-pipe for debugging
    • Add detailed socket error logging with error type/kind

Files Changed

  • kernel/patches/mmfr4-override.patch - MMFR4 override with FTR_HIGHER_SAFE
  • kernel/build.sh - Build 6.18 kernel with patch
  • src/commands/podman.rs - VHE boot args
  • tests/test_kvm.rs - L2/L3/L4 inception tests
  • fuse-pipe/src/client/multiplexer.rs - Request logging and error details
  • fuse-pipe/src/server/pipelined.rs - Request count logging
  • src/storage/disk.rs - Document FUSE_REMAP_FILE_RANGE requirement

Test Results

# L1→L2 test passes:
$ cargo test test_inception_l2 --release
test test_inception_l2 ... ok

# L3 blocked by FUSE latency (documented, test ignored):
# Each L3 FUSE request takes 3-5 seconds through 3-hop chain

Test Plan

  • L1 VM boots with VHE mode
  • MMFR4 override applied: forced to 2
  • KVM_CAP_ARM_EL2=1 in L1
  • test_inception_l2 passes
  • test_inception_l3 (blocked by FUSE latency, needs async PassthroughFs)

ejc3 added 3 commits December 29, 2025 15:06
Switch from nVHE to VHE mode for nested virtualization support.
VHE mode (E2H=1) allows the guest kernel to run at EL2, which is
required for kvm-arm.mode=nested to work in the guest.

Changes:
- podman.rs: Set FCVM_NV2=1 when --kernel is used, change boot
  param from kvm-arm.mode=nvhe to kvm-arm.mode=nested
- test_kvm.rs: Remove #[ignore] from 4-level inception test,
  add proper exit status checking to catch failures
- Makefile: Add rebuild-fc and dev-fc-test targets for Firecracker
  development workflow

Results:
- L1 guest now boots with VHE mode: "kvm [1]: VHE mode initialized"
- Basic nested KVM works (KVM_CREATE_VM succeeds in L1)
- KVM_CAP_ARM_EL2 is reported as 1048576 in L1

Known limitation:
Recursive nested virtualization (L1 creating L2 with HAS_EL2) fails:
  "Error initializing the vcpu: No such file or directory"
L1's KVM advertises the capability but KVM_ARM_VCPU_INIT with HAS_EL2
fails. This is a kernel limitation - NV2 patches note recursive
nesting is "not tested yet".

Tested on AWS c7g.metal (Graviton3) with kernel 6.18.2.
Root cause analysis for recursive nesting failure:
- Host KVM correctly stores emulated ID values (MMFR4=0xe100000, PFR0.EL2=1)
- But ID register reads from virtual EL2+VHE bypass KVM emulation
- Guest reads hardware directly: MMFR4=0, PFR0.EL2=0
- Evidence: 38,904 sysreg traps, ZERO ID registers, access_id_reg never called

Also adds Makefile targets for inception VM debugging:
- inception-vm: Start single development VM
- inception-exec: Run commands in VM
- inception-stop: Stop VM
- inception-status: Show VM status

Tested: make inception-vm, make inception-exec CMD="cat /proc/cpuinfo"
The problem: L1 guest's KVM reported KVM_CAP_ARM_EL2=0, blocking L2+ VM
creation. Root cause is that virtual EL2 reads ID registers directly
from hardware - HCR_EL2.TID3 only traps EL1 reads, not EL2 reads. So
L1 kernel saw MMFR4=0 instead of the emulated NV2_ONLY value.

The solution: Add arm64.nv2 boot parameter that overrides MMFR4.NV_frac
to advertise NV2 support. Key insight: the kernel's override mechanism
normally only allows lowering feature values (FTR_LOWER_SAFE). Changed
to FTR_HIGHER_SAFE for MMFR4 NV_frac to allow upward overrides.

Changes:
- kernel/patches/mmfr4-override.patch: Kernel patch adding MMFR4
  override support with FTR_HIGHER_SAFE, plus arm64.nv2 alias
- kernel/build.sh: Update to kernel 6.18, auto-compute SHA from
  build inputs, apply MMFR4 patch during build
- src/commands/podman.rs: Add arm64.nv2 to boot args for inception

Tested: Successfully ran L2 VM inside L1:
  [ctr:stdout] Hello from L2

The recursive nesting chain now works: Host -> L1 -> L2.
@ejc3 ejc3 changed the title Enable VHE mode for nested virtualization Enable VHE mode and recursive nested virtualization Dec 29, 2025
ejc3 added 3 commits December 29, 2025 18:59
- Add "Complex/Advanced PRs" section with detailed template for
  architectural changes, workarounds, and kernel patches
- Update inception docs to reflect that recursive nesting now works
- Replace "LIMITATION" with "Solved" and document the solution
- test_inception_l2 runs fcvm inside L1 VM to start L2
- Uses script file on shared storage to avoid shell escaping complexity
- L1 imports image from shared cache via skopeo, then runs fcvm
- L2 echoes marker to verify successful execution

Key components:
- Inception kernel 6.18 with FUSE_REMAP_FILE_RANGE support
- Shared storage /mnt/fcvm-btrfs via FUSE-over-vsock
- Image cache for sharing container images between levels

Also includes:
- ensure_inception_image() with content-addressable caching
- Updated Makefile with inception image build target
- fc-agent FUSE logging improvements

Tested: make test-root FILTER=inception_l2 (passes in ~46s)
L2 inception test passes but L3+ tests blocked by FUSE chain latency:
- 3-hop FUSE chain (L3→L2→L1→HOST) causes ~3-5 second latency per request
- PassthroughFs + spawn_blocking serialization at each hop
- FUSE mount initialization alone takes 10+ minutes at L3

Changes:
- Add test_inception_l3 and test_inception_l4 tests with #[ignore]
- Add run_inception_n_levels() helper for arbitrary nesting depth
- Add request count logging to fuse-pipe client/server for diagnostics
- Add detailed error logging for socket write failures (error type, kind)
- Update disk.rs docs about FUSE_REMAP_FILE_RANGE kernel requirement

The L3/L4 tests document the current limitation. Future work needed:
- Request pipelining to avoid spawn_blocking serialization
- Async PassthroughFs implementation
@ejc3 ejc3 merged commit 647101f into main Dec 31, 2025
3 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant