Skip to content

Enable nested virtualization with performance benchmarks#50

Closed
ejc3 wants to merge 13 commits intomainfrom
fuse-latency-investigation
Closed

Enable nested virtualization with performance benchmarks#50
ejc3 wants to merge 13 commits intomainfrom
fuse-latency-investigation

Conversation

@ejc3
Copy link
Copy Markdown
Owner

@ejc3 ejc3 commented Dec 30, 2025

Summary

Enable ARM64 nested virtualization (FEAT_NV2) with L1→L2 inception working and comprehensive performance benchmarks.

  • VHE mode enabled for guest KVM support
  • MMFR4 kernel override to fix ID register visibility issue
  • L2 inception test with egress and disk benchmarks
  • L3+ tests documented as blocked (FUSE chain latency)

Key Changes

Nested Virtualization Support

  • Switch from nVHE to VHE mode (kvm-arm.mode=nested)
  • Add arm64.nv2 boot parameter to override MMFR4 for recursive nesting
  • Kernel patch: kernel/patches/mmfr4-override.patch

Performance Benchmarks

L2 inception test now measures at each level:

  • Egress: curl to ifconfig.me (both L1 and L2 work)
  • Local disk: dd 10MB write/read
  • FUSE disk: dd 10MB write/read through vsock

Results (c7g.metal)

Metric L1 (1-hop) L2 (2-hop) Ratio
Egress ✅ OK ✅ OK Both work
Local Write 10MB 4ms 11ms 2.75x
Local Read 10MB 2ms 5ms 2.5x
FUSE Write 10MB 77ms 205ms 2.66x
FUSE Read 10MB 41ms 138ms 3.37x

L3 Limitation Documented

L3+ tests are #[ignore] because 3-hop FUSE chain causes ~200-500ms per-request latency (vs bulk throughput which is acceptable). Boot takes 10+ minutes due to thousands of small file accesses.

Test Plan

  • make test-root FILTER=inception_l2 passes (~41s)
  • L1 and L2 egress verified (same public IP)
  • FUSE bulk throughput measured at each level
  • L3 limitation documented with clear root cause

ejc3 added 13 commits December 29, 2025 15:06
Switch from nVHE to VHE mode for nested virtualization support.
VHE mode (E2H=1) allows the guest kernel to run at EL2, which is
required for kvm-arm.mode=nested to work in the guest.

Changes:
- podman.rs: Set FCVM_NV2=1 when --kernel is used, change boot
  param from kvm-arm.mode=nvhe to kvm-arm.mode=nested
- test_kvm.rs: Remove #[ignore] from 4-level inception test,
  add proper exit status checking to catch failures
- Makefile: Add rebuild-fc and dev-fc-test targets for Firecracker
  development workflow

Results:
- L1 guest now boots with VHE mode: "kvm [1]: VHE mode initialized"
- Basic nested KVM works (KVM_CREATE_VM succeeds in L1)
- KVM_CAP_ARM_EL2 is reported as 1048576 in L1

Known limitation:
Recursive nested virtualization (L1 creating L2 with HAS_EL2) fails:
  "Error initializing the vcpu: No such file or directory"
L1's KVM advertises the capability but KVM_ARM_VCPU_INIT with HAS_EL2
fails. This is a kernel limitation - NV2 patches note recursive
nesting is "not tested yet".

Tested on AWS c7g.metal (Graviton3) with kernel 6.18.2.
Root cause analysis for recursive nesting failure:
- Host KVM correctly stores emulated ID values (MMFR4=0xe100000, PFR0.EL2=1)
- But ID register reads from virtual EL2+VHE bypass KVM emulation
- Guest reads hardware directly: MMFR4=0, PFR0.EL2=0
- Evidence: 38,904 sysreg traps, ZERO ID registers, access_id_reg never called

Also adds Makefile targets for inception VM debugging:
- inception-vm: Start single development VM
- inception-exec: Run commands in VM
- inception-stop: Stop VM
- inception-status: Show VM status

Tested: make inception-vm, make inception-exec CMD="cat /proc/cpuinfo"
The problem: L1 guest's KVM reported KVM_CAP_ARM_EL2=0, blocking L2+ VM
creation. Root cause is that virtual EL2 reads ID registers directly
from hardware - HCR_EL2.TID3 only traps EL1 reads, not EL2 reads. So
L1 kernel saw MMFR4=0 instead of the emulated NV2_ONLY value.

The solution: Add arm64.nv2 boot parameter that overrides MMFR4.NV_frac
to advertise NV2 support. Key insight: the kernel's override mechanism
normally only allows lowering feature values (FTR_LOWER_SAFE). Changed
to FTR_HIGHER_SAFE for MMFR4 NV_frac to allow upward overrides.

Changes:
- kernel/patches/mmfr4-override.patch: Kernel patch adding MMFR4
  override support with FTR_HIGHER_SAFE, plus arm64.nv2 alias
- kernel/build.sh: Update to kernel 6.18, auto-compute SHA from
  build inputs, apply MMFR4 patch during build
- src/commands/podman.rs: Add arm64.nv2 to boot args for inception

Tested: Successfully ran L2 VM inside L1:
  [ctr:stdout] Hello from L2

The recursive nesting chain now works: Host -> L1 -> L2.
- Add "Complex/Advanced PRs" section with detailed template for
  architectural changes, workarounds, and kernel patches
- Update inception docs to reflect that recursive nesting now works
- Replace "LIMITATION" with "Solved" and document the solution
- test_inception_l2 runs fcvm inside L1 VM to start L2
- Uses script file on shared storage to avoid shell escaping complexity
- L1 imports image from shared cache via skopeo, then runs fcvm
- L2 echoes marker to verify successful execution

Key components:
- Inception kernel 6.18 with FUSE_REMAP_FILE_RANGE support
- Shared storage /mnt/fcvm-btrfs via FUSE-over-vsock
- Image cache for sharing container images between levels

Also includes:
- ensure_inception_image() with content-addressable caching
- Updated Makefile with inception image build target
- fc-agent FUSE logging improvements

Tested: make test-root FILTER=inception_l2 (passes in ~46s)
L2 inception test passes but L3+ tests blocked by FUSE chain latency:
- 3-hop FUSE chain (L3→L2→L1→HOST) causes ~3-5 second latency per request
- PassthroughFs + spawn_blocking serialization at each hop
- FUSE mount initialization alone takes 10+ minutes at L3

Changes:
- Add test_inception_l3 and test_inception_l4 tests with #[ignore]
- Add run_inception_n_levels() helper for arbitrary nesting depth
- Add request count logging to fuse-pipe client/server for diagnostics
- Add detailed error logging for socket write failures (error type, kind)
- Update disk.rs docs about FUSE_REMAP_FILE_RANGE kernel requirement

The L3/L4 tests document the current limitation. Future work needed:
- Request pipelining to avoid spawn_blocking serialization
- Async PassthroughFs implementation
Replace simple marker echo with comprehensive benchmarks at each level:
- Egress test: curl to ifconfig.me to verify network works
- Local disk: dd 10MB write/read to measure rootfs performance
- FUSE disk: dd 10MB write/read to measure vsock FUSE performance

Results from c7g.metal:
- Egress: Both L1 and L2 can reach internet (same public IP)
- Local disk L1: 4ms write, 2ms read (10MB)
- Local disk L2: 11ms write, 5ms read (10MB) - ~2.5x nested overhead
- FUSE L1: 77ms write, 41ms read - ~130/244 MB/s
- FUSE L2: 205ms write, 138ms read - ~49/72 MB/s

Key finding: FUSE bulk throughput is usable at L2 (~3x slower per hop),
but L3 fails due to per-request latency during boot (hundreds of ms per
small file access through 3-hop chain).

Scripts written to /mnt/fcvm-btrfs/inception-scripts/ for reuse.
Changes:
- Host-side: Export localhost/ images as OCI archive tar files using
  `podman save --format oci-archive` instead of skopeo directory format
- Guest-side: Run directly from `oci-archive:/path` format without import
- Strip sha256: prefix from digest for valid filenames

Benefits:
- Single file transfer over FUSE (previously directory with many blobs)
- No skopeo import step needed in guest - podman runs directly from archive
- Faster startup for localhost/ images

Tested:
  sudo fcvm podman run localhost/test-oci:latest
  Container output: OCI_ARCHIVE_TEST_SUCCESS
  make test-root FILTER=sanity  # 2 passed
- Update Makefile to copy binaries to artifacts/ instead of bin/
- Update Containerfile.inception COPY paths to use artifacts/
- Update test_kvm.rs to use artifacts/ for SHA marker and binaries
- Add Containerfile.inception and inception.sh source files

The artifacts/ directory is already in .gitignore, so derived files
(fcvm, fc-agent, firecracker-nv2) won't be tracked.

Tested: test_inception_l2 passes in 68.87s
Benchmark results (256 workers, 1024 × 4KB files):
- 1 reader:   3.0s writes (serialization bottleneck)
- 64 readers: 165ms writes ≈ host 161ms
- 256 readers: 162ms writes (3ms improvement, 4x memory)

64 readers achieves near-host performance while using 1/4 the memory
(512MB vs 2GB for reader stacks at 8MB each).
Added FCVM_FUSE_TRACE_RATE env var to enable per-operation tracing:
- Passed to guest via kernel boot param (fuse_trace_rate=N)
- Shows operation name, total latency, server time, fs time
- Handles clock skew between VMs (shows "?" for invalid deltas)

Key findings from L2 investigation:
- Async writes: 3x slower (expected for 2 FUSE hops)
- Fsync: 16x slower (blocks synchronously through both layers)
- Combined sync writes: 7x slower

Files changed:
- fc-agent/src/fuse/mod.rs: Parse fuse_trace_rate from /proc/cmdline
- src/commands/podman.rs: Pass FCVM_FUSE_TRACE_RATE via boot args
- fuse-pipe/src/protocol/wire.rs: Fix overflow, add op_name to trace
- fuse-pipe/src/client/multiplexer.rs: Pass op_name to print()
- tests/test_kvm.rs: Enable tracing in L2 test
- .claude/CLAUDE.md: Document tracing capabilities

Tested: make test-root FILTER=inception_l2 (passed)
Three fixes for TTY mode exec:

1. Set stdin to null in test TTY wrapper
   - Prevents garbage from test harness being sent to VM's PTY
   - Root cause of null byte (`^@`) in TTY output

2. Fix output lost when exit JSON in same buffer
   - When command output and exit JSON arrive in same packet,
     the old code returned immediately without writing output
   - Now writes data BEFORE the exit JSON before returning

3. Increase vsock backlog from 5 to 128
   - Prevents connection refused under parallel exec stress

Added test_exec_parallel_tty_stress test:
- 100 parallel TTY execs
- Verifies no null bytes or lost output
- Requires 100% success rate

Tested:
  sudo -E cargo test --release -p fcvm --test test_exec \
    test_exec_parallel_tty_stress --features integration-fast
  # 100/100 success, 248 execs/sec
Matches podman/docker semantics:
- -t: allocate PTY (colors/formatting)
- -i: forward stdin
- -it: both (interactive shell)
- neither: plain exec

Changes:
- Add exec-proto crate for binary framing protocol
- Unify run_framed_mode() for both TTY and non-TTY interactive
- fc-agent uses pipes for non-TTY interactive, PTY only with -t
- Add tests for all 4 flag combinations (VM + container)
@ejc3
Copy link
Copy Markdown
Owner Author

ejc3 commented Dec 31, 2025

Changes already in the PR chain

@ejc3 ejc3 closed this Dec 31, 2025
@ejc3 ejc3 deleted the fuse-latency-investigation branch December 31, 2025 17:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant