Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
102 changes: 88 additions & 14 deletions .claude/CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,9 +19,7 @@ fcvm is a Firecracker VM manager for running Podman containers in lightweight mi
## Nested Virtualization (Inception)

fcvm supports running inside another fcvm VM ("inception") using ARM64 FEAT_NV2.

**LIMITATION**: Only **one level** of nesting currently works (Host → L1). Recursive nesting
(L1 → L2 → L3...) is blocked because L1's KVM reports `KVM_CAP_ARM_EL2=0`.
Recursive nesting (Host → L1 → L2 → ...) is enabled via the `arm64.nv2` kernel boot parameter.

### Requirements

Expand All @@ -32,11 +30,12 @@ fcvm supports running inside another fcvm VM ("inception") using ARM64 FEAT_NV2.
### How It Works

1. Set `FCVM_NV2=1` environment variable (auto-set when `--kernel` flag is used)
2. fcvm passes `--enable-nv2` to Firecracker, which enables `HAS_EL2` + `HAS_EL2_E2H0` vCPU features
3. vCPU boots at EL2h so guest kernel sees HYP mode available
4. EL2 registers are initialized: HCR_EL2, CNTHCTL_EL2, VMPIDR_EL2, VPIDR_EL2
5. Guest kernel initializes KVM: "Hyp nVHE mode initialized successfully"
6. Nested fcvm can now create VMs using the guest's KVM
2. fcvm passes `--enable-nv2` to Firecracker, which enables `HAS_EL2` vCPU feature
3. vCPU boots at EL2h in VHE mode (E2H=1) so guest kernel sees HYP mode available
4. EL2 registers are initialized: HCR_EL2, VMPIDR_EL2, VPIDR_EL2
5. Guest kernel initializes KVM: "VHE mode initialized successfully"
6. `arm64.nv2` boot param overrides MMFR4 to advertise NV2 support
7. L1 KVM reports `KVM_CAP_ARM_EL2=1`, enabling recursive L2+ VMs

### Running Inception

Expand All @@ -61,9 +60,9 @@ fcvm podman run --name inner --network bridged alpine:latest

Firecracker fork with NV2 support: `ejc3/firecracker:nv2-inception`

- `HAS_EL2` (bit 7): Enables virtual EL2 for guest
- `HAS_EL2_E2H0` (bit 8): Forces nVHE mode (avoids timer trap storm)
- `HAS_EL2` (bit 7): Enables virtual EL2 for guest in VHE mode
- Boot at EL2h: Guest kernel must see CurrentEL=EL2 on boot
- VHE mode (E2H=1): Required for NV2 support in guest (nVHE mode doesn't support NV2)
- VMPIDR_EL2/VPIDR_EL2: Proper processor IDs for nested guests

### Tests
Expand All @@ -75,13 +74,36 @@ make test-root FILTER=inception
- `test_kvm_available_in_vm`: Verifies /dev/kvm works in guest
- `test_inception_run_fcvm_inside_vm`: Full inception test

### Recursive Nesting Limitation
### Recursive Nesting: The ID Register Problem (Solved)

**Problem**: L1's KVM initially reported `KVM_CAP_ARM_EL2=0`, blocking L2+ VMs.

**Root cause**: ARM architecture provides no mechanism to virtualize ID registers for virtual EL2.

1. Host KVM stores correct emulated ID values in `kvm->arch.id_regs[]`
2. `HCR_EL2.TID3` controls trapping of ID register reads - but only for **EL1 reads**
3. When guest runs at virtual EL2 (with NV2), ID register reads are EL2-level accesses
4. EL2-level accesses don't trap via TID3 - they read hardware directly
5. Guest sees `MMFR4=0` (hardware), not `MMFR4=NV2_ONLY` (emulated)

L1's KVM reports `KVM_CAP_ARM_EL2=0`, blocking L2+ VMs.
**Solution**: Use kernel's ID register override mechanism with `arm64.nv2` boot parameter.

**Root cause**: `kvm-arm.mode=nested` requires VHE (kernel at EL2), but NV2's E2H0 flag forces nVHE (kernel at EL1). E2H0 is required to avoid timer trap storms.
1. Added `arm64.nv2` alias for `id_aa64mmfr4.nv_frac=2` (NV2_ONLY)
2. Changed `FTR_LOWER_SAFE` to `FTR_HIGHER_SAFE` for MMFR4 to allow upward overrides
3. Kernel patch: `kernel/patches/mmfr4-override.patch`

**Status**: Waiting for kernel improvements. Kernel NV2 patches mark recursive nesting as "not tested yet".
**Why it's safe**: The host KVM *does* provide NV2 emulation - we're just fixing the guest's
view of this capability. We're not faking a feature, we're correcting a visibility issue.

**Verification**:
```
$ dmesg | grep mmfr4
CPU features: SYS_ID_AA64MMFR4_EL1[23:20]: forced to 2

$ check_kvm_caps
KVM_CAP_ARM_EL2 (cap 240) = 1
-> Nested virtualization IS supported by KVM (VHE mode)
```

## Quick Reference

Expand Down Expand Up @@ -319,6 +341,58 @@ Tested locally:
Fixed CI. Tested and it works.
```

#### Complex/Advanced PRs

**For non-trivial changes (architectural, workarounds, kernel patches), include:**

1. **The Problem** - What was failing and why. Include root cause analysis.
2. **The Solution** - How you fixed it. Explain the approach, not just "what" but "why this way".
3. **Why It's Safe** - For workarounds or unusual approaches, explain why it won't break things.
4. **Alternatives Considered** - What else you tried and why it didn't work.
5. **Test Results** - Actual command output proving it works.

**Example structure for complex PRs:**

```markdown
## Summary
One-line description of what this enables.

## The Problem
- What was broken
- Root cause analysis (be specific)
- Why existing approaches didn't work

## The Solution
1. First key change and why
2. Second key change and why
3. Why this approach over alternatives

### Why This Is Safe
- Explain non-obvious safety guarantees
- Address potential concerns upfront

### Alternatives Considered
1. Alternative A - why it didn't work
2. Alternative B - why it was more invasive

## Test Results
\`\`\`
$ actual-command-run
actual output proving it works
\`\`\`

## Test Plan
- [x] Test case 1
- [x] Test case 2
```

**When to use this format:**
- Kernel patches or low-level system changes
- Workarounds for architectural limitations
- Changes that might seem "wrong" without context
- Multi-commit PRs with complex interactions
- Anything where a reviewer might ask "why not just...?"

**Why evidence matters:**
- Proves the fix works, not just "looks right"
- Local testing is sufficient - don't need CI green first
Expand Down
202 changes: 201 additions & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,8 @@ CONTAINER_RUN := podman run --rm --privileged \
test test-unit test-fast test-all test-root \
_test-unit _test-fast _test-all _test-root \
container-build container-test container-test-unit container-test-fast container-test-all \
container-shell container-clean setup-btrfs setup-fcvm setup-pjdfstest bench lint fmt
container-shell container-clean setup-btrfs setup-fcvm setup-pjdfstest setup-inception bench lint fmt \
rebuild-fc dev-fc-test inception-vm inception-exec inception-wait-exec inception-stop inception-status

all: build

Expand Down Expand Up @@ -227,6 +228,37 @@ _setup-fcvm:
fi
./target/release/fcvm setup

# Inception test setup - builds container with matching CAS chain
# Ensures: bin/fc-agent == target/release/fc-agent, initrd SHA matches, container cached
setup-inception: setup-fcvm
@echo "==> Setting up inception test container..."
@echo "==> Copying binaries to bin/..."
mkdir -p bin
cp target/release/fcvm bin/
cp target/$(MUSL_TARGET)/release/fc-agent bin/
cp /usr/local/bin/firecracker firecracker-nv2 2>/dev/null || true
@echo "==> Building inception-test container..."
podman rmi localhost/inception-test 2>/dev/null || true
podman build -t localhost/inception-test -f Containerfile.inception .
@echo "==> Exporting container to CAS cache..."
@DIGEST=$$(podman inspect localhost/inception-test --format '{{.Digest}}'); \
CACHE_DIR="/mnt/fcvm-btrfs/image-cache/$${DIGEST}"; \
if [ -d "$$CACHE_DIR" ]; then \
echo "Cache already exists: $$CACHE_DIR"; \
else \
echo "Creating cache: $$CACHE_DIR"; \
sudo mkdir -p "$$CACHE_DIR"; \
sudo skopeo copy containers-storage:localhost/inception-test "dir:$$CACHE_DIR"; \
fi
@echo "==> Verification..."
@echo "fc-agent SHA: $$(sha256sum bin/fc-agent | cut -c1-12)"
@echo "Container fc-agent SHA: $$(podman run --rm localhost/inception-test sha256sum /usr/local/bin/fc-agent | cut -c1-12)"
@echo "Initrd: $$(ls -1 /mnt/fcvm-btrfs/initrd/fc-agent-*.initrd | tail -1)"
@DIGEST=$$(podman inspect localhost/inception-test --format '{{.Digest}}'); \
echo "Image digest: $$DIGEST"; \
echo "Cache path: /mnt/fcvm-btrfs/image-cache/$$DIGEST"
@echo "==> Inception setup complete!"

bench: build
@echo "==> Running benchmarks..."
sudo cargo bench -p fuse-pipe --bench throughput
Expand All @@ -238,3 +270,171 @@ lint:

fmt:
cargo fmt

# Firecracker development targets
# Rebuild Firecracker from source and install to /usr/local/bin
# Usage: make rebuild-fc
FIRECRACKER_SRC ?= /home/ubuntu/firecracker
FIRECRACKER_BIN := $(FIRECRACKER_SRC)/build/cargo_target/release/firecracker

rebuild-fc:
@echo "==> Force rebuilding Firecracker..."
touch $(FIRECRACKER_SRC)/src/vmm/src/arch/aarch64/vcpu.rs
cd $(FIRECRACKER_SRC) && cargo build --release
@echo "==> Installing Firecracker to /usr/local/bin..."
sudo rm -f /usr/local/bin/firecracker
sudo cp $(FIRECRACKER_BIN) /usr/local/bin/firecracker
@echo "==> Verifying installation..."
@strings /usr/local/bin/firecracker | grep -q "NV2 DEBUG" && echo "NV2 debug strings: OK" || echo "WARNING: NV2 debug strings missing"
/usr/local/bin/firecracker --version

# Full rebuild cycle: Firecracker + fcvm + run test
# Usage: make dev-fc-test FILTER=inception
dev-fc-test: rebuild-fc build
@echo "==> Running test with FILTER=$(FILTER)..."
FCVM_DATA_DIR=$(ROOT_DATA_DIR) \
CARGO_TARGET_AARCH64_UNKNOWN_LINUX_GNU_RUNNER='sudo -E' \
CARGO_TARGET_X86_64_UNKNOWN_LINUX_GNU_RUNNER='sudo -E' \
RUST_LOG=debug \
$(NEXTEST) $(NEXTEST_CAPTURE) --features privileged-tests $(FILTER)

# =============================================================================
# Inception VM development targets
# =============================================================================
# These targets manage a SINGLE inception VM for debugging.
# Only ONE VM can exist at a time - inception-vm kills any existing VM first.

# Find the inception kernel (latest vmlinux-*.bin with KVM support)
INCEPTION_KERNEL := $(shell ls -t /mnt/fcvm-btrfs/kernels/vmlinux-*.bin 2>/dev/null | head -1)
INCEPTION_VM_NAME := inception-dev
INCEPTION_VM_LOG := /tmp/inception-vm.log
INCEPTION_VM_PID := /tmp/inception-vm.pid

# Start an inception VM (kills any existing VM first)
# Usage: make inception-vm
inception-vm: build
@echo "==> Ensuring clean environment (killing ALL existing VMs)..."
@sudo pkill -9 firecracker 2>/dev/null || true
@sudo pkill -9 -f "fcvm podman" 2>/dev/null || true
@sleep 2
@if pgrep firecracker >/dev/null 2>&1; then \
echo "ERROR: Could not kill existing firecracker"; \
exit 1; \
fi
@sudo rm -f $(INCEPTION_VM_PID) $(INCEPTION_VM_LOG)
@sudo rm -rf /mnt/fcvm-btrfs/state/vm-*.json
@if [ -z "$(INCEPTION_KERNEL)" ]; then \
echo "ERROR: No inception kernel found. Run ./kernel/build.sh first."; \
exit 1; \
fi
@echo "==> Starting SINGLE inception VM"
@echo "==> Kernel: $(INCEPTION_KERNEL)"
@echo "==> Log: $(INCEPTION_VM_LOG)"
@echo "==> Use 'make inception-exec CMD=...' to run commands"
@echo "==> Use 'make inception-stop' to stop"
@sudo ./target/release/fcvm podman run \
--name $(INCEPTION_VM_NAME) \
--network bridged \
--kernel $(INCEPTION_KERNEL) \
--privileged \
--map /mnt/fcvm-btrfs:/mnt/fcvm-btrfs \
--cmd "sleep infinity" \
alpine:latest > $(INCEPTION_VM_LOG) 2>&1 & \
sleep 2; \
FCVM_PID=$$(pgrep -n -f "fcvm podman run.*$(INCEPTION_VM_NAME)"); \
echo "$$FCVM_PID" | sudo tee $(INCEPTION_VM_PID) > /dev/null; \
echo "==> VM started with fcvm PID $$FCVM_PID"; \
echo "==> Waiting for boot..."; \
sleep 20; \
FC_COUNT=$$(pgrep -c firecracker || echo 0); \
if [ "$$FC_COUNT" -ne 1 ]; then \
echo "ERROR: Expected 1 firecracker, got $$FC_COUNT"; \
exit 1; \
fi; \
echo "==> VM ready. Tailing log (Ctrl+C to stop tail, VM keeps running):"; \
tail -f $(INCEPTION_VM_LOG)

# Run a command inside the running inception VM
# Usage: make inception-exec CMD="ls -la /dev/kvm"
# Usage: make inception-exec CMD="/mnt/fcvm-btrfs/check_kvm_caps"
CMD ?= uname -a
inception-exec:
@if [ ! -f $(INCEPTION_VM_PID) ]; then \
echo "ERROR: No PID file found at $(INCEPTION_VM_PID)"; \
echo "Start a VM with 'make inception-vm' first."; \
exit 1; \
fi; \
PID=$$(cat $(INCEPTION_VM_PID)); \
if ! kill -0 $$PID 2>/dev/null; then \
echo "ERROR: VM process $$PID is not running"; \
echo "Start a VM with 'make inception-vm' first."; \
rm -f $(INCEPTION_VM_PID); \
exit 1; \
fi; \
echo "==> Running in VM (PID $$PID): $(CMD)"; \
sudo ./target/release/fcvm exec --pid $$PID -- $(CMD)

# Wait for VM to be ready and then run a command
# Usage: make inception-wait-exec CMD="/mnt/fcvm-btrfs/check_kvm_caps"
inception-wait-exec: build
@echo "==> Waiting for inception VM to be ready..."
@if [ ! -f $(INCEPTION_VM_PID) ]; then \
echo "ERROR: No PID file found. Start a VM with 'make inception-vm &' first."; \
exit 1; \
fi; \
PID=$$(cat $(INCEPTION_VM_PID)); \
for i in $$(seq 1 30); do \
if ! kill -0 $$PID 2>/dev/null; then \
echo "ERROR: VM process $$PID exited"; \
rm -f $(INCEPTION_VM_PID); \
exit 1; \
fi; \
if sudo ./target/release/fcvm exec --pid $$PID -- true 2>/dev/null; then \
echo "==> VM ready (PID $$PID)"; \
echo "==> Running: $(CMD)"; \
sudo ./target/release/fcvm exec --pid $$PID -- $(CMD); \
exit 0; \
fi; \
sleep 2; \
echo " Waiting... ($$i/30)"; \
done; \
echo "ERROR: Timeout waiting for VM to be ready"; \
exit 1

# Stop the inception VM
inception-stop:
@if [ -f $(INCEPTION_VM_PID) ]; then \
PID=$$(cat $(INCEPTION_VM_PID)); \
if kill -0 $$PID 2>/dev/null; then \
echo "==> Stopping VM (PID $$PID)..."; \
sudo kill $$PID 2>/dev/null || true; \
sleep 1; \
if kill -0 $$PID 2>/dev/null; then \
echo "==> Force killing..."; \
sudo kill -9 $$PID 2>/dev/null || true; \
fi; \
echo "==> VM stopped."; \
else \
echo "==> VM process $$PID not running (stale PID file)"; \
fi; \
rm -f $(INCEPTION_VM_PID); \
else \
echo "==> No PID file found. No VM to stop."; \
fi

# Show VM status
inception-status:
@echo "=== Inception VM Status ==="
@if [ -f $(INCEPTION_VM_PID) ]; then \
PID=$$(cat $(INCEPTION_VM_PID)); \
if kill -0 $$PID 2>/dev/null; then \
echo "VM PID: $$PID (running)"; \
ps -p $$PID -o pid,ppid,user,%cpu,%mem,etime,cmd --no-headers 2>/dev/null || true; \
else \
echo "VM PID: $$PID (NOT running - stale PID file)"; \
rm -f $(INCEPTION_VM_PID); \
fi; \
else \
echo "No PID file found at $(INCEPTION_VM_PID)"; \
echo "No VM running."; \
fi
Loading
Loading