Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
56 changes: 56 additions & 0 deletions .claude/CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,62 @@ Examples of hacks to avoid:
## Overview
fcvm is a Firecracker VM manager for running Podman containers in lightweight microVMs. This document tracks implementation findings and decisions.

## Nested Virtualization (Inception)

fcvm supports running inside another fcvm VM ("inception") using ARM64 FEAT_NV2.

### Requirements

- **Hardware**: ARM64 with FEAT_NV2 (Graviton3+, c7g.metal)
- **Host kernel**: 6.18+ with `kvm-arm.mode=nested`
- **Inception kernel**: Custom kernel with CONFIG_KVM=y (built by `kernel/build.sh`)

### How It Works

1. Set `FCVM_NV2=1` environment variable (auto-set when `--kernel` flag is used)
2. fcvm passes `--enable-nv2` to Firecracker, which enables `HAS_EL2` + `HAS_EL2_E2H0` vCPU features
3. vCPU boots at EL2h so guest kernel sees HYP mode available
4. EL2 registers are initialized: HCR_EL2, CNTHCTL_EL2, VMPIDR_EL2, VPIDR_EL2
5. Guest kernel initializes KVM: "Hyp nVHE mode initialized successfully"
6. Nested fcvm can now create VMs using the guest's KVM

### Running Inception

```bash
# Build inception kernel (first time only, ~10-20 min)
./kernel/build.sh

# Run outer VM with inception kernel (--kernel auto-sets FCVM_NV2=1)
sudo fcvm podman run \
--name outer \
--network bridged \
--kernel /mnt/fcvm-btrfs/kernels/vmlinux-6.12.10-*.bin \
--privileged \
--map /mnt/fcvm-btrfs:/mnt/fcvm-btrfs \
nginx:alpine

# Inside outer VM, run inner fcvm
fcvm podman run --name inner --network bridged alpine:latest
```

### Key Firecracker Changes

Firecracker fork with NV2 support: `ejc3/firecracker:nv2-inception`

- `HAS_EL2` (bit 7): Enables virtual EL2 for guest
- `HAS_EL2_E2H0` (bit 8): Forces nVHE mode (avoids timer trap storm)
- Boot at EL2h: Guest kernel must see CurrentEL=EL2 on boot
- VMPIDR_EL2/VPIDR_EL2: Proper processor IDs for nested guests

### Tests

```bash
make test-root FILTER=inception
```

- `test_kvm_available_in_vm`: Verifies /dev/kvm works in guest
- `test_inception_run_fcvm_inside_vm`: Full inception test

## Quick Reference

### Shell Scripts to /tmp
Expand Down
90 changes: 90 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -283,6 +283,96 @@ sudo fcvm podman run --name full \

---

## Nested Virtualization (Inception)

fcvm supports running inside another fcvm VM using ARM64 FEAT_NV2 nested virtualization. This enables "inception" - VMs inside VMs.

### Requirements

| Requirement | Details |
|-------------|---------|
| **Hardware** | ARM64 with FEAT_NV2 (Graviton3+: c7g.metal, c7gn.metal, r7g.metal) |
| **Host kernel** | 6.18+ with `kvm-arm.mode=nested` boot parameter |
| **Inception kernel** | Custom kernel with CONFIG_KVM=y (built by `kernel/build.sh`) |
| **Firecracker** | Fork with NV2 support: `ejc3/firecracker:nv2-inception` |

### Building the Inception Kernel

```bash
# Build kernel with KVM support (~10-20 minutes first time)
./kernel/build.sh

# Kernel will be at /mnt/fcvm-btrfs/kernels/vmlinux-6.12.10-*.bin
```

The inception kernel adds these configs on top of the standard Firecracker kernel:
- `CONFIG_KVM=y` - KVM hypervisor support
- `CONFIG_VIRTUALIZATION=y` - Virtualization support
- `CONFIG_TUN=y`, `CONFIG_VETH=y` - Network devices for nested VMs
- `CONFIG_NETFILTER*` - iptables/nftables for bridged networking

### Running Inception

**Step 1: Start outer VM with inception kernel**
```bash
# FCVM_NV2=1 is auto-set when --kernel flag is used
sudo fcvm podman run \
--name outer-vm \
--network bridged \
--kernel /mnt/fcvm-btrfs/kernels/vmlinux-6.12.10-*.bin \
--privileged \
--map /mnt/fcvm-btrfs:/mnt/fcvm-btrfs \
--map /path/to/fcvm/binary:/opt/fcvm \
nginx:alpine
```

**Step 2: Verify nested KVM works**
```bash
# Check guest sees HYP mode
fcvm exec --pid <outer_pid> --vm -- dmesg | grep -i kvm
# Should show: "kvm [1]: Hyp nVHE mode initialized successfully"

# Verify /dev/kvm is accessible
fcvm exec --pid <outer_pid> --vm -- ls -la /dev/kvm
```

**Step 3: Run inner VM**
```bash
# Inside outer VM (via exec or SSH)
cd /mnt/fcvm-btrfs
/opt/fcvm/fcvm podman run --name inner-vm --network bridged alpine:latest echo "Hello from inception!"
```

### How It Works

1. **FCVM_NV2=1** environment variable (auto-set when `--kernel` is used) triggers fcvm to pass `--enable-nv2` to Firecracker
2. **HAS_EL2 + HAS_EL2_E2H0** vCPU features are enabled
- HAS_EL2 (bit 7): Enables virtual EL2 for guest
- HAS_EL2_E2H0 (bit 8): Forces nVHE mode (avoids timer trap storm)
3. **vCPU boots at EL2h** so guest kernel's `is_hyp_mode_available()` returns true
4. **EL2 registers initialized**: HCR_EL2, CNTHCTL_EL2, VMPIDR_EL2, VPIDR_EL2
5. Guest kernel initializes KVM: "CPU: All CPU(s) started at EL2"
6. Nested fcvm creates VMs using the guest's KVM

### Testing Inception

```bash
# Run inception tests
make test-root FILTER=inception

# Tests:
# - test_kvm_available_in_vm: Verifies /dev/kvm works in guest
# - test_inception_run_fcvm_inside_vm: Full inception (fcvm inside fcvm)
```

### Limitations

- ARM64 only (x86_64 nested virt uses different mechanism)
- Requires bare-metal instance (c7g.metal) or host with nested virt enabled
- Maximum 2 levels tested (host → outer VM → inner VM)

---

## Project Structure

```
Expand Down
4 changes: 1 addition & 3 deletions fc-agent/src/main.rs
Original file line number Diff line number Diff line change
Expand Up @@ -1106,9 +1106,7 @@ fn create_kvm_device() {
let err = std::io::Error::last_os_error();
// ENOENT means the kernel doesn't have KVM support
// This is expected with standard Firecracker kernel
if err.kind() == std::io::ErrorKind::NotFound
|| err.raw_os_error() == Some(libc::ENOENT)
{
if err.kind() == std::io::ErrorKind::NotFound || err.raw_os_error() == Some(libc::ENOENT) {
eprintln!("[fc-agent] /dev/kvm not available (kernel needs CONFIG_KVM)");
} else {
eprintln!("[fc-agent] WARNING: failed to create /dev/kvm: {}", err);
Expand Down
30 changes: 20 additions & 10 deletions fuse-pipe/src/server/passthrough.rs
Original file line number Diff line number Diff line change
Expand Up @@ -1597,10 +1597,10 @@ mod tests {

// Call remap_file_range (FICLONE equivalent - whole file)
let resp = fs.remap_file_range(
src_ino, src_fh, 0, // source: ino, fh, offset
dst_ino, dst_fh, 0, // dest: ino, fh, offset
0, // len = 0 means whole file clone
0, // no special flags
src_ino, src_fh, 0, // source: ino, fh, offset
dst_ino, dst_fh, 0, // dest: ino, fh, offset
0, // len = 0 means whole file clone
0, // no special flags
);

match resp {
Expand Down Expand Up @@ -1628,7 +1628,10 @@ mod tests {
// EOPNOTSUPP or EINVAL is expected on filesystems without reflink support
// tmpfs returns EINVAL, ext4/xfs without reflinks return EOPNOTSUPP
if errno == libc::EOPNOTSUPP || errno == libc::EINVAL {
eprintln!("FICLONE not supported on this filesystem (errno={}) - OK", errno);
eprintln!(
"FICLONE not supported on this filesystem (errno={}) - OK",
errno
);
// Check filesystem type
eprintln!("tempdir path: {:?}", dir.path());
// Try direct FICLONE to confirm
Expand All @@ -1649,7 +1652,10 @@ mod tests {
};
if result < 0 {
let err = std::io::Error::last_os_error();
eprintln!("Direct FICLONE also failed: {} - filesystem doesn't support reflinks", err);
eprintln!(
"Direct FICLONE also failed: {} - filesystem doesn't support reflinks",
err
);
}
} else {
panic!(
Expand Down Expand Up @@ -1706,10 +1712,14 @@ mod tests {

// Clone second block from source to first block of destination
let resp = fs.remap_file_range(
src_ino, src_fh, block_size as u64, // source offset: second block
dst_ino, dst_fh, 0, // dest offset: first block
block_size as u64, // length: one block
0, // no special flags
src_ino,
src_fh,
block_size as u64, // source offset: second block
dst_ino,
dst_fh,
0, // dest offset: first block
block_size as u64, // length: one block
0, // no special flags
);

match resp {
Expand Down
24 changes: 19 additions & 5 deletions fuse-pipe/tests/integration_root.rs
Original file line number Diff line number Diff line change
Expand Up @@ -199,6 +199,9 @@ fn test_nonroot_mkdir_with_readers(num_readers: usize) {
/// Test copy_file_range through FUSE.
/// This tests the server-side implementation of copy_file_range which enables
/// instant reflinks on btrfs filesystems.
///
/// Note: copy_file_range through FUSE requires kernel support (FUSE protocol 7.28+,
/// Linux 4.20+). If the kernel doesn't support it, this test is skipped.
#[test]
fn test_copy_file_range() {
use std::os::unix::io::AsRawFd;
Expand Down Expand Up @@ -232,11 +235,22 @@ fn test_copy_file_range() {
libc::copy_file_range(fd_in, &mut off_in, fd_out, &mut off_out, test_data.len(), 0)
};

assert!(
result >= 0,
"copy_file_range failed: {}",
std::io::Error::last_os_error()
);
// Check if kernel supports copy_file_range through FUSE
if result < 0 {
let err = std::io::Error::last_os_error();
let errno = err.raw_os_error().unwrap_or(0);
// EINVAL (22) or ENOSYS (38) means kernel doesn't support copy_file_range on FUSE
// EXDEV (18) can also occur if cross-device copy isn't supported
if errno == libc::EINVAL || errno == libc::ENOSYS || errno == libc::EXDEV {
eprintln!(
"SKIP: copy_file_range not supported through FUSE on this kernel ({})",
err
);
return;
}
panic!("copy_file_range failed unexpectedly: {}", err);
}

assert_eq!(result as usize, test_data.len(), "should copy all bytes");

// Sync and verify
Expand Down
47 changes: 36 additions & 11 deletions fuse-pipe/tests/test_remap_file_range.rs
Original file line number Diff line number Diff line change
Expand Up @@ -99,9 +99,9 @@ fn check_kernel_remap_support(mount_path: &std::path::Path) -> Option<bool> {
} else {
let errno = std::io::Error::last_os_error().raw_os_error().unwrap_or(0);
match errno {
libc::ENOSYS => None, // Kernel doesn't support
libc::ENOSYS => None, // Kernel doesn't support
libc::EOPNOTSUPP | libc::EINVAL => Some(false), // Kernel supports, fs doesn't
_ => Some(false), // Other error, assume kernel supports
_ => Some(false), // Other error, assume kernel supports
}
}
}
Expand Down Expand Up @@ -154,7 +154,9 @@ fn run_ficlone_test_with_paths(data_dir: &std::path::Path, mount_dir: &std::path
// Check kernel support first
match check_kernel_remap_support(mount) {
None => {
eprintln!("SKIP: test_ficlone_whole_file requires kernel FUSE_REMAP_FILE_RANGE support");
eprintln!(
"SKIP: test_ficlone_whole_file requires kernel FUSE_REMAP_FILE_RANGE support"
);
eprintln!(" Got ENOSYS - kernel patch not applied");
return;
}
Expand Down Expand Up @@ -186,15 +188,23 @@ fn run_ficlone_test_with_paths(data_dir: &std::path::Path, mount_dir: &std::path

if ret != 0 {
let err = std::io::Error::last_os_error();
panic!("FICLONE failed: {} (errno {})", err, err.raw_os_error().unwrap_or(0));
panic!(
"FICLONE failed: {} (errno {})",
err,
err.raw_os_error().unwrap_or(0)
);
}

drop(src_file);
drop(dst_file);

// Verify content is identical
let dst_content = fs::read(&dst_path).expect("read dest");
assert_eq!(dst_content.len(), test_data.len(), "cloned file size mismatch");
assert_eq!(
dst_content.len(),
test_data.len(),
"cloned file size mismatch"
);
assert_eq!(dst_content, test_data, "cloned file content mismatch");

// Verify on underlying filesystem that extents are shared
Expand Down Expand Up @@ -243,7 +253,9 @@ fn run_ficlonerange_test_with_paths(data_dir: &std::path::Path, mount_dir: &std:
// Check kernel support first
match check_kernel_remap_support(mount) {
None => {
eprintln!("SKIP: test_ficlonerange_partial requires kernel FUSE_REMAP_FILE_RANGE support");
eprintln!(
"SKIP: test_ficlonerange_partial requires kernel FUSE_REMAP_FILE_RANGE support"
);
return;
}
Some(false) => {
Expand All @@ -257,7 +269,9 @@ fn run_ficlonerange_test_with_paths(data_dir: &std::path::Path, mount_dir: &std:
// btrfs block size is typically 4096
let block_size = 4096usize;
let num_blocks = 4;
let test_data: Vec<u8> = (0..block_size * num_blocks).map(|i| (i % 256) as u8).collect();
let test_data: Vec<u8> = (0..block_size * num_blocks)
.map(|i| (i % 256) as u8)
.collect();
let src_path = mount.join("clonerange_source.bin");
let dst_path = mount.join("clonerange_dest.bin");

Expand All @@ -276,7 +290,7 @@ fn run_ficlonerange_test_with_paths(data_dir: &std::path::Path, mount_dir: &std:
// Clone middle 2 blocks from source to dest
let clone_range = FileCloneRange {
src_fd: src_file.as_raw_fd() as i64,
src_offset: block_size as u64, // Start at block 1
src_offset: block_size as u64, // Start at block 1
src_length: (block_size * 2) as u64, // Clone 2 blocks
dest_offset: block_size as u64, // Write to same offset in dest
};
Expand All @@ -291,7 +305,11 @@ fn run_ficlonerange_test_with_paths(data_dir: &std::path::Path, mount_dir: &std:

if ret != 0 {
let err = std::io::Error::last_os_error();
panic!("FICLONERANGE failed: {} (errno {})", err, err.raw_os_error().unwrap_or(0));
panic!(
"FICLONERANGE failed: {} (errno {})",
err,
err.raw_os_error().unwrap_or(0)
);
}

drop(src_file);
Expand Down Expand Up @@ -379,7 +397,11 @@ fn run_cp_reflink_test_with_paths(data_dir: &std::path::Path, mount_dir: &std::p

// Run cp --reflink=always
let output = std::process::Command::new("cp")
.args(["--reflink=always", src_path.to_str().unwrap(), dst_path.to_str().unwrap()])
.args([
"--reflink=always",
src_path.to_str().unwrap(),
dst_path.to_str().unwrap(),
])
.output()
.expect("run cp");

Expand Down Expand Up @@ -421,7 +443,10 @@ fn verify_shared_extents(src: &std::path::Path, dst: &std::path::Path) {
}
}
Err(e) => {
eprintln!("Note: filefrag not available ({}), skipping extent verification", e);
eprintln!(
"Note: filefrag not available ({}), skipping extent verification",
e
);
}
}
}
Loading
Loading