Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 22 additions & 0 deletions DESIGN.md
Original file line number Diff line number Diff line change
Expand Up @@ -1677,6 +1677,28 @@ The fuse-pipe library passes the pjdfstest POSIX compliance suite. Tests run via

---

## Known Limitations

### FUSE Volume Cache Coherency

`--map` volumes use FUSE-over-vsock with `WRITEBACK_CACHE` and `AUTO_INVAL_DATA`. When a host process modifies a file in a mapped directory, the guest sees the change on its next read — but only after the kernel detects the mtime change (up to ~1 second granularity). Writes within the same second may not be visible immediately.

Directory changes (new files, deletions) are subject to the kernel's directory entry cache TTL. A new file created on the host may not appear in guest `readdir()` until the cache expires.

There are no push notifications from host to guest. The guest discovers changes only on access. inotify/fanotify in the guest watches the FUSE mount, not the host filesystem, so host-side changes don't trigger guest notifications.

**Potential fix**: Use `FUSE_NOTIFY_INVAL_INODE` and `FUSE_NOTIFY_INVAL_ENTRY` — server-initiated invalidation notifications. The host VolumeServer would watch directories with inotify and push invalidations through the FUSE connection when files change. This is how production network filesystems (NFS, CIFS) handle it.

### Nested VM Performance (NV2)

ARM64 FEAT_NV2 has architectural issues with cache coherency under double Stage 2 translation. The DSB SY kernel patch fixes this for vsock/FUSE data paths, but multi-vCPU L2 VMs still hit interrupt delivery issues (NETDEV WATCHDOG). L2 VMs are limited to single vCPU.

### Snapshot + FUSE Volumes

Snapshots are disabled when `--map` volumes are present because the FUSE-over-vsock connection state may not survive the pause/resume cycle cleanly. This means VMs with volume mounts always do a fresh boot. Block device mounts (`--disk`, `--disk-dir`) do not have this limitation.

---

## Future Enhancements

### Phase 2 (Post-MVP)
Expand Down
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -780,6 +780,8 @@ See [DESIGN.md](DESIGN.md#cli-interface) for architecture and design decisions.
-t, --tty Allocate pseudo-TTY (for vim, colors, etc.)
--setup Auto-setup if kernel/rootfs missing (rootless only)
--no-snapshot Disable automatic snapshot creation (for testing)
--forward-localhost <PORTS> Forward localhost ports to host (e.g., 1421,9099)
--rootfs-size <SIZE> Minimum free space on rootfs (default: 10G)
```

**`fcvm exec`** - Execute in VM/container:
Expand Down
17 changes: 0 additions & 17 deletions fc-agent/src/fuse/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -121,20 +121,3 @@ pub fn mount_vsock(port: u32, mount_point: &str) -> anyhow::Result<()> {
);
fuse_pipe::mount_vsock_with_options(HOST_CID, port, mount_point, num_readers, trace_rate)
}

/// Mount a FUSE filesystem with multiple reader threads.
///
/// Same as `mount_vsock` but creates multiple FUSE reader threads for
/// better parallel performance.
#[allow(dead_code)]
pub fn mount_vsock_with_readers(
port: u32,
mount_point: &str,
num_readers: usize,
) -> anyhow::Result<()> {
eprintln!(
"[fc-agent] mounting FUSE volume at {} via vsock port {} ({} readers)",
mount_point, port, num_readers
);
fuse_pipe::mount_vsock_with_readers(HOST_CID, port, mount_point, num_readers)
}
149 changes: 137 additions & 12 deletions fc-agent/src/main.rs
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,12 @@ struct Plan {
/// Path to OCI archive for localhost/ images (run directly without import)
#[serde(default)]
image_archive: Option<String>,
/// Run container as USER:GROUP (e.g., "1000:1000")
#[serde(default)]
user: Option<String>,
/// Localhost ports to forward to host gateway via iptables DNAT
#[serde(default)]
forward_localhost: Vec<String>,
/// Run container in privileged mode (allows mknod, device access, etc.)
#[serde(default)]
privileged: bool,
Expand Down Expand Up @@ -1693,23 +1699,29 @@ fn mount_fuse_volumes(volumes: &[VolumeMount]) -> Result<Vec<String>> {
mounted_paths.push(vol.guest_path.clone());
}

// Give FUSE mounts time to initialize
if !volumes.is_empty() {
eprintln!("[fc-agent] waiting for FUSE mounts to initialize...");
std::thread::sleep(std::time::Duration::from_millis(500));

// Verify each mount point is accessible
for vol in volumes {
let path = std::path::Path::new(&vol.guest_path);
// Wait for each FUSE mount to become accessible (up to 30s per mount)
for vol in volumes {
let path = std::path::Path::new(&vol.guest_path);
let mut ready = false;
for attempt in 1..=60 {
if let Ok(entries) = std::fs::read_dir(path) {
let count = entries.count();
eprintln!(
"[fc-agent] ✓ mount {} accessible ({} entries)",
vol.guest_path, count
"[fc-agent] ✓ mount {} ready ({} entries, {}ms)",
vol.guest_path,
count,
(attempt - 1) * 500
);
} else {
eprintln!("[fc-agent] ✗ mount {} NOT accessible", vol.guest_path);
ready = true;
break;
}
std::thread::sleep(std::time::Duration::from_millis(500));
}
if !ready {
return Err(anyhow::anyhow!(
"mount {} not accessible after 30s",
vol.guest_path
));
}
}

Expand Down Expand Up @@ -2240,6 +2252,38 @@ async fn run_agent() -> Result<()> {
// Save proxy settings for exec commands to use
save_proxy_settings(&plan);

// Forward specific localhost ports to host gateway via iptables DNAT.
// Only the listed ports are redirected — other localhost traffic stays local.
if !plan.forward_localhost.is_empty() {
let _ = std::process::Command::new("sysctl")
.args(["-w", "net.ipv4.conf.all.route_localnet=1"])
.output();
for port in &plan.forward_localhost {
let _ = std::process::Command::new("iptables")
.args([
"-t",
"nat",
"-A",
"OUTPUT",
"-d",
"127.0.0.0/8",
"-p",
"tcp",
"--dport",
port,
"-j",
"DNAT",
"--to-destination",
"10.0.2.2",
])
.output();
}
eprintln!(
"[fc-agent] ✓ forwarding localhost ports to host: {:?}",
plan.forward_localhost
);
}

// Sync VM clock from host before launching container
// This ensures TLS certificate validation works immediately
if let Err(e) = sync_clock_from_host().await {
Expand Down Expand Up @@ -2357,6 +2401,13 @@ async fn run_agent() -> Result<()> {
let image_ref = if let Some(archive_path) = &plan.image_archive {
eprintln!("[fc-agent] using Docker archive: {}", archive_path);

// Make block device readable by non-root (needed with --userns=keep-id)
if archive_path.starts_with("/dev/") {
let _ = std::process::Command::new("chmod")
.args(["444", archive_path])
.output();
}

format!("docker-archive:{}", archive_path)
} else {
// Pull image with retries to handle transient DNS/network errors
Expand Down Expand Up @@ -2544,6 +2595,80 @@ async fn run_agent() -> Result<()> {
"nofile=65536:65536".to_string(),
];

// User mapping: run podman as the specified user with --userns=keep-id
// This replicates host behavior where rootless podman maps the user as root
// inside the container while keeping the real UID on shared mounts.
if let Some(ref user_spec) = plan.user {
// Parse "uid:gid" format
let parts: Vec<&str> = user_spec.split(':').collect();
let uid = parts[0];
let gid = parts.get(1).unwrap_or(&"100");
let username = format!("fcvm-user");

eprintln!(
"[fc-agent] setting up user mapping: uid={} gid={}",
uid, gid
);

// Create group and user in the VM
let _ = std::process::Command::new("groupadd")
.args(["-g", gid, &username])
.output();
let _ = std::process::Command::new("useradd")
.args(["-u", uid, "-g", gid, "-m", "-s", "/bin/sh", &username])
.output();

// Set up subuid/subgid for rootless podman
let subuid_entry = format!("{}:100000:65536\n", username);
let _ = std::fs::write("/etc/subuid", &subuid_entry);
let _ = std::fs::write("/etc/subgid", &subuid_entry);

// Ensure XDG_RUNTIME_DIR exists for rootless podman
let runtime_dir = format!("/run/user/{}", uid);
let _ = std::fs::create_dir_all(&runtime_dir);
let _ = std::process::Command::new("chown")
.args([&format!("{}:{}", uid, gid), &runtime_dir])
.output();

// Delegate cgroup subtree to the user for rootless podman
let cgroup_dir = format!("/sys/fs/cgroup/user.slice/user-{}.slice", uid);
let _ = std::fs::create_dir_all(&cgroup_dir);
let _ = std::process::Command::new("chown")
.args(["-R", &format!("{}:{}", uid, gid), &cgroup_dir])
.output();
// Enable controllers in the user's cgroup
for path in &[
"/sys/fs/cgroup/cgroup.subtree_control",
&format!("{}/cgroup.subtree_control", cgroup_dir),
] {
let _ = std::fs::write(path, "+cpu +memory +pids");
}

// Delegate fc-agent's own cgroup to the user so rootless podman can create sub-cgroups
if let Ok(cgroup_path) = std::fs::read_to_string("/proc/self/cgroup") {
// Format: "0::/system.slice/fc-agent.service"
if let Some(path) = cgroup_path.trim().strip_prefix("0::") {
let full_path = format!("/sys/fs/cgroup{}", path);
let _ = std::process::Command::new("chown")
.args(["-R", &format!("{}:{}", uid, gid), &full_path])
.output();
eprintln!("[fc-agent] delegated cgroup {} to user {}", full_path, uid);
}
}

// Remove --cgroups=split (rootless podman uses cgroupfs, not split)
podman_args.retain(|a| a != "--cgroups=split");

// Add --userns=keep-id to podman args (replicates host behavior)
podman_args.push("--userns=keep-id".to_string());

// Wrap entire command with runuser to run podman as the target user
podman_args.insert(0, "--".to_string());
podman_args.insert(0, username.clone());
podman_args.insert(0, "-u".to_string());
podman_args.insert(0, "runuser".to_string());
}

// Privileged mode: allows mknod, device access, etc. for POSIX compliance tests
if plan.privileged {
eprintln!("[fc-agent] privileged mode enabled");
Expand Down
26 changes: 3 additions & 23 deletions fuse-pipe/src/client/fuse.rs
Original file line number Diff line number Diff line change
Expand Up @@ -605,7 +605,7 @@ impl Filesystem for FuseClient {
});

match response {
VolumeResponse::Written { size } => reply.written(size),
VolumeResponse::Written { size } => reply.written(size as u32),
VolumeResponse::Error { errno } => reply.error(Errno::from_i32(errno)),
_ => reply.error(Errno::EIO),
}
Expand Down Expand Up @@ -990,26 +990,6 @@ impl Filesystem for FuseClient {
}

fn getxattr(&self, req: &Request, ino: INodeNo, name: &OsStr, size: u32, reply: ReplyXattr) {
// Fast path: The kernel calls getxattr("security.capability") on every write
// to check if file capabilities need to be cleared. This is extremely common
// and almost always returns ENODATA (no capabilities set). Short-circuit this
// to avoid the expensive server round-trip (~32µs savings per write).
//
// This is safe because:
// 1. If capabilities ARE set, they're preserved (we'd need setxattr to clear)
// 2. The kernel's capability check is advisory - it clears caps on successful write
// 3. Container workloads rarely use file capabilities
//
// Can be disabled via FCVM_NO_XATTR_FASTPATH=1 for debugging.
if std::env::var("FCVM_NO_XATTR_FASTPATH").is_err() {
if let Some(name_str) = name.to_str() {
if name_str == "security.capability" {
reply.error(Errno::ENODATA);
return;
}
}
}

let response = self.send_request_sync(VolumeRequest::Getxattr {
ino: ino.into(),
name: name.to_string_lossy().to_string(),
Expand Down Expand Up @@ -1198,7 +1178,7 @@ impl Filesystem for FuseClient {
});

match response {
VolumeResponse::Written { size } => reply.written(size),
VolumeResponse::Written { size } => reply.written(size as u32),
VolumeResponse::Error { errno } => reply.error(Errno::from_i32(errno)),
_ => reply.error(Errno::EIO),
}
Expand Down Expand Up @@ -1241,7 +1221,7 @@ impl Filesystem for FuseClient {
);

match response {
VolumeResponse::Written { size } => reply.written(size),
VolumeResponse::Written { size } => reply.written(size as u32),
VolumeResponse::Error { errno } => reply.error(Errno::from_i32(errno)),
_ => reply.error(Errno::EIO),
}
Expand Down
2 changes: 1 addition & 1 deletion fuse-pipe/src/protocol/response.rs
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ pub enum VolumeResponse {
Data { data: Vec<u8> },

/// Number of bytes written.
Written { size: u32 },
Written { size: u64 },

/// File opened response.
Opened { fh: u64, flags: u32 },
Expand Down
10 changes: 5 additions & 5 deletions fuse-pipe/src/server/passthrough.rs
Original file line number Diff line number Diff line change
Expand Up @@ -656,7 +656,7 @@ impl FilesystemHandler for PassthroughFs {
) {
Ok(n) => {
tracing::debug!(target: "passthrough", fh, written = n, "write succeeded");
VolumeResponse::Written { size: n as u32 }
VolumeResponse::Written { size: n as u64 }
}
Err(e) => {
tracing::debug!(target: "passthrough", fh, error = ?e, "write failed");
Expand Down Expand Up @@ -1149,7 +1149,7 @@ impl FilesystemHandler for PassthroughFs {
) {
Ok(n) => {
tracing::debug!(target: "passthrough", copied = n, "copy_file_range succeeded");
VolumeResponse::Written { size: n as u32 }
VolumeResponse::Written { size: n as u64 }
}
Err(e) => {
tracing::debug!(target: "passthrough", error = ?e, "copy_file_range failed");
Expand Down Expand Up @@ -1190,7 +1190,7 @@ impl FilesystemHandler for PassthroughFs {
) {
Ok(n) => {
tracing::debug!(target: "passthrough", cloned = n, "remap_file_range succeeded");
VolumeResponse::Written { size: n as u32 }
VolumeResponse::Written { size: n as u64 }
}
Err(e) => {
tracing::debug!(target: "passthrough", error = ?e, "remap_file_range failed");
Expand Down Expand Up @@ -1607,7 +1607,7 @@ mod tests {
// For whole-file clone (len=0), we return the file size on success
assert_eq!(
size,
test_data.len() as u32,
test_data.len() as u64,
"FICLONE should return file size for whole file (len=0)"
);

Expand Down Expand Up @@ -1726,7 +1726,7 @@ mod tests {
match resp {
VolumeResponse::Written { size } => {
eprintln!("FICLONERANGE succeeded, size={}", size);
assert_eq!(size, block_size as u32, "should clone requested size");
assert_eq!(size, block_size as u64, "should clone requested size");

// Verify: first block of dest should equal second block of source
let resp = fs.read(dst_ino, dst_fh, 0, block_size as u32, uid, gid, 0);
Expand Down
2 changes: 1 addition & 1 deletion rootfs-config.toml
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ path = "opt/kata/share/kata-containers/vmlinux-6.12.47-173"

[packages]
# Container runtime
runtime = ["podman", "crun", "fuse-overlayfs", "skopeo"]
runtime = ["podman", "crun", "fuse-overlayfs", "skopeo", "uidmap"]

# FUSE support for overlay filesystem
fuse = ["fuse3"]
Expand Down
11 changes: 11 additions & 0 deletions src/cli/args.rs
Original file line number Diff line number Diff line change
Expand Up @@ -169,6 +169,17 @@ pub struct RunArgs {
#[arg(long)]
pub health_check: Option<String>,

/// Run container as USER:GROUP (e.g., --user 1000:1000)
/// Equivalent to podman run --userns=keep-id on the host
#[arg(long)]
pub user: Option<String>,

/// Forward specific localhost ports to the host gateway via iptables DNAT.
/// Enables containers to reach host-only services via localhost.
/// Comma-separated port list, e.g., --forward-localhost 1421,9099
#[arg(long, value_delimiter = ',')]
pub forward_localhost: Vec<u16>,

/// Run container in privileged mode (allows mknod, device access, etc.)
/// Use for POSIX compliance tests that need full filesystem capabilities
#[arg(long)]
Expand Down
Loading
Loading