Merged
Conversation
Network changes: - slirp0 now uses 10.0.2.100/24 address for DNAT compatibility - Add DNAT rule to redirect hostfwd traffic (10.0.2.100) to guest IP - This enables port forwarding to work with dual-TAP architecture VM namespace handling: - Add user_namespace_path and net_namespace_path to VmManager - Implement pre_exec setns for entering user namespace before mount - Enable mount namespace isolation for vsock socket redirect in clones Snapshot improvements: - Add userfaultfd access check with detailed error messages - Better handling of rootless clone network setup Test improvements: - Add unique_names() helper in tests/common for test isolation - Update all snapshot/clone tests to use unique names (PID + counter) - Prevents conflicts when tests run in parallel or with different users - Add test_clone_port_forward_bridged and test_clone_port_forward_rootless - Rootless tests FAIL loudly if run as root (not silently skip) Documentation: - Document clone port forwarding capability in README
- Add network mode guards in fcvm binary (podman.rs, snapshot.rs) - Bridged without root: fails with helpful error message - Rootless with root: warns that it's unnecessary - Add dynamic NBD device selection in rootfs.rs (scans nbd0-nbd15) - Enables parallel rootfs creation without conflicts - Includes retry logic for race conditions - Add require_non_root() helper in tests/common/mod.rs - All rootless tests now fail loudly if run as root - Update all tests to use unique names (unique_names() or PID-based) - test_exec, test_egress, test_sanity, test_signal_cleanup, etc. - Split Makefile targets by network mode - test-vm-exec-bridged/rootless, test-vm-egress-bridged/rootless - container-test-vm-exec-bridged/rootless, etc. - Bridged targets run with sudo, rootless without - Remove silent test skips in test_readme_examples.rs - Tests now fail properly when run without required privileges - Fix clippy warnings (double-reference issues in test_snapshot_clone.rs)
The firecracker tarball from GitHub contains files owned by the packager's UID (647281167). When rootless podman tries to load an image with UIDs outside its subuid range, it fails with: "lchown: invalid argument" Fix by adding chown root:root after extracting firecracker binary. UID 0 is always mappable in rootless podman.
The rootless container (using rootless podman) was running processes as UID 0 inside the container. The require_non_root() guard in tests correctly detected this and failed. Add --user testuser to CONTAINER_RUN_ROOTLESS so tests run as non-root inside the container, matching the actual rootless use case.
Bridged tests create the rootfs as root. Rootless tests then use the pre-created rootfs. Running rootless first fails because testuser can't access NBD devices to create the rootfs. Order changed: - container-test-vm-exec: bridged first, then rootless - container-test-vm-egress: bridged first, then rootless - container-test-vm: bridged first, then rootless
- Use rootless podman with --privileged for user namespace capabilities - Add --group-add keep-groups to preserve kvm group for /dev/kvm access - Update require_non_root() to detect container environment via /run/.containerenv or /.dockerenv marker files - Container is the isolation boundary, not UID inside it
- Create /dev/userfaultfd if missing (mknod c 10 126) - Set permissions to 666 for container access - Enable vm.unprivileged_userfaultfd=1 sysctl
Each CI job runs on a different BuildJet runner, which means each needs to recreate the rootfs via virt-customize. This was causing timeouts because virt-customize can be slow or hang on some runners. Combine all VM tests (sanity, exec, egress) into a single job that runs them sequentially. The rootfs is created once during the sanity test and reused for exec and egress tests.
Each CI job runs on a different BuildJet runner, which means each needs to recreate the rootfs via virt-customize. This was causing timeouts because virt-customize can be slow or hang on some runners. Combine all VM tests (sanity, exec, egress) into a single job that runs them sequentially. The rootfs is created once during the sanity test and reused for exec and egress tests. Also add verbose output to virt-customize for debugging.
- Combine lint + build + unit tests + CLI tests + FUSE integration into single build-and-test job - Combine noroot + root FUSE tests into single fuse-tests job - Combine bridged + exec + egress VM tests into single vm-tests job - Remove verbose diagnostic output from VM setup steps - Each job now compiles once and runs all related tests sequentially Reduces from 9 jobs to 4 jobs, eliminating ~5 redundant cargo builds.
- Add CI=1 mode to Makefile that uses host directories instead of named volumes - Add container-build-only target for compiling without running tests - CI workflow: Build job compiles inside container, uploads target/release - FUSE Tests and POSIX Compliance download artifact, run tests without rebuild - Lint and Native Tests run in parallel using rust-cache - VM Tests run independently on BuildJet (separate build) Dependency graph: - Build, Lint, Native Tests, VM Tests start in parallel - FUSE Tests and POSIX Compliance wait for Build, then run in parallel - Container tests reuse pre-built binaries (no recompilation)
Replace virt-customize/NBD approach with fully rootless setup: - No sudo required - only kvm group membership for /dev/kvm - initrd boots with busybox, mounts rootfs and packages ISO - Packages delivered via ISO9660 (genisoimage, no root needed) - chroot installs packages with bind-mounted /proc, /sys, /dev Content-addressable caching: - SHA256 of complete init script (mounts + install + setup) - Layer 2 rebuilt only when init script content changes - fc-agent NOT in Layer 2 - injected per-VM via separate initrd Rootless operations used throughout: - qemu-img convert (qcow2 → raw) - sfdisk --json for GPT partition parsing - dd skip/count for partition extraction - truncate + resize2fs for filesystem expansion - debugfs for fstab fixes (removes BOOT/UEFI entries) - genisoimage for packages ISO creation - cpio for initrd archive New rootfs-plan.toml config file: - Defines base image URL per architecture - Lists packages: runtime (podman, crun), fuse, system - Specifies services to enable/disable Success detection via FCVM_SETUP_COMPLETE marker in serial output instead of timing-based heuristics.
Replace custom kernel build with Kata Containers kernel: - Download from Kata 3.24.0 release (kernel 6.12.47) - Kata kernel has CONFIG_FUSE_FS=y built-in - Cache by URL hash, auto-download on first run - Add kernel config section to rootfs-plan.toml Embed packages directly in initrd instead of ISO: - No ISO9660/SquashFS filesystem driver needed - Packages copied from /packages in initrd to rootfs - initrd size ~205MB (317 packages embedded) - Only one disk needed during Layer 2 setup Update SHA calculation: - Include kernel URL in Layer 2 hash - Changing kernel URL triggers Layer 2 rebuild Add hex crate dependency for SHA encoding.
ejc3
added a commit
that referenced
this pull request
Feb 23, 2026
…hang Replace the racy double-rebind approach with a deterministic handshake chain that guarantees the exec server's AsyncFd epoll is re-registered before the host starts health-checking. Reduces restore-to-healthy from ~61s to ~0.5s. ## The Problem After snapshot restore, Firecracker's vsock transport reset (VIRTIO_VSOCK_EVENT_TRANSPORT_RESET) leaves the exec server's AsyncFd epoll registration stale. The previous fix (c15aa6b) removed the duplicate rebind signal from agent.rs but left a timing gap: if the restore-epoch watcher's single signal arrived late, the host's health monitor would start exec calls against a stale listener, hanging for ~60s until the kernel's vsock cleanup expired the stale connections. ## Trace Evidence (the smoking gun) From the vsock muxer log of a failing run (vm-ba97c): T+0.009s Exec call #1 → WORKS (167+144+176+123+71+27 bytes response) T+0.076s Exec call #2 → WORKS T+0.520s Exec call #3 → guest ACKs, receives request, sends NOTHING → 5s timeout T+5.5-55s Exec calls #4-#9 → same pattern: kernel accepts, app never processes T+60.5s Guest sends RST for ALL stale connections simultaneously T+60.5s Exec call #10 → WORKS → "container running status running=true" The container started at T+0.28s. The exec server was broken for 60 more seconds because the duplicate re_register() from agent.rs corrupted the edge-triggered epoll: the old AsyncFd consumed the edge notification, and the new AsyncFd never received events for pending connections. ## The Fix: Deterministic Handshake Chain exec_rebind_signal → exec_re_register → rebind_done → output.reconnect() ↓ host accepts output connection ↓ health monitor spawns Every transition has an explicit signal. Zero timing dependencies. ### fc-agent side (4 files): - exec.rs: After re_register(), signals rebind_done (AtomicBool + Notify) - restore.rs: Signals exec rebind, waits for rebind_done confirmation (5s timeout), THEN reconnects output vsock - agent.rs: Removed duplicate rebind signal after notify_cache_ready_and_wait; added exec_rebind_done/exec_rebind_done_notify Arcs - mmds.rs: Threads new params through watch_restore_epoch to both handle_clone_restore call sites ### Host side (3 files): - listeners.rs: Added connected_tx oneshot to run_output_listener(), fired on first output connection accept - snapshot.rs: Waits for output_connected_rx (30s timeout) before spawning health monitor; removed stale output_reconnect.notify_one() for startup snapshots - podman/mod.rs: Passes None for connected_tx (non-snapshot path) ## Results Before: restore-to-healthy = ~61s (exec broken, 9 consecutive 5s timeouts) After: restore-to-healthy = ~0.5s (35ms to output connected, 533ms to healthy) Post-restore exec stress test: 10 parallel calls completed in 16.3ms (max single: 15.3ms), zero timeouts. Tested: make test-root FILTER=localhost_rootless_btrfs_snapshot_restore STREAM=1
ejc3
added a commit
that referenced
this pull request
Mar 2, 2026
…hang Replace the racy double-rebind approach with a deterministic handshake chain that guarantees the exec server's AsyncFd epoll is re-registered before the host starts health-checking. Reduces restore-to-healthy from ~61s to ~0.5s. ## The Problem After snapshot restore, Firecracker's vsock transport reset (VIRTIO_VSOCK_EVENT_TRANSPORT_RESET) leaves the exec server's AsyncFd epoll registration stale. The previous fix (c15aa6b) removed the duplicate rebind signal from agent.rs but left a timing gap: if the restore-epoch watcher's single signal arrived late, the host's health monitor would start exec calls against a stale listener, hanging for ~60s until the kernel's vsock cleanup expired the stale connections. ## Trace Evidence (the smoking gun) From the vsock muxer log of a failing run (vm-ba97c): T+0.009s Exec call #1 → WORKS (167+144+176+123+71+27 bytes response) T+0.076s Exec call #2 → WORKS T+0.520s Exec call #3 → guest ACKs, receives request, sends NOTHING → 5s timeout T+5.5-55s Exec calls #4-#9 → same pattern: kernel accepts, app never processes T+60.5s Guest sends RST for ALL stale connections simultaneously T+60.5s Exec call #10 → WORKS → "container running status running=true" The container started at T+0.28s. The exec server was broken for 60 more seconds because the duplicate re_register() from agent.rs corrupted the edge-triggered epoll: the old AsyncFd consumed the edge notification, and the new AsyncFd never received events for pending connections. ## The Fix: Deterministic Handshake Chain exec_rebind_signal → exec_re_register → rebind_done → output.reconnect() ↓ host accepts output connection ↓ health monitor spawns Every transition has an explicit signal. Zero timing dependencies. ### fc-agent side (4 files): - exec.rs: After re_register(), signals rebind_done (AtomicBool + Notify) - restore.rs: Signals exec rebind, waits for rebind_done confirmation (5s timeout), THEN reconnects output vsock - agent.rs: Removed duplicate rebind signal after notify_cache_ready_and_wait; added exec_rebind_done/exec_rebind_done_notify Arcs - mmds.rs: Threads new params through watch_restore_epoch to both handle_clone_restore call sites ### Host side (3 files): - listeners.rs: Added connected_tx oneshot to run_output_listener(), fired on first output connection accept - snapshot.rs: Waits for output_connected_rx (30s timeout) before spawning health monitor; removed stale output_reconnect.notify_one() for startup snapshots - podman/mod.rs: Passes None for connected_tx (non-snapshot path) ## Results Before: restore-to-healthy = ~61s (exec broken, 9 consecutive 5s timeouts) After: restore-to-healthy = ~0.5s (35ms to output connected, 533ms to healthy) Post-restore exec stress test: 10 parallel calls completed in 16.3ms (max single: 15.3ms), zero timeouts. Tested: make test-root FILTER=localhost_rootless_btrfs_snapshot_restore STREAM=1
ejc3
added a commit
that referenced
this pull request
Mar 2, 2026
…hang Replace the racy double-rebind approach with a deterministic handshake chain that guarantees the exec server's AsyncFd epoll is re-registered before the host starts health-checking. Reduces restore-to-healthy from ~61s to ~0.5s. ## The Problem After snapshot restore, Firecracker's vsock transport reset (VIRTIO_VSOCK_EVENT_TRANSPORT_RESET) leaves the exec server's AsyncFd epoll registration stale. The previous fix (c15aa6b) removed the duplicate rebind signal from agent.rs but left a timing gap: if the restore-epoch watcher's single signal arrived late, the host's health monitor would start exec calls against a stale listener, hanging for ~60s until the kernel's vsock cleanup expired the stale connections. ## Trace Evidence (the smoking gun) From the vsock muxer log of a failing run (vm-ba97c): T+0.009s Exec call #1 → WORKS (167+144+176+123+71+27 bytes response) T+0.076s Exec call #2 → WORKS T+0.520s Exec call #3 → guest ACKs, receives request, sends NOTHING → 5s timeout T+5.5-55s Exec calls #4-#9 → same pattern: kernel accepts, app never processes T+60.5s Guest sends RST for ALL stale connections simultaneously T+60.5s Exec call #10 → WORKS → "container running status running=true" The container started at T+0.28s. The exec server was broken for 60 more seconds because the duplicate re_register() from agent.rs corrupted the edge-triggered epoll: the old AsyncFd consumed the edge notification, and the new AsyncFd never received events for pending connections. ## The Fix: Deterministic Handshake Chain exec_rebind_signal → exec_re_register → rebind_done → output.reconnect() ↓ host accepts output connection ↓ health monitor spawns Every transition has an explicit signal. Zero timing dependencies. ### fc-agent side (4 files): - exec.rs: After re_register(), signals rebind_done (AtomicBool + Notify) - restore.rs: Signals exec rebind, waits for rebind_done confirmation (5s timeout), THEN reconnects output vsock - agent.rs: Removed duplicate rebind signal after notify_cache_ready_and_wait; added exec_rebind_done/exec_rebind_done_notify Arcs - mmds.rs: Threads new params through watch_restore_epoch to both handle_clone_restore call sites ### Host side (3 files): - listeners.rs: Added connected_tx oneshot to run_output_listener(), fired on first output connection accept - snapshot.rs: Waits for output_connected_rx (30s timeout) before spawning health monitor; removed stale output_reconnect.notify_one() for startup snapshots - podman/mod.rs: Passes None for connected_tx (non-snapshot path) ## Results Before: restore-to-healthy = ~61s (exec broken, 9 consecutive 5s timeouts) After: restore-to-healthy = ~0.5s (35ms to output connected, 533ms to healthy) Post-restore exec stress test: 10 parallel calls completed in 16.3ms (max single: 15.3ms), zero timeouts. Tested: make test-root FILTER=localhost_rootless_btrfs_snapshot_restore STREAM=1
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Part 1 of 5 - splitting large branch into logical PRs.