CI infrastructure & Kata kernel by ejc3 · Pull Request #9 · ejc3/fcvm

ejc3 · 2025-12-24T08:30:37Z

Summary

CI debugging and job consolidation
Kata kernel with FUSE support
Initrd-based rootfs setup
Rootless podman container fixes

Part 1 of 5 - splitting large branch into logical PRs.

Network changes: - slirp0 now uses 10.0.2.100/24 address for DNAT compatibility - Add DNAT rule to redirect hostfwd traffic (10.0.2.100) to guest IP - This enables port forwarding to work with dual-TAP architecture VM namespace handling: - Add user_namespace_path and net_namespace_path to VmManager - Implement pre_exec setns for entering user namespace before mount - Enable mount namespace isolation for vsock socket redirect in clones Snapshot improvements: - Add userfaultfd access check with detailed error messages - Better handling of rootless clone network setup Test improvements: - Add unique_names() helper in tests/common for test isolation - Update all snapshot/clone tests to use unique names (PID + counter) - Prevents conflicts when tests run in parallel or with different users - Add test_clone_port_forward_bridged and test_clone_port_forward_rootless - Rootless tests FAIL loudly if run as root (not silently skip) Documentation: - Document clone port forwarding capability in README

- Add network mode guards in fcvm binary (podman.rs, snapshot.rs) - Bridged without root: fails with helpful error message - Rootless with root: warns that it's unnecessary - Add dynamic NBD device selection in rootfs.rs (scans nbd0-nbd15) - Enables parallel rootfs creation without conflicts - Includes retry logic for race conditions - Add require_non_root() helper in tests/common/mod.rs - All rootless tests now fail loudly if run as root - Update all tests to use unique names (unique_names() or PID-based) - test_exec, test_egress, test_sanity, test_signal_cleanup, etc. - Split Makefile targets by network mode - test-vm-exec-bridged/rootless, test-vm-egress-bridged/rootless - container-test-vm-exec-bridged/rootless, etc. - Bridged targets run with sudo, rootless without - Remove silent test skips in test_readme_examples.rs - Tests now fail properly when run without required privileges - Fix clippy warnings (double-reference issues in test_snapshot_clone.rs)

The firecracker tarball from GitHub contains files owned by the packager's UID (647281167). When rootless podman tries to load an image with UIDs outside its subuid range, it fails with: "lchown: invalid argument" Fix by adding chown root:root after extracting firecracker binary. UID 0 is always mappable in rootless podman.

The rootless container (using rootless podman) was running processes as UID 0 inside the container. The require_non_root() guard in tests correctly detected this and failed. Add --user testuser to CONTAINER_RUN_ROOTLESS so tests run as non-root inside the container, matching the actual rootless use case.

Bridged tests create the rootfs as root. Rootless tests then use the pre-created rootfs. Running rootless first fails because testuser can't access NBD devices to create the rootfs. Order changed: - container-test-vm-exec: bridged first, then rootless - container-test-vm-egress: bridged first, then rootless - container-test-vm: bridged first, then rootless

- Use rootless podman with --privileged for user namespace capabilities - Add --group-add keep-groups to preserve kvm group for /dev/kvm access - Update require_non_root() to detect container environment via /run/.containerenv or /.dockerenv marker files - Container is the isolation boundary, not UID inside it

- Create /dev/userfaultfd if missing (mknod c 10 126) - Set permissions to 666 for container access - Enable vm.unprivileged_userfaultfd=1 sysctl

Each CI job runs on a different BuildJet runner, which means each needs to recreate the rootfs via virt-customize. This was causing timeouts because virt-customize can be slow or hang on some runners. Combine all VM tests (sanity, exec, egress) into a single job that runs them sequentially. The rootfs is created once during the sanity test and reused for exec and egress tests.

Each CI job runs on a different BuildJet runner, which means each needs to recreate the rootfs via virt-customize. This was causing timeouts because virt-customize can be slow or hang on some runners. Combine all VM tests (sanity, exec, egress) into a single job that runs them sequentially. The rootfs is created once during the sanity test and reused for exec and egress tests. Also add verbose output to virt-customize for debugging.

- Combine lint + build + unit tests + CLI tests + FUSE integration into single build-and-test job - Combine noroot + root FUSE tests into single fuse-tests job - Combine bridged + exec + egress VM tests into single vm-tests job - Remove verbose diagnostic output from VM setup steps - Each job now compiles once and runs all related tests sequentially Reduces from 9 jobs to 4 jobs, eliminating ~5 redundant cargo builds.

- Add CI=1 mode to Makefile that uses host directories instead of named volumes - Add container-build-only target for compiling without running tests - CI workflow: Build job compiles inside container, uploads target/release - FUSE Tests and POSIX Compliance download artifact, run tests without rebuild - Lint and Native Tests run in parallel using rust-cache - VM Tests run independently on BuildJet (separate build) Dependency graph: - Build, Lint, Native Tests, VM Tests start in parallel - FUSE Tests and POSIX Compliance wait for Build, then run in parallel - Container tests reuse pre-built binaries (no recompilation)

Replace virt-customize/NBD approach with fully rootless setup: - No sudo required - only kvm group membership for /dev/kvm - initrd boots with busybox, mounts rootfs and packages ISO - Packages delivered via ISO9660 (genisoimage, no root needed) - chroot installs packages with bind-mounted /proc, /sys, /dev Content-addressable caching: - SHA256 of complete init script (mounts + install + setup) - Layer 2 rebuilt only when init script content changes - fc-agent NOT in Layer 2 - injected per-VM via separate initrd Rootless operations used throughout: - qemu-img convert (qcow2 → raw) - sfdisk --json for GPT partition parsing - dd skip/count for partition extraction - truncate + resize2fs for filesystem expansion - debugfs for fstab fixes (removes BOOT/UEFI entries) - genisoimage for packages ISO creation - cpio for initrd archive New rootfs-plan.toml config file: - Defines base image URL per architecture - Lists packages: runtime (podman, crun), fuse, system - Specifies services to enable/disable Success detection via FCVM_SETUP_COMPLETE marker in serial output instead of timing-based heuristics.

Replace custom kernel build with Kata Containers kernel: - Download from Kata 3.24.0 release (kernel 6.12.47) - Kata kernel has CONFIG_FUSE_FS=y built-in - Cache by URL hash, auto-download on first run - Add kernel config section to rootfs-plan.toml Embed packages directly in initrd instead of ISO: - No ISO9660/SquashFS filesystem driver needed - Packages copied from /packages in initrd to rootfs - initrd size ~205MB (317 packages embedded) - Only one disk needed during Layer 2 setup Update SHA calculation: - Include kernel URL in Layer 2 hash - Changing kernel URL triggers Layer 2 rebuild Add hex crate dependency for SHA encoding.

…hang Replace the racy double-rebind approach with a deterministic handshake chain that guarantees the exec server's AsyncFd epoll is re-registered before the host starts health-checking. Reduces restore-to-healthy from ~61s to ~0.5s. ## The Problem After snapshot restore, Firecracker's vsock transport reset (VIRTIO_VSOCK_EVENT_TRANSPORT_RESET) leaves the exec server's AsyncFd epoll registration stale. The previous fix (c15aa6b) removed the duplicate rebind signal from agent.rs but left a timing gap: if the restore-epoch watcher's single signal arrived late, the host's health monitor would start exec calls against a stale listener, hanging for ~60s until the kernel's vsock cleanup expired the stale connections. ## Trace Evidence (the smoking gun) From the vsock muxer log of a failing run (vm-ba97c): T+0.009s Exec call #1 → WORKS (167+144+176+123+71+27 bytes response) T+0.076s Exec call #2 → WORKS T+0.520s Exec call #3 → guest ACKs, receives request, sends NOTHING → 5s timeout T+5.5-55s Exec calls #4-#9 → same pattern: kernel accepts, app never processes T+60.5s Guest sends RST for ALL stale connections simultaneously T+60.5s Exec call #10 → WORKS → "container running status running=true" The container started at T+0.28s. The exec server was broken for 60 more seconds because the duplicate re_register() from agent.rs corrupted the edge-triggered epoll: the old AsyncFd consumed the edge notification, and the new AsyncFd never received events for pending connections. ## The Fix: Deterministic Handshake Chain exec_rebind_signal → exec_re_register → rebind_done → output.reconnect() ↓ host accepts output connection ↓ health monitor spawns Every transition has an explicit signal. Zero timing dependencies. ### fc-agent side (4 files): - exec.rs: After re_register(), signals rebind_done (AtomicBool + Notify) - restore.rs: Signals exec rebind, waits for rebind_done confirmation (5s timeout), THEN reconnects output vsock - agent.rs: Removed duplicate rebind signal after notify_cache_ready_and_wait; added exec_rebind_done/exec_rebind_done_notify Arcs - mmds.rs: Threads new params through watch_restore_epoch to both handle_clone_restore call sites ### Host side (3 files): - listeners.rs: Added connected_tx oneshot to run_output_listener(), fired on first output connection accept - snapshot.rs: Waits for output_connected_rx (30s timeout) before spawning health monitor; removed stale output_reconnect.notify_one() for startup snapshots - podman/mod.rs: Passes None for connected_tx (non-snapshot path) ## Results Before: restore-to-healthy = ~61s (exec broken, 9 consecutive 5s timeouts) After: restore-to-healthy = ~0.5s (35ms to output connected, 533ms to healthy) Post-restore exec stress test: 10 parallel calls completed in 16.3ms (max single: 15.3ms), zero timeouts. Tested: make test-root FILTER=localhost_rootless_btrfs_snapshot_restore STREAM=1

CI infrastructure & Kata kernel

…hang Replace the racy double-rebind approach with a deterministic handshake chain that guarantees the exec server's AsyncFd epoll is re-registered before the host starts health-checking. Reduces restore-to-healthy from ~61s to ~0.5s. ## The Problem After snapshot restore, Firecracker's vsock transport reset (VIRTIO_VSOCK_EVENT_TRANSPORT_RESET) leaves the exec server's AsyncFd epoll registration stale. The previous fix (c15aa6b) removed the duplicate rebind signal from agent.rs but left a timing gap: if the restore-epoch watcher's single signal arrived late, the host's health monitor would start exec calls against a stale listener, hanging for ~60s until the kernel's vsock cleanup expired the stale connections. ## Trace Evidence (the smoking gun) From the vsock muxer log of a failing run (vm-ba97c): T+0.009s Exec call #1 → WORKS (167+144+176+123+71+27 bytes response) T+0.076s Exec call #2 → WORKS T+0.520s Exec call #3 → guest ACKs, receives request, sends NOTHING → 5s timeout T+5.5-55s Exec calls #4-#9 → same pattern: kernel accepts, app never processes T+60.5s Guest sends RST for ALL stale connections simultaneously T+60.5s Exec call #10 → WORKS → "container running status running=true" The container started at T+0.28s. The exec server was broken for 60 more seconds because the duplicate re_register() from agent.rs corrupted the edge-triggered epoll: the old AsyncFd consumed the edge notification, and the new AsyncFd never received events for pending connections. ## The Fix: Deterministic Handshake Chain exec_rebind_signal → exec_re_register → rebind_done → output.reconnect() ↓ host accepts output connection ↓ health monitor spawns Every transition has an explicit signal. Zero timing dependencies. ### fc-agent side (4 files): - exec.rs: After re_register(), signals rebind_done (AtomicBool + Notify) - restore.rs: Signals exec rebind, waits for rebind_done confirmation (5s timeout), THEN reconnects output vsock - agent.rs: Removed duplicate rebind signal after notify_cache_ready_and_wait; added exec_rebind_done/exec_rebind_done_notify Arcs - mmds.rs: Threads new params through watch_restore_epoch to both handle_clone_restore call sites ### Host side (3 files): - listeners.rs: Added connected_tx oneshot to run_output_listener(), fired on first output connection accept - snapshot.rs: Waits for output_connected_rx (30s timeout) before spawning health monitor; removed stale output_reconnect.notify_one() for startup snapshots - podman/mod.rs: Passes None for connected_tx (non-snapshot path) ## Results Before: restore-to-healthy = ~61s (exec broken, 9 consecutive 5s timeouts) After: restore-to-healthy = ~0.5s (35ms to output connected, 533ms to healthy) Post-restore exec stress test: 10 parallel calls completed in 16.3ms (max single: 15.3ms), zero timeouts. Tested: make test-root FILTER=localhost_rootless_btrfs_snapshot_restore STREAM=1

CI infrastructure & Kata kernel

…hang Replace the racy double-rebind approach with a deterministic handshake chain that guarantees the exec server's AsyncFd epoll is re-registered before the host starts health-checking. Reduces restore-to-healthy from ~61s to ~0.5s. ## The Problem After snapshot restore, Firecracker's vsock transport reset (VIRTIO_VSOCK_EVENT_TRANSPORT_RESET) leaves the exec server's AsyncFd epoll registration stale. The previous fix (c15aa6b) removed the duplicate rebind signal from agent.rs but left a timing gap: if the restore-epoch watcher's single signal arrived late, the host's health monitor would start exec calls against a stale listener, hanging for ~60s until the kernel's vsock cleanup expired the stale connections. ## Trace Evidence (the smoking gun) From the vsock muxer log of a failing run (vm-ba97c): T+0.009s Exec call #1 → WORKS (167+144+176+123+71+27 bytes response) T+0.076s Exec call #2 → WORKS T+0.520s Exec call #3 → guest ACKs, receives request, sends NOTHING → 5s timeout T+5.5-55s Exec calls #4-#9 → same pattern: kernel accepts, app never processes T+60.5s Guest sends RST for ALL stale connections simultaneously T+60.5s Exec call #10 → WORKS → "container running status running=true" The container started at T+0.28s. The exec server was broken for 60 more seconds because the duplicate re_register() from agent.rs corrupted the edge-triggered epoll: the old AsyncFd consumed the edge notification, and the new AsyncFd never received events for pending connections. ## The Fix: Deterministic Handshake Chain exec_rebind_signal → exec_re_register → rebind_done → output.reconnect() ↓ host accepts output connection ↓ health monitor spawns Every transition has an explicit signal. Zero timing dependencies. ### fc-agent side (4 files): - exec.rs: After re_register(), signals rebind_done (AtomicBool + Notify) - restore.rs: Signals exec rebind, waits for rebind_done confirmation (5s timeout), THEN reconnects output vsock - agent.rs: Removed duplicate rebind signal after notify_cache_ready_and_wait; added exec_rebind_done/exec_rebind_done_notify Arcs - mmds.rs: Threads new params through watch_restore_epoch to both handle_clone_restore call sites ### Host side (3 files): - listeners.rs: Added connected_tx oneshot to run_output_listener(), fired on first output connection accept - snapshot.rs: Waits for output_connected_rx (30s timeout) before spawning health monitor; removed stale output_reconnect.notify_one() for startup snapshots - podman/mod.rs: Passes None for connected_tx (non-snapshot path) ## Results Before: restore-to-healthy = ~61s (exec broken, 9 consecutive 5s timeouts) After: restore-to-healthy = ~0.5s (35ms to output connected, 533ms to healthy) Post-restore exec stress test: 10 parallel calls completed in 16.3ms (max single: 15.3ms), zero timeouts. Tested: make test-root FILTER=localhost_rootless_btrfs_snapshot_restore STREAM=1

ejc3 added 19 commits December 21, 2025 07:28

Trigger CI rebuild (clear podman cache)

8c5fcdc

Add /dev/userfaultfd device for rootless container clone tests

8dd5c5a

Add userfaultfd setup to CI for snapshot clone tests

604d12a

- Create /dev/userfaultfd if missing (mknod c 10 126) - Set permissions to 666 for container access - Enable vm.unprivileged_userfaultfd=1 sysctl

Debug: investigate virt-customize hang on BuildJet

c3cd727

Debug: test virt-customize INSIDE container (matching local)

37fa51e

Debug: run actual fcvm rootfs creation in container

99d9ec6

CI: Add descriptive job names with environment info

752d048

ejc3 merged commit 818a902 into main Dec 24, 2025
2 of 4 checks passed

ejc3 deleted the pr1-ci-kata branch December 24, 2025 08:30

ejc3 mentioned this pull request Feb 7, 2026

Fix blocking I/O in UFFD server accept loop #286

Merged

ejc3 added a commit that referenced this pull request Mar 2, 2026

Merge pull request #9 from ejc3/pr1-ci-kata

ad2e63b

CI infrastructure & Kata kernel

ejc3 added a commit that referenced this pull request Mar 2, 2026

Merge pull request #9 from ejc3/pr1-ci-kata

9a8df9b

CI infrastructure & Kata kernel

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI infrastructure & Kata kernel#9

CI infrastructure & Kata kernel#9
ejc3 merged 19 commits intomainfrom
pr1-ci-kata

ejc3 commented Dec 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ejc3 commented Dec 24, 2025

Summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant