Skip to content

CI infrastructure & Kata kernel#9

Merged
ejc3 merged 19 commits intomainfrom
pr1-ci-kata
Dec 24, 2025
Merged

CI infrastructure & Kata kernel#9
ejc3 merged 19 commits intomainfrom
pr1-ci-kata

Conversation

@ejc3
Copy link
Copy Markdown
Owner

@ejc3 ejc3 commented Dec 24, 2025

Summary

  • CI debugging and job consolidation
  • Kata kernel with FUSE support
  • Initrd-based rootfs setup
  • Rootless podman container fixes

Part 1 of 5 - splitting large branch into logical PRs.

ejc3 added 19 commits December 21, 2025 07:28
Network changes:
- slirp0 now uses 10.0.2.100/24 address for DNAT compatibility
- Add DNAT rule to redirect hostfwd traffic (10.0.2.100) to guest IP
- This enables port forwarding to work with dual-TAP architecture

VM namespace handling:
- Add user_namespace_path and net_namespace_path to VmManager
- Implement pre_exec setns for entering user namespace before mount
- Enable mount namespace isolation for vsock socket redirect in clones

Snapshot improvements:
- Add userfaultfd access check with detailed error messages
- Better handling of rootless clone network setup

Test improvements:
- Add unique_names() helper in tests/common for test isolation
- Update all snapshot/clone tests to use unique names (PID + counter)
- Prevents conflicts when tests run in parallel or with different users
- Add test_clone_port_forward_bridged and test_clone_port_forward_rootless
- Rootless tests FAIL loudly if run as root (not silently skip)

Documentation:
- Document clone port forwarding capability in README
- Add network mode guards in fcvm binary (podman.rs, snapshot.rs)
  - Bridged without root: fails with helpful error message
  - Rootless with root: warns that it's unnecessary
- Add dynamic NBD device selection in rootfs.rs (scans nbd0-nbd15)
  - Enables parallel rootfs creation without conflicts
  - Includes retry logic for race conditions
- Add require_non_root() helper in tests/common/mod.rs
  - All rootless tests now fail loudly if run as root
- Update all tests to use unique names (unique_names() or PID-based)
  - test_exec, test_egress, test_sanity, test_signal_cleanup, etc.
- Split Makefile targets by network mode
  - test-vm-exec-bridged/rootless, test-vm-egress-bridged/rootless
  - container-test-vm-exec-bridged/rootless, etc.
  - Bridged targets run with sudo, rootless without
- Remove silent test skips in test_readme_examples.rs
  - Tests now fail properly when run without required privileges
- Fix clippy warnings (double-reference issues in test_snapshot_clone.rs)
The firecracker tarball from GitHub contains files owned by the
packager's UID (647281167). When rootless podman tries to load an
image with UIDs outside its subuid range, it fails with:
"lchown: invalid argument"

Fix by adding chown root:root after extracting firecracker binary.
UID 0 is always mappable in rootless podman.
The rootless container (using rootless podman) was running processes as
UID 0 inside the container. The require_non_root() guard in tests
correctly detected this and failed.

Add --user testuser to CONTAINER_RUN_ROOTLESS so tests run as
non-root inside the container, matching the actual rootless use case.
Bridged tests create the rootfs as root. Rootless tests then use
the pre-created rootfs. Running rootless first fails because testuser
can't access NBD devices to create the rootfs.

Order changed:
- container-test-vm-exec: bridged first, then rootless
- container-test-vm-egress: bridged first, then rootless
- container-test-vm: bridged first, then rootless
- Use rootless podman with --privileged for user namespace capabilities
- Add --group-add keep-groups to preserve kvm group for /dev/kvm access
- Update require_non_root() to detect container environment via
  /run/.containerenv or /.dockerenv marker files
- Container is the isolation boundary, not UID inside it
- Create /dev/userfaultfd if missing (mknod c 10 126)
- Set permissions to 666 for container access
- Enable vm.unprivileged_userfaultfd=1 sysctl
Each CI job runs on a different BuildJet runner, which means each
needs to recreate the rootfs via virt-customize. This was causing
timeouts because virt-customize can be slow or hang on some runners.

Combine all VM tests (sanity, exec, egress) into a single job that
runs them sequentially. The rootfs is created once during the sanity
test and reused for exec and egress tests.
Each CI job runs on a different BuildJet runner, which means each
needs to recreate the rootfs via virt-customize. This was causing
timeouts because virt-customize can be slow or hang on some runners.

Combine all VM tests (sanity, exec, egress) into a single job that
runs them sequentially. The rootfs is created once during the sanity
test and reused for exec and egress tests.

Also add verbose output to virt-customize for debugging.
- Combine lint + build + unit tests + CLI tests + FUSE integration into single build-and-test job
- Combine noroot + root FUSE tests into single fuse-tests job
- Combine bridged + exec + egress VM tests into single vm-tests job
- Remove verbose diagnostic output from VM setup steps
- Each job now compiles once and runs all related tests sequentially

Reduces from 9 jobs to 4 jobs, eliminating ~5 redundant cargo builds.
- Add CI=1 mode to Makefile that uses host directories instead of named volumes
- Add container-build-only target for compiling without running tests
- CI workflow: Build job compiles inside container, uploads target/release
- FUSE Tests and POSIX Compliance download artifact, run tests without rebuild
- Lint and Native Tests run in parallel using rust-cache
- VM Tests run independently on BuildJet (separate build)

Dependency graph:
- Build, Lint, Native Tests, VM Tests start in parallel
- FUSE Tests and POSIX Compliance wait for Build, then run in parallel
- Container tests reuse pre-built binaries (no recompilation)
Replace virt-customize/NBD approach with fully rootless setup:

- No sudo required - only kvm group membership for /dev/kvm
- initrd boots with busybox, mounts rootfs and packages ISO
- Packages delivered via ISO9660 (genisoimage, no root needed)
- chroot installs packages with bind-mounted /proc, /sys, /dev

Content-addressable caching:
- SHA256 of complete init script (mounts + install + setup)
- Layer 2 rebuilt only when init script content changes
- fc-agent NOT in Layer 2 - injected per-VM via separate initrd

Rootless operations used throughout:
- qemu-img convert (qcow2 → raw)
- sfdisk --json for GPT partition parsing
- dd skip/count for partition extraction
- truncate + resize2fs for filesystem expansion
- debugfs for fstab fixes (removes BOOT/UEFI entries)
- genisoimage for packages ISO creation
- cpio for initrd archive

New rootfs-plan.toml config file:
- Defines base image URL per architecture
- Lists packages: runtime (podman, crun), fuse, system
- Specifies services to enable/disable

Success detection via FCVM_SETUP_COMPLETE marker in serial
output instead of timing-based heuristics.
Replace custom kernel build with Kata Containers kernel:
- Download from Kata 3.24.0 release (kernel 6.12.47)
- Kata kernel has CONFIG_FUSE_FS=y built-in
- Cache by URL hash, auto-download on first run
- Add kernel config section to rootfs-plan.toml

Embed packages directly in initrd instead of ISO:
- No ISO9660/SquashFS filesystem driver needed
- Packages copied from /packages in initrd to rootfs
- initrd size ~205MB (317 packages embedded)
- Only one disk needed during Layer 2 setup

Update SHA calculation:
- Include kernel URL in Layer 2 hash
- Changing kernel URL triggers Layer 2 rebuild

Add hex crate dependency for SHA encoding.
@ejc3 ejc3 merged commit 818a902 into main Dec 24, 2025
2 of 4 checks passed
@ejc3 ejc3 deleted the pr1-ci-kata branch December 24, 2025 08:30
ejc3 added a commit that referenced this pull request Feb 23, 2026
…hang

Replace the racy double-rebind approach with a deterministic handshake chain
that guarantees the exec server's AsyncFd epoll is re-registered before the
host starts health-checking. Reduces restore-to-healthy from ~61s to ~0.5s.

## The Problem

After snapshot restore, Firecracker's vsock transport reset
(VIRTIO_VSOCK_EVENT_TRANSPORT_RESET) leaves the exec server's AsyncFd epoll
registration stale. The previous fix (c15aa6b) removed the duplicate rebind
signal from agent.rs but left a timing gap: if the restore-epoch watcher's
single signal arrived late, the host's health monitor would start exec calls
against a stale listener, hanging for ~60s until the kernel's vsock cleanup
expired the stale connections.

## Trace Evidence (the smoking gun)

From the vsock muxer log of a failing run (vm-ba97c):

  T+0.009s  Exec call #1 → WORKS (167+144+176+123+71+27 bytes response)
  T+0.076s  Exec call #2 → WORKS
  T+0.520s  Exec call #3 → guest ACKs, receives request, sends NOTHING → 5s timeout
  T+5.5-55s Exec calls #4-#9 → same pattern: kernel accepts, app never processes
  T+60.5s   Guest sends RST for ALL stale connections simultaneously
  T+60.5s   Exec call #10 → WORKS → "container running status running=true"

The container started at T+0.28s. The exec server was broken for 60 more
seconds because the duplicate re_register() from agent.rs corrupted the
edge-triggered epoll: the old AsyncFd consumed the edge notification, and
the new AsyncFd never received events for pending connections.

## The Fix: Deterministic Handshake Chain

  exec_rebind_signal → exec_re_register → rebind_done → output.reconnect()
                                                              ↓
                                               host accepts output connection
                                                              ↓
                                                  health monitor spawns

Every transition has an explicit signal. Zero timing dependencies.

### fc-agent side (4 files):

- exec.rs: After re_register(), signals rebind_done (AtomicBool + Notify)
- restore.rs: Signals exec rebind, waits for rebind_done confirmation (5s
  timeout), THEN reconnects output vsock
- agent.rs: Removed duplicate rebind signal after notify_cache_ready_and_wait;
  added exec_rebind_done/exec_rebind_done_notify Arcs
- mmds.rs: Threads new params through watch_restore_epoch to both
  handle_clone_restore call sites

### Host side (3 files):

- listeners.rs: Added connected_tx oneshot to run_output_listener(), fired
  on first output connection accept
- snapshot.rs: Waits for output_connected_rx (30s timeout) before spawning
  health monitor; removed stale output_reconnect.notify_one() for startup
  snapshots
- podman/mod.rs: Passes None for connected_tx (non-snapshot path)

## Results

Before: restore-to-healthy = ~61s (exec broken, 9 consecutive 5s timeouts)
After:  restore-to-healthy = ~0.5s (35ms to output connected, 533ms to healthy)

Post-restore exec stress test: 10 parallel calls completed in 16.3ms
(max single: 15.3ms), zero timeouts.

Tested: make test-root FILTER=localhost_rootless_btrfs_snapshot_restore STREAM=1
ejc3 added a commit that referenced this pull request Mar 2, 2026
CI infrastructure & Kata kernel
ejc3 added a commit that referenced this pull request Mar 2, 2026
…hang

Replace the racy double-rebind approach with a deterministic handshake chain
that guarantees the exec server's AsyncFd epoll is re-registered before the
host starts health-checking. Reduces restore-to-healthy from ~61s to ~0.5s.

## The Problem

After snapshot restore, Firecracker's vsock transport reset
(VIRTIO_VSOCK_EVENT_TRANSPORT_RESET) leaves the exec server's AsyncFd epoll
registration stale. The previous fix (c15aa6b) removed the duplicate rebind
signal from agent.rs but left a timing gap: if the restore-epoch watcher's
single signal arrived late, the host's health monitor would start exec calls
against a stale listener, hanging for ~60s until the kernel's vsock cleanup
expired the stale connections.

## Trace Evidence (the smoking gun)

From the vsock muxer log of a failing run (vm-ba97c):

  T+0.009s  Exec call #1 → WORKS (167+144+176+123+71+27 bytes response)
  T+0.076s  Exec call #2 → WORKS
  T+0.520s  Exec call #3 → guest ACKs, receives request, sends NOTHING → 5s timeout
  T+5.5-55s Exec calls #4-#9 → same pattern: kernel accepts, app never processes
  T+60.5s   Guest sends RST for ALL stale connections simultaneously
  T+60.5s   Exec call #10 → WORKS → "container running status running=true"

The container started at T+0.28s. The exec server was broken for 60 more
seconds because the duplicate re_register() from agent.rs corrupted the
edge-triggered epoll: the old AsyncFd consumed the edge notification, and
the new AsyncFd never received events for pending connections.

## The Fix: Deterministic Handshake Chain

  exec_rebind_signal → exec_re_register → rebind_done → output.reconnect()
                                                              ↓
                                               host accepts output connection
                                                              ↓
                                                  health monitor spawns

Every transition has an explicit signal. Zero timing dependencies.

### fc-agent side (4 files):

- exec.rs: After re_register(), signals rebind_done (AtomicBool + Notify)
- restore.rs: Signals exec rebind, waits for rebind_done confirmation (5s
  timeout), THEN reconnects output vsock
- agent.rs: Removed duplicate rebind signal after notify_cache_ready_and_wait;
  added exec_rebind_done/exec_rebind_done_notify Arcs
- mmds.rs: Threads new params through watch_restore_epoch to both
  handle_clone_restore call sites

### Host side (3 files):

- listeners.rs: Added connected_tx oneshot to run_output_listener(), fired
  on first output connection accept
- snapshot.rs: Waits for output_connected_rx (30s timeout) before spawning
  health monitor; removed stale output_reconnect.notify_one() for startup
  snapshots
- podman/mod.rs: Passes None for connected_tx (non-snapshot path)

## Results

Before: restore-to-healthy = ~61s (exec broken, 9 consecutive 5s timeouts)
After:  restore-to-healthy = ~0.5s (35ms to output connected, 533ms to healthy)

Post-restore exec stress test: 10 parallel calls completed in 16.3ms
(max single: 15.3ms), zero timeouts.

Tested: make test-root FILTER=localhost_rootless_btrfs_snapshot_restore STREAM=1
ejc3 added a commit that referenced this pull request Mar 2, 2026
CI infrastructure & Kata kernel
ejc3 added a commit that referenced this pull request Mar 2, 2026
…hang

Replace the racy double-rebind approach with a deterministic handshake chain
that guarantees the exec server's AsyncFd epoll is re-registered before the
host starts health-checking. Reduces restore-to-healthy from ~61s to ~0.5s.

## The Problem

After snapshot restore, Firecracker's vsock transport reset
(VIRTIO_VSOCK_EVENT_TRANSPORT_RESET) leaves the exec server's AsyncFd epoll
registration stale. The previous fix (c15aa6b) removed the duplicate rebind
signal from agent.rs but left a timing gap: if the restore-epoch watcher's
single signal arrived late, the host's health monitor would start exec calls
against a stale listener, hanging for ~60s until the kernel's vsock cleanup
expired the stale connections.

## Trace Evidence (the smoking gun)

From the vsock muxer log of a failing run (vm-ba97c):

  T+0.009s  Exec call #1 → WORKS (167+144+176+123+71+27 bytes response)
  T+0.076s  Exec call #2 → WORKS
  T+0.520s  Exec call #3 → guest ACKs, receives request, sends NOTHING → 5s timeout
  T+5.5-55s Exec calls #4-#9 → same pattern: kernel accepts, app never processes
  T+60.5s   Guest sends RST for ALL stale connections simultaneously
  T+60.5s   Exec call #10 → WORKS → "container running status running=true"

The container started at T+0.28s. The exec server was broken for 60 more
seconds because the duplicate re_register() from agent.rs corrupted the
edge-triggered epoll: the old AsyncFd consumed the edge notification, and
the new AsyncFd never received events for pending connections.

## The Fix: Deterministic Handshake Chain

  exec_rebind_signal → exec_re_register → rebind_done → output.reconnect()
                                                              ↓
                                               host accepts output connection
                                                              ↓
                                                  health monitor spawns

Every transition has an explicit signal. Zero timing dependencies.

### fc-agent side (4 files):

- exec.rs: After re_register(), signals rebind_done (AtomicBool + Notify)
- restore.rs: Signals exec rebind, waits for rebind_done confirmation (5s
  timeout), THEN reconnects output vsock
- agent.rs: Removed duplicate rebind signal after notify_cache_ready_and_wait;
  added exec_rebind_done/exec_rebind_done_notify Arcs
- mmds.rs: Threads new params through watch_restore_epoch to both
  handle_clone_restore call sites

### Host side (3 files):

- listeners.rs: Added connected_tx oneshot to run_output_listener(), fired
  on first output connection accept
- snapshot.rs: Waits for output_connected_rx (30s timeout) before spawning
  health monitor; removed stale output_reconnect.notify_one() for startup
  snapshots
- podman/mod.rs: Passes None for connected_tx (non-snapshot path)

## Results

Before: restore-to-healthy = ~61s (exec broken, 9 consecutive 5s timeouts)
After:  restore-to-healthy = ~0.5s (35ms to output connected, 533ms to healthy)

Post-restore exec stress test: 10 parallel calls completed in 16.3ms
(max single: 15.3ms), zero timeouts.

Tested: make test-root FILTER=localhost_rootless_btrfs_snapshot_restore STREAM=1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant