Fix parallel test execution with proper root/rootless isolation#8
Merged
Fix parallel test execution with proper root/rootless isolation#8
Conversation
Network changes: - slirp0 now uses 10.0.2.100/24 address for DNAT compatibility - Add DNAT rule to redirect hostfwd traffic (10.0.2.100) to guest IP - This enables port forwarding to work with dual-TAP architecture VM namespace handling: - Add user_namespace_path and net_namespace_path to VmManager - Implement pre_exec setns for entering user namespace before mount - Enable mount namespace isolation for vsock socket redirect in clones Snapshot improvements: - Add userfaultfd access check with detailed error messages - Better handling of rootless clone network setup Test improvements: - Add unique_names() helper in tests/common for test isolation - Update all snapshot/clone tests to use unique names (PID + counter) - Prevents conflicts when tests run in parallel or with different users - Add test_clone_port_forward_bridged and test_clone_port_forward_rootless - Rootless tests FAIL loudly if run as root (not silently skip) Documentation: - Document clone port forwarding capability in README
- Add network mode guards in fcvm binary (podman.rs, snapshot.rs) - Bridged without root: fails with helpful error message - Rootless with root: warns that it's unnecessary - Add dynamic NBD device selection in rootfs.rs (scans nbd0-nbd15) - Enables parallel rootfs creation without conflicts - Includes retry logic for race conditions - Add require_non_root() helper in tests/common/mod.rs - All rootless tests now fail loudly if run as root - Update all tests to use unique names (unique_names() or PID-based) - test_exec, test_egress, test_sanity, test_signal_cleanup, etc. - Split Makefile targets by network mode - test-vm-exec-bridged/rootless, test-vm-egress-bridged/rootless - container-test-vm-exec-bridged/rootless, etc. - Bridged targets run with sudo, rootless without - Remove silent test skips in test_readme_examples.rs - Tests now fail properly when run without required privileges - Fix clippy warnings (double-reference issues in test_snapshot_clone.rs)
The firecracker tarball from GitHub contains files owned by the packager's UID (647281167). When rootless podman tries to load an image with UIDs outside its subuid range, it fails with: "lchown: invalid argument" Fix by adding chown root:root after extracting firecracker binary. UID 0 is always mappable in rootless podman.
The rootless container (using rootless podman) was running processes as UID 0 inside the container. The require_non_root() guard in tests correctly detected this and failed. Add --user testuser to CONTAINER_RUN_ROOTLESS so tests run as non-root inside the container, matching the actual rootless use case.
Bridged tests create the rootfs as root. Rootless tests then use the pre-created rootfs. Running rootless first fails because testuser can't access NBD devices to create the rootfs. Order changed: - container-test-vm-exec: bridged first, then rootless - container-test-vm-egress: bridged first, then rootless - container-test-vm: bridged first, then rootless
- Use rootless podman with --privileged for user namespace capabilities - Add --group-add keep-groups to preserve kvm group for /dev/kvm access - Update require_non_root() to detect container environment via /run/.containerenv or /.dockerenv marker files - Container is the isolation boundary, not UID inside it
- Create /dev/userfaultfd if missing (mknod c 10 126) - Set permissions to 666 for container access - Enable vm.unprivileged_userfaultfd=1 sysctl
Each CI job runs on a different BuildJet runner, which means each needs to recreate the rootfs via virt-customize. This was causing timeouts because virt-customize can be slow or hang on some runners. Combine all VM tests (sanity, exec, egress) into a single job that runs them sequentially. The rootfs is created once during the sanity test and reused for exec and egress tests.
Each CI job runs on a different BuildJet runner, which means each needs to recreate the rootfs via virt-customize. This was causing timeouts because virt-customize can be slow or hang on some runners. Combine all VM tests (sanity, exec, egress) into a single job that runs them sequentially. The rootfs is created once during the sanity test and reused for exec and egress tests. Also add verbose output to virt-customize for debugging.
- Combine lint + build + unit tests + CLI tests + FUSE integration into single build-and-test job - Combine noroot + root FUSE tests into single fuse-tests job - Combine bridged + exec + egress VM tests into single vm-tests job - Remove verbose diagnostic output from VM setup steps - Each job now compiles once and runs all related tests sequentially Reduces from 9 jobs to 4 jobs, eliminating ~5 redundant cargo builds.
- Add CI=1 mode to Makefile that uses host directories instead of named volumes - Add container-build-only target for compiling without running tests - CI workflow: Build job compiles inside container, uploads target/release - FUSE Tests and POSIX Compliance download artifact, run tests without rebuild - Lint and Native Tests run in parallel using rust-cache - VM Tests run independently on BuildJet (separate build) Dependency graph: - Build, Lint, Native Tests, VM Tests start in parallel - FUSE Tests and POSIX Compliance wait for Build, then run in parallel - Container tests reuse pre-built binaries (no recompilation)
Replace virt-customize/NBD approach with fully rootless setup: - No sudo required - only kvm group membership for /dev/kvm - initrd boots with busybox, mounts rootfs and packages ISO - Packages delivered via ISO9660 (genisoimage, no root needed) - chroot installs packages with bind-mounted /proc, /sys, /dev Content-addressable caching: - SHA256 of complete init script (mounts + install + setup) - Layer 2 rebuilt only when init script content changes - fc-agent NOT in Layer 2 - injected per-VM via separate initrd Rootless operations used throughout: - qemu-img convert (qcow2 → raw) - sfdisk --json for GPT partition parsing - dd skip/count for partition extraction - truncate + resize2fs for filesystem expansion - debugfs for fstab fixes (removes BOOT/UEFI entries) - genisoimage for packages ISO creation - cpio for initrd archive New rootfs-plan.toml config file: - Defines base image URL per architecture - Lists packages: runtime (podman, crun), fuse, system - Specifies services to enable/disable Success detection via FCVM_SETUP_COMPLETE marker in serial output instead of timing-based heuristics.
Replace custom kernel build with Kata Containers kernel: - Download from Kata 3.24.0 release (kernel 6.12.47) - Kata kernel has CONFIG_FUSE_FS=y built-in - Cache by URL hash, auto-download on first run - Add kernel config section to rootfs-plan.toml Embed packages directly in initrd instead of ISO: - No ISO9660/SquashFS filesystem driver needed - Packages copied from /packages in initrd to rootfs - initrd size ~205MB (317 packages embedded) - Only one disk needed during Layer 2 setup Update SHA calculation: - Include kernel URL in Layer 2 hash - Changing kernel URL triggers Layer 2 rebuild Add hex crate dependency for SHA encoding.
When root creates /tmp/fcvm-layer2-initrd and then non-root
tries to use it, permission denied errors occur.
Fix by including UID in temp directory names:
- /tmp/fcvm-layer2-initrd-{uid}
- /tmp/fcvm-layer2-setup-{uid}
Each user gets their own temp directory, avoiding conflicts.
For clones, port mappings now DNAT to veth_inner_ip (10.x.y.2) which the host can route to. The existing blanket DNAT rule inside the namespace (set up by setup_in_namespace_nat) forwards traffic from veth_inner_ip to guest_ip. Changes: - Track veth_inner_ip in BridgedNetwork for clones - Port mappings target veth_inner_ip for clones, guest_ip for baseline - Update test to expect direct guest access N/A for clones (by design) The test now passes: - Port forward (host IP): curl host:19080 → clone nginx ✓ - Localhost port forward: curl localhost:19080 → clone nginx ✓
- Delete snapshot directory (memory.bin, disk.raw, etc.) on SIGTERM/SIGINT - Add double Ctrl-C protection: warns about running clones first, requires confirmation within 3 seconds to force shutdown - Prevents disk space exhaustion from orphaned snapshots (5.6GB each) - Each snapshot has ~2GB memory.bin that cannot be reflinked, so cleanup is essential for repeated test runs
- delete_state() now removes .json.lock and .json.tmp files - Prevents accumulation of orphaned lock files during test runs - Lock files are harmless but clutter the state directory
The rootless user_data_dir() and is_writable() fallback was overly complex and not needed. All fcvm operations require the btrfs filesystem at /mnt/fcvm-btrfs anyway, so the automatic fallback to ~/.local/share/fcvm was misleading - it would fail later when btrfs operations were attempted. Changes: - Remove user_data_dir() and is_writable() helpers - Simplify base_dir(), kernel_dir(), rootfs_dir() to just use DEFAULT_BASE_DIR - Remove fallback paths that check both user and system locations
Changes to disk format and error handling: - Rename disk files from .ext4 to .raw (reflects raw disk format) - Remove fallback to regular cp when reflink fails - Require btrfs filesystem explicitly with clear error message - Update test assertions to use .raw extension The fallback copy was problematic because: 1. Without reflinks, each VM would use ~10GB disk space 2. Regular copy would succeed but defeat the CoW benefit 3. Better to fail fast with a clear error about btrfs requirement
Implements bidirectional I/O channel between fc-agent and host for container stdout/stderr streaming. fc-agent changes: - Add OUTPUT_VSOCK_PORT (4997) for dedicated I/O channel - Create vsock connection on container start - Stream stdout/stderr to host as "stdout:line" / "stderr:line" - Accept stdin from host as "stdin:line" (bidirectional) - Wait for output tasks to complete before closing connection Host changes (podman.rs): - Add run_output_listener() for vsock output handling - Parse raw line format and print with [ctr:stream] prefix - Send ack for bidirectional protocol This separates container output from the status channel (port 4999) for cleaner protocol handling.
Tests that use bridged networking or modify iptables require root. Adding #[cfg(feature = "privileged-tests")] allows running unprivileged tests separately from privileged ones. Affected tests: - test_sanity_bridged - test_egress_fresh_bridged, test_egress_clone_bridged - test_egress_stress_bridged - test_exec_bridged - test_fuse_in_vm_smoke, test_fuse_in_vm_full - test_posix_all_sequential_bridged (renamed for clarity) - test_port_forward_bridged Rootless variants remain unprivileged and run without the feature flag.
Changes enable tests to run concurrently without resource conflicts: tests/common/mod.rs: - Make require_non_root() a no-op (testing shows unshare works as root) - Keep for API compatibility test_health_monitor.rs: - Use create_unique_test_dir() instead of shared base dir - Remove serial_test dependency for this file test_clone_connection.rs: - Use unique_names() helper for VM/snapshot names - Update name pattern for clarity test_localhost_image.rs: - Use unique_names() for test isolation - Update assertions for new naming test_readme_examples.rs: - Use unique_names() throughout - Fix test_quick_start to use unique names test_signal_cleanup.rs: - Use unique VM names per test run This fixes failures when tests run in parallel by ensuring each test uses unique resource names (VMs, snapshots, temp directories).
Documentation: - CLAUDE.md: Update development patterns and test isolation notes - DESIGN.md: Reflect current architecture changes - README.md: Update usage examples and descriptions Build system: - Makefile: Improve test targets and feature flag handling - .gitignore: Add container marker files Minor code: - args.rs: Add example to --cmd flag documentation - setup/mod.rs: Minor cleanup
When multiple VMs start simultaneously, they all try to create the same fc-agent initrd. The previous code had a TOCTOU race where: 1. Process A checks if initrd exists (no) 2. Process B checks if initrd exists (no) 3. Process A creates temp dir and starts building 4. Process B does remove_dir_all(&temp_dir), deleting A's work 5. Process A fails with "No such file or directory" Fix: - Add flock-based exclusive lock around initrd creation - Double-check pattern: check existence before AND after acquiring lock - Use PID in temp dir name as extra safety measure - Release lock on error and success paths
Problem: When running `make test-vm` at full parallelism (64 CPUs), rootless tests failed under the sudo (privileged) phase with "namespace holder died" errors. Multiple holder processes (sleep infinity) died within milliseconds of each other, suggesting a mass kill event. Root cause: Rootless tests were running TWICE: 1. In the unprivileged phase (no sudo) - PASSED 2. In the privileged phase (with sudo) - FAILED When running rootless tests under sudo with high parallelism, nextest's process group signal handling causes cross-test interference. The exact mechanism involves holder processes being killed when other tests in the sudo session fail or timeout. Fix: - Filter rootless tests from privileged run: -E '!test(/rootless/)' - Rootless tests only need to run once (without sudo) - Bridged tests only need to run once (with sudo) - Remove max-threads=16 workaround now that root cause is fixed Result: - Unprivileged: 66 tests, 66 passed (208s) - Privileged: 71 tests, 71 passed, 18 skipped (152s) - Total: 137 tests at full parallelism, 0 failures
- state/manager.rs: Add cleanup_stale_state() to remove orphaned state files from dead processes, freeing loopback IPs - test_signal_cleanup.rs: Track specific firecracker PIDs instead of global process counts (works with parallel tests) - test_egress_stress.rs: Dynamic port allocation via find_free_port() instead of fixed port 18080 - test_exec.rs: Use public.ecr.aws instead of ifconfig.me for connectivity tests (more reliable) - CLAUDE.md: Add race condition debugging protocol and log preservation best practices
- Add busybox-static package to Containerfile - Use find_busybox() to prefer statically-linked binary - Required because initrd runs before root filesystem is mounted
- Kernel now auto-downloaded from Kata Containers 3.24.0 release
- Remove references to custom kernel build (~/linux-firecracker)
- Update data layout: layer2-{sha}.raw, initrd/, rootfs.raw
- Expand NO LEGACY policy to cover Makefile and documentation
- Remove obsolete setup-kernel and rootfs target documentation
- Remove setup-kernel target (kernel auto-downloaded by fcvm) - Remove rootfs target (fc-agent injected via initrd now) - Remove rebuild target (just use make build) - Remove container-test-fcvm legacy alias - Remove unused TEST_VM variable - Remove KERNEL_DIR variable - Add initrd/ to btrfs directory creation
The function was a no-op kept for "API compatibility" - exactly what our NO LEGACY policy prohibits. Rootless tests work fine under sudo. Removed function and all 12 call sites across test files.
- handler.rs: Clarify trait default is for simple test handlers - multiplexer.rs: "legacy behavior" -> "No collector - print directly" - types.rs: "backward compatibility" -> "JSON convention" in test
Include guidance on detailed commit messages with: - What changed, why it changed - How it was tested with "show don't tell" (actual commands) - Example of good vs bad commit messages
Makefile: - Remove sudo from all podman commands - rootless with --privileged grants sufficient capabilities within user namespace - Add --group-add keep-groups for proper group handling - Remove /dev/nbd0 device (no longer needed) - Simplify rootless marker (no export/import needed) Containerfile: - Use Rust 1.92.0 (edition 2024 is stable since 1.85) - Remove libguestfs-tools (not needed for current rootfs build) - Install rust toolchain for testuser to prevent re-downloading when running as --user testuser CI: - Remove nbd module loading step (no longer needed) Documentation: - Remove libguestfs-tools from prerequisites - Remove dynamic NBD device selection docs
Container test fixes: - CONTAINER_RUN_FCVM now uses VOLUME_TARGET_ROOT for proper isolation from rootless podman builds (different UID context) - CTEST_VM_* commands wrapped in sh -c to ensure && runs inside container, not on host shell - Added cargo build --release before nextest to ensure fc-agent is built - Added cpio to Containerfile for initrd creation Initrd creation fixes: - Changed from sh to bash for pipefail support - Added set -o pipefail so cpio errors aren't masked by gzip success - Removed 2>/dev/null to surface actual errors - Improved error messages to include both stdout and stderr Root cause: Pipeline 'find | cpio | gzip' reported success even when cpio was missing because gzip (last command) succeeded. Empty initrd caused silent VM boot failures. Tested: Container VM sanity test recompiles correctly on source changes
fc-agent was writing logs to stderr, but the systemd service file wasn't configured to forward output to the console. This made it impossible to diagnose issues like image pull failures from the host - we only saw systemd's "Started fc-agent.service" messages. Changes: - Add StandardOutput=journal+console to FC_AGENT_SERVICE - Add StandardError=journal+console to FC_AGENT_SERVICE - Same for FC_AGENT_SERVICE_STRACE This also includes previous uncommitted strace debugging support: - Init script checks for fc_agent_strace=1 kernel cmdline parameter - Strace wrapper script tees output to /dev/console for visibility - Initrd cache invalidation now includes all service file content - Debug packages support in rootfs-plan.toml Tested: Host sanity tests pass with fc-agent logs now visible
fc-agent was built with glibc, but when running in a container with a different glibc version (Debian Bookworm 2.36 vs Ubuntu 24.04 2.39), fc-agent would fail to start due to library compatibility. Changes: - Add musl targets to rust-toolchain.toml for static linking - Update Containerfile to install musl-tools and add musl targets - Update Makefile to build fc-agent with --target aarch64-unknown-linux-musl - Makefile now copies the musl binary to target/release/fc-agent This ensures fc-agent is fully statically linked and works across all Linux distributions regardless of glibc version. Tested: Host VM sanity tests pass with statically linked fc-agent
When fc-agent fails to start or crashes, it's hard to diagnose without seeing what system calls it's making. This adds strace support via kernel cmdline parameter. Changes: - Add --strace-agent flag to RunArgs (args.rs) - Pass fc_agent_strace=1 to boot args when flag is set (podman.rs) - Init script detects flag and uses strace wrapper service - tests/common: Add maybe_add_strace_flag() for FCVM_STRACE_AGENT env var - rootfs-plan.toml: Add debug packages section with strace Usage: fcvm podman run --strace-agent --name test nginx:alpine # Or in tests: make test-vm STRACE=1 FILTER=sanity
Image pull failures were hard to diagnose because: 1. Output wasn't streamed in real-time during pulls 2. Error messages weren't prominent enough 3. Retries weren't logged clearly Changes: - Stream podman pull output line-by-line in real-time - Add prominent banners around image pull attempts - Show attempt number (e.g., "attempt 2/3") - Capture stderr lines for final error message - Log clearly when all retry attempts fail This complements the console output fix in the systemd service to ensure image pull errors are visible in VM serial console.
Configure nextest to warn after 60 seconds and terminate after 300 seconds. Helps identify tests that are hanging or taking too long.
Add documentation for: - STREAM=1 flag to see test output in real-time - Container build layer caching (no workarounds needed) - Symlinks for sudo access in container
When VMs crash without cleanup, state files persist. If OS reuses that PID for a new VM, queries by PID would find the stale entry. Fix: save_state() now checks for and deletes any other state file claiming the same PID before saving (logs warning when this happens). Also reverts DNS DNAT approach - now mounts /run/systemd/resolve in container.
- Add warn! log when cleanup_stale_state removes dead process state files
- Fix unused import warnings in ls.rs and bridged.rs
The cleanup_stale_state function runs during loopback IP allocation
and removes state files where /proc/{pid} doesn't exist. Adding
logging helps debug parallel test failures.
Tested: make test-vm-privileged (71 passed twice in a row)
CONTAINER_RUN_FCVM was missing resource limits that CONTAINER_RUN_FUSE had: - --ulimit nofile=65536:65536 - --ulimit nproc=65536:65536 - --pids-limit=-1 Without these, parallel VM tests in the container hit EAGAIN (os error 11) on fork/spawn due to default container process limits. Tested: container-test-vm-privileged Before: 50 passed, 5 failed, 16 timed out After: 69 passed, 2 failed (podman build issues, not resource limits)
The test_localhost_image test uses skopeo to copy images from localhost/
registry to OCI directory format. The container was missing both podman
(for building test images) and skopeo (for copying them).
Tested: make container-test-vm-privileged FILTER=localhost (passed)
make container-test-vm-privileged FILTER=fuse_in_vm (passed)
The container was re-downloading the Rust toolchain on every test run because sudo resets RUSTUP_HOME, causing rustup to look in /root/.rustup (empty). Fix: Configure nextest target runner in .cargo/config.toml with "sudo -E". This allows cargo/rustup to run as normal user (preserving RUSTUP_HOME) while test binaries run via sudo with environment preserved. Also simplifies Makefile by merging separate unprivileged/privileged targets: - test-vm-unprivileged + test-vm-privileged → test-vm - container-test-vm-unprivileged + container-test-vm-privileged → container-test-vm Tested: make test-vm FILTER=sanity # 5.3s second run (no downloads) make container-test-vm FILTER=sanity # No toolchain re-download
Removes global target runner from .cargo/config.toml and instead sets
CARGO_TARGET_{ARCH}_RUNNER env vars explicitly in Makefile for tests
that need sudo. This is more secure (opt-in to privileges) and avoids
needing workarounds for non-root tests.
Changes:
- .cargo/config.toml: Remove global runner, add explanatory comment
- Makefile: Add explicit CARGO_TARGET_*_RUNNER='sudo -E' to TEST_VM
- Makefile: Add --user root to CONTAINER_RUN_FUSE_ROOT
- namespace.rs: Gate test_exec_in_namespace behind privileged-tests
- veth.rs: Gate entire tests module behind privileged-tests feature,
remove redundant per-test runtime euid checks
Tested:
make container-test-noroot # 92 tests passed
make container-test-vm FILTER=sanity # 1 test passed
Each POSIX category (chmod, chown, link, etc.) now runs as a separate #[test] function, allowing nextest to parallelize them across processes. Changes: - Add pjdfstest_matrix.rs with 17 test functions (one per category) - Add run_single_category() to pjdfstest_common.rs for isolated FUSE mounts - Remove pjdfstest_full.rs, pjdfstest_fast.rs, pjdfstest_stress.rs - Update Makefile to use pjdfstest_matrix - Update fuse-pipe/Cargo.toml test entries - Update scripts/run_fuse_pipe_tests.sh
- Remove references to test-vm-unprivileged/test-vm-privileged (merged) - Remove container-test-vm-privileged (merged into container-test-vm) - Document STREAM=1 for live test output - Update fuse-pipe test file lists for pjdfstest_matrix.rs
Previous CI had 6 jobs with artifact sharing complexity. Now: - container-rootless: lint + unit + FUSE noroot (rootless podman) - container-sudo: FUSE root + pjdfstest (sudo podman) - vm: VM tests on buildjet (KVM required) Each job builds independently - simpler than artifact passing.
ejc3
added a commit
that referenced
this pull request
Mar 2, 2026
Fix parallel test execution with proper root/rootless isolation
ejc3
added a commit
that referenced
this pull request
Mar 2, 2026
Fix parallel test execution with proper root/rootless isolation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR enables parallel test execution by properly isolating resources and separating root/rootless test execution.
Key Changes
Network mode guards: fcvm binary now enforces proper network mode usage
sudoor--network rootlessTest isolation: All tests use unique resource names for parallel execution
unique_names()helper generates timestamp+counter-based namesMakefile target splitting: Tests now run with appropriate privileges
test-vm-exec-bridged(sudo) /test-vm-exec-rootless(no sudo)test-vm-egress-bridged(sudo) /test-vm-egress-rootless(no sudo)Dynamic NBD device selection: Scans nbd0-nbd15 to find free device
Rootless networking documentation: Updated DESIGN.md with detailed dual-TAP architecture
Test Plan
All tests verified locally: