Conversation
- Remove needless Ok()? in find_plan_file - Remove useless format!() in fix_fstab_in_image - Use .map() instead of if let Some for Option::map pattern - Use function reference instead of closure for is_process_alive - Replace deprecated into_path() with keep()
- Add CACHE_REGISTRY variable to Makefile (optional)
- When set, podman build uses --cache-from and --cache-to flags
- CI sets CACHE_REGISTRY=ghcr.io/${{ github.repository }}/cache
- All jobs login to ghcr.io before building
- Removes actions/cache in favor of native podman registry caching
- Renamed job to "VM (bare metal)" for clarity - Install Rust, deps, Firecracker, cargo-nextest directly on host - Run make test-vm instead of make container-test-vm - Removes podman version issues on buildjet runner
- Container (rootless): make ci-container-rootless (fuse-pipe tests) - Container (sudo): make ci-container-sudo (fuse-pipe root tests) - VM (bare metal): make test-vm (Firecracker VM tests) All on buildjet-32vcpu-ubuntu-2204 to avoid GHA ulimit restrictions.
- CI container targets now call make test-noroot/test-root inside container - Remove redundant test enumeration variables - Remove separate pjdfstest target (included in test-root) - test-noroot runs: unit tests + non-root fuse-pipe tests - test-root runs: root fuse-pipe tests (integration_root, permission_edge, pjdfstest)
Major changes: - Replace enumeration-based test filtering with compile-time features - test-rootless: no features (privileged tests not compiled) - test-root: --features privileged-tests (requires sudo + KVM) - Container tests call back to Makefile (single source of truth) - pjdfstest built via Makefile, not Containerfile (setup-pjdfstest target) - All pjdfstest moved to root-only (C suite requires chown/mknod/user-switch) Linting: - Add tests/lint.rs - runs fmt, clippy, audit, deny in parallel via nextest - Add deny.toml for license/security checks - Replace unmaintained atty crate with std::io::IsTerminal Makefile: - Removed ~300 lines of redundant targets - test-rootless, test-root, test (both) - container-test-rootless, container-test-root, container-test (both) - setup-pjdfstest builds pjdfstest on first test run (idempotent) Files: - pjdfstest_matrix.rs -> pjdfstest_matrix_root.rs (all categories need root) - Containerfile: removed pjdfstest build (now via Makefile) - src/main.rs: use std::io::IsTerminal instead of atty - Cargo.toml: removed atty dependency
Previously, fcvm would automatically download the kernel and create the rootfs on first run of `fcvm podman run`. This caused issues in CI where parallel tests would timeout waiting for the 5-minute setup to complete. Changes: - Add `fcvm setup` command that downloads kernel, creates rootfs, and creates fc-agent initrd - Add `--setup` flag to `fcvm podman run` to opt-in to auto-setup - Without --setup, fcvm now fails immediately with a helpful error if any required asset is missing - All ensure_* functions now take `allow_create: bool` parameter: - ensure_kernel(allow_create) - ensure_rootfs(allow_create) - ensure_fc_agent_initrd(allow_create) - Internal rootfs creation still allows kernel download (since creating rootfs requires booting a VM with the kernel) Error messages direct users to run `fcvm setup` or use `--setup` flag. Tested: - Removed rootfs and ran without --setup: fails immediately with "Rootfs not found. Run 'fcvm setup' first, or use --setup flag." - Ran `fcvm setup`: creates kernel, rootfs, and initrd successfully - Ran `fcvm podman run` after setup: works without --setup flag
- New target `setup-fcvm` runs `fcvm setup` to download kernel and create rootfs before tests - Updated dependencies: - test-root: setup-fcvm (instead of just setup-btrfs) - container-test-root: setup-fcvm - bench-exec: setup-fcvm - ci-host: setup-fcvm - ci-root: setup-fcvm - setup-fcvm depends on build and setup-btrfs (correct order) - Added to help text and .PHONY list This ensures rootfs is created once before parallel tests run, avoiding the timeout issues from previous approach where each test would race to create the rootfs.
Documentation updates: - CLAUDE.md: Document fcvm setup command and --setup flag behavior - README.md: Add setup section to Quick Start, add fcvm setup to CLI Reference - DESIGN.md: Add fcvm setup command documentation - CONTRIBUTING.md: Add setup step to development workflow Code changes: - Disallow --setup flag when running as root (must use fcvm setup explicitly) - Add disk space check to Makefile setup-fcvm target with remediation options - Explain btrfs CoW usage in disk space error message The --setup flag is only available for rootless mode. For bridged networking (which requires root), run `fcvm setup` first.
Introduces three test tiers using Cargo feature flags:
- test-unit: No VMs, ~1s (107 tests) - cli parsing, state manager, lint
- test-integration-fast: Quick VMs, ~80s (176 tests) - sanity, exec, port forward
- test-root: All tests, ~6min (196 tests) - includes clone, snapshot, POSIX
Implementation:
- Cargo.toml: Added integration-fast and integration-slow features
with default = ["integration-fast", "integration-slow"]
- Test files: Added #![cfg(feature = "...")] to categorize by speed
- integration-fast: sanity, signal_cleanup, exec, port_forward,
localhost_image, readme_examples, lint
- integration-slow: snapshot_clone, clone_connection, fuse_posix,
fuse_in_vm, egress, egress_stress
- Makefile: New targets test-unit, test-integration-fast
(test-root unchanged, runs all tests)
- Container targets: container-test-unit, container-test-integration-fast
- CI targets: ci-unit, ci-integration-fast
Unit tests always compile (no feature flag) so test-unit uses
--no-default-features to exclude VM tests entirely.
Tested:
make test-unit # 107 tests, 1.2s
make test-integration-fast # 176 tests, 81s
make test-root # 196 tests, 367s (all passed)
Problem: pjdfstest (POSIX compliance tests, ~80s) was running in
test-integration-fast when it should only run in test-root.
Root cause: Cargo feature unification was merging features across
workspace members, causing fuse-pipe to get `integration-slow` even
when `--no-default-features` was passed.
Changes:
1. Workspace (Cargo.toml):
- Add resolver = "2" for proper workspace-wide feature handling
2. fuse-pipe/Cargo.toml:
- Remove fuse-client feature (fuser is now always included)
- Add integration-slow feature to gate slow tests
- default = ["integration-slow"] (all tests run by default)
3. fc-agent/Cargo.toml:
- Use default-features = false to exclude integration-slow
- This prevents feature unification from enabling slow tests
4. fuse-pipe/tests/pjdfstest_matrix_root.rs:
- Add #![cfg(all(feature = "privileged-tests", feature = "integration-slow"))]
5. Makefile:
- Add LIST=1 flag for fast test listing without running
- Remove test pre-compilation from build target
Test counts:
test-unit: 189 tests (~1s)
test-integration-fast: 243 tests (~12s, no pjdfstest)
test-root: 280 tests (~6min, includes pjdfstest)
Tested: make test-unit LIST=1, make test-integration-fast LIST=1,
make test-root LIST=1 (verified pjdfstest excluded from fast tier)
Add Build Performance section to DESIGN.md with benchmarks from c6g.metal (64 ARM cores): - Cold build: 44s (~12 parallel rustc, dependency-graph limited) - Incremental: 13s (only recompiles changed crate) - test-unit LIST: 24s cold, 1.2s warm Tested mold linker and sccache: - mold: ~1s savings, not worth the config complexity - sccache: adds overhead for local dev, may help CI Conclusion: defaults are fine, compilation (not linking) is the bottleneck.
README.md (684 → 445 lines, -35%): - Consolidate detailed sections with links to DESIGN.md - Remove dnsmasq from prerequisites (not needed - VMs use host DNS) - Update Makefile targets (test-vm → test-root/integration-fast/unit) - Add --strace-agent option documentation DESIGN.md: - Fix nftables → iptables (6 locations) to match actual implementation - Remove fake config file section (we use env vars, not config files) - Update state file path and format to match actual code - Add pjdfstest skip breakdown (54/237 files skipped on Linux) - Add fcvm exec command documentation - Fix directory structure (remove memory_server.rs, add exec.rs) src/cli/args.rs: - Fix setup description: "~500MB download" → "kernel ~15MB, rootfs ~10GB" Makefile: - Add LIST=$(LIST) passthrough to container test targets
The problem: all test tiers were running with `sudo -E` as the test binary runner, even tests that don't need root (rootless networking). New test tiers: - test-unit: no features, no sudo, no VMs - test-fast: integration-fast, no sudo, rootless VMs only - test-all: default features (fast+slow), no sudo, rootless VMs only - test-root: + privileged-tests, sudo via runner, bridged + pjdfstest Key changes: - Add TEST_ENV_BASE without sudo for rootless tiers - Only TEST_ROOT uses CARGO_TARGET_*_RUNNER='sudo -E' - Align container targets (container-test-fast, container-test-all) - Align CI targets (ci-fast instead of ci-integration-fast) - Add namespace holder retry logic (2s deadline) for transient failures - Switch egress tests to checkip.amazonaws.com (faster, more reliable) - Reduce curl/wget timeouts from 15s to 5s - Update CLAUDE.md with test tier documentation Tested: make test-unit # 107 passed make test-fast # 115 passed make test-all # 127 passed make test-root # 196 passed, 3 skipped (intentionally ignored)
New test file: tests/test_fuse_in_vm_matrix.rs - 17 tests (one per pjdfstest category) - Each test spins up a VM with pjdfstest container - Runs category inside VM against FUSE-mounted volume - Nextest runs all 17 in parallel This complements the host-side matrix (fuse-pipe/tests/pjdfstest_matrix_root.rs): - Host matrix: tests fuse-pipe FUSE directly (no VM) - VM matrix: tests full stack (host VolumeServer → vsock → guest FUSE) Removed obsolete files: - tests/test_fuse_in_vm.rs (replaced by matrix) - tests/test_fuse_posix.rs (ignored sequential test) Tested: make test-root FILTER=test_pjdfstest_vm_posix_fallocate (passed)
- CLAUDE.md: Fix test tier tables (test-unit/fast/all/root) - CLAUDE.md: Document both pjdfstest matrices with test counts - CLAUDE.md: Update project structure (remove old test files) - Cross-reference between host-side and in-VM matrices
Containerfile (118 → 48 lines): - Consolidate RUN commands to reduce layers - Remove verbose comments - Keep --no-same-owner for tar (fixes rootless builds) - Install cargo-audit and cargo-deny alongside nextest - Add uidmap package inline with other apt deps Makefile (269 → 90 lines): - Remove separate CONTAINER_RUN_ROOTLESS/ROOT configs - Use single CONTAINER_RUN with --userns=keep-id for UID mapping - Use bind mounts (./target, ./cargo-home) instead of named volumes - Remove CI-specific targets (ci-host, ci-unit, ci-fast, ci-root) - Remove container-build-root (single container-build works for all) - Consolidate help text - Remove verbose disk space error messages Tested: make container-test-all (125 tests passed in 211s)
DNAT scoping fix: - Scope iptables DNAT rules to veth IP (`-d 172.30.x.y`) instead of matching any destination. This allows parallel VMs to use the same host port since each rule only matches traffic to its specific veth IP. - Update tests to curl the veth IP instead of localhost. Timeout fixes for pjdfstest: - Increase localhost test timeout from 60s to 120s to handle podman storage lock contention during parallel test runs. - Add 15-minute timeout for pjdfstest_vm tests in nextest.toml. Bottleneck analysis (proved via controlled experiment): - Single VM: 25s total (skopeo export: 3.2s, fuse import: 6.5s) - 2 VMs parallel: 29s for chmod (skopeo export: 6.2s, fuse import: 6.6s) - Bottleneck is skopeo export (podman storage lock), NOT fuse-pipe. - fuse-pipe maintains 66 MB/s read speed regardless of parallelism. Files changed: - src/network/bridged.rs: Scope DNAT to veth IP - src/network/namespace.rs, portmap.rs: Update veth IP accessor - tests/test_*.rs: Use veth IP for curl, increase timeouts - .config/nextest.toml: Add pjdfstest_vm timeout override Tested: Single VM chmod: 25s 2 parallel VMs (chmod + chown): 29s + 96s fuse-pipe import speed: 66 MB/s (consistent)
fc-agent was using only 1 FUSE reader thread, severely limiting parallel I/O throughput over vsock. Benchmarks showed 256 readers gives best performance. Change: - fc-agent/src/fuse/mod.rs: Add NUM_READERS=256 constant - Use mount_vsock_with_readers() instead of mount_vsock() Performance improvement (image import via FUSE over vsock): - Before (1 reader): 6.6s for 430MB pjdfstest image - After (256 readers): 5.6s (15% faster) The import phase now properly utilizes parallel I/O, reducing the per-VM overhead during pjdfstest matrix runs.
Problem: When running 17 pjdfstest VM tests in parallel, all tests
serialize on `skopeo copy containers-storage:localhost/pjdfstest`.
This creates a "thundering herd" where all tests wait ~120s, then
all try to start VMs at once, causing Firecracker crashes.
From logs:
# FAILING (truncate):
05:01:26 Exporting image with skopeo
05:03:34 Image exported (122s later - lock contention!)
05:03:34.835 Firecracker spawned
05:03:34.859 VM setup failed (24ms - crashed immediately)
# PASSING (chmod):
05:01:27 Exporting image with skopeo
05:03:10 Image exported (103s - finished earlier)
05:03:11.258 Firecracker spawned
05:03:11.258 API server received request (success)
Solution: Content-addressable cache at /mnt/fcvm-btrfs/image-cache/{digest}/
- Get image digest with `podman image inspect`
- First test exports to cache (with file lock to prevent races)
- All other tests hit cache instantly
Result: 17 pjdfstest VM tests now complete in 95s with 0 failures
(was 128s with 2 failures from Firecracker crashes)
Also updated CLAUDE.md race condition debugging section with this
real example to emphasize "show, don't tell" - always find the
smoking gun in logs rather than guessing.
Host runner (bare metal with KVM): test-unit → test-fast → test-root Container runner (podman): container-test-unit → container-test-fast → container-test-all Each target runs sequentially within its runner. Both runners execute in parallel.
The test was calling link() with an inode from create() without an intermediate lookup(). In real FUSE, the kernel calls LOOKUP on the source file before LINK to resolve the path to an inode. This lookup refreshes the inode reference in fuse-backend-rs. Without this lookup, the inode may be unreachable after release() because fuse-backend-rs tracks inode references internally and the create() reference may not persist correctly across all environments. The fix: 1. After release(), call lookup() to refresh the inode reference 2. Use the inode from lookup() for the link() call This simulates what the kernel does and makes the test work correctly on all environments (not just by accident on some filesystems). Also: - Reverted fuse-pipe tests to use /tmp (the .local/ workaround was wrong) - Added POSIX compliance testing guidelines to CLAUDE.md Tested: cargo test -p fuse-pipe --lib test_passthrough_hardlink -- passes
The previous fix only updated the hash function, not the actual Command that executes podman. This adds --cgroups=disabled to the real download command at line 1552.
Remove duplicate script definition - now generate_download_script() is used for both hashing AND execution. This prevents the bug where the hash version had --cgroups=disabled but the execution version didn't.
bfdbf6a to
b27bb30
Compare
Add lint-tests feature to gate fmt/clippy/audit/deny tests. These were causing test-fast to fail due to corrupt cargo-audit DB. Now run lint explicitly with: make lint
The Host job was missing cargo-audit and cargo-deny, causing lint tests to fail with 'unsupported CVSS version: 4.0' from the RustSec DB. Added cargo install for both tools alongside cargo-nextest.
Root cause: 15 snapshot tests running in parallel, each creating a 5.6GB snapshot (2GB memory + 3.6GB disk). With 20GB btrfs, only ~3 tests fit. Changes: - Increase btrfs loopback from 20G to 60G - Add snapshot-tests group with max-threads=3 in nextest.toml - Assign snapshot/clone tests to this group This limits concurrent snapshots to ~17GB disk usage, well under the 60GB limit. Belt and suspenders approach ensures CI stability.
Container tests were failing with "userfaultfd access check failed" because the Container job wasn't setting vm.unprivileged_userfaultfd=1. The Host job already had this, but Container was missing it. Containers inherit host sysctl settings, so setting it on the host before running podman allows snapshot cloning to work inside the container.
Snapshot cloning requires /dev/userfaultfd device, not just the sysctl. - Create device with mknod in CI setup - Pass device to container via --device flag
a34b64d to
5df20f4
Compare
Pin specific versions instead of using --force: - cargo-nextest@0.9.115 - cargo-audit@0.22.0 (supports CVSS 4.0) - cargo-deny@0.18.9 This way cargo only reinstalls if cached version doesn't match.
5df20f4 to
e99bb18
Compare
Firecracker uses the userfaultfd() syscall, not the /dev/userfaultfd device. The device check was a pre-flight check that gave misleading error messages. - Remove check_userfaultfd_access() function and call - Remove --device /dev/userfaultfd from Makefile - Simplify CI to just set vm.unprivileged_userfaultfd=1 Tested: make test-root FILTER=test_snapshot_run_exec_rootless passes
All tests now automatically capture debug-level logs to files:
- Log files: /tmp/fcvm-test-logs/{name}-{timestamp}.log
- Console shows INFO/WARN/ERROR (clean)
- Files contain DEBUG/TRACE (full detail for debugging)
- CI uploads logs as artifacts (7 day retention)
Changes:
- tests/common: Add TestLogger struct, spawn_fcvm always logs to file
- src/main.rs: Fix RUST_LOG handling (was ignoring env var)
- tests/common: Add --setup flag to podman/snapshot run commands
- CI: Upload /tmp/fcvm-test-logs/ as artifacts
- Makefile: Mount test log dir into container
- src/uffd/server.rs: Use Debug format for UFFD errors (shows errno)
Tested: make test-fast FILTER=sanity - passes with 528 DEBUG lines in log
- main.rs: Fix formatting (single line for EnvFilter) - CI: Add --force to cargo install to override stale cached versions (cargo-audit was failing with CVSS 4.0 parse error due to old version)
Container tests run as root, which rejects --setup flag. CI already runs setup-fcvm before tests, so --setup is not needed. Removes unconditional --setup addition from maybe_add_test_flags().
Log fault address and offset when UFFD copy fails to help diagnose the "Copy failed" errors seen in CI. The error message now includes: - vm_id - fault_addr (hex) - offset_in_file - full error with errno This will show which errno (ENOENT, EEXIST, EINVAL, etc.) is causing the copy failures during clone VM boot.
Capture kernel logs filtered for UFFD/KVM/memory related messages: - userfault, uffd, kvm, firecracker - oom, killed, segfault, page fault This helps debug UFFD copy failures at the kernel level without exposing sensitive system info.
ac205ed to
4544e28
Compare
2 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add 3 consolidated CI make targets, one per job:
make ci-container-rootless- lint + rootless FUSE testsmake ci-container-sudo- root FUSE tests + POSIX compliancemake container-test-vm- VM tests (already existed)Each CI job now runs exactly one make command
Fix clippy warnings that broke lint
Test plan
make ci-container-rootlesspasses (58 tests)make ci-container-sudopasses (46 tests)