UFFD EEXIST fix and test stability (4/4) by ejc3 · Pull Request #25 · ejc3/fcvm

ejc3 · 2025-12-26T10:52:03Z

Summary

Final PR in the series. Builds on #24.

UFFD EEXIST Fix (Core Issue):

Handle EEXIST error for older kernels that don't support UFFDIO_COPY on already-copied pages
Add detailed UFFD copy error logging with Debug format
Remove unnecessary userfaultfd device check (sysctl is sufficient)

Debug Logging Infrastructure:

Automatic debug logging to test files
CI artifact upload for test logs (30 days for failures, 7 for passes)
Print debug log path at end of each fcvm process
Add filtered dmesg to CI artifacts

Test Stability:

Double test timeouts from 60s to 120s
Add file lock to prevent parallel pjdfstest container builds
Fix signal cleanup test: poll health for specific VM only

CI Improvements:

Add workflow_dispatch for manual triggers with job selector
Add cargo symlinks for lint tests under sudo
Pin cargo tool versions for CVSS 4.0 support

Test plan

CI passes
Merge after CI: Debug logging and caching (3/4) #24
Close Fix UFFD EEXIST handling for older kernels #17 after this merges

Pin specific versions instead of using --force: - cargo-nextest@0.9.115 - cargo-audit@0.22.0 (supports CVSS 4.0) - cargo-deny@0.18.9 This way cargo only reinstalls if cached version doesn't match.

Firecracker uses the userfaultfd() syscall, not the /dev/userfaultfd device. The device check was a pre-flight check that gave misleading error messages. - Remove check_userfaultfd_access() function and call - Remove --device /dev/userfaultfd from Makefile - Simplify CI to just set vm.unprivileged_userfaultfd=1 Tested: make test-root FILTER=test_snapshot_run_exec_rootless passes

All tests now automatically capture debug-level logs to files: - Log files: /tmp/fcvm-test-logs/{name}-{timestamp}.log - Console shows INFO/WARN/ERROR (clean) - Files contain DEBUG/TRACE (full detail for debugging) - CI uploads logs as artifacts (7 day retention) Changes: - tests/common: Add TestLogger struct, spawn_fcvm always logs to file - src/main.rs: Fix RUST_LOG handling (was ignoring env var) - tests/common: Add --setup flag to podman/snapshot run commands - CI: Upload /tmp/fcvm-test-logs/ as artifacts - Makefile: Mount test log dir into container - src/uffd/server.rs: Use Debug format for UFFD errors (shows errno) Tested: make test-fast FILTER=sanity - passes with 528 DEBUG lines in log

- main.rs: Fix formatting (single line for EnvFilter) - CI: Add --force to cargo install to override stale cached versions (cargo-audit was failing with CVSS 4.0 parse error due to old version)

Container tests run as root, which rejects --setup flag. CI already runs setup-fcvm before tests, so --setup is not needed. Removes unconditional --setup addition from maybe_add_test_flags().

Log fault address and offset when UFFD copy fails to help diagnose the "Copy failed" errors seen in CI. The error message now includes: - vm_id - fault_addr (hex) - offset_in_file - full error with errno This will show which errno (ENOENT, EEXIST, EINVAL, etc.) is causing the copy failures during clone VM boot.

Capture kernel logs filtered for UFFD/KVM/memory related messages: - userfault, uffd, kvm, firecracker - oom, killed, segfault, page fault This helps debug UFFD copy failures at the kernel level without exposing sensitive system info.

On older kernels (e.g., CI's 5.15 vs local 6.14), page fault coalescing is less aggressive, leading to multiple faults for the same page being queued. When the second fault tries to copy, it gets EEXIST because the page was already filled. Our code was treating ALL copy errors as fatal, disconnecting the VM. This is wrong - EEXIST just means "page already valid". Fix: Check for CopyFailed(EEXIST) and continue instead of returning an error. The Linux kernel documentation confirms this is expected behavior: "the kernel must cope with it returning -EEXIST from ioctl(UFFDIO_COPY) as expected" See: https://docs.kernel.org/admin-guide/mm/userfaultfd.html Verified from CI logs: error=CopyFailed(EEXIST) Tested: cargo check, cargo clippy, cargo fmt

Lint tests (fmt, clippy, audit, deny) run under test-root with sudo via CARGO_TARGET_*_RUNNER. The sudo secure_path doesn't include ~/.cargo/bin, so cargo commands fail with ENOENT. Added symlinks for cargo, cargo-fmt, and cargo-clippy in /usr/local/bin alongside the existing cargo-audit and cargo-deny symlinks. Also fixed retention-days: failure() function not valid outside if conditions, changed to fixed 14 days.

The test was using `fcvm ls --json` and checking if ANY VM was healthy. When running in parallel with other tests, this caused false positives - the health check would pass immediately if another test's VM was healthy, even though this test's VM hadn't started yet. Fix: Use `--pid` flag to query only the specific VM being tested. Root cause analysis from CI logs: - test_sigterm_cleanup_rootless took only 0.507s (should take ~27s) - "VM is healthy after 120.538145ms" - way too fast - pgrep found 17 firecracker processes from parallel tests - Our VM's firecracker (PID 441147) hadn't even started yet when the health check passed (pgrep ran at 03:56:56.028, firecracker started at 03:56:56.684)

Root cause analysis from CI run 20516048896: - Early tests: boot in 19s, image pull in 28s = 47s total (success) - Late tests: boot in 32s, image pull ongoing = >60s (timeout) Resource contention from parallel VMs causes variable boot times. Combined with 27-33s image pulls, late-starting tests exceed 60s. Changes: - poll_health_by_pid timeout: 60 → 120 seconds - tokio::time::timeout for clone health: 60 → 120 seconds - Health polling loops in signal tests: 60 → 120 seconds Files updated: - test_clone_connection.rs - test_egress.rs - test_egress_stress.rs - test_port_forward.rs - test_signal_cleanup.rs - test_snapshot_clone.rs

When 17 pjdfstest_vm tests run in parallel via nextest, they all check if localhost/pjdfstest image exists and try to build it simultaneously. This causes podman overlay storage race conditions: error extracting layer: lgetxattr .../America/Winnipeg: no such file Fix: Use fs2 file lock around the check+build section so only one test builds the container while others wait.

ejc3 added 23 commits December 25, 2025 22:13

CI: Pin cargo tool versions for CVSS 4.0 support

e99bb18

Pin specific versions instead of using --force: - cargo-nextest@0.9.115 - cargo-audit@0.22.0 (supports CVSS 4.0) - cargo-deny@0.18.9 This way cargo only reinstalls if cached version doesn't match.

Fix formatting

62ec598

Print debug log path at end of each fcvm process

04769e0

Update CLAUDE.md with debug logging design

903474a

Fix lint failures: fmt and cargo-audit cache

5929893

- main.rs: Fix formatting (single line for EnvFilter) - CI: Add --force to cargo install to override stale cached versions (cargo-audit was failing with CVSS 4.0 parse error due to old version)

Remove auto-add of --setup flag from tests

efebcee

Container tests run as root, which rejects --setup flag. CI already runs setup-fcvm before tests, so --setup is not needed. Removes unconditional --setup addition from maybe_add_test_flags().

Add filtered dmesg to CI artifacts

bc9c760

Capture kernel logs filtered for UFFD/KVM/memory related messages: - userfault, uffd, kvm, firecracker - oom, killed, segfault, page fault This helps debug UFFD copy failures at the kernel level without exposing sensitive system info.

Fix UFFD error logging: use Debug format to show errno

4544e28

CI: Add workflow_dispatch for manual triggers

29ed0e2

docs: Add manual CI trigger to CLAUDE.md

afedd80

CI: Add job selector for workflow_dispatch

329dfc2

CI: Add symlinks for cargo-audit/deny so sudo can find them

719f7aa

CI: Keep failing test logs for 30 days, passing for 7

2ea6e07

Add debug output to signal cleanup test to diagnose process tree

5a4a2c0

Fix formatting in signal cleanup debug output

6645aa5

ejc3 mentioned this pull request Dec 26, 2025

CI: Debug logging and caching (3/4) #24

Closed

2 tasks

ejc3 deleted the branch pr/ci-debug December 26, 2025 11:05

ejc3 closed this Dec 26, 2025

ejc3 deleted the pr/uffd-stability branch December 26, 2025 11:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UFFD EEXIST fix and test stability (4/4)#25

UFFD EEXIST fix and test stability (4/4)#25
ejc3 wants to merge 23 commits intopr/ci-debugfrom
pr/uffd-stability

ejc3 commented Dec 26, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ejc3 commented Dec 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ejc3 commented Dec 26, 2025 •

edited

Loading