Fix UFFD EEXIST handling for older kernels#17
Closed
ejc3 wants to merge 12 commits intofix/ci-simplifyfrom
Closed
Conversation
On older kernels (e.g., CI's 5.15 vs local 6.14), page fault coalescing is less aggressive, leading to multiple faults for the same page being queued. When the second fault tries to copy, it gets EEXIST because the page was already filled. Our code was treating ALL copy errors as fatal, disconnecting the VM. This is wrong - EEXIST just means "page already valid". Fix: Check for CopyFailed(EEXIST) and continue instead of returning an error. The Linux kernel documentation confirms this is expected behavior: "the kernel must cope with it returning -EEXIST from ioctl(UFFDIO_COPY) as expected" See: https://docs.kernel.org/admin-guide/mm/userfaultfd.html Verified from CI logs: error=CopyFailed(EEXIST) Tested: cargo check, cargo clippy, cargo fmt
187c1a7 to
16fcee6
Compare
Lint tests (fmt, clippy, audit, deny) run under test-root with sudo via CARGO_TARGET_*_RUNNER. The sudo secure_path doesn't include ~/.cargo/bin, so cargo commands fail with ENOENT. Added symlinks for cargo, cargo-fmt, and cargo-clippy in /usr/local/bin alongside the existing cargo-audit and cargo-deny symlinks. Also fixed retention-days: failure() function not valid outside if conditions, changed to fixed 14 days.
16fcee6 to
3cad4cb
Compare
The test was using `fcvm ls --json` and checking if ANY VM was healthy. When running in parallel with other tests, this caused false positives - the health check would pass immediately if another test's VM was healthy, even though this test's VM hadn't started yet. Fix: Use `--pid` flag to query only the specific VM being tested. Root cause analysis from CI logs: - test_sigterm_cleanup_rootless took only 0.507s (should take ~27s) - "VM is healthy after 120.538145ms" - way too fast - pgrep found 17 firecracker processes from parallel tests - Our VM's firecracker (PID 441147) hadn't even started yet when the health check passed (pgrep ran at 03:56:56.028, firecracker started at 03:56:56.684)
f592fd7 to
21f2b15
Compare
Root cause analysis from CI run 20516048896: - Early tests: boot in 19s, image pull in 28s = 47s total (success) - Late tests: boot in 32s, image pull ongoing = >60s (timeout) Resource contention from parallel VMs causes variable boot times. Combined with 27-33s image pulls, late-starting tests exceed 60s. Changes: - poll_health_by_pid timeout: 60 → 120 seconds - tokio::time::timeout for clone health: 60 → 120 seconds - Health polling loops in signal tests: 60 → 120 seconds Files updated: - test_clone_connection.rs - test_egress.rs - test_egress_stress.rs - test_port_forward.rs - test_signal_cleanup.rs - test_snapshot_clone.rs
When 17 pjdfstest_vm tests run in parallel via nextest, they all check if localhost/pjdfstest image exists and try to build it simultaneously. This causes podman overlay storage race conditions: error extracting layer: lgetxattr .../America/Winnipeg: no such file Fix: Use fs2 file lock around the check+build section so only one test builds the container while others wait.
ba1e9f5 to
983a36a
Compare
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
On older kernels (e.g., CI's 5.15 vs local 6.14), page fault coalescing is less aggressive. Multiple faults for the same page get queued, and when the second fault tries to copy, the kernel returns EEXIST because the page was already filled.
Our code was treating ALL copy errors as fatal, disconnecting the VM. This caused the egress stress tests to fail on CI but pass locally.
Changes:
CopyFailed(EEXIST)and continue instead of returning an errorEvidence
CI logs confirmed the hypothesis:
References
Linux kernel documentation confirms this is expected:
Source: https://docs.kernel.org/admin-guide/mm/userfaultfd.html
Test plan
Dependencies
Based on #13 (fix/ci-simplify)