Skip to content

Fix blocking I/O in UFFD server accept loop#286

Merged
ejc3 merged 1 commit intomainfrom
review-blocking-io
Feb 8, 2026
Merged

Fix blocking I/O in UFFD server accept loop#286
ejc3 merged 1 commit intomainfrom
review-blocking-io

Conversation

@ejc3
Copy link
Copy Markdown
Owner

@ejc3 ejc3 commented Feb 7, 2026

Stacked on: review-networking (PR #283)

Summary

  • UFFD server async handshake (CI infrastructure & Kata kernel #9): Replace blocking recv_with_fd in the tokio select! loop with async receive_uffd_async() using AsyncFd (same pattern as handle_vm_page_faults). Each connection spawns a per-connection task with a 30s timeout, so the accept loop is never blocked by slow/misbehaving clones.
  • State manager comment (Fix parallel test execution with proper root/rootless isolation #8): Document why Flock::lock() is acceptable as blocking I/O (microsecond hold time, near-zero contention).

Test plan

cargo check --workspace
cargo clippy --workspace
cargo fmt -p fcvm -p fuse-pipe -p fc-agent -- --check
cargo test --lib -p fcvm -- uffd  # 8/8 pass

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c37e1223b9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/uffd/server.rs
Comment on lines +446 to +448
let std_stream = stream.into_std().context("converting to std stream")?;
// Keep non-blocking — AsyncFd handles readiness
let async_stream = AsyncFd::new(std_stream).context("creating AsyncFd for handshake socket")?;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Ensure handshake socket stays nonblocking before AsyncFd

If tokio::net::UnixStream::into_std() returns a blocking socket (Tokio’s documented behavior), AsyncFd::new(std_stream) will wrap a blocking FD and recv_with_fd can block the runtime thread if Firecracker connects but delays sending the UFFD message. In that case the 30s timeout won’t fire because the task is stuck in a blocking syscall, and enough stalled connections can exhaust Tokio worker threads. Consider explicitly setting the std stream to nonblocking (or using from_std on a nonblocking fd) before wrapping in AsyncFd.

Useful? React with 👍 / 👎.

@claude-claude
Copy link
Copy Markdown
Contributor

claude-claude bot commented Feb 7, 2026

🔍 Claude Review

SEVERITY: none

Findings

No issues found. The changes are correct and well-implemented:

  • Async handshake pattern: The receive_uffd_async() function correctly uses the AsyncFd readability-loop pattern (wait for readable → try non-blocking recv_with_fd → clear_ready on EAGAIN) — matching the existing pattern in handle_vm_page_faults. This properly avoids blocking the tokio runtime.
  • Per-connection task spawn: Moving the handshake into a spawned task with a 30s timeout prevents a slow/misbehaving clone from blocking the accept loop. The timeout value is reasonable.
  • Error handling: All three outcomes (success, handshake error, timeout) are logged with appropriate severity and context. The vmm_sys_util::errno::Error type's errno() method correctly returns i32 for comparison against libc::EWOULDBLOCK/EAGAIN.
  • Socket non-blocking mode: After stream.into_std(), the fd retains tokio's non-blocking flag — the comment "Keep non-blocking — AsyncFd handles readiness" is accurate.
  • State manager comment: The added documentation for Flock::lock() clearly explains the trade-off and suggests a concrete alternative if the assumption changes.

Summary

Clean PR that correctly replaces blocking recv_with_fd in the accept loop with an async AsyncFd-based approach. No bugs, race conditions, or correctness issues found.


Review by Claude | Run

@claude-claude
Copy link
Copy Markdown
Contributor

claude-claude bot commented Feb 7, 2026

🔍 Claude Review

SEVERITY: none

Findings

No new issues found. Thorough analysis of the implementation confirms correctness:

  • AsyncFd readability loop: The pattern (wait for readable → try non-blocking recv_with_fd → clear_ready on EAGAIN) is correctly implemented, matching the existing handle_vm_page_faults pattern.
  • Error type compatibility: vmm_sys_util::errno::Error implements std::error::Error + Send + Sync + 'static, so the Err(e).context(...) conversion to anyhow::Error on line 460 is sound.
  • recv_with_fd via get_inner(): The ScmSocket::recv_with_fd takes &self, so guard.get_inner() (which returns &T) is correct — no need for get_inner_mut().
  • Socket lifecycle: The AsyncFd takes ownership of the std stream and drops it after the handshake. This is correct since the UFFD fd is a separate descriptor received via SCM_RIGHTS.
  • Spawned task captures: vm_id (String), mmap (Arc<Mmap>), and stream (tokio UnixStream) are all Send + 'static, satisfying JoinSet::spawn bounds.
  • Per-connection timeout: 30s timeout via tokio::time::timeout wrapping the async handshake is a clean approach to prevent slow/misbehaving clones from exhausting resources.
  • State manager comment: The Flock::lock() blocking I/O documentation accurately describes the trade-off and provides a concrete non-blocking alternative.

As noted in previous review, no issues were found. This second review confirms the analysis with deeper verification of type compatibility and API contracts.

Summary

Clean, correct PR. The async handshake refactoring properly avoids blocking the tokio runtime and the per-connection task spawn with timeout is a well-designed improvement.


Review by Claude | Run

@ejc3 ejc3 force-pushed the review-networking branch from 8125ce6 to 02c6c4b Compare February 7, 2026 23:50
@ejc3 ejc3 force-pushed the review-blocking-io branch 2 times, most recently from 39e0b69 to edeff8a Compare February 7, 2026 23:51
@claude-claude
Copy link
Copy Markdown
Contributor

claude-claude bot commented Feb 7, 2026

🔍 Claude Review

SEVERITY: none

Findings

No issues found. Two previous reviews (review 1, review 2) reached the same conclusion. This third independent review confirms correctness after verifying:

  • AsyncFd readability loop: The pattern (wait for readable → try non-blocking recv_with_fdclear_ready on EAGAIN) correctly matches the existing handle_vm_page_faults pattern. The ScmSocket::recv_with_fd takes &self, so guard.get_inner() (returning &T) is the correct accessor.
  • Error handling: vmm_sys_util::errno::Error::errno() returns i32, matching libc::EWOULDBLOCK/EAGAIN. The Err(e).context(...) conversion to anyhow::Error is sound since the error type implements std::error::Error + Send + Sync + 'static.
  • Per-connection task spawn: All captured values (vm_id: String, mmap: Arc<Mmap>, stream: tokio UnixStream) satisfy Send + 'static bounds for JoinSet::spawn. The 30s timeout prevents resource exhaustion from slow/misbehaving clones.
  • Socket non-blocking mode: After stream.into_std(), the fd retains tokio's non-blocking flag, which is correct for AsyncFd usage. The UFFD fd received via SCM_RIGHTS is a separate descriptor, so dropping the AsyncFd\<UnixStream> after handshake is safe.
  • State manager comment: The Flock::lock() documentation accurately describes the trade-off (microsecond hold time, near-zero contention) and provides a concrete non-blocking alternative.

Summary

Clean, correct PR. The async handshake refactoring properly avoids blocking the tokio runtime, and the per-connection task spawn with timeout is a well-designed improvement. No changes needed.


Review by Claude | Run

@ejc3 ejc3 changed the base branch from review-networking to main February 8, 2026 01:15
The UFFD server's run() loop called recv_with_fd() inline in the
tokio select! loop after converting the socket to blocking mode.
This serialized all clone connections and would hang the server
forever if a clone misbehaved during handshake.

Replace the blocking handshake with an async receive_uffd_async()
that uses AsyncFd (same pattern already used for handle_vm_page_faults
in the same file). Each connection is now spawned as a per-connection
task with a 30s timeout, so the accept loop returns immediately.

Also document why Flock::lock() in the state manager is acceptable
as blocking I/O in an async context (microsecond hold time, near-zero
contention).
@ejc3 ejc3 force-pushed the review-blocking-io branch from edeff8a to ad99d7c Compare February 8, 2026 01:16
@claude-claude
Copy link
Copy Markdown
Contributor

claude-claude bot commented Feb 8, 2026

🔍 Claude Review

SEVERITY: none

Findings

No issues found. Three previous reviews (review 1, review 2, review 3) reached the same conclusion. This fourth independent review confirms correctness.

The async handshake refactoring is well-implemented:

  • The AsyncFd readability loop in receive_uffd_async correctly matches the existing handle_vm_page_faults pattern
  • Per-connection task spawn with 30s timeout properly isolates slow/misbehaving clones from the accept loop
  • All error paths are handled correctly, including EAGAIN/EWOULDBLOCK clearing readiness
  • The Flock::lock() blocking I/O comment is accurate and provides a concrete alternative

Summary

Clean, correct PR. No changes needed.


Review by Claude | Run

@ejc3 ejc3 merged commit d617bd8 into main Feb 8, 2026
16 checks passed
@ejc3 ejc3 deleted the review-blocking-io branch February 8, 2026 02:47
ejc3 added a commit that referenced this pull request Mar 2, 2026
Fix blocking I/O in UFFD server accept loop
ejc3 added a commit that referenced this pull request Mar 2, 2026
Fix blocking I/O in UFFD server accept loop
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant