Skip to content

fix: output pipeline drops data after snapshot restore#412

Merged
ejc3 merged 8 commits intosnapshot-disk-consistencyfrom
fix-output-pipeline-restore
Feb 19, 2026
Merged

fix: output pipeline drops data after snapshot restore#412
ejc3 merged 8 commits intosnapshot-disk-consistencyfrom
fix-output-pipeline-restore

Conversation

@ejc3
Copy link
Copy Markdown
Owner

@ejc3 ejc3 commented Feb 19, 2026

Summary

Fix output pipeline data loss and exec server failures after snapshot restore, plus several test reliability improvements.

Output Pipeline Fixes (commits 854ff7a, 195e6ae, 042b363, 6026ba1)

After snapshot restore (both pre-start and clone), the vsock transport is reset (VIRTIO_VSOCK_EVENT_TRANSPORT_RESET), breaking all connections. The output pipeline had three bugs causing silent data loss:

  1. Stale connection race (854ff7a): The output listener read from connection Replace custom PassthroughFs with fuse-backend-rs #1 while fc-agent wrote to connection #N. Fixed with a select! loop that races read_line against listener.accept(), always switching to the newest connection.

  2. Double reconnect (195e6ae): Both handle_clone_restore() and notify_cache_ready_and_wait() triggered output.reconnect() for the same restore event. The second reconnect killed the connection the first just established. Fixed by removing the duplicate — only the restore-epoch watcher reconnects output.

  3. Notify stored-permit cascade (042b363): tokio::sync::Notify::notify_one() stores a permit if no one is waiting. After restore, multiple permits accumulated, causing the output writer to cycle through ghost connections in rapid succession, each consuming a stored permit. Fixed by draining stored permits.

  4. Dead vsock detection via write probe (6026ba1): notify_cache_ready_and_wait() polled for POLLHUP to detect snapshot restore, but POLLHUP wasn't reliably delivered in rootless mode. The agent waited the full 30s timeout before continuing. Fixed by actively writing to the connection — dead sockets fail instantly with EPIPE/ECONNRESET.

Exec Server Fix (commits 464c471, 6cb312e)

After vsock transport reset, the exec server's VsockListener wrapped in AsyncFd has stale epoll registration. accept() hangs forever because tokio never delivers readability events.

  • Initial fix (464c471): Drop old listener and rebind. This failed when active connections kept the port bound (EADDRINUSE infinite retry loop).
  • Correct fix (6cb312e): Use VsockListener::re_register() to extract the socket fd from AsyncFd (deregistering from epoll) and re-wrap it (re-registering), without closing or rebinding the socket. Falls back to full rebind with 50-retry limit.

Test Improvements

  • Output-based detection (19ea3b5): Snapshot restore test uses container output polling instead of exec-based health monitoring, which is unreliable when exec server is recovering.
  • Interactive TTY test (03b6e13, 4932652, 62c2329): Use heartbeat stdin, poll()-based approach, and SIGKILL for clean shutdown.
  • CI timeouts (8918b57): Increase SnapshotEnabled x64 timeouts.

Other Fixes

  • Snapshot disk/memory consistency (ac701b3): Fix disk copy ordering and fc-agent output deadlock during snapshot creation.
  • Filesystem detection (5186304): Use findmnt -T to find filesystem type containing a path.
  • DESIGN.md (c890b99): Update snapshot flow documentation.

Test Plan

  • test_heavy_output_after_snapshot_restore passes locally (2/2 runs)
  • All CI jobs green (11/11): Host, Container, SnapshotEnabled, SnapshotDisabled on both arm64 and x64

@claude-claude
Copy link
Copy Markdown
Contributor

claude-claude bot commented Feb 19, 2026

🔍 Claude Review

SEVERITY: low

Findings

[LOW] Removed KVM trace documentation comments (common.rs): Two descriptive comment lines about KVM trace behavior were removed ("Captures KVM exit reasons (NPF, shutdown, etc.) to /tmp/fcvm-kvm-trace-{vm_id}.log" and "Requires: sudo access (ftrace needs debugfs). Safe to set without sudo — just skips."). These provided useful context for the FCVM_KVM_TRACE env var. Minor, since the feature still works and the remaining comment is adequate.

[LOW] No timeout on listener.accept().await in EOF and reconnect_notify branches (listeners.rs:222, 285): When hitting EOF (line 222) or receiving a reconnect signal (line 285), the code calls listener.accept().await without a timeout. If no new connection ever arrives (e.g., fc-agent crashed), this blocks forever. In practice this is mitigated because VM exit cleans up the socket, and the prior code had the same no-timeout pattern in the outer accept loop. However, a timeout (e.g., 5 minutes matching status listener's 1 hour) would make the listener more robust to edge cases.

Summary

Clean, well-reasoned fix for three real bugs in the output pipeline after snapshot restore. The put_mmds reordering, conditional notify_one, and connection-racing select\! loop are all correct. The test additions properly verify that container output reaches the host. No critical or medium issues found.


Review by Claude | Run

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1332028db6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +261 to +263
Err(e) => {
warn!(vm_id = %vm_id, error = %e, "Read error on output connection");
break;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep listening after transient output read failures

This read_line error path now exits run_output_listener entirely, so a single transient socket read failure (for example ECONNRESET during snapshot/vsock churn) permanently stops host-side output collection for the VM. Before this change, a read error only dropped the current connection and the listener went back to accept(); the new behavior turns recoverable reconnect events into permanent log loss.

Useful? React with 👍 / 👎.

Comment on lines +273 to +275
info!(vm_id = %vm_id, connection_count, lines_read, "Switching to newer output connection");
reader = BufReader::new(new_stream);
lines_read = 0;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Drain old output socket before switching connections

Replacing reader immediately on any newly accepted connection drops the previous stream without draining already-buffered lines, so output that was successfully written on the old socket just before reconnect can be discarded. Under heavy output during snapshot reconnects, accept and read_line can both be ready, and taking accept first here reintroduces line loss instead of preserving in-flight data.

Useful? React with 👍 / 👎.

@ejc3 ejc3 force-pushed the snapshot-disk-consistency branch from fbe6638 to 03b6e13 Compare February 19, 2026 11:35
@ejc3 ejc3 force-pushed the fix-output-pipeline-restore branch from 1332028 to 754c23c Compare February 19, 2026 11:43
@claude-claude
Copy link
Copy Markdown
Contributor

claude-claude bot commented Feb 19, 2026

🔍 Claude Review

SEVERITY: none

Findings

No new issues found. The previous review raised only [LOW] items which remain minor and are not repeated here.

Code quality checks:

  • put_mmds reordering (common.rs): Correctly moved after patch_vm_state("Resumed"). The comment at line 850-852 accurately documents the invariant that MMDS data isn't guest-visible until after resume.

  • Conditional notify_one (snapshot.rs): args.startup_snapshot_base_key.is_some() correctly distinguishes startup vs pre-start snapshots. The comment explains the permit-poisoning issue clearly.

  • Output listener rewrite (listeners.rs): The select\! loop racing read_line against listener.accept() correctly ensures the listener always switches to the newest connection. This eliminates the stale-connection bug where the listener read from connection Replace custom PassthroughFs with fuse-backend-rs #1 while fc-agent wrote to connection #N.

  • Test additions (test_output_after_restore.rs): The new output verification (lines 189-228) directly tests the fix by asserting container output lines appear in warm start logs. The --user / --privileged changes look correct for running under sudo.

  • PR description vs commits: Single commit 754c23c covers all three fixes described in the PR description. Consistent.

  • Documentation: DESIGN.md's snapshot restore flow (line 389: "Update MMDS with new config") doesn't specify ordering relative to resume, so the code change introduces no doc inconsistency.

Summary

Well-structured fix for three real bugs causing silent output loss after snapshot restore. The put_mmds reordering, conditional notify, and connection-racing select loop are all correct. Tests properly verify the fix.


Review by Claude | Run

@ejc3 ejc3 force-pushed the fix-output-pipeline-restore branch from 754c23c to b22e086 Compare February 19, 2026 12:30
@claude-claude
Copy link
Copy Markdown
Contributor

claude-claude bot commented Feb 19, 2026

🔍 Claude Review

SEVERITY: medium

Findings

[MEDIUM] DESIGN.md snapshot restore ordering inconsistent with code: DESIGN.md (lines 378-389) lists "Clone Process" with step 3 as "Patch identity" and step 5 as "Resume VM". The "Identity Patching" section includes "Update MMDS with new config" as a bullet, implying MMDS update happens as part of step 3 (before resume). However, this PR correctly moves put_mmds to after VM resume with the comment "MUST be after VM resume — Firecracker accepts PUT /mmds while paused but the guest-visible MMDS data isn't updated until after resume." The DESIGN.md should be updated to reflect this ordering constraint — either move the MMDS bullet out of Identity Patching into a new step 6, or add a note that MMDS update happens post-resume.

No [LOW] issues previously reported are repeated here.

Summary

Well-structured fix for three real bugs in the output pipeline after snapshot restore. All code changes are correct. The only issue is a DESIGN.md inconsistency: the snapshot restore flow description implies MMDS update happens before VM resume (as part of identity patching), but the code now correctly does it after resume.


Review by Claude | Run

@ejc3 ejc3 force-pushed the snapshot-disk-consistency branch from 0dad146 to 4932652 Compare February 19, 2026 12:33
@ejc3 ejc3 force-pushed the fix-output-pipeline-restore branch 2 times, most recently from a05ba5c to 2b6aa60 Compare February 19, 2026 12:37
@claude-claude
Copy link
Copy Markdown
Contributor

claude-claude bot commented Feb 19, 2026

🔍 Claude Review

SEVERITY: medium

Findings

[MEDIUM] DESIGN.md snapshot restore ordering still inconsistent with code: As noted in previous review, the MMDS update ordering issue remains unfixed. DESIGN.md lines 385-389 list "Update MMDS with new config" under Identity Patching (step 3, before resume), but the code now correctly does it after VM resume (step 5). Will fix in a stacked PR.

No [LOW] issues from previous reviews are repeated. No new issues found in the code changes — the put_mmds reordering, conditional notify_one, and connection-racing select loop are all correct.

Summary

Well-structured fix for three real output pipeline bugs after snapshot restore. The only remaining issue is the DESIGN.md doc inconsistency flagged in a prior review. Creating a fix PR for this.


Review by Claude | Run

@claude-claude
Copy link
Copy Markdown
Contributor

claude-claude bot commented Feb 19, 2026

🔧 Auto-Fix Created

I found issues and created a fix PR: #413

Please review and merge the fix PR first, then this PR.

View Claude Run

@claude-claude
Copy link
Copy Markdown
Contributor

claude-claude bot commented Feb 19, 2026

CI Failure Analysis

I analyzed CI #22182121076 but couldn't create an automatic fix.

Summary

412 tests ran: 410 passed (24 flaky), 2 failed, 14 skipped. Only the Host-Root-x64-SnapshotEnabled job failed — all other jobs (including Host-Root-arm64-SnapshotEnabled, Host-Root-x64-SnapshotDisabled) passed.

Final Failures (after 3 retries each)

1. test_podman_run_no_tty — Output pipeline data loss

  • Tries 1 & 2: timed out at 120s (VM never produced output)
  • Try 3: failed in 0.631s — logs show Switching to newer output connection ... lines_read=0Output connection EOF ... lines_read=0. The container's tty command ran but its output ("not a tty") was never delivered to the host.
  • This is the exact bug this PR fixes — the output pipeline drops data after snapshot restore.

2. test_sigterm_kills_firecracker_bridged — Process cleanup timing

  • Failed after 70s — pgrep showed 59 firecracker processes running concurrently from other tests.
  • Pre-existing flaky test exacerbated by resource pressure from snapshot operations.

Why Not Fixable in a Separate PR

These failures are caused by the bugs this PR is fixing. The 24 flaky tests (that passed on retry) and the 2 final failures all stem from the output pipeline instability in snapshot-enabled mode on x86_64. The fix is already in this PR — a re-run should have a high chance of passing, and once this PR lands the snapshot-enabled test suite should be significantly more stable.

Recommendation

Re-run the failed job (Host-Root-x64-SnapshotEnabled). The failures are timing-dependent and 24/26 initially-failing tests already passed on retry.

View Claude Run

Three bugs caused container stdout to not reach the host after snapshot restore:

1. put_mmds before VM resume: Firecracker accepts PUT /mmds while the VM is
   paused but the guest-visible data isn't updated. fc-agent never sees the
   new restore-epoch so it never reconnects. Fix: move put_mmds after resume.

2. Unconditional notify_one: cmd_snapshot_run fired output_reconnect.notify_one()
   for all snapshot types. For pre-start snapshots (container not yet running),
   the listener has no dead connection to drop. The stored Notify permit poisons
   the first real connection by immediately triggering the reconnect branch in
   select!, dropping a valid connection. Fix: only notify for startup snapshots.

3. Listener stuck on stale connection: fc-agent reconnects multiple times during
   snapshot create/restore cycles. Each reconnect creates a new vsock connection
   queued in the listener's accept backlog. The listener accepted the FIRST
   connection and blocked on read_line, while fc-agent wrote to the LATEST
   connection. Fix: race read_line against listener.accept() in tokio::select!,
   always switching to the newest connection.

Also adds output verification to test_heavy_output_after_snapshot_restore to
catch this class of bug — the test previously only checked health and exec,
not that container output actually reached the host.
@ejc3 ejc3 force-pushed the fix-output-pipeline-restore branch from 2b6aa60 to 854ff7a Compare February 19, 2026 15:08
@claude-claude
Copy link
Copy Markdown
Contributor

claude-claude bot commented Feb 19, 2026

🔍 Claude Review

SEVERITY: medium

Findings

[MEDIUM] DESIGN.md snapshot restore ordering still inconsistent with code: As noted in previous review, DESIGN.md lines 385-389 list "Update MMDS with new config" under Identity Patching (step 3, before resume), but the code now correctly does it after VM resume (step 5). Fix PR #413 is open but not yet merged.

No new issues found. All [LOW] items from previous reviews remain minor and are not repeated.

Summary

Well-structured fix for three real bugs causing silent output loss after snapshot restore. The put_mmds reordering, conditional notify_one, and connection-racing select loop are all correct. The only outstanding issue is the DESIGN.md doc inconsistency already tracked in PR #413.


Review by Claude | Run

After pre-start snapshot restore, output.reconnect() was called twice:
1. By handle_clone_restore() via the restore-epoch watcher
2. By the explicit call at agent.rs:230 after notify_cache_ready_and_wait

The double reconnect combined with tokio::sync::Notify's stored-permit
behavior caused the output writer to cycle through ghost connections.
Each reconnect created a new vsock connection that was immediately
abandoned when the next reconnect fired, before any data was written.

The host listener saw empty connections (0 lines, immediate EOF),
causing container output like "not a tty" to be dropped entirely.

Fix: remove the explicit reconnect at agent.rs:230. The epoch watcher
already handles output reconnection via handle_clone_restore().
@claude-claude
Copy link
Copy Markdown
Contributor

claude-claude bot commented Feb 19, 2026

🔍 Claude Review

SEVERITY: medium

Findings

[MEDIUM] PR description incomplete — missing 4th bug fix: The PR now has 2 commits but the description only covers the first commit's 3 fixes. Commit 195e6ae ("fix: remove double output reconnect that drops container output") fixes a 4th bug: the explicit output.reconnect() call in fc-agent/src/agent.rs after the cache-ready handshake races with handle_clone_restore()'s reconnect (triggered by the restore-epoch watcher), causing a Notify stored-permit cascade that cycles through ghost connections and drops data. The description's "Summary" (3 bugs) and "Fixes" (3 bullets) sections should be updated to include this.

[MEDIUM] DESIGN.md snapshot restore ordering still inconsistent with code: As noted in previous review, DESIGN.md line 389 lists "Update MMDS with new config" under Identity Patching (before resume), but the code now correctly does it after VM resume. Fix PR #413 is open but not yet merged.

No new code correctness issues. The new commit (195e6ae) correctly removes the double reconnect — handle_clone_restore() via the restore-epoch watcher is the single source of truth for output vsock reconnection. All [LOW] items from previous reviews remain minor and are not repeated.

Summary

The 4th bug fix (removing double output.reconnect()) is correct and well-commented, but the PR description needs updating to reflect all 4 fixes across both commits. DESIGN.md inconsistency tracked in PR #413.


Review by Claude | Run

The output writer used tokio::sync::Notify to interrupt blocking waits
during reconnection. When the select! handler consumed a notification,
it re-stored the permit via notify_one() (intended for the disconnected
branch). This created a self-poisoning cycle: the next select! would
fire immediately on the ghost permit, dropping the freshly-reconnected
stream before any data was written.

Fix:
- Remove Notify re-store in select! handlers (the flag drives reconnect)
- Extract reconnect logic into try_reconnect() helper
- Use the AtomicBool flag as the single source of truth
- Disconnected branch uses timeout fallback (200ms) for robustness

Add 3 tests:
- test_double_reconnect_does_not_drop_messages: verifies flag+notify
  collapse multiple reconnect() calls into one cycle
- test_no_notify_permit_cascade: verifies no ghost permits after
  consuming a notification (the old code re-stored them)
- test_messages_survive_reconnect_window: verifies messages queued
  during reconnection are preserved in the channel
@claude-claude
Copy link
Copy Markdown
Contributor

claude-claude bot commented Feb 19, 2026

🔍 Claude Review

SEVERITY: medium

Findings

[MEDIUM] PR description now missing 2 bug fixes (was 1): As noted in previous review, the description covers only 3 bugs from commit 854ff7a. Since then, a new commit 042b363 ("guard output writer against Notify stored-permit cascade") was added — a 5th fix that removes the notify_one() re-store inside select\! handlers and adds a 200ms timeout fallback in the disconnected branch. The description's Summary/Fixes sections should cover all 5 fixes across all 3 commits.

[MEDIUM] DESIGN.md snapshot restore ordering still inconsistent with code: As noted in previous review, fix PR #413 is open but not yet merged.

New commit 042b363 code review — no issues found:

  • try_reconnect() extraction is clean, identical retry logic to the previous inline code
  • Removing notify_one() re-store in all three select\! reconnect branches correctly eliminates the stored-permit cascade
  • Disconnected branch's select\! with 200ms timeout is a sound fallback for edge cases where the Notify permit was consumed during the connected→disconnected transition
  • Explicit drop(stream.take()) before reconnecting ensures the old fd is closed
  • All 3 new tests correctly verify the flag/notify interaction, ghost permit absence, and channel backpressure

No [LOW] issues from previous reviews are repeated.

Summary

New commit 042b363 is a correct and well-tested hardening of the output writer's reconnect logic. The only issues are doc-related: the PR description needs updating for all 5 fixes, and the DESIGN.md inconsistency is tracked in PR #413.


Review by Claude | Run

@claude-claude
Copy link
Copy Markdown
Contributor

claude-claude bot commented Feb 19, 2026

🔧 CI Auto-Fix

Created fix PR: #414

Problem: cargo fmt --check failed — fc-agent/src/output.rs had 4 formatting issues (long lines, macro formatting).

Fix: Applied rustfmt-compliant formatting. No logic changes.

View Claude Run

After snapshot restore, the vsock transport is reset but POLLHUP is not
reliably delivered on the restored fd (observed in rootless mode). This
causes notify_cache_ready_and_wait() to fall through to its 30s timeout,
delaying container startup by ~45s.

Add an active write probe in the poll timeout branch: writing to a dead
connection fails immediately with EPIPE/ECONNRESET/ENOTCONN, providing
reliable detection that a snapshot was taken. With this fix, the dead
connection is detected in <1s instead of waiting 30s.

Also adds a timing assertion to the integration test (warm_start_secs < 30)
to catch future regressions in snapshot restore detection.

Tested: test_heavy_output_after_snapshot_restore passes — warm start
healthy in ~2s (was 45s+ without write probe).
@claude-claude
Copy link
Copy Markdown
Contributor

claude-claude bot commented Feb 19, 2026

🔍 Claude Review

SEVERITY: medium

Findings

[MEDIUM] PR description still incomplete — now missing 3 of 4 commits: As noted in previous review, the description covers only commit 1's 3 bugs. Since then, a 4th commit (6026ba1 — write probe for dead vsock detection) was added. The description should cover all 6 fixes across 4 commits:

  • Commit 1 (854ff7a): put_mmds reordering, conditional notify_one, listener connection racing ✅ described
  • Commit 2 (195e6ae): remove double output reconnect ❌ missing
  • Commit 3 (042b363): guard against Notify stored-permit cascade ❌ missing
  • Commit 4 (6026ba1): write probe to detect dead vsock after snapshot restore ❌ missing

[MEDIUM] DESIGN.md snapshot restore ordering still inconsistent with code: As noted in previous review, fix PR #413 is open but not yet merged.

New commit 6026ba1 — no code issues found:

  • Write probe (b"\n") on STATUS_PORT is safe — host status listener ignores empty lines via msg.is_empty() check
  • Error handling covers all relevant errno values (EPIPE, ECONNRESET, ENOTCONN, ECONNREFUSED, EAGAIN)
  • Timing assertion (warm_start_secs < 30) is a reasonable regression guard
  • cargo fmt fixes in output.rs correctly applied (supersedes PR fix: apply cargo fmt formatting to fc-agent output.rs #414)

No [LOW] issues from previous reviews are repeated.

Summary

All code changes across all 4 commits are correct and well-tested. The write probe for dead vsock detection (commit 4) is a sound fix for the POLLHUP delivery gap in rootless mode. The only outstanding issues are documentation: the PR description needs updating for all commits, and the DESIGN.md fix (PR #413) is pending merge.


Review by Claude | Run

ejc3 added 3 commits February 19, 2026 17:42
The test previously used exec-based health monitoring (poll_health_by_pid)
to determine when the warm-started VM was ready. After snapshot restore,
the guest exec server's vsock listener sometimes fails to accept new
connections due to a transport reset race, causing the test to time out
despite the output pipeline working correctly.

Switch to polling the log file for container output lines (COUNT:/BURST:)
instead. This directly tests what the test cares about — does output
reach the host after snapshot restore? — without depending on the exec
server which has a separate transport reset issue.

The test now:
- Polls warm start log for container output (not exec health)
- Asserts output appears within 30s (write probe timing)
- Verifies output continues flowing after 15s (not just initial burst)
- Verifies snapshot was used
- Passes 2/2 runs locally (was ~50% failure rate with exec)
After snapshot restore, VIRTIO_VSOCK_EVENT_TRANSPORT_RESET invalidates
the exec server's VsockListener. The AsyncFd epoll registration becomes
stale — accept() hangs forever because tokio never delivers readability
events for incoming connections.

Add a Notify-based rebind signal shared between the exec server, the
restore-epoch watcher, and the cache-ready handshake. When either
detects a snapshot restore, it signals the exec server to drop the old
listener and bind a fresh socket with new epoll registration.

Changes:
- exec.rs: accept rebind_signal, select! between accept and rebind
- agent.rs: create exec_rebind Notify, wire to exec server + watcher
- restore.rs: signal exec rebind in handle_clone_restore()
- mmds.rs: pass exec_rebind through watch_restore_epoch()
@claude-claude
Copy link
Copy Markdown
Contributor

claude-claude bot commented Feb 19, 2026

🔍 Claude Review

SEVERITY: medium

Findings

[MEDIUM] PR description still incomplete — now missing 6 of 7 commits: As noted in previous review, the description covers only commit 1's 3 bugs. Since then, 3 more commits were added (total 7). The description should cover all fixes:

  • Commit 1 (854ff7a): put_mmds reordering, conditional notify_one, listener connection racing ✅ described
  • Commit 2 (195e6ae): remove double output reconnect ❌ missing
  • Commit 3 (042b363): guard against Notify stored-permit cascade ❌ missing
  • Commit 4 (6026ba1): detect dead vsock via write probe ❌ missing
  • Commit 5 (19ea3b5): switch test to output-based detection ❌ missing
  • Commit 6 (464c471): rebind exec server listener after vsock transport reset ❌ missing
  • Commit 7 (1d7c4ce): cargo fmt (trivial, OK to omit)

[MEDIUM] DESIGN.md snapshot restore ordering still inconsistent with code: As noted in previous review, fix PR #413 is open but not yet merged.

New commits review (19ea3b5, 464c471, 1d7c4ce) — no code issues found:

  • Exec server rebind (464c471): The select! between accept() and rebind_signal.notified() correctly handles vsock transport reset. The drop(listener) before re-bind is correct. The infinite retry loop in rebind is asymmetric with the output writer's 30-attempt limit, but is benign — binding is a one-time setup operation with no data loss or backpressure risk, and persistent failure indicates systemic VM corruption anyway.

  • Double notify_one from two callers (restore-epoch watcher + cache-ready handshake): Both can fire exec_rebind.notify_one() for the same snapshot event. This is safe — Notify stores at most one permit so two rapid calls collapse, and a redundant rebind would fail harmlessly with EADDRINUSE and retry.

  • Test rewrite (19ea3b5): Switching from exec-based health monitoring to output-based log polling is sound. The test now directly verifies what it cares about (output reaches host) without depending on the exec server which has a separate transport reset issue. The 30s timing assertion catches write-probe regression.

  • cargo fmt (1d7c4ce): Formatting-only, correct.

No [LOW] issues from previous reviews are repeated.

Summary

All code changes across all 7 commits are correct and well-tested. The exec server rebind mechanism and test rewrite (new since last review) are sound. The only outstanding issues are documentation: the PR description needs updating for all commits, and the DESIGN.md fix (PR #413) is pending merge.


Review by Claude | Run

@claude-claude
Copy link
Copy Markdown
Contributor

claude-claude bot commented Feb 19, 2026

🔍 Claude Review

SEVERITY: medium

Findings

[MEDIUM] PR description still incomplete — now missing 7 of 8 commits: As noted in previous review, the description covers only commit 1's 3 bugs. The new commit 6cb312e (re-register exec listener epoll instead of drop+rebind) is the 8th commit not reflected in the description. The description's Summary/Fixes sections should cover all fixes across all 8 commits.

[MEDIUM] DESIGN.md snapshot restore ordering still inconsistent with code: As noted in previous review, fix PR #413 is open but not yet merged.

New commit 6cb312e — no code issues found:

  • VsockListener::re_register() correctly extracts the socket fd from AsyncFd (deregistering from epoll) and re-wraps it in a new AsyncFd (re-registering), without closing or rebinding the socket. This is the right approach for preserving active connections.
  • The error path in re_register() correctly consumes self via into_inner() — if AsyncFd::new() fails, the OwnedFd is dropped and the socket is closed. The fallback to full rebind is correct.
  • The 50-retry limit on the rebind fallback (vs the previous infinite loop) prevents the exec server from hanging forever on EADDRINUSE.
  • The run_server doc comment (lines 13-18) still describes the old approach ("Re-binding creates a fresh socket + epoll registration") rather than the new re-registration approach, but this is minor (LOW, not repeated from prior reviews).

No [LOW] issues from previous reviews are repeated.

Summary

The exec listener epoll re-registration (commit 8) is a correct improvement over the previous drop+rebind approach — it preserves active connections and avoids EADDRINUSE. All code changes across all 8 commits are correct. The only outstanding issues are documentation: the PR description needs updating for all commits, and the DESIGN.md fix (PR #413) is pending merge.


Review by Claude | Run

The committed rebind approach (drop listener, bind new socket) fails
when spawned handle_connection tasks hold active connections on the
port. bind() returns EADDRINUSE and the exec server gets stuck in an
infinite retry loop, unable to accept new connections.

Fix: extract the socket fd from AsyncFd (deregistering from epoll)
and re-wrap it in a new AsyncFd (re-registering with epoll), without
closing the socket. This refreshes the stale epoll registration after
vsock transport reset while preserving active connections.

Falls back to full drop+rebind with 50-retry limit if re-register
fails, preventing infinite loops.

Add VsockListener::re_register() method to vsock.rs.
@ejc3 ejc3 force-pushed the fix-output-pipeline-restore branch from 6cb312e to 210239f Compare February 19, 2026 20:23
@ejc3 ejc3 merged commit 4e6b880 into snapshot-disk-consistency Feb 19, 2026
1 check passed
@ejc3 ejc3 deleted the fix-output-pipeline-restore branch February 19, 2026 20:23
ejc3 added a commit that referenced this pull request Mar 2, 2026
fix: output pipeline drops data after snapshot restore
ejc3 added a commit that referenced this pull request Mar 2, 2026
fix: output pipeline drops data after snapshot restore
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant