Add PR-based development workflow to CLAUDE.md by ejc3 · Pull Request #2 · ejc3/fcvm

ejc3 · 2025-12-20T19:51:36Z

No description provided.

When PRs are stacked, the branch for PR #2 must actually be based on PR #1's branch (via git ancestry), not just have the GitHub base set correctly. If PR #2 is based on main instead of PR #1, tests will fail because PR #2 won't have PR #1's changes. Added verification command and fix instructions to CLAUDE.md.

Deduplicate snapshot creation logic between user snapshots (fcvm snapshot create) and system snapshots (podman cache). Both now call create_snapshot_core() which handles: - Pause VM via Firecracker API - Create Firecracker snapshot (Full type) - Resume VM immediately (regardless of result) - Copy disk via btrfs reflink - Write config.json with pre-built SnapshotConfig - Atomic rename temp dir to final location Caller-specific logic remains: - snapshot.rs: VM state lookup, RW disk validation, volume parsing, vsock chain - podman.rs: Lock handling, SnapshotCreationParams No functional changes - same behavior, just deduplicated code. Prepares for PR #2 which will add diff snapshot support.

…hang Replace the racy double-rebind approach with a deterministic handshake chain that guarantees the exec server's AsyncFd epoll is re-registered before the host starts health-checking. Reduces restore-to-healthy from ~61s to ~0.5s. ## The Problem After snapshot restore, Firecracker's vsock transport reset (VIRTIO_VSOCK_EVENT_TRANSPORT_RESET) leaves the exec server's AsyncFd epoll registration stale. The previous fix (c15aa6b) removed the duplicate rebind signal from agent.rs but left a timing gap: if the restore-epoch watcher's single signal arrived late, the host's health monitor would start exec calls against a stale listener, hanging for ~60s until the kernel's vsock cleanup expired the stale connections. ## Trace Evidence (the smoking gun) From the vsock muxer log of a failing run (vm-ba97c): T+0.009s Exec call #1 → WORKS (167+144+176+123+71+27 bytes response) T+0.076s Exec call #2 → WORKS T+0.520s Exec call #3 → guest ACKs, receives request, sends NOTHING → 5s timeout T+5.5-55s Exec calls #4-#9 → same pattern: kernel accepts, app never processes T+60.5s Guest sends RST for ALL stale connections simultaneously T+60.5s Exec call #10 → WORKS → "container running status running=true" The container started at T+0.28s. The exec server was broken for 60 more seconds because the duplicate re_register() from agent.rs corrupted the edge-triggered epoll: the old AsyncFd consumed the edge notification, and the new AsyncFd never received events for pending connections. ## The Fix: Deterministic Handshake Chain exec_rebind_signal → exec_re_register → rebind_done → output.reconnect() ↓ host accepts output connection ↓ health monitor spawns Every transition has an explicit signal. Zero timing dependencies. ### fc-agent side (4 files): - exec.rs: After re_register(), signals rebind_done (AtomicBool + Notify) - restore.rs: Signals exec rebind, waits for rebind_done confirmation (5s timeout), THEN reconnects output vsock - agent.rs: Removed duplicate rebind signal after notify_cache_ready_and_wait; added exec_rebind_done/exec_rebind_done_notify Arcs - mmds.rs: Threads new params through watch_restore_epoch to both handle_clone_restore call sites ### Host side (3 files): - listeners.rs: Added connected_tx oneshot to run_output_listener(), fired on first output connection accept - snapshot.rs: Waits for output_connected_rx (30s timeout) before spawning health monitor; removed stale output_reconnect.notify_one() for startup snapshots - podman/mod.rs: Passes None for connected_tx (non-snapshot path) ## Results Before: restore-to-healthy = ~61s (exec broken, 9 consecutive 5s timeouts) After: restore-to-healthy = ~0.5s (35ms to output connected, 533ms to healthy) Post-restore exec stress test: 10 parallel calls completed in 16.3ms (max single: 15.3ms), zero timeouts. Tested: make test-root FILTER=localhost_rootless_btrfs_snapshot_restore STREAM=1

Co-authored-by: ejc3 <ejc3@users.noreply.github.com>

When PRs are stacked, the branch for PR #2 must actually be based on PR #1's branch (via git ancestry), not just have the GitHub base set correctly. If PR #2 is based on main instead of PR #1, tests will fail because PR #2 won't have PR #1's changes. Added verification command and fix instructions to CLAUDE.md.

Deduplicate snapshot creation logic between user snapshots (fcvm snapshot create) and system snapshots (podman cache). Both now call create_snapshot_core() which handles: - Pause VM via Firecracker API - Create Firecracker snapshot (Full type) - Resume VM immediately (regardless of result) - Copy disk via btrfs reflink - Write config.json with pre-built SnapshotConfig - Atomic rename temp dir to final location Caller-specific logic remains: - snapshot.rs: VM state lookup, RW disk validation, volume parsing, vsock chain - podman.rs: Lock handling, SnapshotCreationParams No functional changes - same behavior, just deduplicated code. Prepares for PR #2 which will add diff snapshot support.

…hang Replace the racy double-rebind approach with a deterministic handshake chain that guarantees the exec server's AsyncFd epoll is re-registered before the host starts health-checking. Reduces restore-to-healthy from ~61s to ~0.5s. ## The Problem After snapshot restore, Firecracker's vsock transport reset (VIRTIO_VSOCK_EVENT_TRANSPORT_RESET) leaves the exec server's AsyncFd epoll registration stale. The previous fix (c15aa6b) removed the duplicate rebind signal from agent.rs but left a timing gap: if the restore-epoch watcher's single signal arrived late, the host's health monitor would start exec calls against a stale listener, hanging for ~60s until the kernel's vsock cleanup expired the stale connections. ## Trace Evidence (the smoking gun) From the vsock muxer log of a failing run (vm-ba97c): T+0.009s Exec call #1 → WORKS (167+144+176+123+71+27 bytes response) T+0.076s Exec call #2 → WORKS T+0.520s Exec call #3 → guest ACKs, receives request, sends NOTHING → 5s timeout T+5.5-55s Exec calls #4-#9 → same pattern: kernel accepts, app never processes T+60.5s Guest sends RST for ALL stale connections simultaneously T+60.5s Exec call #10 → WORKS → "container running status running=true" The container started at T+0.28s. The exec server was broken for 60 more seconds because the duplicate re_register() from agent.rs corrupted the edge-triggered epoll: the old AsyncFd consumed the edge notification, and the new AsyncFd never received events for pending connections. ## The Fix: Deterministic Handshake Chain exec_rebind_signal → exec_re_register → rebind_done → output.reconnect() ↓ host accepts output connection ↓ health monitor spawns Every transition has an explicit signal. Zero timing dependencies. ### fc-agent side (4 files): - exec.rs: After re_register(), signals rebind_done (AtomicBool + Notify) - restore.rs: Signals exec rebind, waits for rebind_done confirmation (5s timeout), THEN reconnects output vsock - agent.rs: Removed duplicate rebind signal after notify_cache_ready_and_wait; added exec_rebind_done/exec_rebind_done_notify Arcs - mmds.rs: Threads new params through watch_restore_epoch to both handle_clone_restore call sites ### Host side (3 files): - listeners.rs: Added connected_tx oneshot to run_output_listener(), fired on first output connection accept - snapshot.rs: Waits for output_connected_rx (30s timeout) before spawning health monitor; removed stale output_reconnect.notify_one() for startup snapshots - podman/mod.rs: Passes None for connected_tx (non-snapshot path) ## Results Before: restore-to-healthy = ~61s (exec broken, 9 consecutive 5s timeouts) After: restore-to-healthy = ~0.5s (35ms to output connected, 533ms to healthy) Post-restore exec stress test: 10 parallel calls completed in 16.3ms (max single: 15.3ms), zero timeouts. Tested: make test-root FILTER=localhost_rootless_btrfs_snapshot_restore STREAM=1

Co-authored-by: ejc3 <ejc3@users.noreply.github.com>

When PRs are stacked, the branch for PR #2 must actually be based on PR #1's branch (via git ancestry), not just have the GitHub base set correctly. If PR #2 is based on main instead of PR #1, tests will fail because PR #2 won't have PR #1's changes. Added verification command and fix instructions to CLAUDE.md.

Deduplicate snapshot creation logic between user snapshots (fcvm snapshot create) and system snapshots (podman cache). Both now call create_snapshot_core() which handles: - Pause VM via Firecracker API - Create Firecracker snapshot (Full type) - Resume VM immediately (regardless of result) - Copy disk via btrfs reflink - Write config.json with pre-built SnapshotConfig - Atomic rename temp dir to final location Caller-specific logic remains: - snapshot.rs: VM state lookup, RW disk validation, volume parsing, vsock chain - podman.rs: Lock handling, SnapshotCreationParams No functional changes - same behavior, just deduplicated code. Prepares for PR #2 which will add diff snapshot support.

…hang Replace the racy double-rebind approach with a deterministic handshake chain that guarantees the exec server's AsyncFd epoll is re-registered before the host starts health-checking. Reduces restore-to-healthy from ~61s to ~0.5s. ## The Problem After snapshot restore, Firecracker's vsock transport reset (VIRTIO_VSOCK_EVENT_TRANSPORT_RESET) leaves the exec server's AsyncFd epoll registration stale. The previous fix (c15aa6b) removed the duplicate rebind signal from agent.rs but left a timing gap: if the restore-epoch watcher's single signal arrived late, the host's health monitor would start exec calls against a stale listener, hanging for ~60s until the kernel's vsock cleanup expired the stale connections. ## Trace Evidence (the smoking gun) From the vsock muxer log of a failing run (vm-ba97c): T+0.009s Exec call #1 → WORKS (167+144+176+123+71+27 bytes response) T+0.076s Exec call #2 → WORKS T+0.520s Exec call #3 → guest ACKs, receives request, sends NOTHING → 5s timeout T+5.5-55s Exec calls #4-#9 → same pattern: kernel accepts, app never processes T+60.5s Guest sends RST for ALL stale connections simultaneously T+60.5s Exec call #10 → WORKS → "container running status running=true" The container started at T+0.28s. The exec server was broken for 60 more seconds because the duplicate re_register() from agent.rs corrupted the edge-triggered epoll: the old AsyncFd consumed the edge notification, and the new AsyncFd never received events for pending connections. ## The Fix: Deterministic Handshake Chain exec_rebind_signal → exec_re_register → rebind_done → output.reconnect() ↓ host accepts output connection ↓ health monitor spawns Every transition has an explicit signal. Zero timing dependencies. ### fc-agent side (4 files): - exec.rs: After re_register(), signals rebind_done (AtomicBool + Notify) - restore.rs: Signals exec rebind, waits for rebind_done confirmation (5s timeout), THEN reconnects output vsock - agent.rs: Removed duplicate rebind signal after notify_cache_ready_and_wait; added exec_rebind_done/exec_rebind_done_notify Arcs - mmds.rs: Threads new params through watch_restore_epoch to both handle_clone_restore call sites ### Host side (3 files): - listeners.rs: Added connected_tx oneshot to run_output_listener(), fired on first output connection accept - snapshot.rs: Waits for output_connected_rx (30s timeout) before spawning health monitor; removed stale output_reconnect.notify_one() for startup snapshots - podman/mod.rs: Passes None for connected_tx (non-snapshot path) ## Results Before: restore-to-healthy = ~61s (exec broken, 9 consecutive 5s timeouts) After: restore-to-healthy = ~0.5s (35ms to output connected, 533ms to healthy) Post-restore exec stress test: 10 parallel calls completed in 16.3ms (max single: 15.3ms), zero timeouts. Tested: make test-root FILTER=localhost_rootless_btrfs_snapshot_restore STREAM=1

Add PR-based development workflow to CLAUDE.md

6932dfb

ejc3 merged commit 91267ca into main Dec 20, 2025
8 of 11 checks passed

ejc3 deleted the docs/pr-workflow branch December 20, 2025 19:52

ejc3 mentioned this pull request Dec 28, 2025

Add critical guidance on maintaining stacked PR coherence #43

Merged

ejc3 mentioned this pull request Jan 25, 2026

feat: add automatic diff-based snapshot support #188

Merged

8 tasks

claude-claude bot mentioned this pull request Feb 18, 2026

fix: conditional podman reset + container build lock #393

Merged

3 tasks

ejc3 added a commit that referenced this pull request Mar 2, 2026

Add PR-based development workflow to CLAUDE.md (#2)

49eec5c

Co-authored-by: ejc3 <ejc3@users.noreply.github.com>

ejc3 added a commit that referenced this pull request Mar 2, 2026

Add PR-based development workflow to CLAUDE.md (#2)

5aa11c0

Co-authored-by: ejc3 <ejc3@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add PR-based development workflow to CLAUDE.md#2

Add PR-based development workflow to CLAUDE.md#2
ejc3 merged 1 commit intomainfrom
docs/pr-workflow

ejc3 commented Dec 20, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ejc3 commented Dec 20, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant