Skip to content

fix: update DESIGN.md snapshot restore flow to reflect post-resume MMDS ordering#413

Closed
claude-claude[bot] wants to merge 2 commits intofix-output-pipeline-restorefrom
claude/fix-22182064883
Closed

fix: update DESIGN.md snapshot restore flow to reflect post-resume MMDS ordering#413
claude-claude[bot] wants to merge 2 commits intofix-output-pipeline-restorefrom
claude/fix-22182064883

Conversation

@claude-claude
Copy link
Copy Markdown
Contributor

@claude-claude claude-claude bot commented Feb 19, 2026

Auto-Fix for PR #412

Issues Fixed

  • [MEDIUM] DESIGN.md snapshot restore ordering inconsistent with code: The "Clone Process" section listed "Update MMDS with new config" under Identity Patching (step 3, before VM resume), but the code now correctly performs the MMDS update after VM resume. Moved it to a new step 6 with a note explaining the ordering constraint.

Changes

  • DESIGN.md: Moved "Update MMDS with new config" from the Identity Patching bullet list to a new step 6 after "Resume VM" (step 5), with an explanatory note that guest-visible MMDS data isn't updated until the VM is running.

Generated by Claude | Review Run

EJ Campbell and others added 2 commits February 19, 2026 12:36
Three bugs caused container stdout to not reach the host after snapshot restore:

1. put_mmds before VM resume: Firecracker accepts PUT /mmds while the VM is
   paused but the guest-visible data isn't updated. fc-agent never sees the
   new restore-epoch so it never reconnects. Fix: move put_mmds after resume.

2. Unconditional notify_one: cmd_snapshot_run fired output_reconnect.notify_one()
   for all snapshot types. For pre-start snapshots (container not yet running),
   the listener has no dead connection to drop. The stored Notify permit poisons
   the first real connection by immediately triggering the reconnect branch in
   select!, dropping a valid connection. Fix: only notify for startup snapshots.

3. Listener stuck on stale connection: fc-agent reconnects multiple times during
   snapshot create/restore cycles. Each reconnect creates a new vsock connection
   queued in the listener's accept backlog. The listener accepted the FIRST
   connection and blocked on read_line, while fc-agent wrote to the LATEST
   connection. Fix: race read_line against listener.accept() in tokio::select!,
   always switching to the newest connection.

Also adds output verification to test_heavy_output_after_snapshot_restore to
catch this class of bug — the test previously only checked health and exec,
not that container output actually reached the host.
…DS ordering

Move "Update MMDS with new config" from Identity Patching (step 3) to a
new step 6 after Resume VM (step 5), matching the code change in
restore_from_snapshot() where put_mmds was moved after patch_vm_state("Resumed").

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@ejc3 ejc3 force-pushed the fix-output-pipeline-restore branch from 2b6aa60 to 854ff7a Compare February 19, 2026 15:08
@ejc3 ejc3 deleted the branch fix-output-pipeline-restore February 19, 2026 20:23
@ejc3 ejc3 closed this Feb 19, 2026
@ejc3 ejc3 deleted the claude/fix-22182064883 branch March 2, 2026 07:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant