Skip to content

fix: wait for network restore before exec in snapshot clones#211

Closed
claude-claude[bot] wants to merge 1 commit intofix-slirp-snapshot-restorefrom
claude/fix-21613960041
Closed

fix: wait for network restore before exec in snapshot clones#211
claude-claude[bot] wants to merge 1 commit intofix-slirp-snapshot-restorefrom
claude/fix-21613960041

Conversation

@claude-claude
Copy link
Copy Markdown
Contributor

@claude-claude claude-claude bot commented Feb 3, 2026

CI Fix

Fixes CI #21605344777

Problem

The test test_snapshot_run_exec_rootless (and bridged variant) was failing with:

  • exec_output_found=true - command output "EXEC_TEST_SUCCESS" was found ✓
  • exit_success=false - but fcvm exited with code 125 ✗

The actual error from the logs:

Error: no such container
2026-02-02T20:24:01.402562Z  INFO fcvm::commands::snapshot: exec completed with exit code 125

Exit code 125 = container runtime error (Podman)

Root Cause

After snapshot restore with rootless networking (slirp4netns), there's a timing race:

  1. VM resumes immediately after snapshot restore
  2. fc-agent network restore happens asynchronously:
    • Polls MMDS every 100ms for restore-epoch changes
    • When detected, runs handle_clone_restore():
      • Flushes ARP cache
      • Sends gratuitous NDP Neighbor Advertisement for IPv6
      • This announces the VM's MAC address to the new slirp4netns process
  3. --exec tried to exec immediately after vsock socket was ready (17µs!)
  4. Container networking not ready → Podman can't communicate → "no such container"

The NDP NA announcement is critical for IPv6 DNS to work (slirp4netns routes via fd00::3).

Solution

Add a 300ms delay after vsock socket is ready but before executing container commands:

  • 100ms max for restore-epoch detection (polls every 100ms)
  • 200ms for network setup (ARP flush + NDP NA transmission)

This ensures IPv6 routing is fully established before attempting container exec.

Testing

This fix specifically addresses the failing tests:

  • test_snapshot_run_exec_rootless - was failing with exit code 125
  • test_snapshot_run_exec_bridged - likely same issue

Generated by Claude | Fix Run

After snapshot restore with slirp4netns, fc-agent needs time to complete
network initialization before container exec will work:

1. fc-agent polls MMDS for restore-epoch every 100ms
2. When detected, it runs handle_clone_restore() which:
   - Flushes ARP cache
   - Sends gratuitous NDP NA for IPv6
   - This announces the VM's MAC to the new slirp4netns process

Without this delay, `fcvm snapshot run --exec` fails with exit code 125
("no such container") because the container's IPv6 networking isn't ready.

The test `test_snapshot_run_exec_rootless` was failing with:
  exec_output_found=true (command ran)
  exit_success=false (exit code 125)

This fix adds a 300ms delay after vsock socket is ready:
- 100ms max for restore-epoch detection
- 200ms for network setup (ARP flush + NDP NA)

This ensures IPv6 routing is established before trying to exec into
the container, allowing Podman to communicate properly.

Fixes CI test: test_snapshot_run_exec_rootless
Fixes CI test: test_snapshot_run_exec_bridged

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@ejc3
Copy link
Copy Markdown
Owner

ejc3 commented Feb 3, 2026

Superseded by PR #217 (http-proxy)

@ejc3 ejc3 closed this Feb 3, 2026
@ejc3 ejc3 deleted the claude/fix-21613960041 branch February 8, 2026 18:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant