Skip to content

fc-agent: clock sync, ss -K fix, --no-swap cgroup isolation, --no-dirty-tracking#536

Merged
ejc3 merged 11 commits intomainfrom
routed-mode-fixes
Mar 2, 2026
Merged

fc-agent: clock sync, ss -K fix, --no-swap cgroup isolation, --no-dirty-tracking#536
ejc3 merged 11 commits intomainfrom
routed-mode-fixes

Conversation

@ejc3
Copy link
Copy Markdown
Owner

@ejc3 ejc3 commented Mar 2, 2026

Summary

  • Clock sync on boot + restore: Start chronyd as root after FUSE mounts with makestep 1 -1 for snapshot restore clock jumps. On restore, sync clock from MMDS host time + chronyc makestep before network reconfiguration.
  • Fix ss -K filter: Changed from ! dst 127.0.0.1 to grouped filter (! dst 127.0.0.0/8 and ! dst [::1] and ! dst 10.0.2.0/24) — preserves gateway and loopback connections after snapshot restore. The old filter killed gateway connections, breaking networking.
  • --no-swap cgroup isolation: Creates /sys/fs/cgroup/fcvm.slice/fcvm-{pid}.scope with memory.swap.max=0 and moves Firecracker into it. Replaces broken mlockall(MCL_FUTURE) which doesn't survive exec. Each VM gets its own cgroup scope (doesn't affect the session).
  • --no-dirty-tracking: Passes track_dirty_pages=false to Firecracker for clones that don't need subsequent diff snapshots, keeping file-backed pages shared through the host page cache.
  • SnapshotFixture: Extracts baseline→snapshot→serve boilerplate into a reusable test fixture in tests/common/mod.rs.
  • DESIGN.md: Documents Clone Memory Sharing analysis (File vs UFFD MISSING+COPY vs UFFD MINOR+CONTINUE).

Test Results

$ make test-root FILTER="-E 'test(/sanity|snapshot_clone|egress|exec_rootless|startup_snapshot|diff_snapshot|hugepage|output_after/)'" STREAM=1
47 tests run: 47 passed, 447 skipped

$ make test-root FILTER=cgroup_swap STREAM=1
2 tests run: 2 passed

$ make test-root FILTER="-E 'test(/clock_synced|ss_filter|no_swap_creates|no_dirty_tracking/)'" STREAM=1
4 tests run: 4 passed

ejc3 added 3 commits March 2, 2026 02:45
- fc-agent/agent.rs: Start chronyd as root after FUSE mounts, add NTP
  servers dynamically via chronyc (config parser can't handle bare IPv6),
  strip xleave option from host chrony.conf
- fc-agent/network.rs: Fix ss -K syntax (remove broken [fd00::]/64 filter),
  fix connection counting to include 10.0.2.* and [fd00: as local
- fc-agent/restore.rs: Add clock sync from MMDS + chronyc makestep on
  snapshot restore, before network reconfiguration
- src/cli/args.rs: Add --no-dirty-tracking and --mlock CLI flags
- src/commands/common.rs: Add track_dirty_pages and mlock to RestoreParams,
  add disable_cgroup_swap() via systemctl set-property
- src/commands/snapshot.rs: Wire track_dirty_pages and mlock through
  RestoreParams for snapshot run
- src/commands/podman/mod.rs: Wire mlock flag through to snapshot args
- DESIGN.md: Add Clone Memory Sharing section documenting file backend,
  UFFD MISSING+COPY, and proposed UFFD MINOR+CONTINUE architecture
- scripts/: Add memory sharing test scripts
- Replace broken mlockall(MCL_FUTURE) with cgroup-based swap control
  (MCL_FUTURE doesn't survive exec). --no-swap creates a dedicated
  cgroup under /sys/fs/cgroup/fcvm.slice/ with memory.swap.max=0
  and moves the Firecracker process into it.
- Wire --no-dirty-tracking through to Firecracker's track_dirty_pages
  API for clones that don't need subsequent diff snapshots.
- Add test_cgroup_swap: verifies process isolation in fcvm.slice,
  memory.swap.max=0, and separate scopes per VM.
- Add test_clone_restore_fixes: integration tests for clock sync after
  restore, ss -K gateway preservation, --no-swap cgroup creation, and
  --no-dirty-tracking clone behavior.
- Document nextest expression filter syntax in CLAUDE.md.

Tested:
  make test-root FILTER=cgroup_swap STREAM=1           # 2/2 pass
  make test-root FILTER="-E 'test(/clock_synced|ss_filter|no_swap_creates|no_dirty_tracking/)'" STREAM=1  # 4/4 pass
  make test-root FILTER=sanity STREAM=1                # 5/5 pass
Add SnapshotFixture to tests/common/mod.rs — encapsulates the
baseline→snapshot→serve setup that was duplicated across 4 tests
in test_clone_restore_fixes.rs. Each test now creates the fixture
in one line and focuses on its actual assertions.
@claude-claude
Copy link
Copy Markdown
Contributor

claude-claude bot commented Mar 2, 2026

🔍 Claude Review

SEVERITY: medium

Findings

[MEDIUM] Hugepage clones regressed: dirty tracking enabled by default breaks 2MB TLB entries

The old code in restore_from_snapshot automatically disabled dirty tracking for hugepage VMs:

track_dirty_pages: Some(\!restore_config.hugepages),

The removed comment explained: "Skip for hugepage VMs because KVM splits 2MB Stage 2 block mappings to 4K for dirty tracking, negating the TLB benefit of hugepages."

The new code in snapshot.rs defaults to dirty tracking ON (\!args.no_dirty_tracking = true) regardless of hugepage status:

let needs_dirty_tracking = if args.startup_snapshot_base_key.is_some() {
    true
} else {
    \!args.no_dirty_tracking // always true unless CLI flag passed
};

This means hugepage clones (fcvm snapshot run from a hugepage snapshot) will have KVM dirty tracking enabled, splitting their 2MB Stage 2 block mappings to 4K and negating the TLB benefit of hugepages — a silent performance regression.

Fix: Add hugepages check to the needs_dirty_tracking logic, similar to how the old code handled it.

[MEDIUM] RestoreParams.track_dirty_pages doc says "Default: false for clones" but actual default is true

The doc comment on RestoreParams::track_dirty_pages says "Default: false for clones" but the only call site sets it to \!args.no_dirty_tracking which evaluates to true by default. Misleading docs on a field that controls a performance-sensitive KVM feature.

[LOW] SnapshotFixture doc comment says "Cleans up all processes on drop" but has no Drop impl

The struct doc says "Cleans up all processes on drop" but there's no impl Drop for SnapshotFixture. Cleanup relies on manual fixture.cleanup().await calls. The doc comment is misleading, though the tests do call cleanup explicitly.

Summary

The main issue is a performance regression for hugepage clones: the refactoring of track_dirty_pages from an automatic hugepage-aware default to a CLI-controlled boolean lost the hugepage guard, causing KVM to split 2MB TLB entries to 4K on hugepage snapshot restores.


Review by Claude | Run

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 632667c44d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread fc-agent/src/agent.rs Outdated
Comment on lines +452 to +454
if line.starts_with("server ") {
if let Some(addr) = line.split_whitespace().nth(1) {
server_addrs.push(addr.to_string());
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Parse pool directives when importing host NTP sources

The chrony source import only accepts lines starting with server , so hosts whose /etc/chrony.conf is pool-only (a common default) leave server_addrs empty; chronyd then starts with no upstream sources and the VM stops doing ongoing clock convergence after boot/restore. Please parse both server and pool entries (or reuse the host config directly) so time sync works across host distros.

Useful? React with 👍 / 👎.

Comment thread src/commands/snapshot.rs Outdated
Comment on lines +880 to +883
let needs_dirty_tracking = if args.startup_snapshot_base_key.is_some() {
true // podman path — needs dirty tracking for startup snapshot
} else {
!args.no_dirty_tracking // CLI default: on. --no-dirty-tracking: off.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep dirty tracking disabled for hugepage clone restores

This logic now enables dirty tracking for clones by default unless --no-dirty-tracking is set, but it no longer gates on hugepages; for hugepage snapshots, dirty tracking forces 2MB stage-2 mappings to split into 4KB pages, which removes the intended hugepage performance characteristics and can significantly increase restore/runtime overhead. Preserve a hugepage-specific guard so track_dirty_pages stays off when restoring hugepage VMs unless a caller explicitly requires diff snapshots.

Useful? React with 👍 / 👎.

ejc3 added 3 commits March 2, 2026 04:21
- Restore hugepage guard: hugepage VMs always disable dirty tracking
  (KVM splits 2MB Stage 2 block mappings to 4K, negating TLB benefit).
  The refactoring to --no-dirty-tracking lost this automatic guard.
- Fix RestoreParams.track_dirty_pages doc: said "Default: false" but
  actual default was true.
- Fix SnapshotFixture doc: said "on drop" but has no Drop impl.
Hosts using "pool" instead of "server" in chrony.conf (common default
on Ubuntu/Debian) left chronyd with no upstream NTP sources.
Two automated reviewers on this repo:
- Claude review: posts PR-level comments (gh pr view --json comments)
- Codex review: posts PR reviews with inline comments on specific lines
  (gh api repos/$REPO/pulls/<N>/comments, filter by codex author)
@claude-claude
Copy link
Copy Markdown
Contributor

claude-claude bot commented Mar 2, 2026

🔍 Claude Review

SEVERITY: medium

Previous Review Status

All three issues from the previous review (hugepage dirty tracking guard, RestoreParams doc, SnapshotFixture doc) have been fixed in commit 5264b3d. ✅

New Findings

[MEDIUM] --no-dirty-tracking CLI help text contradicts DESIGN.md empirical results

The CLI help text (src/cli/args.rs:354-356) claims:

Disable KVM dirty page tracking. File-backed pages stay shared through
the host page cache — multiple clones from the same snapshot share
physical memory pages.

The same claim is repeated in src/commands/snapshot.rs:876-877:

// Without it, pages stay shared through the host page cache — multiple
// clones from the same snapshot share physical memory.

But DESIGN.md (lines 2213-2218) documents the opposite, based on measured results:

Firecracker maps memory.bin with MAP_PRIVATE | PROT_READ | PROT_WRITE.
When KVM handles a guest page fault, even for a read, the page becomes
Private_Clean... The track_dirty_pages flag (--no-dirty-tracking CLI)
controls KVM's dirty bitmap tracking but does NOT prevent the
Private_Clean CoW behavior — that's inherent to MAP_PRIVATE with
writable mappings.

The File backend row in the table shows 33.6 GB Private_Clean per clone with 1.8 MB Shared — pages are clearly NOT staying shared. The CLI flag description actively misleads users into thinking --no-dirty-tracking enables memory sharing when DESIGN.md proves it doesn't. The flag still has value (avoids the additional overhead of dirty bitmap maintenance), but the sharing claim is false.

[MEDIUM] ss -K filter does not preserve fd00:: IPv6 gateway connections despite detection logic counting them as local

In fc-agent/src/network.rs, the detection loop (lines 124-128) correctly classifies [fd00: connections as "local":

|| peer.starts_with("[fd00:")

But the ss -K filter (lines 150-168) only excludes 127.0.0.0/8, [::1], and 10.0.2.0/24:

"(", "!", "dst", "127.0.0.0/8",
"and", "!", "dst", "[::1]",
"and", "!", "dst", "10.0.2.0/24", ")"

The comment on line 149 claims "fd00:: traffic goes through the gateway anyway (preserved by 10.0.2.0/24 rule)" — this is incorrect. fd00:: is IPv6; 10.0.2.0/24 is IPv4. Any established TCP connections to fd00::2 (the pasta IPv6 gateway) or other fd00::/64 addresses would be killed by ss -K.

The practical impact is limited since most TCP connections through fd00::2 have external destinations (not fd00:: itself), but the counting/logging tells the operator "preserving N local" while the kill filter actually kills the fd00:: ones in that count.

[LOW] New CLI flags --no-swap and --no-dirty-tracking not documented in README.md

The README has a "Key podman run flags" section (lines 322-339) but no equivalent for snapshot run flags. The two new flags are only discoverable via --help. Minor since README says "Run fcvm --help for full options."

Summary

The main issue is a misleading --no-dirty-tracking description: the CLI help and code comments claim the flag enables memory sharing between clones, but DESIGN.md's own empirical data shows this doesn't work with the File backend (Private_Clean CoW happens regardless). The secondary issue is a counting/killing inconsistency in the ss -K filter for fd00:: IPv6 connections.


Review by Claude | Run

@claude-claude
Copy link
Copy Markdown
Contributor

claude-claude bot commented Mar 2, 2026

🔧 Auto-Fix Created

I found issues and created a fix PR: #537

Please review and merge the fix PR first, then this PR.

View Claude Run

The test_no_dirty_tracking_clone assertion looked for track_dirty_pages
in the Firecracker API body, which isn't logged. Added track_dirty_pages
to the snapshot load info line so the test can verify the flag was set.
@claude-claude
Copy link
Copy Markdown
Contributor

claude-claude bot commented Mar 2, 2026

🔍 Claude Review

SEVERITY: medium

Previous Review Status

  • First review (comment): All 3 issues (hugepage dirty tracking guard, RestoreParams doc, SnapshotFixture doc) fixed in commit 5264b3d. ✅
  • Second review (comment): 2 [MEDIUM] issues remain unfixed — fix PR #537 was created but has not been merged yet:
    1. --no-dirty-tracking CLI help text falsely claims page sharing (contradicts DESIGN.md empirical data)
    2. ss -K filter missing fd00::/64 exclusion (counts fd00:: as "local" but kills it anyway)

New Findings

No new [MEDIUM] or [CRITICAL] issues found. The code changes are well-structured with good test coverage.

[LOW] SnapshotFixture doc example comment still implies Drop behavior

Line 1076 of tests/common/mod.rs says // fixture drops and kills baseline + serve processes but there's no Drop impl — cleanup requires explicit fixture.cleanup().await calls. The main doc comment was correctly fixed to say "Call cleanup() to kill all fixture processes" but the example comment is stale. Minor since all test code correctly calls cleanup explicitly.

[LOW] Cgroup directories not cleaned up on VM exit

disable_cgroup_swap creates /sys/fs/cgroup/fcvm.slice/fcvm-{pid}.scope but never removes it when the Firecracker process exits. Empty cgroup directories accumulate over time. The kernel retains some metadata per directory, but the practical impact is negligible for typical VM counts.

Summary

The two [MEDIUM] issues from the previous review remain unfixed pending merge of PR #537. No new significant issues found — the clock sync, cgroup swap isolation, and dirty tracking refactoring are solid. The new integration tests (test_clone_restore_fixes.rs) and unit tests (test_cgroup_swap.rs) provide good coverage of the new features.


Review by Claude | Run

The clock sync + chronyc makestep + gateway ping (3s timeout) added
to handle_clone_restore extends the restore from ~0.5s to ~3.5s.
The 3s wait before reading the log was barely enough; on loaded CI
runners the restore messages hadn't flushed to the log yet.
@ejc3 ejc3 force-pushed the routed-mode-fixes branch 2 times, most recently from 1443dae to d69f101 Compare March 2, 2026 05:16
@claude-claude
Copy link
Copy Markdown
Contributor

claude-claude bot commented Mar 2, 2026

🔍 Claude Review

SEVERITY: medium

Previous Review Status

  • First review (comment): All 3 issues fixed ✅
  • Second review (comment): 2 [MEDIUM] issues remain unfixed — fix PR #537 still open:
    1. --no-dirty-tracking CLI help falsely claims page sharing (contradicts DESIGN.md)
    2. ss -K filter missing fd00::/64 exclusion (counts as local but kills anyway)

New Commits Reviewed

Two new commits since last review:

d69f101 — Replace blocking ping with non-blocking ARP ✅ Good fix. ping -c 1 -W 3 required raw sockets and always failed in rootless mode (pasta doesn't respond to ICMP), burning 3s on every restore. ip neigh replace ... nud stale triggers ARP resolution asynchronously via the kernel's neighbor subsystem without blocking.

90b5b43 — Poll for restore-complete instead of fixed sleep ✅ Good fix. Replaces sleep(3) with a polling loop that checks for "[fc-agent] restore complete" in the log — properly event-driven rather than a race-prone fixed delay.

New Findings

No new [MEDIUM] or [CRITICAL] issues.

[LOW] Stale comment in serial console test references removed "gateway ping"

Line 82 of tests/test_serial_console.rs says "clock sync + chronyc + gateway ping" but commit d69f101 replaced the blocking ping with non-blocking ip neigh. The comment is now stale ("gateway ping" → "ARP resolution"), though it doesn't affect test correctness since the polling approach works regardless.

[LOW] send_gratuitous_arp() function name no longer matches behavior

After d69f101, the function adds a STALE neighbor entry via ip neigh replace — it no longer sends a gratuitous ARP. The name is now misleading, though callers aren't affected.

Summary

The two new commits are well-reasoned fixes: non-blocking ARP eliminates a 3s timeout on every rootless restore, and polling for restore-complete replaces a race-prone fixed sleep. The two [MEDIUM] issues from the second review remain unfixed pending merge of PR #537.


Review by Claude | Run

@ejc3 ejc3 force-pushed the routed-mode-fixes branch from d69f101 to 17014b7 Compare March 2, 2026 05:28
@claude-claude
Copy link
Copy Markdown
Contributor

claude-claude bot commented Mar 2, 2026

🔍 Claude Review

SEVERITY: medium

Previous Review Status

No new commits since last review (HEAD is still 17014b7).

  • First review (comment): All 3 issues fixed ✅
  • Second review (comment): 2 [MEDIUM] issues remain unfixed — fix PR #537 is still open and unmerged:
    1. --no-dirty-tracking CLI help and code comments falsely claim page sharing (contradicts DESIGN.md empirical data)
    2. ss -K filter missing fd00::/64 exclusion (counts fd00:: as "local" but kills it anyway)

New Findings

None. No new commits since the previous review.

Summary

The two [MEDIUM] issues from the second review remain the only outstanding items, with a fix PR (#537) already created but not yet merged.


Review by Claude | Run

send_gratuitous_arp() used `ping -c 1 -W 3` to trigger ARP resolution,
but ping requires ICMP raw sockets and always failed in rootless mode
(pasta doesn't respond to ICMP), burning 3 seconds on every restore.

Replace with `arping -c 1 -U -I eth0 <our_ip>` which sends a gratuitous
ARP directly at layer 2 without needing ICMP. This broadcasts our
MAC→IP mapping to the network (bridge/pasta) instantly.

Add iputils-arping to rootfs system packages.
@ejc3 ejc3 force-pushed the routed-mode-fixes branch from 17014b7 to 9bd8bc4 Compare March 2, 2026 05:57
@claude-claude
Copy link
Copy Markdown
Contributor

claude-claude bot commented Mar 2, 2026

🔍 Claude Review

SEVERITY: medium

Previous Review Status

No new commits since last review (HEAD is still 9bd8bc4).

  • First review (comment): All 3 issues fixed ✅
  • Second review (comment): 2 [MEDIUM] issues remain unfixed — fix PR #537 is still open and unmerged:
    1. --no-dirty-tracking CLI help and code comments falsely claim page sharing (contradicts DESIGN.md empirical data)
    2. ss -K filter missing fd00::/64 exclusion (counts fd00:: as "local" but kills it anyway)

New Findings

None. No new commits since the previous review.

Summary

The two [MEDIUM] issues from the second review remain the only outstanding items, with a fix PR (#537) already created but not yet merged.


Review by Claude | Run

@claude-claude
Copy link
Copy Markdown
Contributor

claude-claude bot commented Mar 2, 2026

🔧 CI Auto-Fix

Created fix PR: #538

Root cause: benches/clone.rs clone_http() made a single HTTP request immediately after snapshot restore, but pasta networking needs a moment to establish L4 translation after clone restore. The same function in benches/exec.rs already had retry logic for this exact race condition — clone.rs was simply not updated to match.

View Claude Run

The clone bench's HTTP request through pasta's loopback port forward
can get an empty response on heavily loaded CI runners, even though
verify_port_forwarding() confirmed the port works moments earlier.
Retry up to 3 times with 500ms backoff instead of panicking on the
first empty response.
@claude-claude
Copy link
Copy Markdown
Contributor

claude-claude bot commented Mar 2, 2026

🔍 Claude Review

SEVERITY: medium

Previous Review Status

  • First review (comment): All 3 issues fixed ✅
  • Second review (comment): 2 [MEDIUM] issues remain unfixed — fix PR #537 is still open and unmerged:
    1. --no-dirty-tracking CLI help and code comments falsely claim page sharing (contradicts DESIGN.md empirical data)
    2. ss -K filter missing fd00::/64 exclusion (counts fd00:: as "local" but kills it anyway)

New Commit Reviewed

e5fa3b0 — retry HTTP in clone bench for pasta forwarding under load ✅ Adds retry logic (3 attempts, 500ms backoff) to clone_http() in benches/clone.rs, addressing the same pasta L4 translation race that exec.rs already handles. The author's fix supersedes the auto-fix in PR #538 (which can now be closed).

New Findings

[LOW] Retry count inconsistency between clone.rs (3) and exec.rs (10)

benches/exec.rs:584 retries 10 times while the new benches/clone.rs:274 retries only 3 times for the identical race condition (pasta networking not ready after clone restore). If 10 retries were needed in exec.rs to handle heavy CI load, 3 may still be insufficient in clone.rs. Not flagging higher since this is benchmark infrastructure, not production code.

Summary

The new commit is a reasonable fix for the CI failure. The two [MEDIUM] issues from the second review remain the only outstanding items, with a fix PR (#537) already created but not yet merged. PR #538 is now superseded by the author's own fix in e5fa3b0.


Review by Claude | Run

@ejc3 ejc3 merged commit cde74e4 into main Mar 2, 2026
18 checks passed
ejc3 added a commit that referenced this pull request Mar 2, 2026
fc-agent: clock sync, ss -K fix, --no-swap cgroup isolation, --no-dirty-tracking
ejc3 added a commit that referenced this pull request Mar 2, 2026
fc-agent: clock sync, ss -K fix, --no-swap cgroup isolation, --no-dirty-tracking
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant