Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .claude/CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -884,6 +884,8 @@ assert!(localhost_works, "Localhost port forwarding should work (requires route_
- No feature flag: Unprivileged tests run by default
- Features are compile-time gates - tests won't exist unless the feature is enabled
- Use `FILTER=` to further filter by name pattern: `make test-root FILTER=exec`
- For multiple tests or regex: `make test-root FILTER="-E 'test(/pattern1|pattern2/)'" STREAM=1`
- FILTER is a nextest substring match on test function name, NOT file name. Use `-E` for expressions.

**Common parallel test pitfalls and fixes:**

Expand Down
36 changes: 31 additions & 5 deletions .claude/skills/pr-workflow/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,9 @@ user-invocable: true
|------|---------|
| CI status (non-blocking) | `gh pr view <N> --json statusCheckRollup --jq '.statusCheckRollup[] \| "\(.name): \(.conclusion // "pending")"'` |
| CI status (blocking) | `gh pr checks <N>` |
| Read PR comments | `gh pr view <N> --json comments --jq '.comments[] \| "---\n" + .body'` |
| Read inline review comments | `gh api repos/$REPO/pulls/<N>/comments --jq '.[] \| "---\nfile: \(.path):\(.line // .original_line)\n\(.body)"'` |
| Read PR comments (Claude) | `gh pr view <N> --json comments --jq '.comments[] \| "---\n" + .body'` |
| Read inline comments (Codex) | `gh api repos/$REPO/pulls/<N>/comments --jq '.[] \| select(.user.login \| test("codex")) \| "---\nfile: \(.path):\(.line // .original_line)\n\(.body)"'` |
| Read all inline comments | `gh api repos/$REPO/pulls/<N>/comments --jq '.[] \| "---\nfile: \(.path):\(.line // .original_line)\n\(.body)"'` |
| Create PR | `git push -u origin <branch> && gh pr create --fill` |
| Merge standalone PR | `gh pr merge <N> --merge --delete-branch` |
| Merge stacked PR (base) | `gh pr merge <N> --merge` (NO `--delete-branch`!) |
Expand Down Expand Up @@ -100,16 +101,41 @@ git push

## Before Merging: Read ALL PR Comments

**MANDATORY before any merge, push, or PR update.** Run BOTH commands:
**MANDATORY before any merge, push, or PR update.** Run ALL THREE commands:

```bash
REPO=$(gh repo view --json nameWithOwner --jq '.nameWithOwner')

# PR-level comments
# PR-level comments (Claude review posts here)
gh pr view <N> --json comments --jq '.comments[] | "---\n" + .body'

# Inline code review comments (often more actionable)
# Inline code review comments (Codex review posts here)
gh api "repos/$REPO/pulls/<N>/comments" --jq '.[] | "---\nfile: \(.path):\(.line // .original_line)\n\(.body)"'

# PR reviews with body text (Codex review summary lives here)
gh api "repos/$REPO/pulls/<N>/reviews" --jq '.[] | select(.body != "") | "---\nauthor: \(.user.login)\n\(.body)"'
```

### Two Automated Reviewers

This repo has **two** automated code reviewers. Read findings from BOTH before merging.

**Claude Review** (`claude-claude` bot):
- Posts as a **PR-level comment** (visible via `gh pr view --json comments`)
- Runs as a GitHub Actions workflow on every push
- Severity levels: LOW, MEDIUM, HIGH

**Codex Review** (`chatgpt-codex-connector[bot]`):
- Posts a **PR review** with inline comments on specific lines
- Review summary: `gh api repos/$REPO/pulls/<N>/reviews` (look for `chatgpt-codex-connector`)
- Inline suggestions: `gh api repos/$REPO/pulls/<N>/comments` (filter by codex author)
- Priority badges: P0 (critical), P1 (important), P2 (suggestion)

**Read Codex inline comments specifically:**
```bash
# Codex inline review comments only
gh api "repos/$REPO/pulls/<N>/comments" \
--jq '.[] | select(.user.login | test("codex")) | "---\nfile: \(.path):\(.line // .original_line)\n\(.body)"'
```

### Check for auto-fix PRs
Expand Down
100 changes: 98 additions & 2 deletions DESIGN.md
Original file line number Diff line number Diff line change
Expand Up @@ -2191,6 +2191,101 @@ Additional defense:

---

## Clone Memory Sharing

### Problem

Multiple clones from the same snapshot should share physical memory pages for read-only data.
A large container VM may have 131 GB of guest memory, but most of it is identical across clones
(kernel, application code, page cache). Only pages each clone writes to should be unique
(Private_Dirty).

### Current State (2026-02-28)

Three memory backends were tested. Results for two 131 GB clones:

| Backend | Per-clone RSS | Shared | Private_Clean | Pressure | Status |
|---------|--------------|--------|---------------|----------|--------|
| **File** (`MAP_PRIVATE` on memory.bin) | 44 GB | 1.8 MB | 33.6 GB | 40% | Broken — KVM CoW-copies pages into Private_Clean even for reads |
| **UFFD MISSING+COPY** | 21 GB | 0 | 0 | 11% | Works but no sharing — each fault copies data to fresh anon page |
| **UFFD MINOR+CONTINUE** (not implemented) | ~5 GB (est.) | ~80 GB (est.) | 0 | ~2% (est.) | True sharing via shared memfd |

**File backend**: Firecracker maps memory.bin with `MAP_PRIVATE | PROT_READ | PROT_WRITE`.
When KVM handles a guest page fault, even for a read, the page becomes Private_Clean in the
process's address space. This happens because the kernel creates a private copy of the
file-backed page when setting up writable EPT mappings. The `track_dirty_pages` flag
(`--no-dirty-tracking` CLI) controls KVM's dirty bitmap tracking but does NOT prevent
the Private_Clean CoW behavior — that's inherent to MAP_PRIVATE with writable mappings.

**UFFD MISSING+COPY**: Firecracker creates anonymous memory (`MAP_PRIVATE | MAP_ANONYMOUS`)
and registers it with UFFD in MISSING mode. On each page fault, the UFFD server reads from
memory.bin and calls `UFFDIO_COPY` to fill the page. Each clone gets its own physical copy.
No Private_Clean bloat (no file-backed mapping), but no sharing either.
RSS is lower than File mode because only faulted pages are populated (lazy loading).

**KSM**: Disabled (`/sys/kernel/mm/ksm/run=0`). Firecracker doesn't mark guest memory
with `MADV_MERGEABLE`. Even if enabled, KSM is after-the-fact dedup with scanning overhead.

### Proposed: UFFD MINOR Mode with Shared Memfd

The kernel (6.13+) supports `UFFD_FEATURE_MINOR_SHMEM` — verified on our host.
The `userfaultfd` crate (0.9.0) supports `register_with_mode()` with raw bits.

**Architecture:**

```
┌─────────────────────────────────────────────────────┐
│ fcvm snapshot serve │
│ │
│ 1. memfd_create("snapshot", 131 GB) │
│ 2. Populate memfd from memory.bin │
│ 3. Accept clone connections via UDS │
│ 4. Send memfd fd + UFFD fd to each clone │
│ │
│ On MINOR fault from clone: │
│ UFFDIO_CONTINUE → maps existing memfd page │
│ (zero-copy, page shared across all clones) │
└─────────────────────────────────────────────────────┘
│ memfd fd shared via UDS
┌──────────────────────┐ ┌──────────────────────┐
│ Clone 1 (Firecracker) │ │ Clone 2 (Firecracker) │
│ │ │ │
│ Guest memory: │ │ Guest memory: │
│ MAP_SHARED on memfd │ │ MAP_SHARED on memfd │
│ + UFFD MINOR mode │ │ + UFFD MINOR mode │
│ │ │ │
│ Read → shared page │ │ Read → shared page │
│ Write → kernel CoW │ │ Write → kernel CoW │
└──────────────────────┘ └──────────────────────┘
```

**Changes required:**

1. **Firecracker** (`persist.rs`):
- `guest_memory_from_uffd()`: Use `memfd_backed()` instead of `anonymous()` for guest memory
- Pass memfd fd from the UFFD server (received via UDS alongside the UFFD fd)
- `uffd.register_with_mode(ptr, size, RegisterMode::from_bits_truncate(4))` for MINOR mode

2. **fcvm UFFD server** (`src/uffd/server.rs`):
- Create memfd, populate from memory.bin (one-time cost at serve start)
- Send memfd fd to each clone via UDS handshake
- On MINOR fault: `UFFDIO_CONTINUE` (maps existing page) instead of `UFFDIO_COPY` (copies data)

3. **fcvm serve** (`src/commands/snapshot.rs`):
- `snapshot serve` creates and populates the memfd once
- Each clone receives the same memfd fd

**Why this works:** With `MAP_SHARED` on the memfd, all clones' page tables can point to the
same physical pages. `UFFDIO_CONTINUE` resolves a MINOR fault by installing a PTE pointing to
the already-populated memfd page — no data copy. Writes trigger kernel-level CoW (the page gets
copied to anonymous memory for that process only). This is the same mechanism used by CRIU for
lazy migration and by cloud providers for VM density.

**Kernel support:** Verified `UFFD_FEATURE_MINOR_SHMEM` (bit 10) is available on our
kernel 6.13. The `userfaultfd` crate 0.9.0 doesn't export a `MINOR` constant but
`RegisterMode::from_bits_truncate(4)` works since it's a bitflags struct.

## References

- [Firecracker Documentation](https://github.com/firecracker-microvm/firecracker/tree/main/docs)
Expand All @@ -2199,11 +2294,12 @@ Additional defense:
- [passt/pasta](https://passt.top/)
- [iptables Documentation](https://netfilter.org/documentation/)
- [KVM Documentation](https://www.linux-kvm.org/page/Documents)
- [Linux UFFD Documentation](https://docs.kernel.org/admin-guide/mm/userfaultfd.html)

---

**End of Design Specification**

*Version: 2.3*
*Date: 2025-12-25*
*Version: 2.4*
*Date: 2026-02-28*
*Author: fcvm project*
50 changes: 30 additions & 20 deletions benches/clone.rs
Original file line number Diff line number Diff line change
Expand Up @@ -263,31 +263,41 @@ impl CloneFixture {
std::thread::sleep(Duration::from_millis(50));
};

// Make HTTP request to nginx
// verify_port_forwarding() runs after snapshot restore and confirms end-to-end
// data flow through pasta before health monitor starts.
// Make HTTP request to nginx via loopback port forward (pasta).
// verify_port_forwarding() already confirmed the port works, but under
// heavy CI load the first request can get an empty response if pasta's
// internal forwarding state is briefly inconsistent. Retry up to 3 times.
let addr = format!("{}:{}", loopback_ip, health_port);
let mut stream = TcpStream::connect(&addr).expect("failed to connect to nginx");
stream
.set_read_timeout(Some(Duration::from_secs(5)))
.unwrap();
stream
.set_write_timeout(Some(Duration::from_secs(5)))
.unwrap();

let request = "GET / HTTP/1.1\r\nHost: localhost\r\nConnection: close\r\n\r\n";
stream
.write_all(request.as_bytes())
.expect("failed to send HTTP request");

let mut response = Vec::new();
let _ = stream.read_to_end(&mut response);
let mut last_response = String::new();
for attempt in 0..3 {
if attempt > 0 {
std::thread::sleep(Duration::from_millis(500));
}
let Ok(mut stream) = TcpStream::connect(&addr) else {
continue;
};
let _ = stream.set_read_timeout(Some(Duration::from_secs(5)));
let _ = stream.set_write_timeout(Some(Duration::from_secs(5)));

if stream.write_all(request.as_bytes()).is_err() {
continue;
}

let mut response = Vec::new();
let _ = stream.read_to_end(&mut response);
last_response = String::from_utf8_lossy(&response).to_string();

if last_response.contains("200 OK") {
break;
}
}

let response_str = String::from_utf8_lossy(&response);
if !response_str.contains("200 OK") {
if !last_response.contains("200 OK") {
panic!(
"unexpected HTTP response: {}",
&response_str[..std::cmp::min(200, response_str.len())]
"unexpected HTTP response after 3 attempts: {}",
&last_response[..std::cmp::min(200, last_response.len())]
);
}

Expand Down
76 changes: 76 additions & 0 deletions fc-agent/src/agent.rs
Original file line number Diff line number Diff line change
Expand Up @@ -152,6 +152,11 @@ pub async fn run() -> Result<()> {
};
let has_shared_volume = mounted_fuse_paths.iter().any(|p| p == "/mnt/shared");

// Start chronyd for ongoing NTP time sync.
// Must be after FUSE mounts so /etc-host/chrony.conf is readable.
// makestep 1 -1 allows stepping the clock at any time (critical after snapshot restore).
start_chronyd().await;

let mounted_disk_paths = if !plan.extra_disks.is_empty() {
eprintln!(
"[fc-agent] mounting {} extra disk(s)",
Expand Down Expand Up @@ -432,3 +437,74 @@ pub async fn run() -> Result<()> {

system::shutdown_vm(exit_code).await
}

/// Start chronyd for NTP time sync.
///
/// Writes a chrony config using the host's NTP servers (from /etc-host/chrony.conf),
/// then starts chronyd as a daemon. `makestep 1 -1` allows stepping the clock at
/// any time, which is critical after snapshot restore when the drift can be hours.
async fn start_chronyd() {
// Read NTP server addresses from host's chrony.conf
let host_conf = std::path::Path::new("/etc-host/chrony.conf");
let mut server_addrs = Vec::new();
if let Ok(content) = tokio::fs::read_to_string(host_conf).await {
for line in content.lines() {
// Parse both "server" and "pool" directives (some distros only use pool)
if line.starts_with("server ") || line.starts_with("pool ") {
if let Some(addr) = line.split_whitespace().nth(1) {
server_addrs.push(addr.to_string());
}
}
}
}

// Write minimal config — servers are added dynamically via chronyc because
// chronyd's config parser doesn't handle bare IPv6 addresses correctly.
let config = "makestep 1 -1\ndriftfile /var/lib/chrony/drift\ncmdallow 127.0.0.1\n";

let _ = tokio::fs::create_dir_all("/var/lib/chrony").await;
let _ = tokio::fs::create_dir_all("/var/run/chrony").await;
let _ = tokio::fs::write("/etc/chrony.conf", config).await;

// Kill any existing chronyd (systemd may have started one as _chrony user,
// which can't send UDP in this VM). Then restart as root.
let _ = tokio::process::Command::new("pkill")
.args(["-x", "chronyd"])
.output()
.await;
tokio::time::sleep(std::time::Duration::from_millis(500)).await;

match tokio::process::Command::new("/usr/sbin/chronyd")
.args(["-u", "root"])
.output()
.await
{
Ok(out) if out.status.success() => {
eprintln!("[fc-agent] chronyd started (NTP time sync)");
}
Ok(out) => {
eprintln!(
"[fc-agent] WARNING: chronyd failed: {}",
String::from_utf8_lossy(&out.stderr).trim()
);
return;
}
Err(e) => {
eprintln!("[fc-agent] WARNING: failed to start chronyd: {}", e);
return;
}
}

// Add servers dynamically via chronyc (works reliably with IPv6)
tokio::time::sleep(std::time::Duration::from_secs(1)).await;
for addr in &server_addrs {
let _ = tokio::process::Command::new("/usr/bin/chronyc")
.args(["add", "server", addr, "iburst"])
.output()
.await;
}
eprintln!(
"[fc-agent] added {} NTP servers via chronyc",
server_addrs.len()
);
}
Loading
Loading