ejc3 · ejc3 · Dec 29, 2025 · Dec 29, 2025 · Dec 29, 2025 · Dec 29, 2025
diff --git a/.claude/CLAUDE.md b/.claude/CLAUDE.md
@@ -1,5 +1,16 @@
 # fcvm Development Log
 
+## STACKED PRs BY DEFAULT
+
+**All work goes in stacked PRs.** Each new PR should be based on the previous one, not main.
+
+```
+main → PR#55 → PR#56 → PR#57  (correct)
+main → PR#55, main → PR#56    (wrong - parallel branches)
+```
+
+Only branch directly from main when explicitly starting independent work.
+
 ## NO HACKS
 
 **Fix the root cause, not the symptom.** When something fails:
@@ -13,15 +24,37 @@ Examples of hacks to avoid:
 - Clearing caches instead of updating tools
 - Using `|| true` to ignore errors
 
+## Debugging Test Hangs
+
+**When a test hangs, look at what it's ACTUALLY DOING - don't blame "stale processes".**
+
+```bash
+# WRONG approach: blindly killing "old" processes
+ps aux | grep fcvm   # "I see old processes, they must be blocking!"
+sudo pkill -9 fcvm   # "Fixed it!" (No, you didn't debug anything)
+
+# CORRECT approach: understand what the test is doing
+ps aux | grep -E "fcvm|script|cat"
+# See: script -q -c ./target/release/fcvm exec --pid 1083915 -t -- cat
+# The test is running `cat` in TTY mode - it's waiting for input!
+# The bug is in the test, not "stale processes"
+```
+
+**Common causes of hanging tests:**
+- Command waiting for stdin (like `cat` without EOF signal)
+- Missing Ctrl+D (0x04) in TTY mode tests
+- Blocking reads without timeout
+- Deadlocks in async code
+
+**The process list tells you EXACTLY what's happening.** Read it.
+
 ## Overview
 fcvm is a Firecracker VM manager for running Podman containers in lightweight microVMs. This document tracks implementation findings and decisions.
 
 ## Nested Virtualization (Inception)
 
 fcvm supports running inside another fcvm VM ("inception") using ARM64 FEAT_NV2.
-
-**LIMITATION**: Only **one level** of nesting currently works (Host → L1). Recursive nesting
-(L1 → L2 → L3...) is blocked because L1's KVM reports `KVM_CAP_ARM_EL2=0`.
+Recursive nesting (Host → L1 → L2 → ...) is enabled via the `arm64.nv2` kernel boot parameter.
 
 ### Requirements
 
@@ -32,11 +65,12 @@ fcvm supports running inside another fcvm VM ("inception") using ARM64 FEAT_NV2.
 ### How It Works
 
 1. Set `FCVM_NV2=1` environment variable (auto-set when `--kernel` flag is used)
-2. fcvm passes `--enable-nv2` to Firecracker, which enables `HAS_EL2` + `HAS_EL2_E2H0` vCPU features
-3. vCPU boots at EL2h so guest kernel sees HYP mode available
-4. EL2 registers are initialized: HCR_EL2, CNTHCTL_EL2, VMPIDR_EL2, VPIDR_EL2
-5. Guest kernel initializes KVM: "Hyp nVHE mode initialized successfully"
-6. Nested fcvm can now create VMs using the guest's KVM
+2. fcvm passes `--enable-nv2` to Firecracker, which enables `HAS_EL2` vCPU feature
+3. vCPU boots at EL2h in VHE mode (E2H=1) so guest kernel sees HYP mode available
+4. EL2 registers are initialized: HCR_EL2, VMPIDR_EL2, VPIDR_EL2
+5. Guest kernel initializes KVM: "VHE mode initialized successfully"
+6. `arm64.nv2` boot param overrides MMFR4 to advertise NV2 support
+7. L1 KVM reports `KVM_CAP_ARM_EL2=1`, enabling recursive L2+ VMs
 
 ### Running Inception
 
@@ -61,9 +95,9 @@ fcvm podman run --name inner --network bridged alpine:latest
 
 Firecracker fork with NV2 support: `ejc3/firecracker:nv2-inception`
 
-- `HAS_EL2` (bit 7): Enables virtual EL2 for guest
-- `HAS_EL2_E2H0` (bit 8): Forces nVHE mode (avoids timer trap storm)
+- `HAS_EL2` (bit 7): Enables virtual EL2 for guest in VHE mode
 - Boot at EL2h: Guest kernel must see CurrentEL=EL2 on boot
+- VHE mode (E2H=1): Required for NV2 support in guest (nVHE mode doesn't support NV2)
 - VMPIDR_EL2/VPIDR_EL2: Proper processor IDs for nested guests
 
 ### Tests
@@ -75,13 +109,89 @@ make test-root FILTER=inception
 - `test_kvm_available_in_vm`: Verifies /dev/kvm works in guest
 - `test_inception_run_fcvm_inside_vm`: Full inception test
 
-### Recursive Nesting Limitation
+### Recursive Nesting: The ID Register Problem (Solved)
 
-L1's KVM reports `KVM_CAP_ARM_EL2=0`, blocking L2+ VMs.
+**Problem**: L1's KVM initially reported `KVM_CAP_ARM_EL2=0`, blocking L2+ VMs.
 
-**Root cause**: `kvm-arm.mode=nested` requires VHE (kernel at EL2), but NV2's E2H0 flag forces nVHE (kernel at EL1). E2H0 is required to avoid timer trap storms.
+**Root cause**: ARM architecture provides no mechanism to virtualize ID registers for virtual EL2.
 
-**Status**: Waiting for kernel improvements. Kernel NV2 patches mark recursive nesting as "not tested yet".
+1. Host KVM stores correct emulated ID values in `kvm->arch.id_regs[]`
+2. `HCR_EL2.TID3` controls trapping of ID register reads - but only for **EL1 reads**
+3. When guest runs at virtual EL2 (with NV2), ID register reads are EL2-level accesses
+4. EL2-level accesses don't trap via TID3 - they read hardware directly
+5. Guest sees `MMFR4=0` (hardware), not `MMFR4=NV2_ONLY` (emulated)
+
+**Solution**: Use kernel's ID register override mechanism with `arm64.nv2` boot parameter.
+
+1. Added `arm64.nv2` alias for `id_aa64mmfr4.nv_frac=2` (NV2_ONLY)
+2. Changed `FTR_LOWER_SAFE` to `FTR_HIGHER_SAFE` for MMFR4 to allow upward overrides
+3. Kernel patch: `kernel/patches/mmfr4-override.patch`
+
+**Why it's safe**: The host KVM *does* provide NV2 emulation - we're just fixing the guest's
+view of this capability. We're not faking a feature, we're correcting a visibility issue.
+
+**Verification**:
+```
+$ dmesg | grep mmfr4
+CPU features: SYS_ID_AA64MMFR4_EL1[23:20]: forced to 2
+
+$ check_kvm_caps
+KVM_CAP_ARM_EL2 (cap 240) = 1
+  -> Nested virtualization IS supported by KVM (VHE mode)
+```
+
+## FUSE Performance Tracing
+
+Enable per-operation tracing to diagnose FUSE latency issues (especially in nested VMs).
+
+### Enabling Tracing
+
+Set `FCVM_FUSE_TRACE_RATE=N` to trace every Nth FUSE operation:
+
+```bash
+# Trace every 100th request (recommended for benchmarks)
+FCVM_FUSE_TRACE_RATE=100 fcvm podman run --name test nginx:alpine
+
+# Trace every request (high overhead, use for debugging specific issues)
+FCVM_FUSE_TRACE_RATE=1 fcvm podman run ...
+```
+
+The env var is automatically passed to the guest via kernel boot parameters (`fuse_trace_rate=N`).
+
+### Trace Output Format
+
+```
+[TRACE     lookup] total=8940µs srv=159µs | fs=149 | to_srv=33 to_cli=1974
+[TRACE      fsync] total=70000µs srv=3000µs | fs=2900 | to_srv=? to_cli=?
+```
+
+| Field | Meaning |
+|-------|---------|
+| `total` | End-to-end client round-trip time |
+| `srv` | Server-side processing (reliable) |
+| `fs` | Filesystem operation time (subset of srv) |
+| `to_srv` | Network: client → server (may show `?` if clocks differ) |
+| `to_cli` | Network: server → client (may show `?` if clocks differ) |
+
+### L2 Performance Expectations
+
+Based on FUSE-over-FUSE architecture:
+
+| Operation | Expected L2/L1 Ratio | Notes |
+|-----------|---------------------|-------|
+| `stat`/metadata | ~2x | One extra FUSE layer |
+| Async writes | ~3x | Data transfer overhead |
+| Sync writes (fsync) | ~8-10x | fsync propagates synchronously through layers |
+
+The fsync amplification occurs because each L2 fsync must wait for L1's fsync to complete,
+which itself waits for the host disk sync. This is fundamental to FUSE-over-FUSE durability.
+
+### Related Configuration
+
+```bash
+# Reduce FUSE readers for nested VMs (saves memory)
+FCVM_FUSE_READERS=8 fcvm podman run ...  # Default: 64 readers × 8MB stack = 512MB
+```
 
 ## Quick Reference
 
@@ -319,6 +429,58 @@ Tested locally:
 Fixed CI. Tested and it works.
 ```
 
+#### Complex/Advanced PRs
+
+**For non-trivial changes (architectural, workarounds, kernel patches), include:**
+
+1. **The Problem** - What was failing and why. Include root cause analysis.
+2. **The Solution** - How you fixed it. Explain the approach, not just "what" but "why this way".
+3. **Why It's Safe** - For workarounds or unusual approaches, explain why it won't break things.
+4. **Alternatives Considered** - What else you tried and why it didn't work.
+5. **Test Results** - Actual command output proving it works.
+
+**Example structure for complex PRs:**
+
+```markdown
+## Summary
+One-line description of what this enables.
+
+## The Problem
+- What was broken
+- Root cause analysis (be specific)
+- Why existing approaches didn't work
+
+## The Solution
+1. First key change and why
+2. Second key change and why
+3. Why this approach over alternatives
+
+### Why This Is Safe
+- Explain non-obvious safety guarantees
+- Address potential concerns upfront
+
+### Alternatives Considered
+1. Alternative A - why it didn't work
+2. Alternative B - why it was more invasive
+
+## Test Results
+\`\`\`
+$ actual-command-run
+actual output proving it works
+\`\`\`
+
+## Test Plan
+- [x] Test case 1
+- [x] Test case 2
+```
+
+**When to use this format:**
+- Kernel patches or low-level system changes
+- Workarounds for architectural limitations
+- Changes that might seem "wrong" without context
+- Multi-commit PRs with complex interactions
+- Anything where a reviewer might ask "why not just...?"
+
 **Why evidence matters:**
 - Proves the fix works, not just "looks right"
 - Local testing is sufficient - don't need CI green first
@@ -382,18 +544,25 @@ Why: String matching breaks when JSON formatting changes (spaces, newlines, fiel
 **This project is designed for extreme scale, speed, and correctness.** Test failures are bugs, not excuses.
 
 **NEVER dismiss failures as:**
-- "Resource contention"
-- "Timing issues"
-- "Flaky tests"
-- "Works on my machine"
+- "Resource contention" - **This is NEVER the answer. It's always a race condition.**
+- "Timing issues" - **This means there's a race condition. Find and fix it.**
+- "Flaky tests" - **No such thing. The test found a bug. Fix the bug.**
+- "Works on my machine" - **Your machine just got lucky. The bug is real.**
+
+**"Resource contention" is a lie you tell yourself to avoid finding the real bug.** When a test fails under load:
+1. The test is correct - it found a bug
+2. The bug only manifests under certain timing conditions
+3. This is called a **race condition**
+4. You MUST find the race and fix it
 
 **ALWAYS:**
-1. Investigate the actual root cause
-2. Find evidence in logs, traces, or code
-3. Fix the underlying bug
-4. Add regression tests if needed
+1. **Look at the logs** - The answer is always there
+2. Investigate the actual root cause with evidence
+3. Find the race condition - there IS one
+4. Fix the underlying bug
+5. Add regression tests if needed
 
-If a test fails intermittently, that's a **concurrency bug** or **race condition** that must be fixed, not ignored.
+If a test fails intermittently or only under parallel execution, that's a **concurrency bug** or **race condition** that must be fixed, not ignored. The test passed in isolation? Great - that narrows down the timing window where the race occurs.
 
 ### POSIX Compliance Testing
 
@@ -1294,6 +1463,16 @@ RUST_LOG="fuse_pipe=info,fuse-pipe=info,passthrough=debug" sudo -E cargo test --
 RUST_LOG="passthrough=debug" sudo -E cargo test --release -p fuse-pipe --test integration test_list_directory -- --nocapture
 ```
 
+## Exec Command Flags
+
+`fcvm exec` uses `-i` and `-t` separately, matching podman/docker:
+- `-t`: allocate PTY (for colors/formatting)
+- `-i`: forward stdin
+- `-it`: both (interactive shell)
+- neither: plain exec
+
+**NO backward compatibility wrappers.** When the API changed from `run_tty_mode(stream)` to `run_tty_mode(stream, interactive)`, all callers were updated directly - no deprecated functions or compatibility shims.
+
 ## References
 - Main documentation: `README.md`
 - Design specification: `DESIGN.md`

diff --git a/Cargo.lock b/Cargo.lock
diff --git a/Cargo.toml b/Cargo.toml
@@ -1,7 +1,7 @@
 [workspace]
-members = [".", "fuse-pipe", "fc-agent"]
+members = [".", "fuse-pipe", "fc-agent", "exec-proto"]
 # Build all members by default (not just the root package)
-default-members = [".", "fuse-pipe", "fc-agent"]
+default-members = [".", "fuse-pipe", "fc-agent", "exec-proto"]
 # Exclude sync-test (used only for Makefile sync verification)
 exclude = ["sync-test"]
 # Resolver v2 makes --no-default-features work across all workspace members
@@ -52,7 +52,7 @@ tracing = "0.1"
 tracing-subscriber = { version = "0.3", features = ["env-filter", "json"] }
 tracing-log = "0.2"  # Bridge log crate to tracing (for fuse-backend-rs)
 libc = "0.2"
-nix = { version = "0.29", features = ["sched", "net", "fs", "user", "mount", "signal"] }
+nix = { version = "0.29", features = ["sched", "net", "fs", "user", "mount", "signal", "term"] }
 chrono = { version = "0.4", features = ["serde"] }
 tempfile = "3"
 rand = "0.8"
@@ -65,6 +65,7 @@ memmap2 = "0.9"
 vmm-sys-util = "0.12"
 shell-words = "1"
 fuse-pipe = { path = "fuse-pipe", default-features = false }
+exec-proto = { path = "exec-proto" }
 url = "2"
 tokio-util = "0.7"
 regex = "1.12.2"