fix(windows): make port/process takeover actually free the port#2552
Conversation
Three bugs in the Windows side of the stale-listener recovery path:
1. `kill_pid_force` returned `Err` for every non-zero taskkill exit,
including exit code 128 ("There is no running instance"). That's the
normal race when the process exits between our pid lookup and the
force-kill — recovery would bail out instead of treating the port as
freed. Now classified through `classify_taskkill_force_status`, with
the same "already gone is success" semantics Unix already has for
ESRCH. Also recognizes the stderr "not found" message when an
intermediate shell normalizes the exit code.
2. `parse_netstat_pid` happily returned PIDs 0 (System Idle) and 4
(NT Kernel) when HTTP.sys / driver-level reservations showed up as
LISTENING. Those can't be killed from user mode — taskkill errors,
and recovery aborts. Now skipped so callers fall back to the
port-reroute path. Also walks past the protected entry to the real
user-mode owner on dual-stack (`[::]:port` + `127.0.0.1:port`) lines.
3. `kill_pid_term` / `kill_pid_force` now refuse pids 0 and 4 as a
second line of defense in case the parser is ever bypassed.
Tests:
- Parser: existing `parse_netstat_pid_finds_listening_entry` plus new
`parse_netstat_pid_skips_protected_kernel_pids` and
`parse_netstat_pid_falls_through_protected_to_real_owner_on_dual_stack`
(cross-platform).
- Windows-only unit: `classify_taskkill_force_*` covers exit 0, exit
128, stderr-only "not found", and genuine access-denied; plus
`is_protected_windows_pid` and refusal of pids 0/4 at the public API.
- Windows-only end-to-end: `windows_port_takeover_finds_and_kills_listener`
spawns a real PowerShell TCP listener on an ephemeral port, walks the
production path (`find_pid_on_port` → `kill_pid_force` → poll port),
asserts the port is freed, and re-calls `kill_pid_force` to verify
idempotency on an already-gone pid.
|
Warning Review limit reached
Your plan currently allows 2 reviews/hour. Refill in 36 seconds. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more review capacity refills, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than trial, open-source, and free plans. In all cases, review capacity refills continuously over time. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThis PR prevents attempts to kill kernel-owned Windows PIDs (0 and 4) by skipping them during netstat parsing, refusing termination requests, adding taskkill result classification, and providing unit and end-to-end tests validating the flow and idempotency. ChangesWindows Protected PID Handling for Port Takeover
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@app/src-tauri/src/core_process_tests.rs`:
- Around line 384-389: The assertion comparing pid and child.id() can panic
before the spawned child is cleaned up; modify the test around the assert_eq! so
that if pid != child.id() you first ensure the spawned child (child) is
terminated (call child.kill() and wait/ignore errors) and/or join its handle,
then panic or assert-fail with the same message; alternatively replace the
direct assert_eq! with an if-check that performs child.kill().ok() and
child.wait().ok() before invoking panic! or failing the test, referencing the
existing pid and child.id() variables.
In `@app/src-tauri/src/process_kill.rs`:
- Around line 213-216: The current predicate in classify_taskkill_force_status
treats any stderr_str containing "could not be terminated" as a harmless
"already gone" case; change the condition so that stderr_str is only considered
Ok when it clearly indicates the process is absent (e.g., contains "not found"
or "no running instance") and do NOT treat a plain "could not be terminated" as
Ok—require it to be paired with "not found" (i.e., ("could not be terminated" &&
"not found")) or explicitly exclude stderr_str that contains "Access is denied";
update the if condition around stderr_str in classify_taskkill_force_status
accordingly to ensure access-denied errors are returned Err instead of Ok.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 7da6c907-65b6-46d2-a6f6-d6f2d471cc88
📒 Files selected for processing (3)
app/src-tauri/src/core_process.rsapp/src-tauri/src/core_process_tests.rsapp/src-tauri/src/process_kill.rs
…ess paired CodeRabbit feedback on PR tinyhumansai#2552: 1. `classify_taskkill_force_status` was treating any stderr containing "could not be terminated" as success, but that phrase also appears in access-denied messages like "...could not be terminated. Reason: Access is denied." — masking real failures and leaving the process alive. Now requires "no running instance of the task" OR a pairing with "not found" before treating the run as success; bare access-denied stderr now propagates Err as intended. Added explicit tests for both the access-denied case (must fail) and the canonical `/T`-tree "no running instance of the task" case (must succeed). 2. `windows_port_takeover_finds_and_kills_listener` could panic on the pid sanity check before the spawned PowerShell listener was reaped, leaving a 60-second listener squatting on the test port. Now tears down the child via `kill()` + `wait()` before any panic.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@app/src-tauri/src/process_kill.rs`:
- Around line 281-288: The test
classify_taskkill_force_treats_no_running_instance_as_success currently passes
Some(128) which triggers the early exit path and never exercises the stderr
fallback branch; change the exit code argument to a non-128 masked value (e.g.,
Some(1) or Some(0)) so classify_taskkill_force_status actually inspects stderr
and validates the "no running instance of the task" message in the test body,
ensuring the stderr-matching branch is covered; update only the argument passed
to classify_taskkill_force_status in that test.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: acb78815-7269-4458-bf18-03b735b26640
📒 Files selected for processing (2)
app/src-tauri/src/core_process_tests.rsapp/src-tauri/src/process_kill.rs
Per CodeRabbit on PR tinyhumansai#2552: passing Some(128) hit the early exit-128 fast path in classify_taskkill_force_status and never evaluated the new "no running instance of the task" stderr branch — so a regression in the fallback matcher would silently still pass. Use Some(1) (a masked non-128 code) so the stderr matching is actually covered.
Summary
kill_pid_forcenow treatstaskkillexit 128 ("process not found") as success, matching Unix ESRCH semantics. Previously every race between PID lookup and force-kill aborted recovery.parse_netstat_pidskips kernel-protected PIDs 0 (System Idle) and 4 (NT Kernel & System), which surface for HTTP.sys / driver-level socket reservations and cannot be signalled from user mode.TcpListenerand walks the fullfind_pid_on_port→kill_pid_forceflow end to end, including idempotency on an already-gone pid.Problem
On Windows, when a previous OpenHuman session left a stale core process bound to the RPC port, startup recovery (
CoreProcessHandle::takeover_stale_listener) often failed to free the port:kill_pid_forceusedstatus().success()to detect failure.taskkill /F /T /PID <pid>returns exit code 128 ("There is no running instance of the task") when the process exits between our pid lookup and the kill — a normal race. The Unix branch already handles the equivalent ESRCH as success; the Windows branch was strictly stricter and aborted recovery instead.parse_netstat_pidhappily returned PID 0 (System Idle Process) or PID 4 (NT Kernel & System) when those showed up as LISTENING on the port. Those PIDs are kernel-owned andtaskkillwill always fail against them, breaking recovery. They can appear when HTTP.sys or another driver has the socket reserved — in that case the right behavior is to fall back to a different port, not to try to kill the kernel.kill_pid_term/kill_pid_forcehad no defense in depth against ever receiving these protected PIDs from a future caller.Solution
app/src-tauri/src/process_kill.rs:classify_taskkill_force_status(code, stderr, pid)— exit 0 → Ok, exit 128 → Ok (already gone), stderr containing "not found" / "could not be terminated" → Ok (for hosts that normalize the exit code), everything else → Err with full diagnostic.kill_pid_forcenow uses.output()and routes through the classifier so the stderr fallback path is reachable.is_protected_windows_pid(pid)—pid == 0 || pid == 4. Bothkill_pid_termandkill_pid_forcerefuse protected PIDs with a descriptive error.app/src-tauri/src/core_process.rs:parse_netstat_pidskips lines whose PID is 0 or 4 and continues scanning (so a dual-stack listener like[::]:portshowing under PID 4 ahead of the real127.0.0.1:portowner still returns the genuine PID).Tests:
parse_netstat_pid_skips_protected_kernel_pids,parse_netstat_pid_falls_through_protected_to_real_owner_on_dual_stack.#[cfg(all(test, windows))]inprocess_kill.rs): exit 0 / exit 128 / stderr-only "not found" / genuine access-denied classifications, refusal of pids 0 and 4 by both kill functions, and a real-process round-trip that spawns acmd /C timeoutchild, force-kills it, and asserts the same PID can be force-killed again with no error.#[cfg(windows)]incore_process_tests.rs):windows_port_takeover_finds_and_kills_listenerspawns a real PowerShell TCP listener on an ephemeral port, walks the production path (find_pid_on_port→kill_pid_force→ pollis_port_open), and asserts the port is actually freed.Submission Checklist
#[cfg(windows)]; macOS/Linux coverage runs cannot exercise those paths. Cross-platform parser changes are covered by the newparse_netstat_pid_*unit tests that run on every host.Impact
app/src-tauri(the Tauri host). macOS and Linux paths are unchanged.taskkillfrom this code path.Related
Follows up on the stale-listener recovery work in Proactively kill stale old OpenHuman RPC processes before the Tauri app reuses their port #1130.
Closes:
Follow-up PR(s)/TODOs:
AI Authored PR Metadata (required for Codex/Linear PRs)
Linear Issue
Commit & Branch
fix/windows-port-kill69eea0121ee1da464e25655c0ecb3447e5089b72Validation Run
pnpm --filter openhuman-app format:checkpnpm typecheck— pre-existing failure onupstream/main(missingqrcode.react,@noble/ciphers/*,@tauri-apps/plugin-barcode-scannerdeps in iOS-client files I did not touch). Reproduced on a cleanupstream/maincheckout. Usedgit push --no-verifyper the CLAUDE.md policy for pre-existing unrelated breakage.cargo test --manifest-path app/src-tauri/Cargo.toml --lib parse_netstat_pid— 3/3 pass.cargo check --manifest-path app/src-tauri/Cargo.toml --tests— clean.Validation Blocked
command:pnpm typecheckerror:Cannot find module 'qrcode.react' / '@noble/ciphers/chacha' / '@noble/ciphers/webcrypto' / '@tauri-apps/plugin-barcode-scanner'impact:None — reproduces verbatim on a cleanupstream/maincheckout, files are iOS-client code I did not modify.Behavior Changes
Parity Contract
kill_pid_forcestill surfaces genuine failures (access denied, spawn failure, unexpected exit codes) — only the "already gone" race is reclassified.parse_netstat_pidstill returnsSome(pid)for the same real-listener inputs the existing test covers; the new skip applies only to PIDs 0 and 4.Duplicate / Superseded PR Handling
Summary by CodeRabbit
Bug Fixes
Tests