Skip to content

fix(windows): make port/process takeover actually free the port#2552

Merged
senamakel merged 4 commits into
tinyhumansai:mainfrom
senamakel:fix/windows-port-kill
May 23, 2026
Merged

fix(windows): make port/process takeover actually free the port#2552
senamakel merged 4 commits into
tinyhumansai:mainfrom
senamakel:fix/windows-port-kill

Conversation

@senamakel
Copy link
Copy Markdown
Member

@senamakel senamakel commented May 23, 2026

Summary

  • Windows-only fix for the stale-listener recovery path in the Tauri shell — startup-time port takeover and force-kill were failing in three distinct ways.
  • kill_pid_force now treats taskkill exit 128 ("process not found") as success, matching Unix ESRCH semantics. Previously every race between PID lookup and force-kill aborted recovery.
  • parse_netstat_pid skips kernel-protected PIDs 0 (System Idle) and 4 (NT Kernel & System), which surface for HTTP.sys / driver-level socket reservations and cannot be signalled from user mode.
  • Adds Windows-only unit tests (exit-code classification, protected-PID refusal) and an integration test that spawns a real PowerShell TcpListener and walks the full find_pid_on_portkill_pid_force flow end to end, including idempotency on an already-gone pid.

Problem

On Windows, when a previous OpenHuman session left a stale core process bound to the RPC port, startup recovery (CoreProcessHandle::takeover_stale_listener) often failed to free the port:

  1. kill_pid_force used status().success() to detect failure. taskkill /F /T /PID <pid> returns exit code 128 ("There is no running instance of the task") when the process exits between our pid lookup and the kill — a normal race. The Unix branch already handles the equivalent ESRCH as success; the Windows branch was strictly stricter and aborted recovery instead.
  2. parse_netstat_pid happily returned PID 0 (System Idle Process) or PID 4 (NT Kernel & System) when those showed up as LISTENING on the port. Those PIDs are kernel-owned and taskkill will always fail against them, breaking recovery. They can appear when HTTP.sys or another driver has the socket reserved — in that case the right behavior is to fall back to a different port, not to try to kill the kernel.
  3. kill_pid_term / kill_pid_force had no defense in depth against ever receiving these protected PIDs from a future caller.

Solution

app/src-tauri/src/process_kill.rs:

  • New classify_taskkill_force_status(code, stderr, pid) — exit 0 → Ok, exit 128 → Ok (already gone), stderr containing "not found" / "could not be terminated" → Ok (for hosts that normalize the exit code), everything else → Err with full diagnostic.
  • kill_pid_force now uses .output() and routes through the classifier so the stderr fallback path is reachable.
  • New is_protected_windows_pid(pid)pid == 0 || pid == 4. Both kill_pid_term and kill_pid_force refuse protected PIDs with a descriptive error.

app/src-tauri/src/core_process.rs:

  • parse_netstat_pid skips lines whose PID is 0 or 4 and continues scanning (so a dual-stack listener like [::]:port showing under PID 4 ahead of the real 127.0.0.1:port owner still returns the genuine PID).

Tests:

  • Cross-platform parser units: parse_netstat_pid_skips_protected_kernel_pids, parse_netstat_pid_falls_through_protected_to_real_owner_on_dual_stack.
  • Windows-only units (#[cfg(all(test, windows))] in process_kill.rs): exit 0 / exit 128 / stderr-only "not found" / genuine access-denied classifications, refusal of pids 0 and 4 by both kill functions, and a real-process round-trip that spawns a cmd /C timeout child, force-kills it, and asserts the same PID can be force-killed again with no error.
  • Windows-only end-to-end (#[cfg(windows)] in core_process_tests.rs): windows_port_takeover_finds_and_kills_listener spawns a real PowerShell TCP listener on an ephemeral port, walks the production path (find_pid_on_portkill_pid_force → poll is_port_open), and asserts the port is actually freed.

Submission Checklist

  • Tests added or updated (happy path + at least one failure / edge case) per Testing Strategy
  • N/A: diff-coverage gate — changed code is platform-specific Windows logic gated on #[cfg(windows)]; macOS/Linux coverage runs cannot exercise those paths. Cross-platform parser changes are covered by the new parse_netstat_pid_* unit tests that run on every host.
  • N/A: coverage matrix — behaviour-only change inside an existing lifecycle path; no new user-facing feature row.
  • N/A: feature IDs — no matrix entries affected (see above).
  • No new external network dependencies introduced (mock backend used per Testing Strategy)
  • N/A: manual smoke checklist — internal lifecycle fix, no release-cut surface affected.
  • N/A: linked issue — fix discovered during follow-up debugging of Proactively kill stale old OpenHuman RPC processes before the Tauri app reuses their port #1130 / Windows lifecycle work; no dedicated issue filed.

Impact

  • Platform: Windows-only behavior change inside app/src-tauri (the Tauri host). macOS and Linux paths are unchanged.
  • Runtime: Windows users who hard-quit OpenHuman should now see startup recover from a stale core process reliably instead of aborting with "taskkill exited with...".
  • Security: protected kernel PIDs (0/4) can no longer be passed into taskkill from this code path.
  • Performance: negligible. One extra cheap check per kill call.

Related


AI Authored PR Metadata (required for Codex/Linear PRs)

Linear Issue

  • Key: N/A
  • URL: N/A

Commit & Branch

  • Branch: fix/windows-port-kill
  • Commit SHA: 69eea0121ee1da464e25655c0ecb3447e5089b72

Validation Run

  • pnpm --filter openhuman-app format:check
  • N/A: pnpm typecheck — pre-existing failure on upstream/main (missing qrcode.react, @noble/ciphers/*, @tauri-apps/plugin-barcode-scanner deps in iOS-client files I did not touch). Reproduced on a clean upstream/main checkout. Used git push --no-verify per the CLAUDE.md policy for pre-existing unrelated breakage.
  • Focused tests: cargo test --manifest-path app/src-tauri/Cargo.toml --lib parse_netstat_pid — 3/3 pass.
  • N/A: Rust fmt/check (core) — no core crate changes.
  • Tauri fmt/check (changed): cargo check --manifest-path app/src-tauri/Cargo.toml --tests — clean.

Validation Blocked

  • command: pnpm typecheck
  • error: Cannot find module 'qrcode.react' / '@noble/ciphers/chacha' / '@noble/ciphers/webcrypto' / '@tauri-apps/plugin-barcode-scanner'
  • impact: None — reproduces verbatim on a clean upstream/main checkout, files are iOS-client code I did not modify.

Behavior Changes

  • Intended behavior change: Windows port takeover now succeeds in two scenarios that previously aborted recovery (process exits mid-kill; kernel-reserved socket on the port).
  • User-visible effect: Windows users no longer see startup hang/abort after a hard quit when the previous core left a stale listener.

Parity Contract

  • Legacy behavior preserved: Unix branches unchanged; Windows kill_pid_force still surfaces genuine failures (access denied, spawn failure, unexpected exit codes) — only the "already gone" race is reclassified.
  • Guard/fallback/dispatch parity checks: parse_netstat_pid still returns Some(pid) for the same real-listener inputs the existing test covers; the new skip applies only to PIDs 0 and 4.

Duplicate / Superseded PR Handling

  • Duplicate PR(s): None
  • Canonical PR: This one
  • Resolution: N/A

Summary by CodeRabbit

  • Bug Fixes

    • Windows process termination now safely skips kernel-protected processes, preventing potential system errors
    • Improved port listener recovery to gracefully handle protected kernel-owned processes and fall back to alternative routing
  • Tests

    • Added comprehensive Windows process handling tests covering protected process scenarios, edge cases, and port cleanup verification

Review Change Stack

senamakel added 2 commits May 23, 2026 12:03
Three bugs in the Windows side of the stale-listener recovery path:

1. `kill_pid_force` returned `Err` for every non-zero taskkill exit,
   including exit code 128 ("There is no running instance"). That's the
   normal race when the process exits between our pid lookup and the
   force-kill — recovery would bail out instead of treating the port as
   freed. Now classified through `classify_taskkill_force_status`, with
   the same "already gone is success" semantics Unix already has for
   ESRCH. Also recognizes the stderr "not found" message when an
   intermediate shell normalizes the exit code.

2. `parse_netstat_pid` happily returned PIDs 0 (System Idle) and 4
   (NT Kernel) when HTTP.sys / driver-level reservations showed up as
   LISTENING. Those can't be killed from user mode — taskkill errors,
   and recovery aborts. Now skipped so callers fall back to the
   port-reroute path. Also walks past the protected entry to the real
   user-mode owner on dual-stack (`[::]:port` + `127.0.0.1:port`) lines.

3. `kill_pid_term` / `kill_pid_force` now refuse pids 0 and 4 as a
   second line of defense in case the parser is ever bypassed.

Tests:
- Parser: existing `parse_netstat_pid_finds_listening_entry` plus new
  `parse_netstat_pid_skips_protected_kernel_pids` and
  `parse_netstat_pid_falls_through_protected_to_real_owner_on_dual_stack`
  (cross-platform).
- Windows-only unit: `classify_taskkill_force_*` covers exit 0, exit
  128, stderr-only "not found", and genuine access-denied; plus
  `is_protected_windows_pid` and refusal of pids 0/4 at the public API.
- Windows-only end-to-end: `windows_port_takeover_finds_and_kills_listener`
  spawns a real PowerShell TCP listener on an ephemeral port, walks the
  production path (`find_pid_on_port` → `kill_pid_force` → poll port),
  asserts the port is freed, and re-calls `kill_pid_force` to verify
  idempotency on an already-gone pid.
@senamakel senamakel requested a review from a team May 23, 2026 21:11
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 23, 2026

Warning

Review limit reached

@senamakel, we couldn't start this review because you've used your available PR reviews for now.

Your plan currently allows 2 reviews/hour. Refill in 36 seconds.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more review capacity refills, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than trial, open-source, and free plans. In all cases, review capacity refills continuously over time.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 5926efdf-ca72-4a32-9865-a59a517ccdd4

📥 Commits

Reviewing files that changed from the base of the PR and between af340d0 and 80af8d4.

📒 Files selected for processing (1)
  • app/src-tauri/src/process_kill.rs
📝 Walkthrough

Walkthrough

This PR prevents attempts to kill kernel-owned Windows PIDs (0 and 4) by skipping them during netstat parsing, refusing termination requests, adding taskkill result classification, and providing unit and end-to-end tests validating the flow and idempotency.

Changes

Windows Protected PID Handling for Port Takeover

Layer / File(s) Summary
Protected PID contract and foundation
app/src-tauri/src/process_kill.rs, app/src-tauri/src/core_process.rs
Adds is_protected_windows_pid (PIDs 0 and 4) and documentation describing kernel-owned, unkillable PIDs; establishes guarded checks used by parsing and termination code.
Netstat discovery filtering
app/src-tauri/src/core_process.rs, app/src-tauri/src/core_process_tests.rs
Windows netstat parsing now logs and skips protected PIDs, continuing to scan for real user-mode owners; unit tests cover single- and dual-stack scenarios.
Process termination safety and result classification
app/src-tauri/src/process_kill.rs
kill_pid_term/kill_pid_force refuse protected PIDs; kill_pid_force captures taskkill output and uses classify_taskkill_force_status to treat exit 0, 128, and "not found"/already-gone stderr as success while surfacing other failures.
End-to-end port takeover integration test
app/src-tauri/src/core_process_tests.rs, app/src-tauri/src/process_kill.rs
Windows integration test spawns a PowerShell TCP listener, finds and force-kills it via production code, verifies the port is released, and asserts kill_pid_force is idempotent when re-invoked.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested labels

bug

Suggested reviewers

  • graycyrus
  • M3gA-Mind

Poem

🐇 I sniffed the netstat lines at dawn,
Two guarded PIDs I left alone.
Taskkill learns to say "it's gone" with grace,
Ports freed, no kernel chase.
A tiny hop — the startup's calm is grown.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: fixing Windows port/process takeover to actually free the port by properly handling protected PIDs and improving taskkill classification. It directly reflects the core objective of the PR.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot added the bug label May 23, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@app/src-tauri/src/core_process_tests.rs`:
- Around line 384-389: The assertion comparing pid and child.id() can panic
before the spawned child is cleaned up; modify the test around the assert_eq! so
that if pid != child.id() you first ensure the spawned child (child) is
terminated (call child.kill() and wait/ignore errors) and/or join its handle,
then panic or assert-fail with the same message; alternatively replace the
direct assert_eq! with an if-check that performs child.kill().ok() and
child.wait().ok() before invoking panic! or failing the test, referencing the
existing pid and child.id() variables.

In `@app/src-tauri/src/process_kill.rs`:
- Around line 213-216: The current predicate in classify_taskkill_force_status
treats any stderr_str containing "could not be terminated" as a harmless
"already gone" case; change the condition so that stderr_str is only considered
Ok when it clearly indicates the process is absent (e.g., contains "not found"
or "no running instance") and do NOT treat a plain "could not be terminated" as
Ok—require it to be paired with "not found" (i.e., ("could not be terminated" &&
"not found")) or explicitly exclude stderr_str that contains "Access is denied";
update the if condition around stderr_str in classify_taskkill_force_status
accordingly to ensure access-denied errors are returned Err instead of Ok.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 7da6c907-65b6-46d2-a6f6-d6f2d471cc88

📥 Commits

Reviewing files that changed from the base of the PR and between 4c6007b and 69eea01.

📒 Files selected for processing (3)
  • app/src-tauri/src/core_process.rs
  • app/src-tauri/src/core_process_tests.rs
  • app/src-tauri/src/process_kill.rs

Comment thread app/src-tauri/src/core_process_tests.rs Outdated
Comment thread app/src-tauri/src/process_kill.rs Outdated
…ess paired

CodeRabbit feedback on PR tinyhumansai#2552:

1. `classify_taskkill_force_status` was treating any stderr containing
   "could not be terminated" as success, but that phrase also appears in
   access-denied messages like "...could not be terminated. Reason:
   Access is denied." — masking real failures and leaving the process
   alive. Now requires "no running instance of the task" OR a pairing
   with "not found" before treating the run as success; bare
   access-denied stderr now propagates Err as intended. Added explicit
   tests for both the access-denied case (must fail) and the canonical
   `/T`-tree "no running instance of the task" case (must succeed).

2. `windows_port_takeover_finds_and_kills_listener` could panic on the
   pid sanity check before the spawned PowerShell listener was reaped,
   leaving a 60-second listener squatting on the test port. Now tears
   down the child via `kill()` + `wait()` before any panic.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@app/src-tauri/src/process_kill.rs`:
- Around line 281-288: The test
classify_taskkill_force_treats_no_running_instance_as_success currently passes
Some(128) which triggers the early exit path and never exercises the stderr
fallback branch; change the exit code argument to a non-128 masked value (e.g.,
Some(1) or Some(0)) so classify_taskkill_force_status actually inspects stderr
and validates the "no running instance of the task" message in the test body,
ensuring the stderr-matching branch is covered; update only the argument passed
to classify_taskkill_force_status in that test.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: acb78815-7269-4458-bf18-03b735b26640

📥 Commits

Reviewing files that changed from the base of the PR and between 69eea01 and af340d0.

📒 Files selected for processing (2)
  • app/src-tauri/src/core_process_tests.rs
  • app/src-tauri/src/process_kill.rs

Comment thread app/src-tauri/src/process_kill.rs
Per CodeRabbit on PR tinyhumansai#2552: passing Some(128) hit the early exit-128
fast path in classify_taskkill_force_status and never evaluated the
new "no running instance of the task" stderr branch — so a regression
in the fallback matcher would silently still pass. Use Some(1) (a
masked non-128 code) so the stderr matching is actually covered.
@senamakel senamakel merged commit 6a06bae into tinyhumansai:main May 23, 2026
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant