fix(tauri): clean shutdown + orphan reap (#1060)#1248
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds CancellationToken-driven graceful shutdown for the embedded core server; extracts cross-platform pid-kill helpers; implements macOS startup recovery to reap stale OpenHuman processes; improves orphan-child cleanup with TERM→KILL escalation; and exposes a Tauri diagnostic command to list owned processes. ChangesProcess lifecycle, recovery, and wiring
Sequence Diagram(s)sequenceDiagram
participant Tauri as Tauri App
participant Recovery as process_recovery
participant Core as Embedded Core
participant Axum as Axum Server
participant Children as Child Processes
Tauri->>Recovery: reap_stale_openhuman_processes() (macOS startup)
Recovery->>Recovery: check CEF SingletonLock (may skip)
Recovery->>Recovery: enumerate ps output & identify stale PIDs
Recovery->>Children: SIGTERM stale processes
Recovery->>Recovery: sleep 500ms
Recovery->>Recovery: re-enumerate & match PID+command
Recovery->>Children: SIGKILL holdouts
Recovery-->>Tauri: recovery done
Tauri->>Core: spawn embedded core with CancellationToken
Core->>Axum: run_server_embedded(shutdown_token)
Axum->>Axum: with_graceful_shutdown(shutdown_token)
Tauri->>Core: send_terminate_signal()
Core->>Core: cancel shutdown_token
Axum-->>Axum: graceful shutdown triggered
Tauri->>Tauri: event loop ends
Tauri->>Children: sweep_orphan_children() -> pkill -TERM -P
Tauri->>Tauri: sleep 500ms
Tauri->>Children: sweep_orphan_children() -> pkill -KILL -P (holdouts)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
7b0e4cd to
3d3bb09
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
app/src-tauri/src/core_process.rs (1)
367-381:⚠️ Potential issue | 🟠 Major | ⚡ Quick win
shutdown()still aborts before graceful shutdown can finish.
shutdown()is used by restart/update flows, but it cancels the token and immediately aborts the server task. That makesrun_server_embedded(...).with_graceful_shutdown(...)mostly a race: in-flight RPCs are still cut off unless the server task happens to win beforeabort()runs.I'd keep
send_terminate_signal()as the fire-and-forget exit path, but letshutdown()wait a bounded time for the task to exit cleanly before falling back toabort().Possible direction
pub async fn shutdown(&self) { self.cancel_shutdown_token("").await; - self.abort_task("").await; + let mut task_guard = self.task.lock().await; + if let Some(mut task) = task_guard.take() { + if tokio::time::timeout(Duration::from_secs(2), &mut task) + .await + .is_err() + { + log::warn!("[core] graceful shutdown timed out; aborting embedded core server task"); + task.abort(); + } + } }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@app/src-tauri/src/core_process.rs` around lines 367 - 381, shutdown() currently cancels the shutdown token then immediately aborts the server task, which prevents graceful shutdown; change shutdown() to call cancel_shutdown_token("") then wait for the server task to exit with a bounded timeout (use tokio::time::timeout with a short duration, e.g. 5s) by awaiting the server task's JoinHandle completion, and only call abort_task("") if the timeout elapses or the join returns an error; keep send_terminate_signal() unchanged as the fire-and-forget path and reuse the existing cancel_shutdown_token and abort_task helpers.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/core/jsonrpc_tests.rs`:
- Around line 11-31: EnvVarGuard must serialize process-global env var mutation
to avoid races: add a static lock (e.g.,
once_cell::sync::Lazy<std::sync::Mutex<()>> named ENV_LOCK) and have EnvVarGuard
acquire that lock in EnvVarGuard::set and store the resulting MutexGuard in the
struct (e.g., add a field like lock: std::sync::MutexGuard<'static, ()> or
Option<...>), then let Drop both restore the old env and drop the guard
(releasing the lock). Update EnvVarGuard::set to lock ENV_LOCK before
reading/setting the env and return the guard-bearing EnvVarGuard, and ensure the
Drop impl still restores or removes the env value then lets the stored guard go
out of scope to release serialization.
---
Outside diff comments:
In `@app/src-tauri/src/core_process.rs`:
- Around line 367-381: shutdown() currently cancels the shutdown token then
immediately aborts the server task, which prevents graceful shutdown; change
shutdown() to call cancel_shutdown_token("") then wait for the server task to
exit with a bounded timeout (use tokio::time::timeout with a short duration,
e.g. 5s) by awaiting the server task's JoinHandle completion, and only call
abort_task("") if the timeout elapses or the join returns an error; keep
send_terminate_signal() unchanged as the fire-and-forget path and reuse the
existing cancel_shutdown_token and abort_task helpers.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 92e9f774-165e-4060-a980-2a7266d5ed0d
⛔ Files ignored due to path filters (1)
app/src-tauri/Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (9)
app/src-tauri/Cargo.tomlapp/src-tauri/src/core_process.rsapp/src-tauri/src/core_process_tests.rsapp/src-tauri/src/lib.rsapp/src-tauri/src/process_kill.rsapp/src-tauri/src/process_recovery.rsdocs/src-tauri/README.mdsrc/core/jsonrpc.rssrc/core/jsonrpc_tests.rs
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
app/src-tauri/src/lib.rs (1)
1728-1820: ⚡ Quick winMove the Unix sweep helpers out of
lib.rs.This entrypoint is already carrying startup, tray, updater, IPC, and shutdown orchestration. Adding another ~100 lines of Unix process-enumeration/kill plumbing here makes it harder to maintain than extracting it into
process_kill.rsor a small shutdown module.As per coding guidelines,
Prefer modules ≤ ~500 lines; split growing modules.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@app/src-tauri/src/lib.rs` around lines 1728 - 1820, The lib.rs Unix sweep helpers (sweep_orphan_children, sweep_orphan_children_unix, direct_child_pids, parse_pgrep_pids, pkill_children, log_unexpected_pkill_status) should be moved into a new module (e.g., process_kill.rs): create the file, copy those functions preserving their #[cfg(unix)] attributes and visibility (make them pub(crate) if called from lib.rs), add mod process_kill; in lib.rs replace the inline implementations with a single call to process_kill::sweep_orphan_children() (retain the non-Unix noop variant in lib.rs or re-export a non-unix wrapper), ensure any used imports (log, std) remain available or are referenced fully-qualified, and add the new module to the crate so the build passes.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@app/src-tauri/src/lib.rs`:
- Around line 1744-1776: The code currently skips pkill_children when
direct_child_pids() returns Err; change sweep logic so pkill_children("TERM")
and pkill_children("KILL") are invoked regardless of whether
direct_child_pids(parent_pid) succeeded—use the result of direct_child_pids only
for counting/log messages (term_count/kill_count) but do not gate the
pkill_children calls on its Ok state; keep calling log_unexpected_pkill_status
on successful pkill_children and log warnings on Err, and also fix the final log
total to compute total as term_count + kill_count (instead of term_count twice)
so the messages reflect counts correctly.
- Around line 79-94: The current process_diagnostics_list_owned function hides
errors by returning an empty Vec on failure; change its signature to return a
Result<Vec<process_recovery::ProcessInfo>, String> (or other serializable error
type Tauri expects), propagate the error from
process_recovery::enumerate_openhuman_processes() instead of converting it to
Vec::new() (e.g. map_err(|e| e.to_string())), and keep logging the error before
returning Err; update callers (Tauri command registration) if needed so the
command returns the Result and the real error surfaces to the caller.
---
Nitpick comments:
In `@app/src-tauri/src/lib.rs`:
- Around line 1728-1820: The lib.rs Unix sweep helpers (sweep_orphan_children,
sweep_orphan_children_unix, direct_child_pids, parse_pgrep_pids, pkill_children,
log_unexpected_pkill_status) should be moved into a new module (e.g.,
process_kill.rs): create the file, copy those functions preserving their
#[cfg(unix)] attributes and visibility (make them pub(crate) if called from
lib.rs), add mod process_kill; in lib.rs replace the inline implementations with
a single call to process_kill::sweep_orphan_children() (retain the non-Unix noop
variant in lib.rs or re-export a non-unix wrapper), ensure any used imports
(log, std) remain available or are referenced fully-qualified, and add the new
module to the crate so the build passes.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 7ac963b9-2594-48b5-ba3d-c09be0b8fe67
⛔ Files ignored due to path filters (1)
app/src-tauri/Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (10)
app/src-tauri/Cargo.tomlapp/src-tauri/src/core_process.rsapp/src-tauri/src/core_process_tests.rsapp/src-tauri/src/lib.rsapp/src-tauri/src/process_kill.rsapp/src-tauri/src/process_recovery.rsdocs/src-tauri/README.mdsrc/core/jsonrpc.rssrc/core/jsonrpc_tests.rssrc/openhuman/agent/prompts/mod_tests.rs
✅ Files skipped from review due to trivial changes (3)
- app/src-tauri/Cargo.toml
- docs/src-tauri/README.md
- src/core/jsonrpc.rs
🚧 Files skipped from review as they are similar to previous changes (5)
- app/src-tauri/src/core_process_tests.rs
- src/core/jsonrpc_tests.rs
- app/src-tauri/src/core_process.rs
- app/src-tauri/src/process_kill.rs
- app/src-tauri/src/process_recovery.rs
There was a problem hiding this comment.
🧹 Nitpick comments (1)
app/src-tauri/src/process_kill.rs (1)
28-39: 💤 Low valueRemove redundant
Pid::from_rawand dead code.Line 29 duplicates the
Pid::from_raw(pid as i32)conversion whentarget(line 28) already holds that value. Line 34 (let _ = target;) is a no-op that appears to be leftover from refactoring.♻️ Proposed fix
pub(crate) fn kill_pid_force(pid: u32) -> Result<(), String> { use nix::sys::signal::{kill, Signal}; use nix::unistd::Pid; let target = Pid::from_raw(pid as i32); - match kill(Pid::from_raw(pid as i32), Signal::SIGKILL) { + match kill(target, Signal::SIGKILL) { Ok(()) => Ok(()), // ESRCH means the process exited between our re-validation and the // SIGKILL — the resource is freeing on its own, treat as success. Err(nix::errno::Errno::ESRCH) => { - let _ = target; Ok(()) } Err(e) => Err(format!("SIGKILL pid {pid}: {e}")), } }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@app/src-tauri/src/process_kill.rs` around lines 28 - 39, The code creates target with Pid::from_raw(pid as i32) but then calls kill(Pid::from_raw(pid as i32), ...) and leaves a no-op let _ = target; in the ESRCH arm; change the kill call to use the existing target (kill(target, Signal::SIGKILL)) and remove the dead statement let _ = target; so the function uses the single Pid value and eliminates redundant conversion and dead code in process_kill.rs.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@app/src-tauri/src/process_kill.rs`:
- Around line 28-39: The code creates target with Pid::from_raw(pid as i32) but
then calls kill(Pid::from_raw(pid as i32), ...) and leaves a no-op let _ =
target; in the ESRCH arm; change the kill call to use the existing target
(kill(target, Signal::SIGKILL)) and remove the dead statement let _ = target; so
the function uses the single Pid value and eliminates redundant conversion and
dead code in process_kill.rs.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: e7d644ae-dde4-402a-96b4-76a81977c9ec
📒 Files selected for processing (5)
app/src-tauri/src/core_process.rsapp/src-tauri/src/lib.rsapp/src-tauri/src/process_kill.rsdocs/src-tauri/README.mdsrc/core/jsonrpc_tests.rs
✅ Files skipped from review due to trivial changes (1)
- docs/src-tauri/README.md
🚧 Files skipped from review as they are similar to previous changes (1)
- src/core/jsonrpc_tests.rs
3f885e7 to
67f936a
Compare
Summary
CancellationToken.Problem
RunEvent::ExitRequested, leaving CEF/core helper processes behind.Solution
process_kill.rsfor reusable TERM/KILL helpers.process_recovery.rsstartup reaper that scopes ownership to argv0 paths inside the launching.app/Contents/, skips self, skipsOPENHUMAN_CORE_REUSE_EXISTING=1, and avoids killing a live SingletonLock holder.CancellationTokenplumbing throughCoreProcessHandleintorun_server_embedded, and call Axum.with_graceful_shutdown(token.cancelled()).send_terminate_signal/shutdownto cancel before aborting the embedded server task.sweep_orphan_childrento count direct children, send TERM, wait 500ms, KILL holdouts, and log[app] sweep: term=N kill=M total=K.process_diagnostics_list_ownedfor diagnostics and document stuck-process recovery indocs/src-tauri/README.md.Submission Checklist
docs/TESTING-STRATEGY.mdN/A: Diff coverage >= 80% local merged coverage was not run; this infrastructure PR is covered by targeted Rust tests plus macOS smoke, and CI coverage remains authoritative.
N/A: Coverage matrix updated - process lifecycle infrastructure has no feature row to add/remove/rename.
N/A: All affected feature IDs from the matrix are listed in the PR description under
## Related- no matrix feature IDs apply.docs/TESTING-STRATEGY.md)N/A: Manual smoke checklist updated if this touches release-cut surfaces - no release checklist row exists for this process lifecycle path; local S1-S6-style smoke results are included below.
Closes #NNNin the## RelatedsectionImpact
Manual verification:
cargo fmt --check --allpassed.cargo checkpassed with existing warnings.cargo check --manifest-path app/src-tauri/Cargo.tomlpassed with existing warnings.cargo clippy --manifest-path app/src-tauri/Cargo.toml --testspassed with existing warnings.cargo clippy --testsfailed only on known pre-existing deniedapprox_constantlints outside this PR.cargo test --manifest-path app/src-tauri/Cargo.toml --libpassed: 177 passed, 4 ignored.shutdown_token_stops_axum_listener_within_timeout,send_terminate_signal_cancels_shutdown_token, andprocess_recovery.cargo test -p openhuman-tauri-shellcould not run because this checkout has noopenhuman-tauri-shellpackage; the app package isOpenHuman.pnpm --filter openhuman-app macos:build:debugbuilt the branch debug.appand DMG successfully.macOS smoke from the branch-built bundle:
.app/Contents/processes.kill -9of the branch main PID left no branch-owned CEF helpers in this run.OPENHUMAN_CORE_REUSE_EXISTING=1attach path preserved the standalone listener using the repo's actualopenhuman-core servebin./var/.../openhuman-accessibility-helper/unified_helper_binPPID-1 processes pre-existed this worktree and are outside the branch bundle-path ownership boundary.Related
Closes #1060
Cross-reference: #1130
Cross-reference: #920
Cross-reference: #1120
Follow-up PR(s)/TODOs: N/A: none for this PR.
Summary by CodeRabbit
New Features
Documentation
Tests
Chores