Skip to content

Ctrl+C can become non-functional during /review and MCP startup (startup/shutdown deadlock) #11267

@mingley

Description

@mingley

Summary

Ctrl+C can become non-functional in startup-heavy flows (notably /review and MCP initialization), leaving Codex running in a state where keyboard interrupts do not exit the app.

In affected runs, repeated Ctrl+C does not recover. The only reliable escape is external process termination (closing terminal/tab, killing tmux pane, kill, etc.).

User impact

  • Breaks the primary safety/escape interaction in the CLI.
  • Can strand users in long-running or hung startup paths.
  • Makes /review and initial MCP setup feel deadlocked even when user requests cancellation.

Reproduction scenarios

Scenario A: /review during startup pressure

  1. Start Codex in an environment where startup is slow/hanging (for example MCP server startup delays).
  2. Trigger /review.
  3. Press Ctrl+C one or more times.

Observed in failing cases: app does not exit, and keyboard interrupt appears ignored.

Scenario B: MCP initialization with unresolved client future

  1. Start with MCP servers where one startup future does not resolve promptly.
  2. During initialization, press Ctrl+C.

Observed in failing cases: shutdown request can be starved; process stays alive until externally killed.

Root cause analysis

This was a compound bug with multiple contributing paths:

  1. Startup op starvation in TUI agent startup path

    • Before fix, spawn_agent awaited start_thread before consuming queued ops.
    • If startup hung, Op::Shutdown (triggered by quit flow / Ctrl+C) could not be handled.
  2. Unbounded or effectively unbounded MCP client waits in startup-sensitive paths

    • Several MCP manager call sites awaited shared client futures directly.
    • If client creation/startup stuck, startup progress and related status flows could stall.
  3. Shutdown-first exit lacked hard escalation path

    • Repeated quit requests could continue routing through ShutdownFirst without a guaranteed fallback to immediate exit when shutdown completion was stuck.
  4. Review sub-agent startup was not cancellation-aware while spawning

    • /review path called sub-agent spawn directly; cancellation token was not raced for the spawn await.

Implemented fix

I prepared and validated a two-commit fix on branch mingley/ctrlc-interrupt-fix:

  • e403fd6c7dd5aa277ad6d6f6b6c1323b7a1d102a
  • 3d8dccdea337b2b5ff44db929a50d57a1ecdf565

1) Keep startup interruptible in TUI (codex-rs/tui/src/chatwidget/agent.rs)

  • While startup is pending, consume ops from the op channel.
  • If Op::Shutdown arrives pre-startup, emit AppEvent::Exit(ExitMode::Immediate) and return.
  • Buffer non-shutdown startup ops and flush after startup completes.

2) Bound MCP startup waits across manager call paths (codex-rs/core/src/mcp_connection_manager.rs)

  • Added startup timeout-aware client helper and applied it to startup-sensitive operations.
  • Added timeout-based behavior for startup-failure checks and aggregate listing paths.
  • Updated additional remaining raw startup waits (notify_sandbox_state_change, startup join task) to use bounded waits.

3) Add shutdown escalation fallback (codex-rs/tui/src/app.rs)

  • Added explicit escalation behavior: repeated ShutdownFirst quit request exits immediately.
  • This preserves graceful shutdown attempt first, but guarantees a hard user escape hatch on repeated request.

4) Make review sub-agent spawn cancellation-aware (codex-rs/core/src/codex_delegate.rs)

  • Wrapped Codex::spawn(...) with cancellation (or_cancel) in review/delegate startup path.
  • Prevents /review from hanging uninterruptibly during spawn waits.

Test coverage and validation

Added/validated tests relevant to these regressions:

  • mcp_connection_manager::tests::required_startup_failures_times_out_pending_server
  • mcp_connection_manager::tests::list_all_tools_skips_pending_server_after_timeout
  • app::tests::shutdown_first_exit_escalates_on_second_request
  • chatwidget::tests::ctrl_c_shutdown_works_with_caps_lock

Validation commands run:

  • cargo fmt
  • cargo clippy --all-targets --all-features -- -D warnings
  • cargo test -p codex-core required_startup_failures_times_out_pending_server
  • cargo test -p codex-core list_all_tools_skips_pending_server_after_timeout
  • cargo test -p codex-tui shutdown_first_exit_escalates_on_second_request
  • cargo test -p codex-tui ctrl_c_shutdown_works_with_caps_lock
  • cargo test -p codex-tui
  • cargo test (workspace)

Workspace test note:

  • cargo test fails in this environment only on pre-existing core tests unrelated to this fix:
    • shell::tests::detects_bash
    • seatbelt::tests::create_seatbelt_args_with_read_only_git_pointer_file
    • seatbelt::tests::create_seatbelt_args_with_read_only_git_and_codex_subpaths

Why this contribution should be accepted

  1. High user impact: restores reliability of Ctrl+C, the primary fail-safe control.
  2. Clear causal diagnosis: fix addresses each independent blocking path, not just symptoms.
  3. Surgical scope: targeted changes in affected startup/shutdown paths only.
  4. Regression coverage: added focused tests for timeout and shutdown escalation behavior.
  5. Low behavior risk: normal successful flow remains unchanged; only failure/cancellation paths are hardened.

Request for contribution path

External PRs are currently invitation-only. Please consider one of:

  1. Allowing me to submit this as a PR directly, or
  2. Cherry-picking the two commits above.

I can also split the commits further if maintainers want each root-cause fix isolated.

Metadata

Metadata

Assignees

No one assigned

    Labels

    TUIIssues related to the terminal user interface: text input, menus and dialogs, and terminal displaybugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions