fix(loop): prevent stale exit signals from causing premature exit (#194) by frankbria · Pull Request #218 · frankbria/ralph-claude-code

frankbria · 2026-03-16T17:18:13Z

Summary

Fixes #194: Ralph ends early despite remaining steps in fix_plan.md

Root cause: When a previous Ralph run ended without calling reset_session() (crash, SIGKILL, or API-limit user exit), stale completion_indicators persisted in .exit_signals. On the next invocation, should_exit_gracefully() read these stale signals and exited the loop before execute_claude_code() was ever called.

Four fixes:

Unconditionally reset .exit_signals and .response_analysis at startup before the main loop (primary fix)
Add reset_session("api_limit_exit") to the API-limit user-exit path
Add diagnostic logging of signal counts in should_exit_gracefully() for future debugging
Include current_loop: 0 in circuit breaker init/reset, add jq fallback in show_circuit_status() to prevent #null display

Acceptance Criteria

New ralph invocation always starts with clean .exit_signals
API-limit user-exit cleans up session state
should_exit_gracefully() logs signal counts for diagnosability
--circuit-status shows 0 instead of #null for fresh state
No regressions — all 580 tests pass

Test Plan

7 new tests written (TDD approach): 4 in test_exit_detection.bats, 3 in test_circuit_breaker_recovery.bats
All 580 tests passing (0 failures)
Documentation updated (CLAUDE.md startup state reset section + test counts)

Closes #194

Summary by CodeRabbit

Bug Fixes
- Fixed stale exit signals preventing premature termination at startup.
New Features
- Enhanced circuit breaker state tracking with loop iteration monitoring.
- Added diagnostic logging for exit-check visibility in verbose mode.
Tests
- Expanded test suite from 573 to 580 tests with comprehensive coverage for exit signal handling and circuit breaker operations.
Documentation
- Updated release documentation to reflect exit signal initialization and circuit breaker improvements.

When a previous Ralph run ended without calling reset_session() — crash, SIGKILL, or API-limit user exit — stale completion_indicators persisted in .exit_signals. On the next invocation, should_exit_gracefully() read these stale signals and exited the loop before execute_claude_code() was ever called (visible as "Current loop: #null" in circuit breaker status). Four fixes: 1. Unconditionally reset .exit_signals and .response_analysis at startup before the main loop (primary fix) 2. Add reset_session("api_limit_exit") to the API-limit user-exit path 3. Add diagnostic logging of signal counts in should_exit_gracefully() 4. Include current_loop:0 in circuit breaker init/reset, add jq fallback in show_circuit_status() to prevent #null display

coderabbitai · 2026-03-16T17:18:33Z

Walkthrough

This PR fixes premature exit behavior caused by stale exit signals. It introduces startup state reset logic to clear exit signal files, adds a current_loop field to circuit breaker state tracking with safe display fallback, and implements diagnostic logging for exit-detection debugging during the main loop.

Changes

Cohort / File(s)	Summary
Documentation `CLAUDE.md`	Added startup state reset section, expanded completion indicators with EXIT_SIGNAL gate rules, updated test counts and references to related fixes.
Circuit Breaker State Management `lib/circuit_breaker.sh`	Introduced `current_loop` field (initialized to 0) with fallback display logic to prevent `#null` output when field is missing.
Main Loop Exit-Signal Handling `ralph_loop.sh`	Added unconditional exit signal reset at startup, diagnostic logging for exit-check state in `should_exit_gracefully`, and `reset_session` call in API limit exit path.
Unit Tests - Circuit Breaker `tests/unit/test_circuit_breaker_recovery.bats`	Added three new tests verifying `current_loop` initialization, reset, and safe display with fallback values.
Unit Tests - Exit Detection `tests/unit/test_exit_detection.bats`	Added comprehensive test suite covering stale exit signal prevention, diagnostic logging, session reset invocation, and EXIT_SIGNAL interaction with completion indicators and edge cases.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

feat(session): add session lifecycle management with auto-reset triggers #66: Introduces session lifecycle functions (reset_session, init_session_tracking) that are now called during startup and API-limit exit paths in this PR.
fix: Recognize STATUS: COMPLETE as progress in circuit breaker #140: Modifies response_analysis and completion indicator handling; this PR removes stale response_analysis files to prevent false exit triggers.
fix(loop): respect Claude's EXIT_SIGNAL when checking completion indicators #90: Alters should_exit_gracefully logic; this PR adds diagnostic logging and startup resets to the same function to prevent stale signal issues.

Poem

🐰 A loop fresh-scrubbed, no ghosts of old
Current marked, its story told
Signals cleared with startup's cheer
Ralph hops on—no false exits here! 🌟

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main fix: preventing stale exit signals from causing premature exit, with issue reference `#194`.
Linked Issues check	✅ Passed	The PR implements all requirements from `#194`: resets stale exit signals at startup, fixes `#null` display in circuit status, adds diagnostic logging, and calls reset_session on API-limit exit.
Out of Scope Changes check	✅ Passed	All changes directly address `#194` requirements: exit signal reset, circuit breaker current_loop handling, diagnostic logging, and reset_session call are all in scope.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch fix/194-ralph-ends-early

📝 Coding Plan

Generate coding plan for human review comments

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-03-16T17:19:57Z

Code Review Summary

This PR effectively addresses Issue #194 by adding unconditional startup state reset to prevent stale exit signals from causing premature exits. The implementation is clean and well-tested.

✅ Strengths

Correct placement of startup reset - The reset is positioned correctly after and before the main loop, ensuring every invocation starts with clean state
Comprehensive cleanup - Both and are cleared, preventing any stale state from crashes, SIGKILL, or prior runs
Backward compatibility - The fallback in () handles old state files without the field
Good test coverage - 7 new tests (4 in , 3 in ) verify the fix comprehensively
Minimal and focused changes - The fix is surgical, only touching what's necessary to solve the problem

📝 Minor Observations

JSON creation method - The startup reset uses a heredoc to create the JSON () instead of . This is intentional and appropriate for creating known-good JSON. Using would be overkill for this use case.
API limit exit path - The addition of at line 1952 is correctly placed before the statement, ensuring cleanup occurs.
Diagnostic logging - The verbose logging at line 514 uses the existing flag, avoiding log spam unless explicitly enabled.

✅ Acceptance Criteria Verification

New ralph invocation always starts with clean ✅ (lines 1813-1817)
API-limit user-exit cleans up session state ✅ (line 1952)
logs signal counts for diagnosability ✅ (line 514)
shows instead of for fresh state ✅ (line 383 with fallback)
No regressions — all 580 tests pass ✅ (test count updated in CLAUDE.md)

🔍 Code Quality

Best practices: Follows bash conventions, uses proper quoting, handles errors appropriately
Security: No concerns - only internal state files are modified
Performance: Negligible impact - two file operations at startup only
Test coverage: Excellent - tests verify the fix from multiple angles

🎯 Recommendation

APPROVE - This is a solid, well-tested fix that correctly addresses the reported issue without introducing new problems or complexity. The changes are minimal, focused, and aligned with the project's code quality standards.

github-actions · 2026-03-16T17:20:07Z

Code Review Summary

This PR effectively addresses Issue #194 by adding unconditional startup state reset to prevent stale exit signals from causing premature exits. The implementation is clean and well-tested.

Strengths

Correct placement of startup reset - The reset is positioned correctly after init_session_tracking() and before the main loop, ensuring every ralph invocation starts with clean state
Comprehensive cleanup - Both .exit_signals and .response_analysis are cleared, preventing any stale state from crashes, SIGKILL, or prior runs
Backward compatibility - The jq fallback in show_circuit_status() handles old state files without the current_loop field
Good test coverage - 7 new tests (4 in test_exit_detection.bats, 3 in test_circuit_breaker_recovery.bats) verify the fix comprehensively
Minimal and focused changes - The fix is surgical, only touching what is necessary to solve the problem

Minor Observations

JSON creation method - The startup reset uses a heredoc to create the JSON instead of jq. This is intentional and appropriate for creating known-good JSON.
API limit exit path - The addition of reset_session at line 1952 is correctly placed before the break statement, ensuring cleanup occurs.
Diagnostic logging - The verbose logging at line 514 uses the existing VERBOSE_PROGRESS flag, avoiding log spam unless explicitly enabled.

Acceptance Criteria Verification

New ralph invocation always starts with clean .exit_signals (lines 1813-1817)
API-limit user-exit cleans up session state (line 1952)
should_exit_gracefully() logs signal counts for diagnosability (line 514)
--circuit-status shows 0 instead of null for fresh state (line 383 with fallback)
No regressions - all 580 tests pass (test count updated in CLAUDE.md)

Code Quality

Best practices: Follows bash conventions, uses proper quoting, handles errors appropriately
Security: No concerns - only internal state files are modified
Performance: Negligible impact - two file operations at startup only
Test coverage: Excellent - tests verify the fix from multiple angles

Recommendation

APPROVE - This is a solid, well-tested fix that correctly addresses the reported issue without introducing new problems or complexity. The changes are minimal, focused, and aligned with the project code quality standards.

github-actions · 2026-03-16T17:20:11Z

Review comment posted successfully at: #218 (comment)

opencode session | github run

github-actions · 2026-03-16T17:20:33Z

Review: fix(loop) - prevent stale exit signals causing premature exit (Issue 194)

This is a well-targeted fix for a real operational issue. The root cause analysis is accurate and the primary fix is appropriately unconditional. A few observations:

What works well

Primary fix is clean. Resetting .exit_signals and removing .response_analysis unconditionally in main() before the loop is the correct approach. No conditional logic needed — every new invocation should start fresh.

Backward compatibility in circuit breaker. Using // "N/A" as a jq fallback for old state files without current_loop is good defensive coding, preventing the #null display bug without requiring a migration step.

Gated diagnostic logging. The VERBOSE_PROGRESS guard on the new should_exit_gracefully() log line is the right call — doesn't clutter normal output while still being available for debugging.

reset_session on API limit exit. Good defense-in-depth. Even though the startup reset handles the stale-signals case, explicitly cleaning session state on user-initiated API-limit exit prevents session ID issues on the next run.

Issues worth addressing

1. Weak "before reset" assertion in stale exit signals do not cause premature exit test

run should_exit_gracefully
[[ "$output" != "" ]]  # Would exit

The comment says "Would exit" but the assertion only checks that output is non-empty, not that an exit condition was actually triggered. should_exit_gracefully may emit output for reasons unrelated to exiting. A stronger assertion would verify the return value or the echoed exit reason (e.g., [[ "$output" == *"project_complete"* ]]). As written, the test could pass even if the stale-signal detection regressed.

2. rm -f ... 2>/dev/null is redundant

rm -f "$RESPONSE_ANALYSIS_FILE" 2>/dev/null

rm -f already suppresses "no such file" errors. The 2>/dev/null adds no value and may mask genuine errors (e.g., permission denied). Consider dropping it.

3. Static analysis tests are fragile

The tests startup resets stale exit signals before main loop and should_exit_gracefully logs diagnostic signal counts grep for specific strings in the source code. These will break on any refactor that renames variables or log messages, even if the behavior remains correct. Behavioral tests (actually calling the functions or running the script) are more robust and meaningful. The structural check approach is understandable for integration-level concerns, but it's worth noting the tradeoff.

4. Hardcoded exit signals JSON structure

The startup reset inlines the structure:

echo '{"test_only_loops": [], "done_signals": [], "completion_indicators": []}' > "$EXIT_SIGNALS_FILE"

This same pattern presumably appears elsewhere in the codebase (e.g., reset_session()). If the .exit_signals schema changes, all occurrences need updating. A reset_exit_signals() helper function would centralize this. Not a blocker for this PR, but worth a follow-up.

Minor nit

The current_loop: 0 initialized in init_circuit_breaker() and reset_circuit_breaker() is fine functionally, but 0 could be confused with "loop 0 actually ran." null might be more semantically accurate for "no loops yet," though the display fallback // "N/A" covers both cases cleanly. Not a blocker.

Overall

The fix is correct and addresses the root cause. The circuit breaker display polish is a nice addition. The main actionable item before merge is strengthening the assertion in the stale exit signals do not cause premature exit test — as currently written, that test doesn't fully validate the behavior it claims to cover.

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

lib/circuit_breaker.sh (1)

46-57: ⚠️ Potential issue | 🟠 Major

Include opened_at in initialized/reset circuit state payloads.

The state JSON written in these paths omits opened_at, so the schema is inconsistent across writers. Please include opened_at (and keep fallback-to-last_change only for legacy files).

Suggested patch

 {
     "state": "$CB_STATE_CLOSED",
     "last_change": "$(get_iso_timestamp)",
     "consecutive_no_progress": 0,
     "consecutive_same_error": 0,
     "consecutive_permission_denials": 0,
     "last_progress_loop": 0,
     "total_opens": 0,
     "reason": "",
-    "current_loop": 0
+    "current_loop": 0,
+    "opened_at": ""
 }

 {
     "state": "$CB_STATE_CLOSED",
     "last_change": "$(get_iso_timestamp)",
     "consecutive_no_progress": 0,
     "consecutive_same_error": 0,
     "consecutive_permission_denials": 0,
     "last_progress_loop": 0,
     "total_opens": 0,
     "reason": "$reason",
-    "current_loop": 0
+    "current_loop": 0,
+    "opened_at": ""
 }

As per coding guidelines, "Circuit breaker state files must include opened_at timestamp field; fall back to last_change for backward compatibility with old state files".

Also applies to: 420-431

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@lib/circuit_breaker.sh` around lines 46 - 57, The JSON written to
CB_STATE_FILE when initializing/resetting the circuit breaker (using
CB_STATE_CLOSED and get_iso_timestamp) is missing the opened_at field; update
the payload written in the here-doc to include "opened_at":
"$(get_iso_timestamp)" (or set opened_at to the same value as last_change) so
new state files include opened_at while retaining logic elsewhere that falls
back to last_change for legacy files; ensure the same change is applied to the
other initialization/reset block referenced in the comment (around the section
noted as also applies to lines 420-431) so all writers produce the opened_at
field.

🧹 Nitpick comments (2)

tests/unit/test_exit_detection.bats (1)

1245-1315: Prefer behavior assertions over source-grep assertions for these new tests.

These checks are currently tied to script text and can pass on false positives. Consider executing the relevant path and asserting file/state effects (.exit_signals cleared, .response_analysis removed, and session reset side effects) instead.

Based on learnings: "Test suite must achieve 100% test pass rate with comprehensive coverage of ... exit detection ... edge cases ..."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/unit/test_exit_detection.bats` around lines 1245 - 1315, Replace
fragile source-grep assertions with behavior-based assertions: for the "startup
resets stale exit signals before main loop" test, actually run the startup path
(or source and call main startup helper) and assert EXIT_SIGNALS_FILE is
cleared/rewritten and RESPONSE_ANALYSIS_FILE is removed; for "stale exit signals
do not cause premature exit" leave the simulated .exit_signals and
.response_analysis setup but invoke the actual startup reset code (or call the
reset helper used by main) and then call should_exit_gracefully() and assert it
returns/prints nothing, not by grepping the script; for "should_exit_gracefully
logs diagnostic signal counts" call should_exit_gracefully() and assert the
runtime output contains the expected diagnostic strings (recent_test_loops,
recent_done_signals, recent_completion_indicators) and that log_status was
invoked by checking the log output, not the source; for "API limit user-exit
path calls reset_session" exercise the API-limit user-exit flow (simulate
user_choice == "2" or call the function handling that path) and assert side
effects of reset_session (e.g., session file/state cleared or a known marker
updated) occur before the break; reference functions/variables to change:
should_exit_gracefully, reset_session, main/startup reset helper,
EXIT_SIGNALS_FILE, RESPONSE_ANALYSIS_FILE, and user_choice handling code in
ralph_loop.sh.

tests/unit/test_circuit_breaker_recovery.bats (1)

450-488: Add one regression test for CB_AUTO_RESET=true state rewrite.

Current additions cover init/reset/display, but not the OPEN→CLOSED auto-reset transition writer. A dedicated assertion for current_loop (and full expected schema) there would prevent gaps.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/unit/test_circuit_breaker_recovery.bats` around lines 450 - 488, Add a
regression test that verifies the OPEN→CLOSED auto-reset path (when
CB_AUTO_RESET=true) rewrites the state file including current_loop and the full
expected schema: create an OPEN state JSON, export CB_AUTO_RESET=true, invoke
the same command/function that performs the auto-reset transition (the code path
under test that flips OPEN→CLOSED), then assert the resulting CB_STATE_FILE
contains "current_loop": 0 and all expected keys (state, last_change,
consecutive_no_progress, consecutive_same_error, consecutive_permission_denials,
last_progress_loop, total_opens, reason) to ensure the auto-reset writer
preserves the schema.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@lib/circuit_breaker.sh`:
- Around line 46-57: The JSON written to CB_STATE_FILE when
initializing/resetting the circuit breaker (using CB_STATE_CLOSED and
get_iso_timestamp) is missing the opened_at field; update the payload written in
the here-doc to include "opened_at": "$(get_iso_timestamp)" (or set opened_at to
the same value as last_change) so new state files include opened_at while
retaining logic elsewhere that falls back to last_change for legacy files;
ensure the same change is applied to the other initialization/reset block
referenced in the comment (around the section noted as also applies to lines
420-431) so all writers produce the opened_at field.

---

Nitpick comments:
In `@tests/unit/test_circuit_breaker_recovery.bats`:
- Around line 450-488: Add a regression test that verifies the OPEN→CLOSED
auto-reset path (when CB_AUTO_RESET=true) rewrites the state file including
current_loop and the full expected schema: create an OPEN state JSON, export
CB_AUTO_RESET=true, invoke the same command/function that performs the
auto-reset transition (the code path under test that flips OPEN→CLOSED), then
assert the resulting CB_STATE_FILE contains "current_loop": 0 and all expected
keys (state, last_change, consecutive_no_progress, consecutive_same_error,
consecutive_permission_denials, last_progress_loop, total_opens, reason) to
ensure the auto-reset writer preserves the schema.

In `@tests/unit/test_exit_detection.bats`:
- Around line 1245-1315: Replace fragile source-grep assertions with
behavior-based assertions: for the "startup resets stale exit signals before
main loop" test, actually run the startup path (or source and call main startup
helper) and assert EXIT_SIGNALS_FILE is cleared/rewritten and
RESPONSE_ANALYSIS_FILE is removed; for "stale exit signals do not cause
premature exit" leave the simulated .exit_signals and .response_analysis setup
but invoke the actual startup reset code (or call the reset helper used by main)
and then call should_exit_gracefully() and assert it returns/prints nothing, not
by grepping the script; for "should_exit_gracefully logs diagnostic signal
counts" call should_exit_gracefully() and assert the runtime output contains the
expected diagnostic strings (recent_test_loops, recent_done_signals,
recent_completion_indicators) and that log_status was invoked by checking the
log output, not the source; for "API limit user-exit path calls reset_session"
exercise the API-limit user-exit flow (simulate user_choice == "2" or call the
function handling that path) and assert side effects of reset_session (e.g.,
session file/state cleared or a known marker updated) occur before the break;
reference functions/variables to change: should_exit_gracefully, reset_session,
main/startup reset helper, EXIT_SIGNALS_FILE, RESPONSE_ANALYSIS_FILE, and
user_choice handling code in ralph_loop.sh.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e6a071d0-24b0-4a25-82d7-b0c5a23bd6e6

📥 Commits

Reviewing files that changed from the base of the PR and between 13e35c4 and 257bfea.

📒 Files selected for processing (5)

CLAUDE.md
lib/circuit_breaker.sh
ralph_loop.sh
tests/unit/test_circuit_breaker_recovery.bats
tests/unit/test_exit_detection.bats

Test User added 2 commits March 16, 2026 10:16

docs: update exit conditions and test count for #194

257bfea

coderabbitai Bot reviewed Mar 16, 2026

View reviewed changes

frankbria merged commit 12e4710 into main Mar 16, 2026
7 checks passed

This was referenced Mar 16, 2026

fix(loop): detect Claude Extra Usage quota exhaustion (#100) #219

Merged

Exit confidence threshold (≥40) causes false-positive completions on documentation keywords #224

Closed

frankbria deleted the fix/194-ralph-ends-early branch March 24, 2026 23:50

coderabbitai Bot mentioned this pull request Apr 2, 2026

Feature Request: Add "Optional" or "Future" Section Support in fix_plan.md #239

Open

10 tasks

This was referenced Apr 11, 2026

fix: grep -c pattern produces "0\n0" when no matches found #251

Open

ralph_loop.sh line 716: syntax error in arithmetic expression when fix_plan.md has non-standard line endings or heading-prefixed checkboxes #255

Open

This was referenced May 11, 2026

on-stop.sh mutates Ralph state in interactive Claude Code sessions, inflating cost counters and tripping no_status_block_3x halt #263

Closed

fix(loop): trust agent RALPH_STATUS before halting on permission denials #264

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(loop): prevent stale exit signals from causing premature exit (#194)#218

fix(loop): prevent stale exit signals from causing premature exit (#194)#218
frankbria merged 2 commits into
mainfrom
fix/194-ralph-ends-early

frankbria commented Mar 16, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Mar 16, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Mar 16, 2026

Uh oh!

github-actions Bot commented Mar 16, 2026

Uh oh!

github-actions Bot commented Mar 16, 2026

Uh oh!

github-actions Bot commented Mar 16, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

frankbria commented Mar 16, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Acceptance Criteria

Test Plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

github-actions Bot commented Mar 16, 2026

Code Review Summary

✅ Strengths

📝 Minor Observations

✅ Acceptance Criteria Verification

🔍 Code Quality

🎯 Recommendation

Uh oh!

github-actions Bot commented Mar 16, 2026

Code Review Summary

Strengths

Minor Observations

Acceptance Criteria Verification

Code Quality

Recommendation

Uh oh!

github-actions Bot commented Mar 16, 2026

Uh oh!

github-actions Bot commented Mar 16, 2026

Review: fix(loop) - prevent stale exit signals causing premature exit (Issue 194)

What works well

Issues worth addressing

Minor nit

Overall

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

frankbria commented Mar 16, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Mar 16, 2026 •

edited

Loading