Skip to content

fix(loop): prevent stale exit signals from causing premature exit (#194)#218

Merged
frankbria merged 2 commits into
mainfrom
fix/194-ralph-ends-early
Mar 16, 2026
Merged

fix(loop): prevent stale exit signals from causing premature exit (#194)#218
frankbria merged 2 commits into
mainfrom
fix/194-ralph-ends-early

Conversation

@frankbria
Copy link
Copy Markdown
Owner

@frankbria frankbria commented Mar 16, 2026

Summary

Fixes #194: Ralph ends early despite remaining steps in fix_plan.md

Root cause: When a previous Ralph run ended without calling reset_session() (crash, SIGKILL, or API-limit user exit), stale completion_indicators persisted in .exit_signals. On the next invocation, should_exit_gracefully() read these stale signals and exited the loop before execute_claude_code() was ever called.

Four fixes:

  • Unconditionally reset .exit_signals and .response_analysis at startup before the main loop (primary fix)
  • Add reset_session("api_limit_exit") to the API-limit user-exit path
  • Add diagnostic logging of signal counts in should_exit_gracefully() for future debugging
  • Include current_loop: 0 in circuit breaker init/reset, add jq fallback in show_circuit_status() to prevent #null display

Acceptance Criteria

  • New ralph invocation always starts with clean .exit_signals
  • API-limit user-exit cleans up session state
  • should_exit_gracefully() logs signal counts for diagnosability
  • --circuit-status shows 0 instead of #null for fresh state
  • No regressions — all 580 tests pass

Test Plan

  • 7 new tests written (TDD approach): 4 in test_exit_detection.bats, 3 in test_circuit_breaker_recovery.bats
  • All 580 tests passing (0 failures)
  • Documentation updated (CLAUDE.md startup state reset section + test counts)

Closes #194

Summary by CodeRabbit

  • Bug Fixes

    • Fixed stale exit signals preventing premature termination at startup.
  • New Features

    • Enhanced circuit breaker state tracking with loop iteration monitoring.
    • Added diagnostic logging for exit-check visibility in verbose mode.
  • Tests

    • Expanded test suite from 573 to 580 tests with comprehensive coverage for exit signal handling and circuit breaker operations.
  • Documentation

    • Updated release documentation to reflect exit signal initialization and circuit breaker improvements.

Test User added 2 commits March 16, 2026 10:16
When a previous Ralph run ended without calling reset_session() — crash,
SIGKILL, or API-limit user exit — stale completion_indicators persisted
in .exit_signals. On the next invocation, should_exit_gracefully() read
these stale signals and exited the loop before execute_claude_code() was
ever called (visible as "Current loop: #null" in circuit breaker status).

Four fixes:
1. Unconditionally reset .exit_signals and .response_analysis at startup
   before the main loop (primary fix)
2. Add reset_session("api_limit_exit") to the API-limit user-exit path
3. Add diagnostic logging of signal counts in should_exit_gracefully()
4. Include current_loop:0 in circuit breaker init/reset, add jq fallback
   in show_circuit_status() to prevent #null display
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 16, 2026

Walkthrough

This PR fixes premature exit behavior caused by stale exit signals. It introduces startup state reset logic to clear exit signal files, adds a current_loop field to circuit breaker state tracking with safe display fallback, and implements diagnostic logging for exit-detection debugging during the main loop.

Changes

Cohort / File(s) Summary
Documentation
CLAUDE.md
Added startup state reset section, expanded completion indicators with EXIT_SIGNAL gate rules, updated test counts and references to related fixes.
Circuit Breaker State Management
lib/circuit_breaker.sh
Introduced current_loop field (initialized to 0) with fallback display logic to prevent #null output when field is missing.
Main Loop Exit-Signal Handling
ralph_loop.sh
Added unconditional exit signal reset at startup, diagnostic logging for exit-check state in should_exit_gracefully, and reset_session call in API limit exit path.
Unit Tests - Circuit Breaker
tests/unit/test_circuit_breaker_recovery.bats
Added three new tests verifying current_loop initialization, reset, and safe display with fallback values.
Unit Tests - Exit Detection
tests/unit/test_exit_detection.bats
Added comprehensive test suite covering stale exit signal prevention, diagnostic logging, session reset invocation, and EXIT_SIGNAL interaction with completion indicators and edge cases.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Poem

🐰 A loop fresh-scrubbed, no ghosts of old
Current marked, its story told
Signals cleared with startup's cheer
Ralph hops on—no false exits here! 🌟

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main fix: preventing stale exit signals from causing premature exit, with issue reference #194.
Linked Issues check ✅ Passed The PR implements all requirements from #194: resets stale exit signals at startup, fixes #null display in circuit status, adds diagnostic logging, and calls reset_session on API-limit exit.
Out of Scope Changes check ✅ Passed All changes directly address #194 requirements: exit signal reset, circuit breaker current_loop handling, diagnostic logging, and reset_session call are all in scope.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/194-ralph-ends-early
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown

Code Review Summary

This PR effectively addresses Issue #194 by adding unconditional startup state reset to prevent stale exit signals from causing premature exits. The implementation is clean and well-tested.

Strengths

  1. Correct placement of startup reset - The reset is positioned correctly after and before the main loop, ensuring every invocation starts with clean state

  2. Comprehensive cleanup - Both and are cleared, preventing any stale state from crashes, SIGKILL, or prior runs

  3. Backward compatibility - The fallback in () handles old state files without the field

  4. Good test coverage - 7 new tests (4 in , 3 in ) verify the fix comprehensively

  5. Minimal and focused changes - The fix is surgical, only touching what's necessary to solve the problem

📝 Minor Observations

  1. JSON creation method - The startup reset uses a heredoc to create the JSON () instead of . This is intentional and appropriate for creating known-good JSON. Using would be overkill for this use case.

  2. API limit exit path - The addition of at line 1952 is correctly placed before the statement, ensuring cleanup occurs.

  3. Diagnostic logging - The verbose logging at line 514 uses the existing flag, avoiding log spam unless explicitly enabled.

Acceptance Criteria Verification

  • New ralph invocation always starts with clean ✅ (lines 1813-1817)
  • API-limit user-exit cleans up session state ✅ (line 1952)
  • logs signal counts for diagnosability ✅ (line 514)
  • shows instead of for fresh state ✅ (line 383 with fallback)
  • No regressions — all 580 tests pass ✅ (test count updated in CLAUDE.md)

🔍 Code Quality

  • Best practices: Follows bash conventions, uses proper quoting, handles errors appropriately
  • Security: No concerns - only internal state files are modified
  • Performance: Negligible impact - two file operations at startup only
  • Test coverage: Excellent - tests verify the fix from multiple angles

🎯 Recommendation

APPROVE - This is a solid, well-tested fix that correctly addresses the reported issue without introducing new problems or complexity. The changes are minimal, focused, and aligned with the project's code quality standards.

@github-actions
Copy link
Copy Markdown

Code Review Summary

This PR effectively addresses Issue #194 by adding unconditional startup state reset to prevent stale exit signals from causing premature exits. The implementation is clean and well-tested.

Strengths

  1. Correct placement of startup reset - The reset is positioned correctly after init_session_tracking() and before the main loop, ensuring every ralph invocation starts with clean state

  2. Comprehensive cleanup - Both .exit_signals and .response_analysis are cleared, preventing any stale state from crashes, SIGKILL, or prior runs

  3. Backward compatibility - The jq fallback in show_circuit_status() handles old state files without the current_loop field

  4. Good test coverage - 7 new tests (4 in test_exit_detection.bats, 3 in test_circuit_breaker_recovery.bats) verify the fix comprehensively

  5. Minimal and focused changes - The fix is surgical, only touching what is necessary to solve the problem

Minor Observations

  1. JSON creation method - The startup reset uses a heredoc to create the JSON instead of jq. This is intentional and appropriate for creating known-good JSON.

  2. API limit exit path - The addition of reset_session at line 1952 is correctly placed before the break statement, ensuring cleanup occurs.

  3. Diagnostic logging - The verbose logging at line 514 uses the existing VERBOSE_PROGRESS flag, avoiding log spam unless explicitly enabled.

Acceptance Criteria Verification

  • New ralph invocation always starts with clean .exit_signals (lines 1813-1817)
  • API-limit user-exit cleans up session state (line 1952)
  • should_exit_gracefully() logs signal counts for diagnosability (line 514)
  • --circuit-status shows 0 instead of null for fresh state (line 383 with fallback)
  • No regressions - all 580 tests pass (test count updated in CLAUDE.md)

Code Quality

  • Best practices: Follows bash conventions, uses proper quoting, handles errors appropriately
  • Security: No concerns - only internal state files are modified
  • Performance: Negligible impact - two file operations at startup only
  • Test coverage: Excellent - tests verify the fix from multiple angles

Recommendation

APPROVE - This is a solid, well-tested fix that correctly addresses the reported issue without introducing new problems or complexity. The changes are minimal, focused, and aligned with the project code quality standards.

@github-actions
Copy link
Copy Markdown

Review comment posted successfully at: #218 (comment)

New%20session%20-%202026-03-16T17%3A18%3A34.338Z
opencode session  |  github run

@github-actions
Copy link
Copy Markdown

Review: fix(loop) - prevent stale exit signals causing premature exit (Issue 194)

This is a well-targeted fix for a real operational issue. The root cause analysis is accurate and the primary fix is appropriately unconditional. A few observations:


What works well

Primary fix is clean. Resetting .exit_signals and removing .response_analysis unconditionally in main() before the loop is the correct approach. No conditional logic needed — every new invocation should start fresh.

Backward compatibility in circuit breaker. Using // "N/A" as a jq fallback for old state files without current_loop is good defensive coding, preventing the #null display bug without requiring a migration step.

Gated diagnostic logging. The VERBOSE_PROGRESS guard on the new should_exit_gracefully() log line is the right call — doesn't clutter normal output while still being available for debugging.

reset_session on API limit exit. Good defense-in-depth. Even though the startup reset handles the stale-signals case, explicitly cleaning session state on user-initiated API-limit exit prevents session ID issues on the next run.


Issues worth addressing

1. Weak "before reset" assertion in stale exit signals do not cause premature exit test

run should_exit_gracefully
[[ "$output" != "" ]]  # Would exit

The comment says "Would exit" but the assertion only checks that output is non-empty, not that an exit condition was actually triggered. should_exit_gracefully may emit output for reasons unrelated to exiting. A stronger assertion would verify the return value or the echoed exit reason (e.g., [[ "$output" == *"project_complete"* ]]). As written, the test could pass even if the stale-signal detection regressed.

2. rm -f ... 2>/dev/null is redundant

rm -f "$RESPONSE_ANALYSIS_FILE" 2>/dev/null

rm -f already suppresses "no such file" errors. The 2>/dev/null adds no value and may mask genuine errors (e.g., permission denied). Consider dropping it.

3. Static analysis tests are fragile

The tests startup resets stale exit signals before main loop and should_exit_gracefully logs diagnostic signal counts grep for specific strings in the source code. These will break on any refactor that renames variables or log messages, even if the behavior remains correct. Behavioral tests (actually calling the functions or running the script) are more robust and meaningful. The structural check approach is understandable for integration-level concerns, but it's worth noting the tradeoff.

4. Hardcoded exit signals JSON structure

The startup reset inlines the structure:

echo '{"test_only_loops": [], "done_signals": [], "completion_indicators": []}' > "$EXIT_SIGNALS_FILE"

This same pattern presumably appears elsewhere in the codebase (e.g., reset_session()). If the .exit_signals schema changes, all occurrences need updating. A reset_exit_signals() helper function would centralize this. Not a blocker for this PR, but worth a follow-up.


Minor nit

The current_loop: 0 initialized in init_circuit_breaker() and reset_circuit_breaker() is fine functionally, but 0 could be confused with "loop 0 actually ran." null might be more semantically accurate for "no loops yet," though the display fallback // "N/A" covers both cases cleanly. Not a blocker.


Overall

The fix is correct and addresses the root cause. The circuit breaker display polish is a nice addition. The main actionable item before merge is strengthening the assertion in the stale exit signals do not cause premature exit test — as currently written, that test doesn't fully validate the behavior it claims to cover.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
lib/circuit_breaker.sh (1)

46-57: ⚠️ Potential issue | 🟠 Major

Include opened_at in initialized/reset circuit state payloads.

The state JSON written in these paths omits opened_at, so the schema is inconsistent across writers. Please include opened_at (and keep fallback-to-last_change only for legacy files).

Suggested patch
 {
     "state": "$CB_STATE_CLOSED",
     "last_change": "$(get_iso_timestamp)",
     "consecutive_no_progress": 0,
     "consecutive_same_error": 0,
     "consecutive_permission_denials": 0,
     "last_progress_loop": 0,
     "total_opens": 0,
     "reason": "",
-    "current_loop": 0
+    "current_loop": 0,
+    "opened_at": ""
 }
 {
     "state": "$CB_STATE_CLOSED",
     "last_change": "$(get_iso_timestamp)",
     "consecutive_no_progress": 0,
     "consecutive_same_error": 0,
     "consecutive_permission_denials": 0,
     "last_progress_loop": 0,
     "total_opens": 0,
     "reason": "$reason",
-    "current_loop": 0
+    "current_loop": 0,
+    "opened_at": ""
 }

As per coding guidelines, "Circuit breaker state files must include opened_at timestamp field; fall back to last_change for backward compatibility with old state files".

Also applies to: 420-431

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lib/circuit_breaker.sh` around lines 46 - 57, The JSON written to
CB_STATE_FILE when initializing/resetting the circuit breaker (using
CB_STATE_CLOSED and get_iso_timestamp) is missing the opened_at field; update
the payload written in the here-doc to include "opened_at":
"$(get_iso_timestamp)" (or set opened_at to the same value as last_change) so
new state files include opened_at while retaining logic elsewhere that falls
back to last_change for legacy files; ensure the same change is applied to the
other initialization/reset block referenced in the comment (around the section
noted as also applies to lines 420-431) so all writers produce the opened_at
field.
🧹 Nitpick comments (2)
tests/unit/test_exit_detection.bats (1)

1245-1315: Prefer behavior assertions over source-grep assertions for these new tests.

These checks are currently tied to script text and can pass on false positives. Consider executing the relevant path and asserting file/state effects (.exit_signals cleared, .response_analysis removed, and session reset side effects) instead.

Based on learnings: "Test suite must achieve 100% test pass rate with comprehensive coverage of ... exit detection ... edge cases ..."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/test_exit_detection.bats` around lines 1245 - 1315, Replace
fragile source-grep assertions with behavior-based assertions: for the "startup
resets stale exit signals before main loop" test, actually run the startup path
(or source and call main startup helper) and assert EXIT_SIGNALS_FILE is
cleared/rewritten and RESPONSE_ANALYSIS_FILE is removed; for "stale exit signals
do not cause premature exit" leave the simulated .exit_signals and
.response_analysis setup but invoke the actual startup reset code (or call the
reset helper used by main) and then call should_exit_gracefully() and assert it
returns/prints nothing, not by grepping the script; for "should_exit_gracefully
logs diagnostic signal counts" call should_exit_gracefully() and assert the
runtime output contains the expected diagnostic strings (recent_test_loops,
recent_done_signals, recent_completion_indicators) and that log_status was
invoked by checking the log output, not the source; for "API limit user-exit
path calls reset_session" exercise the API-limit user-exit flow (simulate
user_choice == "2" or call the function handling that path) and assert side
effects of reset_session (e.g., session file/state cleared or a known marker
updated) occur before the break; reference functions/variables to change:
should_exit_gracefully, reset_session, main/startup reset helper,
EXIT_SIGNALS_FILE, RESPONSE_ANALYSIS_FILE, and user_choice handling code in
ralph_loop.sh.
tests/unit/test_circuit_breaker_recovery.bats (1)

450-488: Add one regression test for CB_AUTO_RESET=true state rewrite.

Current additions cover init/reset/display, but not the OPEN→CLOSED auto-reset transition writer. A dedicated assertion for current_loop (and full expected schema) there would prevent gaps.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/test_circuit_breaker_recovery.bats` around lines 450 - 488, Add a
regression test that verifies the OPEN→CLOSED auto-reset path (when
CB_AUTO_RESET=true) rewrites the state file including current_loop and the full
expected schema: create an OPEN state JSON, export CB_AUTO_RESET=true, invoke
the same command/function that performs the auto-reset transition (the code path
under test that flips OPEN→CLOSED), then assert the resulting CB_STATE_FILE
contains "current_loop": 0 and all expected keys (state, last_change,
consecutive_no_progress, consecutive_same_error, consecutive_permission_denials,
last_progress_loop, total_opens, reason) to ensure the auto-reset writer
preserves the schema.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@lib/circuit_breaker.sh`:
- Around line 46-57: The JSON written to CB_STATE_FILE when
initializing/resetting the circuit breaker (using CB_STATE_CLOSED and
get_iso_timestamp) is missing the opened_at field; update the payload written in
the here-doc to include "opened_at": "$(get_iso_timestamp)" (or set opened_at to
the same value as last_change) so new state files include opened_at while
retaining logic elsewhere that falls back to last_change for legacy files;
ensure the same change is applied to the other initialization/reset block
referenced in the comment (around the section noted as also applies to lines
420-431) so all writers produce the opened_at field.

---

Nitpick comments:
In `@tests/unit/test_circuit_breaker_recovery.bats`:
- Around line 450-488: Add a regression test that verifies the OPEN→CLOSED
auto-reset path (when CB_AUTO_RESET=true) rewrites the state file including
current_loop and the full expected schema: create an OPEN state JSON, export
CB_AUTO_RESET=true, invoke the same command/function that performs the
auto-reset transition (the code path under test that flips OPEN→CLOSED), then
assert the resulting CB_STATE_FILE contains "current_loop": 0 and all expected
keys (state, last_change, consecutive_no_progress, consecutive_same_error,
consecutive_permission_denials, last_progress_loop, total_opens, reason) to
ensure the auto-reset writer preserves the schema.

In `@tests/unit/test_exit_detection.bats`:
- Around line 1245-1315: Replace fragile source-grep assertions with
behavior-based assertions: for the "startup resets stale exit signals before
main loop" test, actually run the startup path (or source and call main startup
helper) and assert EXIT_SIGNALS_FILE is cleared/rewritten and
RESPONSE_ANALYSIS_FILE is removed; for "stale exit signals do not cause
premature exit" leave the simulated .exit_signals and .response_analysis setup
but invoke the actual startup reset code (or call the reset helper used by main)
and then call should_exit_gracefully() and assert it returns/prints nothing, not
by grepping the script; for "should_exit_gracefully logs diagnostic signal
counts" call should_exit_gracefully() and assert the runtime output contains the
expected diagnostic strings (recent_test_loops, recent_done_signals,
recent_completion_indicators) and that log_status was invoked by checking the
log output, not the source; for "API limit user-exit path calls reset_session"
exercise the API-limit user-exit flow (simulate user_choice == "2" or call the
function handling that path) and assert side effects of reset_session (e.g.,
session file/state cleared or a known marker updated) occur before the break;
reference functions/variables to change: should_exit_gracefully, reset_session,
main/startup reset helper, EXIT_SIGNALS_FILE, RESPONSE_ANALYSIS_FILE, and
user_choice handling code in ralph_loop.sh.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e6a071d0-24b0-4a25-82d7-b0c5a23bd6e6

📥 Commits

Reviewing files that changed from the base of the PR and between 13e35c4 and 257bfea.

📒 Files selected for processing (5)
  • CLAUDE.md
  • lib/circuit_breaker.sh
  • ralph_loop.sh
  • tests/unit/test_circuit_breaker_recovery.bats
  • tests/unit/test_exit_detection.bats

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ralph ends early

1 participant