fix(loop): prevent stale exit signals from causing premature exit (#194)#218
Conversation
When a previous Ralph run ended without calling reset_session() — crash,
SIGKILL, or API-limit user exit — stale completion_indicators persisted
in .exit_signals. On the next invocation, should_exit_gracefully() read
these stale signals and exited the loop before execute_claude_code() was
ever called (visible as "Current loop: #null" in circuit breaker status).
Four fixes:
1. Unconditionally reset .exit_signals and .response_analysis at startup
before the main loop (primary fix)
2. Add reset_session("api_limit_exit") to the API-limit user-exit path
3. Add diagnostic logging of signal counts in should_exit_gracefully()
4. Include current_loop:0 in circuit breaker init/reset, add jq fallback
in show_circuit_status() to prevent #null display
WalkthroughThis PR fixes premature exit behavior caused by stale exit signals. It introduces startup state reset logic to clear exit signal files, adds a Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
📝 Coding Plan
Comment |
Code Review SummaryThis PR effectively addresses Issue #194 by adding unconditional startup state reset to prevent stale exit signals from causing premature exits. The implementation is clean and well-tested. ✅ Strengths
📝 Minor Observations
✅ Acceptance Criteria Verification
🔍 Code Quality
🎯 RecommendationAPPROVE - This is a solid, well-tested fix that correctly addresses the reported issue without introducing new problems or complexity. The changes are minimal, focused, and aligned with the project's code quality standards. |
Code Review SummaryThis PR effectively addresses Issue #194 by adding unconditional startup state reset to prevent stale exit signals from causing premature exits. The implementation is clean and well-tested. Strengths
Minor Observations
Acceptance Criteria Verification
Code Quality
RecommendationAPPROVE - This is a solid, well-tested fix that correctly addresses the reported issue without introducing new problems or complexity. The changes are minimal, focused, and aligned with the project code quality standards. |
|
Review comment posted successfully at: #218 (comment) |
Review: fix(loop) - prevent stale exit signals causing premature exit (Issue 194)This is a well-targeted fix for a real operational issue. The root cause analysis is accurate and the primary fix is appropriately unconditional. A few observations: What works wellPrimary fix is clean. Resetting Backward compatibility in circuit breaker. Using Gated diagnostic logging. The
Issues worth addressing1. Weak "before reset" assertion in run should_exit_gracefully
[[ "$output" != "" ]] # Would exitThe comment says "Would exit" but the assertion only checks that output is non-empty, not that an exit condition was actually triggered. 2. rm -f "$RESPONSE_ANALYSIS_FILE" 2>/dev/null
3. Static analysis tests are fragile The tests 4. Hardcoded exit signals JSON structure The startup reset inlines the structure: echo '{"test_only_loops": [], "done_signals": [], "completion_indicators": []}' > "$EXIT_SIGNALS_FILE"This same pattern presumably appears elsewhere in the codebase (e.g., Minor nitThe OverallThe fix is correct and addresses the root cause. The circuit breaker display polish is a nice addition. The main actionable item before merge is strengthening the assertion in the |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
lib/circuit_breaker.sh (1)
46-57:⚠️ Potential issue | 🟠 MajorInclude
opened_atin initialized/reset circuit state payloads.The state JSON written in these paths omits
opened_at, so the schema is inconsistent across writers. Please includeopened_at(and keep fallback-to-last_changeonly for legacy files).Suggested patch
{ "state": "$CB_STATE_CLOSED", "last_change": "$(get_iso_timestamp)", "consecutive_no_progress": 0, "consecutive_same_error": 0, "consecutive_permission_denials": 0, "last_progress_loop": 0, "total_opens": 0, "reason": "", - "current_loop": 0 + "current_loop": 0, + "opened_at": "" }{ "state": "$CB_STATE_CLOSED", "last_change": "$(get_iso_timestamp)", "consecutive_no_progress": 0, "consecutive_same_error": 0, "consecutive_permission_denials": 0, "last_progress_loop": 0, "total_opens": 0, "reason": "$reason", - "current_loop": 0 + "current_loop": 0, + "opened_at": "" }As per coding guidelines, "Circuit breaker state files must include
opened_attimestamp field; fall back tolast_changefor backward compatibility with old state files".Also applies to: 420-431
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@lib/circuit_breaker.sh` around lines 46 - 57, The JSON written to CB_STATE_FILE when initializing/resetting the circuit breaker (using CB_STATE_CLOSED and get_iso_timestamp) is missing the opened_at field; update the payload written in the here-doc to include "opened_at": "$(get_iso_timestamp)" (or set opened_at to the same value as last_change) so new state files include opened_at while retaining logic elsewhere that falls back to last_change for legacy files; ensure the same change is applied to the other initialization/reset block referenced in the comment (around the section noted as also applies to lines 420-431) so all writers produce the opened_at field.
🧹 Nitpick comments (2)
tests/unit/test_exit_detection.bats (1)
1245-1315: Prefer behavior assertions over source-grep assertions for these new tests.These checks are currently tied to script text and can pass on false positives. Consider executing the relevant path and asserting file/state effects (
.exit_signalscleared,.response_analysisremoved, and session reset side effects) instead.Based on learnings: "Test suite must achieve 100% test pass rate with comprehensive coverage of ... exit detection ... edge cases ..."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/test_exit_detection.bats` around lines 1245 - 1315, Replace fragile source-grep assertions with behavior-based assertions: for the "startup resets stale exit signals before main loop" test, actually run the startup path (or source and call main startup helper) and assert EXIT_SIGNALS_FILE is cleared/rewritten and RESPONSE_ANALYSIS_FILE is removed; for "stale exit signals do not cause premature exit" leave the simulated .exit_signals and .response_analysis setup but invoke the actual startup reset code (or call the reset helper used by main) and then call should_exit_gracefully() and assert it returns/prints nothing, not by grepping the script; for "should_exit_gracefully logs diagnostic signal counts" call should_exit_gracefully() and assert the runtime output contains the expected diagnostic strings (recent_test_loops, recent_done_signals, recent_completion_indicators) and that log_status was invoked by checking the log output, not the source; for "API limit user-exit path calls reset_session" exercise the API-limit user-exit flow (simulate user_choice == "2" or call the function handling that path) and assert side effects of reset_session (e.g., session file/state cleared or a known marker updated) occur before the break; reference functions/variables to change: should_exit_gracefully, reset_session, main/startup reset helper, EXIT_SIGNALS_FILE, RESPONSE_ANALYSIS_FILE, and user_choice handling code in ralph_loop.sh.tests/unit/test_circuit_breaker_recovery.bats (1)
450-488: Add one regression test forCB_AUTO_RESET=truestate rewrite.Current additions cover init/reset/display, but not the OPEN→CLOSED auto-reset transition writer. A dedicated assertion for
current_loop(and full expected schema) there would prevent gaps.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/test_circuit_breaker_recovery.bats` around lines 450 - 488, Add a regression test that verifies the OPEN→CLOSED auto-reset path (when CB_AUTO_RESET=true) rewrites the state file including current_loop and the full expected schema: create an OPEN state JSON, export CB_AUTO_RESET=true, invoke the same command/function that performs the auto-reset transition (the code path under test that flips OPEN→CLOSED), then assert the resulting CB_STATE_FILE contains "current_loop": 0 and all expected keys (state, last_change, consecutive_no_progress, consecutive_same_error, consecutive_permission_denials, last_progress_loop, total_opens, reason) to ensure the auto-reset writer preserves the schema.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@lib/circuit_breaker.sh`:
- Around line 46-57: The JSON written to CB_STATE_FILE when
initializing/resetting the circuit breaker (using CB_STATE_CLOSED and
get_iso_timestamp) is missing the opened_at field; update the payload written in
the here-doc to include "opened_at": "$(get_iso_timestamp)" (or set opened_at to
the same value as last_change) so new state files include opened_at while
retaining logic elsewhere that falls back to last_change for legacy files;
ensure the same change is applied to the other initialization/reset block
referenced in the comment (around the section noted as also applies to lines
420-431) so all writers produce the opened_at field.
---
Nitpick comments:
In `@tests/unit/test_circuit_breaker_recovery.bats`:
- Around line 450-488: Add a regression test that verifies the OPEN→CLOSED
auto-reset path (when CB_AUTO_RESET=true) rewrites the state file including
current_loop and the full expected schema: create an OPEN state JSON, export
CB_AUTO_RESET=true, invoke the same command/function that performs the
auto-reset transition (the code path under test that flips OPEN→CLOSED), then
assert the resulting CB_STATE_FILE contains "current_loop": 0 and all expected
keys (state, last_change, consecutive_no_progress, consecutive_same_error,
consecutive_permission_denials, last_progress_loop, total_opens, reason) to
ensure the auto-reset writer preserves the schema.
In `@tests/unit/test_exit_detection.bats`:
- Around line 1245-1315: Replace fragile source-grep assertions with
behavior-based assertions: for the "startup resets stale exit signals before
main loop" test, actually run the startup path (or source and call main startup
helper) and assert EXIT_SIGNALS_FILE is cleared/rewritten and
RESPONSE_ANALYSIS_FILE is removed; for "stale exit signals do not cause
premature exit" leave the simulated .exit_signals and .response_analysis setup
but invoke the actual startup reset code (or call the reset helper used by main)
and then call should_exit_gracefully() and assert it returns/prints nothing, not
by grepping the script; for "should_exit_gracefully logs diagnostic signal
counts" call should_exit_gracefully() and assert the runtime output contains the
expected diagnostic strings (recent_test_loops, recent_done_signals,
recent_completion_indicators) and that log_status was invoked by checking the
log output, not the source; for "API limit user-exit path calls reset_session"
exercise the API-limit user-exit flow (simulate user_choice == "2" or call the
function handling that path) and assert side effects of reset_session (e.g.,
session file/state cleared or a known marker updated) occur before the break;
reference functions/variables to change: should_exit_gracefully, reset_session,
main/startup reset helper, EXIT_SIGNALS_FILE, RESPONSE_ANALYSIS_FILE, and
user_choice handling code in ralph_loop.sh.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: e6a071d0-24b0-4a25-82d7-b0c5a23bd6e6
📒 Files selected for processing (5)
CLAUDE.mdlib/circuit_breaker.shralph_loop.shtests/unit/test_circuit_breaker_recovery.batstests/unit/test_exit_detection.bats

Summary
Fixes #194: Ralph ends early despite remaining steps in fix_plan.md
Root cause: When a previous Ralph run ended without calling
reset_session()(crash, SIGKILL, or API-limit user exit), stalecompletion_indicatorspersisted in.exit_signals. On the next invocation,should_exit_gracefully()read these stale signals and exited the loop beforeexecute_claude_code()was ever called.Four fixes:
.exit_signalsand.response_analysisat startup before the main loop (primary fix)reset_session("api_limit_exit")to the API-limit user-exit pathshould_exit_gracefully()for future debuggingcurrent_loop: 0in circuit breaker init/reset, add jq fallback inshow_circuit_status()to prevent#nulldisplayAcceptance Criteria
.exit_signalsshould_exit_gracefully()logs signal counts for diagnosability--circuit-statusshows0instead of#nullfor fresh stateTest Plan
test_exit_detection.bats, 3 intest_circuit_breaker_recovery.batsCloses #194
Summary by CodeRabbit
Bug Fixes
New Features
Tests
Documentation