fix(analyzer): suppress heuristic exit in JSON mode, raise text threshold (#224)#230
Conversation
…hold (#224) The heuristic exit condition at response_analyzer.sh:641 used `>= 40 OR has_completion_signal` in the text-mode fallback path. A single keyword match (+10) was enough to set has_completion_signal=true, satisfying the OR condition alone — so writing a CHANGELOG ("implementation complete") or README ("setup is done") could silently exit the loop mid-task. Fix: two-part change in the text-mode fallback path (step 9): - JSON mode (output_format == "json" but parse failed): suppress heuristics entirely. A malformed/truncated JSON response is not a completion signal; only an explicit EXIT_SIGNAL from a RALPH_STATUS block can trigger exit. - Text mode: raise threshold from `>= 40 OR has_completion_signal` to `>= 70 AND has_completion_signal`, requiring both a high accumulated confidence score AND a structural completion signal simultaneously. Note: the JSON SUCCESS path already returns early at line 477 and was never affected by this bug. Adds 4 regression tests covering both modes. Fixes #224
WalkthroughUpdated exit-signal determination in the response analyzer: JSON output mode now ignores heuristic-based exit and requires an explicit Changes
Sequence Diagram(s)(omitted — changes are heuristic threshold adjustments internal to a single component and do not introduce new multi-component control flow warranting a sequence diagram) Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Code Review: Issue #224 FixThis is a well-designed fix that addresses the false-positive exit detection issue. The changes are minimal, focused, and properly tested. ✅ Strengths
|
Code Review: Issue #224 FixThis is a well-designed fix that addresses the false-positive exit detection issue. The changes are minimal, focused, and properly tested. Strengths
Considerations
SecurityNo security concerns. The fix is purely about heuristic logic and doesn't handle user input or external data. RecommendationThe PR is ready to merge as-is. The fix correctly addresses Issue #224 with minimal risk of regressions. |
Review: fix(analyzer): suppress heuristic exit in JSON mode, raise text threshold (Issue 224)The fix is well-targeted and addresses a real problem. The logic split between JSON mode (suppress entirely) and text mode (AND + higher threshold) is the right architecture. Some observations below. Correctness: the text-mode threshold is effectively unreachable without a RALPH_STATUS blockLooking at every confidence accumulation point in the text-fallback path:
The new threshold is In practice, the heuristic Test No.3 comment is misleadingThe test creates a commit, but without Consider using Missing positive test caseThe test plan checklist includes a check for: Text mode, confidence >= 70 AND has_completion_signal=true should produce exit_signal: true. This case is not implemented in the PR. Given the analysis above (max reachable score is 55), it would require a RALPH_STATUS block with Minor: debug log symmetryThe OverallThe JSON-mode suppression is solid and correctly addresses the false-positive exits in Issue 224. The four regression tests cover the important negative cases well. The items above (unreachable branch documentation, misleading test comment, missing positive test) are non-blocking but worth addressing before merge. |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tests/unit/test_json_parsing.bats (1)
1126-1146: Test is valid, but the comment about git confidence boost is inaccurate.The test correctly asserts that
exit_signalremainsfalsewith low confidence. However, the comment on lines 1129 and 1134-1135 claims git changes boost confidence to 30, but this won't happen because:
.ralph/.loop_start_shais never created in the test setup- The file is committed, so
git diff --name-onlyandgit diff --cachedreturn nothingActual confidence is only 10 (from the "done" keyword). The test still passes since 10 < 70, but the comment is misleading.
💡 Suggested fix: Update comment or adjust test setup
Either update the comment to reflect actual behavior:
- # Create text output with completion keyword + git changes (score: 10+20=30 < 70) - # to show it won't exit on low confidence + # Create text output with completion keyword (score: 10 < 70) + # to show it won't exit on low confidence cat > "$output_file" << 'EOF' All tasks are done. The implementation is complete. EOF - # Simulate git file changes to boost confidence to 30 (10 keyword + 20 git) - # Still below 70 — should NOT exit - touch modified_file.txt - git add modified_file.txt - git commit -m "test" --allow-empty > /dev/null 2>&1Or, to actually test with git confidence boost, create the
.loop_start_shaand leave changes uncommitted:+ # Store initial SHA for git change detection + git rev-parse HEAD > "$RALPH_DIR/.loop_start_sha" + # Create text output with completion keyword + git changes (score: 10+20=30 < 70) # to show it won't exit on low confidence cat > "$output_file" << 'EOF' All tasks are done. The implementation is complete. EOF - # Simulate git file changes to boost confidence to 30 (10 keyword + 20 git) + # Simulate uncommitted git file changes to boost confidence to 30 (10 keyword + 20 git) # Still below 70 — should NOT exit touch modified_file.txt git add modified_file.txt - git commit -m "test" --allow-empty > /dev/null 2>&1 + # Leave staged but uncommitted so git diff --cached detects it🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/test_json_parsing.bats` around lines 1126 - 1146, The comment claiming git changes boost confidence to 30 is inaccurate because the test never creates .loop_start_sha and commits the change (so git diff returns nothing), so actual confidence is only 10; fix by either updating the comment to state that only the "done" keyword provides confidence (reference analyze_response and $RALPH_DIR/.response_analysis, modified_file.txt and the git add/commit lines), or change the test setup to actually produce a git diff-based boost by creating a .loop_start_sha snapshot and leaving modified_file.txt as an unstaged/uncommitted change (so git diff --name-only detects it) before calling analyze_response; pick one of these two approaches and apply it to the test block containing the assertions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@tests/unit/test_json_parsing.bats`:
- Around line 1126-1146: The comment claiming git changes boost confidence to 30
is inaccurate because the test never creates .loop_start_sha and commits the
change (so git diff returns nothing), so actual confidence is only 10; fix by
either updating the comment to state that only the "done" keyword provides
confidence (reference analyze_response and $RALPH_DIR/.response_analysis,
modified_file.txt and the git add/commit lines), or change the test setup to
actually produce a git diff-based boost by creating a .loop_start_sha snapshot
and leaving modified_file.txt as an unstaged/uncommitted change (so git diff
--name-only detects it) before calling analyze_response; pick one of these two
approaches and apply it to the test block containing the assertions.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: b47dbc18-43b8-48f6-a12c-5ab513e88d8b
📒 Files selected for processing (2)
lib/response_analyzer.shtests/unit/test_json_parsing.bats
PR Review: Issue #224 FixThis is a well-designed fix that addresses the false-positive exit detection issue. The changes are minimal, focused, and properly tested. ✅ Code Quality
|
Follow-up ReviewThe misleading test comment (test No.3) has been corrected in the latest commit -- thanks for addressing that. The test now accurately states that the git-change boost does not fire due to missing Two items from the earlier review remain open: 1. The text-mode
|
| Source | Max pts |
|---|---|
| COMPLETION_KEYWORDS match | +10 |
| NO_WORK_PATTERNS match | +15 |
| git file changes detected | +20 |
| Output length decline | +10 |
| Total | 55 |
Since 55 < 70, the elif branch can never fire via pure heuristics. When a STATUS: COMPLETE block is present (score 100), exit_signal is already set before this block is reached, making the elif redundant at best.
This is likely the desired outcome -- effectively mandating a RALPH_STATUS block for any exit in text mode. The concern is the current comment implies the threshold is merely strict, not unreachable. A more accurate comment would clarify that 70 is intentionally above the heuristic ceiling and that the branch exists as a safety net if scoring weights are ever increased.
This is non-blocking but could cause confusion during future maintenance.
2. Missing positive test case from the PR checklist
The PR's test plan includes:
Text mode, confidence >= 70 AND has_completion_signal=true -> exit_signal: true
This is still not covered. Given that the only reachable path to score 100 in text mode is via a RALPH_STATUS STATUS: COMPLETE block, the positive test would effectively exercise the RALPH_STATUS path (already covered elsewhere). If the intent is that the elif branch is intentionally unreachable, the checklist item could be marked N/A with a note explaining why. Either close the loop in the test plan or add a test using STATUS: COMPLETE without an explicit EXIT_SIGNAL to document the surviving positive case.
Overall: The core fix (JSON-mode heuristic suppression) is solid and correctly resolves Issue 224. The misleading comment fix is appreciated. The two remaining items are documentation and test-plan hygiene -- non-blocking, but worth resolving to keep the code self-documenting for future contributors.
Code Review: Issue #224 FixThis is a well-designed fix that correctly addresses the false-positive exit detection problem. The changes are minimal, focused, and properly tested. Code Quality
Documentation Gap: Effectively Unreachable BranchThe heuristic "elif" branch at line 646 is now unreachable for new input without a RALPH_STATUS block:
Since 55 < 70, this branch can never fire via pure heuristics. When a "STATUS: COMPLETE" RALPH_STATUS block is present, exit_signal is already set to "true" and confidence_score=100 at lines 507-509 — before reaching the heuristic block. This appears intentional (effectively mandating a RALPH_STATUS block for any exit), but the current comment at line 648 implies the threshold is merely "strict" rather than unreachable. Consider adding a note clarifying that 70 is intentionally above the heuristic ceiling and that the branch exists as a safety net if scoring weights are ever increased. Test CoverageThe 4 new regression tests cover the important negative cases well:
The PR checklist includes: "Text mode, confidence >= 70 AND has_completion_signal=true -> exit_signal: true". Given that the only reachable path to score >=70 in text mode is via a RALPH_STATUS "STATUS: COMPLETE" block (already covered by existing tests at lines 500-509), this checklist item is effectively covered. Consider adding a note to the PR description or marking it N/A with explanation. Performance & Security
RecommendationAPPROVED — The fix correctly addresses Issue #224 with minimal risk of regressions. The core JSON-mode suppression is solid, and the text-mode threshold correctly prevents false-positive exits from documentation keywords. The two items above (unreachable branch documentation, test plan checklist clarity) are non-blocking but worth addressing to keep the code self-documenting for future contributors. |
|
Review comment posted: #230 (comment) The fix is solid and correctly addresses Issue #224. The two minor items noted (effectively unreachable branch documentation and test plan clarity) are non-blocking but would improve future maintainability. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@CLAUDE.md`:
- Around line 564-571: The table header "### Test Files (588 tests total)" is
out of sync with the real count; update that header to "### Test Files (591
tests total)" and verify the per-file counts (e.g.,
`test_circuit_breaker_recovery.bats`, `test_cli_parsing.bats`,
`test_cli_modern.bats`, `test_json_parsing.bats`, etc.) to ensure their summed
total equals 591, correcting any individual file counts in the table so the sum
matches the header.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
| ### Test Files (588 tests total) | ||
|
|
||
| | File | Tests | Description | | ||
| |------|-------|-------------| | ||
| | `test_circuit_breaker_recovery.bats` | 22 | Cooldown timer, auto-reset, parse_iso_to_epoch, CLI flag (Issue #160) + current_loop init/display fix (#194) | | ||
| | `test_cli_parsing.bats` | 35 | CLI argument parsing for all flags + monitor parameter forwarding | | ||
| | `test_cli_modern.bats` | 111 | Modern CLI commands (Phase 1.1) + build_claude_command fix + live mode text format fix (#164) + errexit pipeline guard (#175) + ALLOWED_TOOLS tightening (#149) + API limit false positive detection (#183) + Claude CLI command validation (#97) + stale call counter fix (#196) + is_error detection (#134, #199) + set-e removal (#208) + question detection + version check + semver comparison + stderr separation (#190) + productive timeout detection + session ID fallback + stale analysis cleanup (#198) + Extra Usage quota detection (#100) | | ||
| | `test_json_parsing.bats` | 52 | JSON output format parsing + Claude CLI format + session management + array format + question detection (#190) | | ||
| | `test_json_parsing.bats` | 56 | JSON output format parsing + Claude CLI format + session management + array format + question detection (#190) + heuristic exit threshold tests (#224) | |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "Count `@test` blocks across bats files:"
rg -n --type-add 'bats:*.bats' --type bats '^\s*@test\s+' tests | wc -l
echo
echo "Current CLAUDE.md declared total:"
rg -n 'Test Files \([0-9]+ tests total\)' CLAUDE.md
echo
echo "Current CLAUDE.md per-file rows:"
rg -n '^\| `test_.*\.bats` \| [0-9]+ \|' CLAUDE.mdRepository: frankbria/ralph-claude-code
Length of output: 2854
Update CLAUDE.md test table total from 588 to 591 tests.
The actual test count is 591 @test blocks across all .bats files, but CLAUDE.md currently documents 588 tests. The per-file test counts in the table (22+35+111+56+26+54+11+20+25+15+50+33+38+23+24+20+15+10 = 588) appear to be missing 3 tests. Review each test file to confirm the correct per-file counts and update the table header to reflect the actual total of 591 tests.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@CLAUDE.md` around lines 564 - 571, The table header "### Test Files (588
tests total)" is out of sync with the real count; update that header to "###
Test Files (591 tests total)" and verify the per-file counts (e.g.,
`test_circuit_breaker_recovery.bats`, `test_cli_parsing.bats`,
`test_cli_modern.bats`, `test_json_parsing.bats`, etc.) to ensure their summed
total equals 591, correcting any individual file counts in the table so the sum
matches the header.
|
test comment |
|
PLACEHOLDER - will be replaced with full review |
|
Review: Issue 224 Fix This is a clean, focused fix. The root cause analysis is accurate: the old >= 40 OR has_completion_signal condition was trivially satisfied by a single keyword match (score: 10), making the threshold meaningless. What is working well Format-aware logic is correct. The output_format variable (set at line 356 via detect_output_format()) correctly distinguishes two scenarios at line 641:
Inline comments are clear. The Issue 224 references and rationale in both the code and the CLAUDE.md update make this easy to maintain. Debug logging is additive. The VERBOSE_PROGRESS debug line goes to stderr and does not affect normal operation. Missing test: positive text-mode exit case The 4 new regression tests all verify that conditions do NOT trigger exit. There is no test verifying that the new threshold DOES trigger exit when conditions are genuinely met. A test constructing output with enough signals to push confidence_score >= 70 alongside has_completion_signal=true, then asserting exit_signal == true, would protect the happy path. Without a positive case, the threshold could be raised to an arbitrarily high value and all 4 new tests would still pass. This is the most notable gap. Minor observations Existing test implicitly changed. The test analyze_response handles plain text completion signals (around line 416 in test_json_parsing.bats) checks has_completion_signal == true but not exit_signal. Under the old behavior that plain text would have set exit_signal=true; under the new behavior it would not. The test does not break, but does not capture this behavioral change. Pre-existing, not a blocker. Threshold rationale not documented. The PR explains the problem but does not mention how 70 was chosen vs. 60 or 80. A brief note on the empirical basis or design intent would help future reviewers. output_format recorded as text when JSON parse fails (line 660). The hardcoded --arg output_format text in the jq call is correct since text parsing was what actually ran. But the saved analysis will say output_format: text even when the input file was JSON that failed to parse. A brief comment would prevent future confusion. Bottom line The fix is correct and addresses Issue 224. The main recommendation before merge is adding at least one positive test case verifying that confidence_score >= 70 AND has_completion_signal == true actually does set exit_signal=true, protecting the happy path from future regressions. |


Summary
Fixes #224 — documentation keywords like "setup is done" or "implementation complete" could silently exit the Ralph loop mid-task.
Root cause: The text-mode heuristic at
response_analyzer.sh:641used>= 40 OR has_completion_signal. A single keyword match setshas_completion_signal=true(+10 confidence), which alone satisfied the OR condition — threshold was never meaningfully enforced.Fix (text-mode fallback path only — JSON success path already
return 0s at line 477):EXIT_SIGNALfrom aRALPH_STATUSblock can trigger exit.>= 40 OR has_completion_signal→>= 70 AND has_completion_signal. Requires both a high accumulated score AND a structural signal simultaneously.Changes
lib/response_analyzer.sh— 1 condition replaced with format-aware guard + debug log linetests/unit/test_json_parsing.bats— 4 regression testsTest plan
exit_signal: falseexit_signal: falsehas_completion_signal=truealone (confidence < 70) →exit_signal: falsehas_completion_signal=true→exit_signal: trueSummary by CodeRabbit
Bug Fixes
Tests
Documentation