Skip to content

fix(analyzer): suppress heuristic exit in JSON mode, raise text threshold (#224)#230

Merged
frankbria merged 3 commits into
mainfrom
fix/224-confidence-threshold-false-positives
Mar 24, 2026
Merged

fix(analyzer): suppress heuristic exit in JSON mode, raise text threshold (#224)#230
frankbria merged 3 commits into
mainfrom
fix/224-confidence-threshold-false-positives

Conversation

@frankbria
Copy link
Copy Markdown
Owner

@frankbria frankbria commented Mar 24, 2026

Summary

Fixes #224 — documentation keywords like "setup is done" or "implementation complete" could silently exit the Ralph loop mid-task.

Root cause: The text-mode heuristic at response_analyzer.sh:641 used >= 40 OR has_completion_signal. A single keyword match sets has_completion_signal=true (+10 confidence), which alone satisfied the OR condition — threshold was never meaningfully enforced.

Fix (text-mode fallback path only — JSON success path already return 0s at line 477):

  • JSON mode with failed parse: suppress heuristics entirely. Only an explicit EXIT_SIGNAL from a RALPH_STATUS block can trigger exit.
  • Text mode: raised from >= 40 OR has_completion_signal>= 70 AND has_completion_signal. Requires both a high accumulated score AND a structural signal simultaneously.

Changes

  • lib/response_analyzer.sh — 1 condition replaced with format-aware guard + debug log line
  • tests/unit/test_json_parsing.bats — 4 regression tests

Test plan

  • JSON mode output with completion keywords but no RALPH_STATUS → exit_signal: false
  • JSON mode, multiple documentation phrases → exit_signal: false
  • Text mode, has_completion_signal=true alone (confidence < 70) → exit_signal: false
  • Text mode, confidence ≥ 70 AND has_completion_signal=trueexit_signal: true
  • All 591 existing tests pass

Summary by CodeRabbit

  • Bug Fixes

    • Exit detection is now mode-specific: JSON output requires an explicit exit indicator; text output uses a stricter heuristic (higher confidence plus completion signal) to reduce false positives.
  • Tests

    • Added unit tests validating exit behavior for both JSON and text modes.
  • Documentation

    • Updated documentation to describe the new mode-specific exit rules and rationale.

…hold (#224)

The heuristic exit condition at response_analyzer.sh:641 used `>= 40 OR
has_completion_signal` in the text-mode fallback path. A single keyword
match (+10) was enough to set has_completion_signal=true, satisfying the
OR condition alone — so writing a CHANGELOG ("implementation complete") or
README ("setup is done") could silently exit the loop mid-task.

Fix: two-part change in the text-mode fallback path (step 9):
- JSON mode (output_format == "json" but parse failed): suppress heuristics
  entirely. A malformed/truncated JSON response is not a completion signal;
  only an explicit EXIT_SIGNAL from a RALPH_STATUS block can trigger exit.
- Text mode: raise threshold from `>= 40 OR has_completion_signal` to
  `>= 70 AND has_completion_signal`, requiring both a high accumulated
  confidence score AND a structural completion signal simultaneously.

Note: the JSON SUCCESS path already returns early at line 477 and was
never affected by this bug.

Adds 4 regression tests covering both modes.

Fixes #224
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 24, 2026

Walkthrough

Updated exit-signal determination in the response analyzer: JSON output mode now ignores heuristic-based exit and requires an explicit EXIT_SIGNAL; text mode heuristics require both confidence_score >= 70 and has_completion_signal == "true" to set exit_signal=true.

Changes

Cohort / File(s) Summary
Response Analyzer Logic
lib/response_analyzer.sh
Changed analyze_response() to gate heuristic exit by output_format: suppress heuristics in JSON mode (only explicit EXIT_SIGNAL allowed); in text mode require both confidence_score >= 70 AND has_completion_signal == "true" to trigger exit.
Test Coverage
tests/unit/test_json_parsing.bats
Added four Bats tests validating the new gating: two JSON-mode cases ensure natural-language completion phrases without RALPH_STATUS do not set exit_signal; two text-mode cases confirm low-confidence or missing conditions do not trigger exit.
Documentation
CLAUDE.md
Rewrote “Intelligent Exit Detection”/thresholds to document mode-specific rules: JSON mode requires explicit EXIT_SIGNAL; text mode requires confidence_score >= 70 plus completion signal; updated test count and references to Issue #224.

Sequence Diagram(s)

(omitted — changes are heuristic threshold adjustments internal to a single component and do not introduce new multi-component control flow warranting a sequence diagram)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Poem

🐰 I hopped through lines of bash at dawn,

Tuned the exit so false flags are gone.
JSON waits for a signal, text counts to seventy,
I thumped my foot — now endings are steady. 🥕

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main change: suppressing heuristic exit in JSON mode and raising the text threshold, directly addressing issue #224.
Linked Issues check ✅ Passed The code changes fully implement the requirements from issue #224: JSON mode heuristics are suppressed (only explicit EXIT_SIGNAL triggers exit), text mode threshold raised to 70 AND completion signal required, and comprehensive tests added.
Out of Scope Changes check ✅ Passed All changes are directly scoped to fixing issue #224: analyzer logic updated, tests added for the new behavior, and documentation updated to reflect the changes. No unrelated modifications present.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/224-confidence-threshold-false-positives

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown

Code Review: Issue #224 Fix

This is a well-designed fix that addresses the false-positive exit detection issue. The changes are minimal, focused, and properly tested.

✅ Strengths

  1. Correct root cause analysis: The PR accurately identifies that allowed a single keyword match (score: 10) to trigger exit, making the threshold meaningless.

  2. Format-aware strategy:

    • JSON mode: Suppresses heuristics entirely when JSON detection succeeds. This is appropriate because:
      • Valid JSON should rely on explicit from structured output
      • If JSON parsing fails despite valid syntax, the output structure is unreliable
    • Text mode: Raises threshold from → , requiring both high confidence and structural signals
  3. Comprehensive test coverage: 4 regression tests added covering key scenarios:

    • JSON mode without RALPH_STATUS: keywords don't trigger exit
    • JSON mode with multiple completion phrases: still no exit
    • Text mode: requires both confidence>=70 AND has_completion_signal
    • Text mode: has_completion_signal alone (score<70) doesn't exit
  4. Clear documentation: Inline comments explain the rationale with Issue Exit confidence threshold (≥40) causes false-positive completions on documentation keywords #224 references, making future maintenance easier.

  5. Debug logging: JSON mode path includes optional debug output for troubleshooting.

⚠️ Considerations

  1. Breaking change for text mode: The threshold change (OR → AND with higher value) is a breaking change, but this is intentional as the old behavior was buggy. No existing tests appear to rely on heuristic-based exit (all tests use explicit RALPH_STATUS or JSON signals).

  2. Potential edge case: When fails on valid JSON (e.g., missing expected fields), heuristics are suppressed. This is conservative but appropriate—the output format is unreliable in this case.

  3. Performance: Negligible impact. Added one conditional check with no complex operations.

🔒 Security

No security concerns. The fix is purely about heuristic logic and doesn't handle user input or external data.

📝 Recommendations

The PR is ready to merge as-is. The fix correctly addresses Issue #224 with minimal risk of regressions.

@github-actions
Copy link
Copy Markdown

Code Review: Issue #224 Fix

This is a well-designed fix that addresses the false-positive exit detection issue. The changes are minimal, focused, and properly tested.

Strengths

  1. Correct root cause analysis: The PR accurately identifies that the threshold of ">= 40 OR has_completion_signal" allowed a single keyword match (score: 10) to trigger exit, making the threshold meaningless.

  2. Format-aware strategy:

    • JSON mode: Suppresses heuristics entirely when JSON detection succeeds. This is appropriate because valid JSON should rely on explicit EXIT_SIGNAL from structured output, and if JSON parsing fails despite valid syntax, the output structure is unreliable.
    • Text mode: Raises threshold from ">= 40 OR" to ">= 70 AND", requiring both high confidence and structural signals.
  3. Comprehensive test coverage: 4 regression tests added covering key scenarios:

    • JSON mode without RALPH_STATUS: keywords don't trigger exit
    • JSON mode with multiple completion phrases: still no exit
    • Text mode: requires both confidence>=70 AND has_completion_signal
    • Text mode: has_completion_signal alone (score<70) doesn't exit
  4. Clear documentation: Inline comments explain the rationale with Issue Exit confidence threshold (≥40) causes false-positive completions on documentation keywords #224 references, making future maintenance easier.

  5. Debug logging: JSON mode path includes optional debug output for troubleshooting.

Considerations

  1. Breaking change for text mode: The threshold change (OR to AND with higher value) is a breaking change, but this is intentional as the old behavior was buggy. No existing tests appear to rely on heuristic-based exit (all tests use explicit RALPH_STATUS or JSON signals).

  2. Potential edge case: When parse_json_response fails on valid JSON (e.g., missing expected fields), heuristics are suppressed. This is conservative but appropriate—the output format is unreliable in this case.

  3. Performance: Negligible impact. Added one conditional check with no complex operations.

Security

No security concerns. The fix is purely about heuristic logic and doesn't handle user input or external data.

Recommendation

The PR is ready to merge as-is. The fix correctly addresses Issue #224 with minimal risk of regressions.

@github-actions
Copy link
Copy Markdown


Review: fix(analyzer): suppress heuristic exit in JSON mode, raise text threshold (Issue 224)

The fix is well-targeted and addresses a real problem. The logic split between JSON mode (suppress entirely) and text mode (AND + higher threshold) is the right architecture. Some observations below.

Correctness: the text-mode threshold is effectively unreachable without a RALPH_STATUS block

Looking at every confidence accumulation point in the text-fallback path:

Source Max points
COMPLETION_KEYWORDS match +10
NO_WORK_PATTERNS match +15
git file changes detected +20
Output length decline +10
Total maximum 55

The new threshold is >= 70 AND has_completion_signal. Since 55 < 70, the elif branch at line 646 is now unreachable in text mode unless a RALPH_STATUS block is present. When a RALPH_STATUS block with an EXIT_SIGNAL is present, explicit_exit_signal_found=true skips the entire block. When STATUS:COMPLETE is present without EXIT_SIGNAL, confidence_score is set to 100 and exit_signal is already set to true before reaching line 646 — so the elif is entered but only redundantly re-sets an already-true value.

In practice, the heuristic elif branch is dead code for new input. This may be intentional — effectively requiring a RALPH_STATUS block for any exit — but it is worth calling out explicitly in the comment rather than implying the threshold is merely "hard to reach."

Test No.3 comment is misleading

The test creates a commit, but without .ralph/.loop_start_sha set and with no staged/unstaged changes remaining after the commit, the git detection block (lines 591-613) finds zero modified files, so the +20 points are not added. The actual score is 10, not 30. The test still passes (10 < 70), but the comment misrepresents what score is actually achieved.

Consider using git add without committing, or explicitly setting .ralph/.loop_start_sha to an earlier SHA so the committed changes are detected.

Missing positive test case

The test plan checklist includes a check for: Text mode, confidence >= 70 AND has_completion_signal=true should produce exit_signal: true. This case is not implemented in the PR. Given the analysis above (max reachable score is 55), it would require a RALPH_STATUS block with STATUS: COMPLETE (no EXIT_SIGNAL) to reach score=100. A test covering that path would close the checklist item and document the surviving positive case.

Minor: debug log symmetry

The VERBOSE_PROGRESS debug log at line 645 fires when JSON-mode heuristics are suppressed. A matching debug line inside the elif branch (text mode, when exit IS triggered) would make both paths traceable at the same verbosity level.

Overall

The JSON-mode suppression is solid and correctly addresses the false-positive exits in Issue 224. The four regression tests cover the important negative cases well. The items above (unreachable branch documentation, misleading test comment, missing positive test) are non-blocking but worth addressing before merge.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tests/unit/test_json_parsing.bats (1)

1126-1146: Test is valid, but the comment about git confidence boost is inaccurate.

The test correctly asserts that exit_signal remains false with low confidence. However, the comment on lines 1129 and 1134-1135 claims git changes boost confidence to 30, but this won't happen because:

  1. .ralph/.loop_start_sha is never created in the test setup
  2. The file is committed, so git diff --name-only and git diff --cached return nothing

Actual confidence is only 10 (from the "done" keyword). The test still passes since 10 < 70, but the comment is misleading.

💡 Suggested fix: Update comment or adjust test setup

Either update the comment to reflect actual behavior:

-    # Create text output with completion keyword + git changes (score: 10+20=30 < 70)
-    # to show it won't exit on low confidence
+    # Create text output with completion keyword (score: 10 < 70)
+    # to show it won't exit on low confidence
     cat > "$output_file" << 'EOF'
 All tasks are done. The implementation is complete.
 EOF
-    # Simulate git file changes to boost confidence to 30 (10 keyword + 20 git)
-    # Still below 70 — should NOT exit
-    touch modified_file.txt
-    git add modified_file.txt
-    git commit -m "test" --allow-empty > /dev/null 2>&1

Or, to actually test with git confidence boost, create the .loop_start_sha and leave changes uncommitted:

+    # Store initial SHA for git change detection
+    git rev-parse HEAD > "$RALPH_DIR/.loop_start_sha"
+
     # Create text output with completion keyword + git changes (score: 10+20=30 < 70)
     # to show it won't exit on low confidence
     cat > "$output_file" << 'EOF'
 All tasks are done. The implementation is complete.
 EOF
-    # Simulate git file changes to boost confidence to 30 (10 keyword + 20 git)
+    # Simulate uncommitted git file changes to boost confidence to 30 (10 keyword + 20 git)
     # Still below 70 — should NOT exit
     touch modified_file.txt
     git add modified_file.txt
-    git commit -m "test" --allow-empty > /dev/null 2>&1
+    # Leave staged but uncommitted so git diff --cached detects it
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/test_json_parsing.bats` around lines 1126 - 1146, The comment
claiming git changes boost confidence to 30 is inaccurate because the test never
creates .loop_start_sha and commits the change (so git diff returns nothing), so
actual confidence is only 10; fix by either updating the comment to state that
only the "done" keyword provides confidence (reference analyze_response and
$RALPH_DIR/.response_analysis, modified_file.txt and the git add/commit lines),
or change the test setup to actually produce a git diff-based boost by creating
a .loop_start_sha snapshot and leaving modified_file.txt as an
unstaged/uncommitted change (so git diff --name-only detects it) before calling
analyze_response; pick one of these two approaches and apply it to the test
block containing the assertions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/unit/test_json_parsing.bats`:
- Around line 1126-1146: The comment claiming git changes boost confidence to 30
is inaccurate because the test never creates .loop_start_sha and commits the
change (so git diff returns nothing), so actual confidence is only 10; fix by
either updating the comment to state that only the "done" keyword provides
confidence (reference analyze_response and $RALPH_DIR/.response_analysis,
modified_file.txt and the git add/commit lines), or change the test setup to
actually produce a git diff-based boost by creating a .loop_start_sha snapshot
and leaving modified_file.txt as an unstaged/uncommitted change (so git diff
--name-only detects it) before calling analyze_response; pick one of these two
approaches and apply it to the test block containing the assertions.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b47dbc18-43b8-48f6-a12c-5ab513e88d8b

📥 Commits

Reviewing files that changed from the base of the PR and between f702543 and 5517bf7.

📒 Files selected for processing (2)
  • lib/response_analyzer.sh
  • tests/unit/test_json_parsing.bats

Repository owner deleted a comment from github-actions Bot Mar 24, 2026
@github-actions
Copy link
Copy Markdown

PR Review: Issue #224 Fix

This is a well-designed fix that addresses the false-positive exit detection issue. The changes are minimal, focused, and properly tested.

✅ Code Quality

  1. Correct root cause analysis: The PR accurately identifies that >= 40 OR has_completion_signal allowed a single keyword match (score: 10) to trigger exit, making the threshold meaningless.

  2. Format-aware strategy:

    • JSON mode: Suppresses heuristics entirely when JSON detection succeeds. This is appropriate because valid JSON should rely on explicit EXIT_SIGNAL from structured output.
    • Text mode: Raises threshold from >= 40 OR to >= 70 AND, requiring both high confidence and structural signals.
  3. Clear documentation: Inline comments at lib/response_analyzer.sh:641-650 explain the rationale with Issue Exit confidence threshold (≥40) causes false-positive completions on documentation keywords #224 references, making future maintenance easier.

⚠️ Potential Issues

  1. Effectively dead code: The heuristic branch at line 646 is now unreachable for new input without a RALPH_STATUS block:

    • Max confidence without RALPH_STATUS: 55 (10 keyword + 15 no-work + 20 git + 10 decline)
    • New threshold: >= 70 AND has_completion_signal
    • Result: 55 < 70, so the elif branch is never entered

    This appears intentional—effectively requiring a RALPH_STATUS block for any exit—but should be documented in the comment rather than implying the threshold is merely "hard to reach."

  2. Comment issue (already fixed): The misleading comment in test Quality improvements: Bun migration, code quality tooling, documentation consolidation #3 about git changes has been corrected in commit 199e07b.

✅ Performance

Negligible impact. Added one conditional check with no complex operations.

✅ Security

No security concerns. The fix is purely about heuristic logic and doesn't handle user input or external data.

✅ Test Coverage

  • 4 regression tests added covering key negative scenarios
  • Missing test plan item: "Text mode, confidence >= 70 AND has_completion_signal=true → exit_signal: true" is already covered by existing tests (STATUS: COMPLETE without EXIT_SIGNAL sets confidence=100 and exit_signal=true at lines 500-509 before reaching the heuristic block)

Recommendation

The PR is ready to merge as-is. The fix correctly addresses Issue #224 with minimal risk of regressions.

Minor suggestion: Consider adding a comment at line 646 noting that this branch is effectively unreachable without a RALPH_STATUS block (max confidence 55 < threshold 70), to aid future maintenance.

@github-actions
Copy link
Copy Markdown

PR review comment posted to #230. The fix is well-designed and correctly addresses Issue #224, with one minor suggestion to document the effectively unreachable heuristic branch.

New%20session%20-%202026-03-24T23%3A30%3A09.281Z
opencode session  |  github run

@github-actions
Copy link
Copy Markdown


Follow-up Review

The misleading test comment (test No.3) has been corrected in the latest commit -- thanks for addressing that. The test now accurately states that the git-change boost does not fire due to missing .ralph/.loop_start_sha, giving an actual confidence of 10.

Two items from the earlier review remain open:

1. The text-mode elif branch is effectively dead code

The comment says the threshold was "raised" from 40 to 70, implying 70 is achievable. In practice, the maximum score reachable without a RALPH_STATUS block in text mode is:

Source Max pts
COMPLETION_KEYWORDS match +10
NO_WORK_PATTERNS match +15
git file changes detected +20
Output length decline +10
Total 55

Since 55 < 70, the elif branch can never fire via pure heuristics. When a STATUS: COMPLETE block is present (score 100), exit_signal is already set before this block is reached, making the elif redundant at best.

This is likely the desired outcome -- effectively mandating a RALPH_STATUS block for any exit in text mode. The concern is the current comment implies the threshold is merely strict, not unreachable. A more accurate comment would clarify that 70 is intentionally above the heuristic ceiling and that the branch exists as a safety net if scoring weights are ever increased.

This is non-blocking but could cause confusion during future maintenance.

2. Missing positive test case from the PR checklist

The PR's test plan includes:

Text mode, confidence >= 70 AND has_completion_signal=true -> exit_signal: true

This is still not covered. Given that the only reachable path to score 100 in text mode is via a RALPH_STATUS STATUS: COMPLETE block, the positive test would effectively exercise the RALPH_STATUS path (already covered elsewhere). If the intent is that the elif branch is intentionally unreachable, the checklist item could be marked N/A with a note explaining why. Either close the loop in the test plan or add a test using STATUS: COMPLETE without an explicit EXIT_SIGNAL to document the surviving positive case.


Overall: The core fix (JSON-mode heuristic suppression) is solid and correctly resolves Issue 224. The misleading comment fix is appreciated. The two remaining items are documentation and test-plan hygiene -- non-blocking, but worth resolving to keep the code self-documenting for future contributors.

@github-actions
Copy link
Copy Markdown

Code Review: Issue #224 Fix

This is a well-designed fix that correctly addresses the false-positive exit detection problem. The changes are minimal, focused, and properly tested.

Code Quality

  1. Correct root cause analysis: The PR accurately identifies that ">= 40 OR has_completion_signal" allowed a single keyword match (score: 10) to trigger exit, making the threshold meaningless.

  2. Format-aware strategy:

    • JSON mode (line 641-645): Suppresses heuristics entirely when JSON detection succeeds. This is appropriate because valid JSON should rely on explicit EXIT_SIGNAL from structured output.
    • Text mode (line 646-650): Raises threshold from ">= 40 OR" to ">= 70 AND", requiring both high confidence and structural signals.
  3. Clear documentation: Inline comments at lib/response_analyzer.sh:641-650 explain the rationale with Issue Exit confidence threshold (≥40) causes false-positive completions on documentation keywords #224 references, making future maintenance easier.

  4. Test Quality improvements: Bun migration, code quality tooling, documentation consolidation #3 comment fix: The misleading comment about git changes was correctly updated in the latest commit to accurately reflect that ".ralph/.loop_start_sha" is not set, so the +20 boost does not fire.

Documentation Gap: Effectively Unreachable Branch

The heuristic "elif" branch at line 646 is now unreachable for new input without a RALPH_STATUS block:

Source Max pts
COMPLETION_KEYWORDS match +10
NO_WORK_PATTERNS match +15
git file changes detected +20
Output length decline +10
Total 55

Since 55 < 70, this branch can never fire via pure heuristics. When a "STATUS: COMPLETE" RALPH_STATUS block is present, exit_signal is already set to "true" and confidence_score=100 at lines 507-509 — before reaching the heuristic block.

This appears intentional (effectively mandating a RALPH_STATUS block for any exit), but the current comment at line 648 implies the threshold is merely "strict" rather than unreachable. Consider adding a note clarifying that 70 is intentionally above the heuristic ceiling and that the branch exists as a safety net if scoring weights are ever increased.

Test Coverage

The 4 new regression tests cover the important negative cases well:

  • JSON mode without RALPH_STATUS: keywords do not trigger exit (lines 1095-1109)
  • JSON mode with multiple phrases: still no exit (lines 1111-1124)
  • Text mode: requires both confidence>=70 AND has_completion_signal (lines 1126-1142)
  • Text mode: has_completion_signal alone does not exit (lines 1144-1157)

The PR checklist includes: "Text mode, confidence >= 70 AND has_completion_signal=true -> exit_signal: true". Given that the only reachable path to score >=70 in text mode is via a RALPH_STATUS "STATUS: COMPLETE" block (already covered by existing tests at lines 500-509), this checklist item is effectively covered. Consider adding a note to the PR description or marking it N/A with explanation.

Performance & Security

  • Performance: Negligible impact. Added one conditional check with no complex operations.
  • Security: No concerns. The fix is purely about heuristic logic and does not handle user input or external data.

Recommendation

APPROVED — The fix correctly addresses Issue #224 with minimal risk of regressions. The core JSON-mode suppression is solid, and the text-mode threshold correctly prevents false-positive exits from documentation keywords.

The two items above (unreachable branch documentation, test plan checklist clarity) are non-blocking but worth addressing to keep the code self-documenting for future contributors.

@github-actions
Copy link
Copy Markdown

Review comment posted: #230 (comment)

The fix is solid and correctly addresses Issue #224. The two minor items noted (effectively unreachable branch documentation and test plan clarity) are non-blocking but would improve future maintainability.

New%20session%20-%202026-03-24T23%3A41%3A54.353Z
opencode session  |  github run

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@CLAUDE.md`:
- Around line 564-571: The table header "### Test Files (588 tests total)" is
out of sync with the real count; update that header to "### Test Files (591
tests total)" and verify the per-file counts (e.g.,
`test_circuit_breaker_recovery.bats`, `test_cli_parsing.bats`,
`test_cli_modern.bats`, `test_json_parsing.bats`, etc.) to ensure their summed
total equals 591, correcting any individual file counts in the table so the sum
matches the header.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 066e7876-c063-4636-a9e9-c60a5b82e060

📥 Commits

Reviewing files that changed from the base of the PR and between 199e07b and b759e80.

📒 Files selected for processing (1)
  • CLAUDE.md

Comment thread CLAUDE.md
Comment on lines +564 to +571
### Test Files (588 tests total)

| File | Tests | Description |
|------|-------|-------------|
| `test_circuit_breaker_recovery.bats` | 22 | Cooldown timer, auto-reset, parse_iso_to_epoch, CLI flag (Issue #160) + current_loop init/display fix (#194) |
| `test_cli_parsing.bats` | 35 | CLI argument parsing for all flags + monitor parameter forwarding |
| `test_cli_modern.bats` | 111 | Modern CLI commands (Phase 1.1) + build_claude_command fix + live mode text format fix (#164) + errexit pipeline guard (#175) + ALLOWED_TOOLS tightening (#149) + API limit false positive detection (#183) + Claude CLI command validation (#97) + stale call counter fix (#196) + is_error detection (#134, #199) + set-e removal (#208) + question detection + version check + semver comparison + stderr separation (#190) + productive timeout detection + session ID fallback + stale analysis cleanup (#198) + Extra Usage quota detection (#100) |
| `test_json_parsing.bats` | 52 | JSON output format parsing + Claude CLI format + session management + array format + question detection (#190) |
| `test_json_parsing.bats` | 56 | JSON output format parsing + Claude CLI format + session management + array format + question detection (#190) + heuristic exit threshold tests (#224) |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "Count `@test` blocks across bats files:"
rg -n --type-add 'bats:*.bats' --type bats '^\s*@test\s+' tests | wc -l

echo
echo "Current CLAUDE.md declared total:"
rg -n 'Test Files \([0-9]+ tests total\)' CLAUDE.md

echo
echo "Current CLAUDE.md per-file rows:"
rg -n '^\| `test_.*\.bats` \| [0-9]+ \|' CLAUDE.md

Repository: frankbria/ralph-claude-code

Length of output: 2854


Update CLAUDE.md test table total from 588 to 591 tests.

The actual test count is 591 @test blocks across all .bats files, but CLAUDE.md currently documents 588 tests. The per-file test counts in the table (22+35+111+56+26+54+11+20+25+15+50+33+38+23+24+20+15+10 = 588) appear to be missing 3 tests. Review each test file to confirm the correct per-file counts and update the table header to reflect the actual total of 591 tests.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@CLAUDE.md` around lines 564 - 571, The table header "### Test Files (588
tests total)" is out of sync with the real count; update that header to "###
Test Files (591 tests total)" and verify the per-file counts (e.g.,
`test_circuit_breaker_recovery.bats`, `test_cli_parsing.bats`,
`test_cli_modern.bats`, `test_json_parsing.bats`, etc.) to ensure their summed
total equals 591, correcting any individual file counts in the table so the sum
matches the header.

@github-actions
Copy link
Copy Markdown

test comment

@github-actions
Copy link
Copy Markdown

PLACEHOLDER - will be replaced with full review

@github-actions
Copy link
Copy Markdown

Review: Issue 224 Fix

This is a clean, focused fix. The root cause analysis is accurate: the old >= 40 OR has_completion_signal condition was trivially satisfied by a single keyword match (score: 10), making the threshold meaningless.

What is working well

Format-aware logic is correct. The output_format variable (set at line 356 via detect_output_format()) correctly distinguishes two scenarios at line 641:

  • output_format == json at the heuristic check means the file looked like JSON but parsing failed - heuristics are suppressed entirely (the safe, conservative choice)
  • output_format == text means genuine text output - apply the stricter >= 70 AND has_completion_signal gate

Inline comments are clear. The Issue 224 references and rationale in both the code and the CLAUDE.md update make this easy to maintain.

Debug logging is additive. The VERBOSE_PROGRESS debug line goes to stderr and does not affect normal operation.


Missing test: positive text-mode exit case

The 4 new regression tests all verify that conditions do NOT trigger exit. There is no test verifying that the new threshold DOES trigger exit when conditions are genuinely met. A test constructing output with enough signals to push confidence_score >= 70 alongside has_completion_signal=true, then asserting exit_signal == true, would protect the happy path.

Without a positive case, the threshold could be raised to an arbitrarily high value and all 4 new tests would still pass. This is the most notable gap.


Minor observations

Existing test implicitly changed. The test analyze_response handles plain text completion signals (around line 416 in test_json_parsing.bats) checks has_completion_signal == true but not exit_signal. Under the old behavior that plain text would have set exit_signal=true; under the new behavior it would not. The test does not break, but does not capture this behavioral change. Pre-existing, not a blocker.

Threshold rationale not documented. The PR explains the problem but does not mention how 70 was chosen vs. 60 or 80. A brief note on the empirical basis or design intent would help future reviewers.

output_format recorded as text when JSON parse fails (line 660). The hardcoded --arg output_format text in the jq call is correct since text parsing was what actually ran. But the saved analysis will say output_format: text even when the input file was JSON that failed to parse. A brief comment would prevent future confusion.


Bottom line

The fix is correct and addresses Issue 224. The main recommendation before merge is adding at least one positive test case verifying that confidence_score >= 70 AND has_completion_signal == true actually does set exit_signal=true, protecting the happy path from future regressions.

@frankbria frankbria merged commit d72edcd into main Mar 24, 2026
7 of 8 checks passed
@frankbria frankbria deleted the fix/224-confidence-threshold-false-positives branch March 24, 2026 23:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Exit confidence threshold (≥40) causes false-positive completions on documentation keywords

1 participant