Skip to content

fix(reflection): detect planning loops via GenAI prompts, fix inferTaskType misclassification (#115)#117

Merged
dzianisv merged 1 commit intomainfrom
fix/115-reflection-stuck-research-misclassification
Feb 15, 2026
Merged

fix(reflection): detect planning loops via GenAI prompts, fix inferTaskType misclassification (#115)#117
dzianisv merged 1 commit intomainfrom
fix/115-reflection-stuck-research-misclassification

Conversation

@dzianisv
Copy link
Owner

Summary

Fixes #115 — sessions where the agent only reads/explores files but never implements code changes were being marked as "complete" by the reflection plugin.

Root Cause (two interacting bugs)

  1. inferTaskType() misclassification: The regex research|investigate|analyze|compare|evaluate|study matched before fix|bug|issue|error|regression. Tasks containing both "investigate" AND "fix" were classified as "research" instead of "coding".

  2. Research tasks bypass ALL workflow gates: When taskType === "research", requiresTests, requiresBuild, requiresPR, and requiresCI are all set to false. With no requirements, evaluateSelfAssessment() finds missing.length === 0, and if the LLM returns status: "complete", the task is marked complete without feedback.

  3. No planning loop detection in GenAI prompts: The self-assessment and judge prompts had no rule checking whether the agent actually made code changes vs only reading/exploring.

Changes

  • inferTaskType() refactored — prioritizes coding action keywords (fix, implement, add, create, etc.) over research classification; adds GitHub issue URL detection
  • Planning loop detection via GenAI prompts — added "PLANNING LOOP CHECK" rules to self-assessment prompt and judge/analyze prompt telling the LLM to set status: "in_progress" / complete: false when a coding task shows only read operations
  • Stuck-detection eval prompt enhanced — added planning loop rule scoped to message_completed: true (avoids interfering with "WORKING" priority when tools are still running)
  • Mirror fix in test-helpersinferTaskType() in reflection-3.test-helpers.ts updated identically
  • Unit tests — 5 new tests for inferTaskType, evaluateSelfAssessment, detectPlanningLoop, buildEscalatingFeedback
  • Eval test case — new stuck-detection case: "Planning loop - agent only read/explored, never wrote code"

Design Decision

Per feedback on PR #114, planning loop detection is done entirely via GenAI prompts, not mechanical heuristics or counters. The detectPlanningLoop() function still exists and is used for buildEscalatingFeedback() (choosing feedback text style), but does NOT mechanically override analysis.complete.

Test Results

Suite Result
npm test 320 pass, 5 skipped
eval:judge 23/23 (100%)
eval:stuck 18/18 (100%)
eval:compression 12/12 (100%)

…skType misclassification (#115)

Root cause: tasks containing both 'research' and 'fix/implement' keywords were
misclassified as 'research' because the research regex matched first. With
taskType='research', all workflow gates were disabled, allowing the LLM to mark
read-only sessions as 'complete'.

Changes:
- Refactor inferTaskType() to prioritize coding action keywords (fix, implement,
  add, create, etc.) over research classification. Add GitHub issue URL detection.
- Add PLANNING LOOP CHECK rules to self-assessment and judge GenAI prompts so
  the LLM itself detects when a coding task only has read operations.
- Add planning loop rule to stuck-detection eval prompt (scoped to
  message_completed=true to avoid interfering with 'working' priority).
- Mirror inferTaskType() fix in test-helpers.
- Add unit tests for inferTaskType, evaluateSelfAssessment, detectPlanningLoop,
  and buildEscalatingFeedback.
- Add eval test case for planning loop detection.

All evals pass: judge 23/23, stuck 18/18, compression 12/12.
Unit tests: 320 pass (5 skipped).
@dzianisv dzianisv merged commit 8a3bf0a into main Feb 15, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Session stuck again, reflection didn't push, task wasn't completed

1 participant