Skip to content

Agent Pattern: Reflection / Self-Correction #172

@yasha-dev1

Description

@yasha-dev1

Overview

The Reflection (Self-Correction) pattern is a generate-critique-refine cycle where an agent creates an initial response, reflects on its quality through self-critique or an external evaluator, and iteratively improves the output. This pattern dramatically improves accuracy and quality.

Performance Impact: Research shows reflection can improve accuracy from 78.6% to 97.1% on complex tasks by enabling agents to catch and correct their own mistakes.

How It Works

  1. Generate: Generator agent produces an initial output
  2. Reflect/Critique: Evaluator agent (or same agent in reflection mode) critiques the output
  3. Revise: Generator incorporates feedback to improve output
  4. Iterate: Repeat reflection cycle until quality threshold met or max iterations reached
  5. Terminate: Return final refined output

Control Flow:

    Task → Generator Agent
              ↓
        Initial Output
              ↓
        Evaluator Agent
              ↓
         Critique/Feedback
              ↓
      ┌─── Good enough? ───┐
      ↓ No                  ↓ Yes
  Generator                Final
  (revise)                Output
      ↓
  Improved Output
      ↓
  (loop back to Evaluator)

Variants:

  • Self-Reflection: Same agent critiques its own output
  • Dual-Agent: Separate generator and evaluator agents
  • Multi-Perspective: Multiple evaluators provide different critiques

Reference Implementations

Proposed PyWorkflow Implementation

from pyworkflow_agents import ReflectiveAgent, Agent
from pyworkflow_agents.providers import AnthropicProvider
from pyworkflow import workflow, step, get_context

# Method 1: Dual-Agent Reflection
generator = Agent(
    name="coder",
    provider=AnthropicProvider(model="claude-sonnet-4-5-20250929"),
    instructions="You generate Python code solutions.",
    tools=[code_execution_tool],
)

evaluator = Agent(
    name="reviewer",
    provider=AnthropicProvider(model="claude-sonnet-4-5-20250929"),
    instructions="You review code for bugs, style, and correctness. Provide specific feedback.",
)

reflective_agent = ReflectiveAgent(
    generator=generator,
    evaluator=evaluator,
    max_reflections=3,
    quality_threshold=0.9,  # Stop if evaluator score >= 0.9
)

@workflow(durable=True)
async def reflective_workflow(task: str):
    """
    Execute reflection pattern with event-sourced reflection cycles.
    """
    result = await reflective_agent.run(task)
    return result

# Method 2: Manual Reflection Loop
@workflow(durable=True)
async def manual_reflection_workflow(task: str):
    """
    Explicit reflection loop using PyWorkflow primitives.
    """
    ctx = get_context()
    
    # Initial generation
    output = await generate_as_step(task)
    
    # Reflection loop
    for iteration in range(3):  # max_reflections=3
        # Critique
        critique = await evaluate_as_step(output, task)
        
        # Record reflection event
        await ctx.storage.record_event(Event(
            run_id=ctx.run_id,
            type=EventType.REFLECTION_ITERATION,
            data={
                "iteration": iteration,
                "output": output,
                "critique": critique,
                "quality_score": critique.get("score", 0)
            }
        ))
        
        # Check if good enough
        if critique.get("approved", False):
            return {
                "output": output,
                "iterations": iteration + 1,
                "final_quality": critique.get("score")
            }
        
        # Revise based on feedback
        output = await revise_as_step(output, critique, task)
    
    # Max iterations reached
    return {
        "output": output,
        "iterations": 3,
        "warning": "Max reflections reached, may not meet quality threshold"
    }

@step()
async def generate_as_step(task: str):
    """Generate initial output."""
    response = await generator.run(task)
    return response.content

@step()
async def evaluate_as_step(output: str, original_task: str):
    """Evaluate output quality and provide critique."""
    prompt = f"""
    Evaluate this output for the task: {original_task}
    
    Output: {output}
    
    Provide:
    1. Quality score (0.0-1.0)
    2. Specific issues found
    3. Actionable feedback for improvement
    4. Approval (true/false)
    """
    response = await evaluator.run(prompt)
    return response.structured_output  # Pydantic model: {"score": 0.85, "issues": [...], "feedback": "...", "approved": false}

@step()
async def revise_as_step(output: str, critique: dict, task: str):
    """Revise output based on critique."""
    prompt = f"""
    Original task: {task}
    Current output: {output}
    Feedback: {critique["feedback"]}
    Issues: {critique["issues"]}
    
    Revise the output to address all feedback and issues.
    """
    response = await generator.run(prompt)
    return response.content

# Method 3: Self-Reflection (same agent)
@workflow(durable=True)
async def self_reflection_workflow(task: str):
    """
    Same agent reflects on its own output.
    """
    agent = Agent(
        name="self_reflective_agent",
        provider=AnthropicProvider(model="claude-opus-4-6"),
        instructions="You generate solutions and critically evaluate them."
    )
    
    output = await generate_with_self_reflection(agent, task)
    return output

@step()
async def generate_with_self_reflection(agent: Agent, task: str, max_iterations: int = 3):
    """
    Agent generates and self-critiques in iterations.
    """
    current_output = None
    
    for iteration in range(max_iterations):
        if iteration == 0:
            # Initial generation
            prompt = f"Task: {task}\n\nGenerate a solution."
        else:
            # Reflection prompt
            prompt = f"""
            Task: {task}
            Your previous output: {current_output}
            
            Reflect on your output:
            1. What are the weaknesses?
            2. How can you improve it?
            3. Generate an improved version.
            """
        
        response = await agent.run(prompt)
        current_output = response.content
        
        # Self-evaluation
        eval_prompt = f"Rate the quality of this output (0-10): {current_output}"
        eval_response = await agent.run(eval_prompt)
        quality_score = extract_score(eval_response.content)
        
        if quality_score >= 9:
            return current_output
    
    return current_output

Key Mapping to PyWorkflow Primitives:

  • Reflection cycle = Workflow loop with event-sourced iterations
  • Generate step = @step for generator agent
  • Evaluate step = @step for evaluator agent
  • Revise step = @step for generator with feedback
  • Reflection history = REFLECTION_ITERATION events in event log
  • Max iterations = Loop counter (prevent infinite reflection)
  • Quality threshold = Conditional check to exit loop

Event Types

New events for reflection pattern:

class EventType(str, Enum):
    # Existing events...
    REFLECTION_START = "reflection_start"           # Start reflection process
    REFLECTION_ITERATION = "reflection_iteration"   # Single reflect-revise cycle
    REFLECTION_APPROVED = "reflection_approved"     # Output approved by evaluator
    REFLECTION_MAX_REACHED = "reflection_max_reached"  # Max iterations without approval
    REFLECTION_COMPLETE = "reflection_complete"     # Final output

Event Data Schema:

# REFLECTION_START
{
    "task": "Generate Python function for Fibonacci",
    "generator_agent": "coder",
    "evaluator_agent": "reviewer",
    "max_reflections": 3,
    "quality_threshold": 0.9
}

# REFLECTION_ITERATION
{
    "iteration": 1,
    "generator_output": "def fib(n): ...",
    "evaluator_critique": {
        "score": 0.7,
        "issues": ["Missing docstring", "No input validation"],
        "feedback": "Add type hints and handle edge cases"
    },
    "approved": false
}

# REFLECTION_APPROVED
{
    "iteration": 2,
    "final_output": "def fib(n: int) -> int: ...",
    "quality_score": 0.95,
    "total_reflections": 2
}

# REFLECTION_MAX_REACHED
{
    "max_iterations": 3,
    "final_quality_score": 0.85,
    "quality_threshold": 0.9,
    "warning": "Quality threshold not met"
}

Trade-offs

Pros

  • Accuracy: 78.6% → 97.1% improvement (research-backed)
  • Quality: Multiple revision passes catch mistakes
  • Self-correction: Agents fix their own errors without human intervention
  • Debuggability: Full reflection history visible in event log
  • Event replay: Reflection cycles reconstructed during recovery
  • Adaptability: Can tune quality threshold and max iterations per task

Cons

  • Latency: Multiple LLM calls per reflection cycle (2-4x slower)
  • Cost: 2-4x LLM invocations vs single-pass generation
  • Diminishing returns: Later iterations may not improve much
  • Infinite loops: Need max_iterations to prevent endless reflection
  • Critique quality: Evaluator must be good at identifying issues

When to Use

  • High-stakes tasks requiring accuracy (code generation, medical analysis)
  • Complex problem-solving where first attempts often have flaws
  • Tasks where quality > speed (research reports, legal documents)
  • Self-improving systems (agent learns from mistakes)

When to Avoid

  • Simple tasks where first output is usually correct
  • Latency-sensitive applications (real-time chat)
  • Cost-sensitive scenarios (multiple LLM calls expensive)
  • Tasks where reflection provides little value (data retrieval)

Performance Benchmarks

Based on LangGraph and research papers:

  • Code generation: 78.6% → 97.1% accuracy with reflection
  • Reasoning tasks: 14-19% improvement (HALO framework)
  • Latency: 2-4x slower (depending on max_reflections)
  • Cost: 2-4x more LLM calls

Advanced Patterns

1. Multi-Perspective Reflection

Use multiple evaluators for diverse critiques:

reflective_agent = ReflectiveAgent(
    generator=generator,
    evaluators=[
        Agent(name="style_reviewer", ...),
        Agent(name="correctness_reviewer", ...),
        Agent(name="performance_reviewer", ...),
    ],
    aggregation="consensus",  # or "weighted", "all_must_approve"
)

2. Reflexion (Learning from Failures)

Store past failures in memory for future reference:

# Record failures in vector DB
await memory.store_failure(task, output, critique)

# In future iterations, retrieve similar failures
similar_failures = await memory.retrieve_similar_failures(task)
prompt = f"Avoid these past mistakes: {similar_failures}\n\nTask: {task}"

3. Self-Correcting RAG

Reflection for retrieval-augmented generation:

@workflow(durable=True)
async def self_correcting_rag(question: str):
    """RAG with reflection on retrieval quality."""
    # Generate
    docs = await retrieve_docs(question)
    answer = await generate_answer(question, docs)
    
    # Reflect
    for i in range(3):
        eval = await evaluate_answer(question, answer, docs)
        if eval["grade"] == "correct":
            return answer
        
        # Revise retrieval or generation
        if eval["issue"] == "missing_info":
            docs = await retrieve_additional_docs(question, eval["feedback"])
        
        answer = await generate_answer(question, docs, feedback=eval["feedback"])
    
    return answer

Implementation Checklist

  • Create ReflectiveAgent class in pyworkflow_agents/reflection.py
  • Implement dual-agent reflection (generator + evaluator)
  • Implement self-reflection (same agent)
  • Add REFLECTION_* event types
  • Add max_reflections limit (default: 3)
  • Add quality_threshold parameter (default: 0.9)
  • Implement multi-perspective reflection (multiple evaluators)
  • Add aggregation strategies for multi-evaluator (consensus, weighted)
  • Create generate_as_step(), evaluate_as_step(), revise_as_step() helpers
  • Add reflection history to event log
  • Implement event replay for reflection cycles
  • Create examples in examples/agents/reflection_pattern.py
  • Add tests for all reflection scenarios
  • Document performance benchmarks (accuracy vs latency/cost)
  • Add metrics: reflection count, quality score progression, approval rate
  • Add visualization of reflection cycles over time

Related Issues

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    agentsAI Agent module (pyworkflow_agents)featureFeature to be implementedmulti-agentMulti-agent orchestration patterns

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions