Agent Pattern: Reflection / Self-Correction

## Overview

The Reflection (Self-Correction) pattern is a generate-critique-refine cycle where an agent creates an initial response, reflects on its quality through self-critique or an external evaluator, and iteratively improves the output. This pattern dramatically improves accuracy and quality.

**Performance Impact**: Research shows reflection can improve accuracy from **78.6% to 97.1%** on complex tasks by enabling agents to catch and correct their own mistakes.

## How It Works

1. **Generate**: Generator agent produces an initial output
2. **Reflect/Critique**: Evaluator agent (or same agent in reflection mode) critiques the output
3. **Revise**: Generator incorporates feedback to improve output
4. **Iterate**: Repeat reflection cycle until quality threshold met or max iterations reached
5. **Terminate**: Return final refined output

**Control Flow**:
```
    Task → Generator Agent
              ↓
        Initial Output
              ↓
        Evaluator Agent
              ↓
         Critique/Feedback
              ↓
      ┌─── Good enough? ───┐
      ↓ No                  ↓ Yes
  Generator                Final
  (revise)                Output
      ↓
  Improved Output
      ↓
  (loop back to Evaluator)
```

**Variants**:
- **Self-Reflection**: Same agent critiques its own output
- **Dual-Agent**: Separate generator and evaluator agents
- **Multi-Perspective**: Multiple evaluators provide different critiques

## Reference Implementations

- [LangGraph Reflection Tutorial](https://langchain-ai.github.io/langgraph/tutorials/reflection/reflection/) - Official LangGraph implementation
- [Reflexion Pattern](https://langchain-ai.github.io/langgraph/tutorials/reflexion/reflexion/) - Learning through verbal feedback
- [LangGraph Reflection Blog](https://blog.langchain.com/reflection-agents/) - Deep dive on reflection agents
- [Building Self-Correcting AI](https://medium.com/@vi.ha.engr/building-a-self-correcting-ai-a-deep-dive-into-the-reflexion-agent-with-langchain-and-langgraph-ae2b1ddb8c3b) - Reflexion agent deep dive
- [Self-Reflective RAG](https://blog.langchain.com/agentic-rag-with-langgraph/) - Agentic RAG with LangGraph
- [Reflection Pattern Documentation](https://agent-patterns.readthedocs.io/en/stable/patterns/reflection.html) - Agent patterns reference
- [Reflection Agentic Design Pattern](https://datalearningscience.com/p/4-reflection-agentic-design-pattern) - Design pattern series
- [LangGraph Self-Correcting RAG](https://learnopencv.com/langgraph-self-correcting-agent-code-generation/) - Code generation example

## Proposed PyWorkflow Implementation

```python
from pyworkflow_agents import ReflectiveAgent, Agent
from pyworkflow_agents.providers import AnthropicProvider
from pyworkflow import workflow, step, get_context

# Method 1: Dual-Agent Reflection
generator = Agent(
    name="coder",
    provider=AnthropicProvider(model="claude-sonnet-4-5-20250929"),
    instructions="You generate Python code solutions.",
    tools=[code_execution_tool],
)

evaluator = Agent(
    name="reviewer",
    provider=AnthropicProvider(model="claude-sonnet-4-5-20250929"),
    instructions="You review code for bugs, style, and correctness. Provide specific feedback.",
)

reflective_agent = ReflectiveAgent(
    generator=generator,
    evaluator=evaluator,
    max_reflections=3,
    quality_threshold=0.9,  # Stop if evaluator score >= 0.9
)

@workflow(durable=True)
async def reflective_workflow(task: str):
    """
    Execute reflection pattern with event-sourced reflection cycles.
    """
    result = await reflective_agent.run(task)
    return result

# Method 2: Manual Reflection Loop
@workflow(durable=True)
async def manual_reflection_workflow(task: str):
    """
    Explicit reflection loop using PyWorkflow primitives.
    """
    ctx = get_context()
    
    # Initial generation
    output = await generate_as_step(task)
    
    # Reflection loop
    for iteration in range(3):  # max_reflections=3
        # Critique
        critique = await evaluate_as_step(output, task)
        
        # Record reflection event
        await ctx.storage.record_event(Event(
            run_id=ctx.run_id,
            type=EventType.REFLECTION_ITERATION,
            data={
                "iteration": iteration,
                "output": output,
                "critique": critique,
                "quality_score": critique.get("score", 0)
            }
        ))
        
        # Check if good enough
        if critique.get("approved", False):
            return {
                "output": output,
                "iterations": iteration + 1,
                "final_quality": critique.get("score")
            }
        
        # Revise based on feedback
        output = await revise_as_step(output, critique, task)
    
    # Max iterations reached
    return {
        "output": output,
        "iterations": 3,
        "warning": "Max reflections reached, may not meet quality threshold"
    }

@step()
async def generate_as_step(task: str):
    """Generate initial output."""
    response = await generator.run(task)
    return response.content

@step()
async def evaluate_as_step(output: str, original_task: str):
    """Evaluate output quality and provide critique."""
    prompt = f"""
    Evaluate this output for the task: {original_task}
    
    Output: {output}
    
    Provide:
    1. Quality score (0.0-1.0)
    2. Specific issues found
    3. Actionable feedback for improvement
    4. Approval (true/false)
    """
    response = await evaluator.run(prompt)
    return response.structured_output  # Pydantic model: {"score": 0.85, "issues": [...], "feedback": "...", "approved": false}

@step()
async def revise_as_step(output: str, critique: dict, task: str):
    """Revise output based on critique."""
    prompt = f"""
    Original task: {task}
    Current output: {output}
    Feedback: {critique["feedback"]}
    Issues: {critique["issues"]}
    
    Revise the output to address all feedback and issues.
    """
    response = await generator.run(prompt)
    return response.content

# Method 3: Self-Reflection (same agent)
@workflow(durable=True)
async def self_reflection_workflow(task: str):
    """
    Same agent reflects on its own output.
    """
    agent = Agent(
        name="self_reflective_agent",
        provider=AnthropicProvider(model="claude-opus-4-6"),
        instructions="You generate solutions and critically evaluate them."
    )
    
    output = await generate_with_self_reflection(agent, task)
    return output

@step()
async def generate_with_self_reflection(agent: Agent, task: str, max_iterations: int = 3):
    """
    Agent generates and self-critiques in iterations.
    """
    current_output = None
    
    for iteration in range(max_iterations):
        if iteration == 0:
            # Initial generation
            prompt = f"Task: {task}\n\nGenerate a solution."
        else:
            # Reflection prompt
            prompt = f"""
            Task: {task}
            Your previous output: {current_output}
            
            Reflect on your output:
            1. What are the weaknesses?
            2. How can you improve it?
            3. Generate an improved version.
            """
        
        response = await agent.run(prompt)
        current_output = response.content
        
        # Self-evaluation
        eval_prompt = f"Rate the quality of this output (0-10): {current_output}"
        eval_response = await agent.run(eval_prompt)
        quality_score = extract_score(eval_response.content)
        
        if quality_score >= 9:
            return current_output
    
    return current_output
```

**Key Mapping to PyWorkflow Primitives**:
- **Reflection cycle** = Workflow loop with event-sourced iterations
- **Generate step** = `@step` for generator agent
- **Evaluate step** = `@step` for evaluator agent
- **Revise step** = `@step` for generator with feedback
- **Reflection history** = `REFLECTION_ITERATION` events in event log
- **Max iterations** = Loop counter (prevent infinite reflection)
- **Quality threshold** = Conditional check to exit loop

## Event Types

New events for reflection pattern:

```python
class EventType(str, Enum):
    # Existing events...
    REFLECTION_START = "reflection_start"           # Start reflection process
    REFLECTION_ITERATION = "reflection_iteration"   # Single reflect-revise cycle
    REFLECTION_APPROVED = "reflection_approved"     # Output approved by evaluator
    REFLECTION_MAX_REACHED = "reflection_max_reached"  # Max iterations without approval
    REFLECTION_COMPLETE = "reflection_complete"     # Final output
```

**Event Data Schema**:
```python
# REFLECTION_START
{
    "task": "Generate Python function for Fibonacci",
    "generator_agent": "coder",
    "evaluator_agent": "reviewer",
    "max_reflections": 3,
    "quality_threshold": 0.9
}

# REFLECTION_ITERATION
{
    "iteration": 1,
    "generator_output": "def fib(n): ...",
    "evaluator_critique": {
        "score": 0.7,
        "issues": ["Missing docstring", "No input validation"],
        "feedback": "Add type hints and handle edge cases"
    },
    "approved": false
}

# REFLECTION_APPROVED
{
    "iteration": 2,
    "final_output": "def fib(n: int) -> int: ...",
    "quality_score": 0.95,
    "total_reflections": 2
}

# REFLECTION_MAX_REACHED
{
    "max_iterations": 3,
    "final_quality_score": 0.85,
    "quality_threshold": 0.9,
    "warning": "Quality threshold not met"
}
```

## Trade-offs

### Pros
- **Accuracy**: 78.6% → 97.1% improvement (research-backed)
- **Quality**: Multiple revision passes catch mistakes
- **Self-correction**: Agents fix their own errors without human intervention
- **Debuggability**: Full reflection history visible in event log
- **Event replay**: Reflection cycles reconstructed during recovery
- **Adaptability**: Can tune quality threshold and max iterations per task

### Cons
- **Latency**: Multiple LLM calls per reflection cycle (2-4x slower)
- **Cost**: 2-4x LLM invocations vs single-pass generation
- **Diminishing returns**: Later iterations may not improve much
- **Infinite loops**: Need max_iterations to prevent endless reflection
- **Critique quality**: Evaluator must be good at identifying issues

### When to Use
- High-stakes tasks requiring accuracy (code generation, medical analysis)
- Complex problem-solving where first attempts often have flaws
- Tasks where quality > speed (research reports, legal documents)
- Self-improving systems (agent learns from mistakes)

### When to Avoid
- Simple tasks where first output is usually correct
- Latency-sensitive applications (real-time chat)
- Cost-sensitive scenarios (multiple LLM calls expensive)
- Tasks where reflection provides little value (data retrieval)

## Performance Benchmarks

Based on LangGraph and research papers:

- **Code generation**: 78.6% → 97.1% accuracy with reflection
- **Reasoning tasks**: 14-19% improvement (HALO framework)
- **Latency**: 2-4x slower (depending on max_reflections)
- **Cost**: 2-4x more LLM calls

## Advanced Patterns

### 1. Multi-Perspective Reflection
Use multiple evaluators for diverse critiques:
```python
reflective_agent = ReflectiveAgent(
    generator=generator,
    evaluators=[
        Agent(name="style_reviewer", ...),
        Agent(name="correctness_reviewer", ...),
        Agent(name="performance_reviewer", ...),
    ],
    aggregation="consensus",  # or "weighted", "all_must_approve"
)
```

### 2. Reflexion (Learning from Failures)
Store past failures in memory for future reference:
```python
# Record failures in vector DB
await memory.store_failure(task, output, critique)

# In future iterations, retrieve similar failures
similar_failures = await memory.retrieve_similar_failures(task)
prompt = f"Avoid these past mistakes: {similar_failures}\n\nTask: {task}"
```

### 3. Self-Correcting RAG
Reflection for retrieval-augmented generation:
```python
@workflow(durable=True)
async def self_correcting_rag(question: str):
    """RAG with reflection on retrieval quality."""
    # Generate
    docs = await retrieve_docs(question)
    answer = await generate_answer(question, docs)
    
    # Reflect
    for i in range(3):
        eval = await evaluate_answer(question, answer, docs)
        if eval["grade"] == "correct":
            return answer
        
        # Revise retrieval or generation
        if eval["issue"] == "missing_info":
            docs = await retrieve_additional_docs(question, eval["feedback"])
        
        answer = await generate_answer(question, docs, feedback=eval["feedback"])
    
    return answer
```

## Implementation Checklist

- [ ] Create `ReflectiveAgent` class in `pyworkflow_agents/reflection.py`
- [ ] Implement dual-agent reflection (generator + evaluator)
- [ ] Implement self-reflection (same agent)
- [ ] Add `REFLECTION_*` event types
- [ ] Add max_reflections limit (default: 3)
- [ ] Add quality_threshold parameter (default: 0.9)
- [ ] Implement multi-perspective reflection (multiple evaluators)
- [ ] Add aggregation strategies for multi-evaluator (consensus, weighted)
- [ ] Create `generate_as_step()`, `evaluate_as_step()`, `revise_as_step()` helpers
- [ ] Add reflection history to event log
- [ ] Implement event replay for reflection cycles
- [ ] Create examples in `examples/agents/reflection_pattern.py`
- [ ] Add tests for all reflection scenarios
- [ ] Document performance benchmarks (accuracy vs latency/cost)
- [ ] Add metrics: reflection count, quality score progression, approval rate
- [ ] Add visualization of reflection cycles over time

## Related Issues

- #154 - Supervisor Agent - Can use reflection as a step quality check
- #168 - Parallel Agent - Can use parallel evaluators for multi-perspective reflection
- #164 - Collaborative Agent - Can store reflection history in scratchpad

## References

- [LangGraph Reflection Tutorial](https://langchain-ai.github.io/langgraph/tutorials/reflection/reflection/)
- [LangGraph Reflexion Pattern](https://langchain-ai.github.io/langgraph/tutorials/reflexion/reflexion/)
- [Reflection Agents Blog](https://blog.langchain.com/reflection-agents/)
- [Building Self-Correcting AI with Reflexion](https://medium.com/@vi.ha.engr/building-a-self-correcting-ai-a-deep-dive-into-the-reflexion-agent-with-langchain-and-langgraph-ae2b1ddb8c3b)
- [Self-Reflective RAG with LangGraph](https://blog.langchain.com/agentic-rag-with-langgraph/)
- [Reflection Pattern Documentation](https://agent-patterns.readthedocs.io/en/stable/patterns/reflection.html)
- [Reflection Agentic Design Pattern](https://datalearningscience.com/p/4-reflection-agentic-design-pattern)
- [Self-Correcting Code Generation](https://learnopencv.com/langgraph-self-correcting-agent-code-generation/)
- [LangGraph Reflection Library](https://github.com/langchain-ai/langgraph-reflection)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agent Pattern: Reflection / Self-Correction #172

Overview

How It Works

Reference Implementations

Proposed PyWorkflow Implementation

Event Types

Trade-offs

Pros

Cons

When to Use

When to Avoid

Performance Benchmarks

Advanced Patterns

1. Multi-Perspective Reflection

2. Reflexion (Learning from Failures)

3. Self-Correcting RAG

Implementation Checklist

Related Issues

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Agent Pattern: Reflection / Self-Correction #172

Description

Overview

How It Works

Reference Implementations

Proposed PyWorkflow Implementation

Event Types

Trade-offs

Pros

Cons

When to Use

When to Avoid

Performance Benchmarks

Advanced Patterns

1. Multi-Perspective Reflection

2. Reflexion (Learning from Failures)

3. Self-Correcting RAG

Implementation Checklist

Related Issues

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions