-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
agentsAI Agent module (pyworkflow_agents)AI Agent module (pyworkflow_agents)featureFeature to be implementedFeature to be implementedmulti-agentMulti-agent orchestration patternsMulti-agent orchestration patterns
Description
Overview
The Reflection (Self-Correction) pattern is a generate-critique-refine cycle where an agent creates an initial response, reflects on its quality through self-critique or an external evaluator, and iteratively improves the output. This pattern dramatically improves accuracy and quality.
Performance Impact: Research shows reflection can improve accuracy from 78.6% to 97.1% on complex tasks by enabling agents to catch and correct their own mistakes.
How It Works
- Generate: Generator agent produces an initial output
- Reflect/Critique: Evaluator agent (or same agent in reflection mode) critiques the output
- Revise: Generator incorporates feedback to improve output
- Iterate: Repeat reflection cycle until quality threshold met or max iterations reached
- Terminate: Return final refined output
Control Flow:
Task → Generator Agent
↓
Initial Output
↓
Evaluator Agent
↓
Critique/Feedback
↓
┌─── Good enough? ───┐
↓ No ↓ Yes
Generator Final
(revise) Output
↓
Improved Output
↓
(loop back to Evaluator)
Variants:
- Self-Reflection: Same agent critiques its own output
- Dual-Agent: Separate generator and evaluator agents
- Multi-Perspective: Multiple evaluators provide different critiques
Reference Implementations
- LangGraph Reflection Tutorial - Official LangGraph implementation
- Reflexion Pattern - Learning through verbal feedback
- LangGraph Reflection Blog - Deep dive on reflection agents
- Building Self-Correcting AI - Reflexion agent deep dive
- Self-Reflective RAG - Agentic RAG with LangGraph
- Reflection Pattern Documentation - Agent patterns reference
- Reflection Agentic Design Pattern - Design pattern series
- LangGraph Self-Correcting RAG - Code generation example
Proposed PyWorkflow Implementation
from pyworkflow_agents import ReflectiveAgent, Agent
from pyworkflow_agents.providers import AnthropicProvider
from pyworkflow import workflow, step, get_context
# Method 1: Dual-Agent Reflection
generator = Agent(
name="coder",
provider=AnthropicProvider(model="claude-sonnet-4-5-20250929"),
instructions="You generate Python code solutions.",
tools=[code_execution_tool],
)
evaluator = Agent(
name="reviewer",
provider=AnthropicProvider(model="claude-sonnet-4-5-20250929"),
instructions="You review code for bugs, style, and correctness. Provide specific feedback.",
)
reflective_agent = ReflectiveAgent(
generator=generator,
evaluator=evaluator,
max_reflections=3,
quality_threshold=0.9, # Stop if evaluator score >= 0.9
)
@workflow(durable=True)
async def reflective_workflow(task: str):
"""
Execute reflection pattern with event-sourced reflection cycles.
"""
result = await reflective_agent.run(task)
return result
# Method 2: Manual Reflection Loop
@workflow(durable=True)
async def manual_reflection_workflow(task: str):
"""
Explicit reflection loop using PyWorkflow primitives.
"""
ctx = get_context()
# Initial generation
output = await generate_as_step(task)
# Reflection loop
for iteration in range(3): # max_reflections=3
# Critique
critique = await evaluate_as_step(output, task)
# Record reflection event
await ctx.storage.record_event(Event(
run_id=ctx.run_id,
type=EventType.REFLECTION_ITERATION,
data={
"iteration": iteration,
"output": output,
"critique": critique,
"quality_score": critique.get("score", 0)
}
))
# Check if good enough
if critique.get("approved", False):
return {
"output": output,
"iterations": iteration + 1,
"final_quality": critique.get("score")
}
# Revise based on feedback
output = await revise_as_step(output, critique, task)
# Max iterations reached
return {
"output": output,
"iterations": 3,
"warning": "Max reflections reached, may not meet quality threshold"
}
@step()
async def generate_as_step(task: str):
"""Generate initial output."""
response = await generator.run(task)
return response.content
@step()
async def evaluate_as_step(output: str, original_task: str):
"""Evaluate output quality and provide critique."""
prompt = f"""
Evaluate this output for the task: {original_task}
Output: {output}
Provide:
1. Quality score (0.0-1.0)
2. Specific issues found
3. Actionable feedback for improvement
4. Approval (true/false)
"""
response = await evaluator.run(prompt)
return response.structured_output # Pydantic model: {"score": 0.85, "issues": [...], "feedback": "...", "approved": false}
@step()
async def revise_as_step(output: str, critique: dict, task: str):
"""Revise output based on critique."""
prompt = f"""
Original task: {task}
Current output: {output}
Feedback: {critique["feedback"]}
Issues: {critique["issues"]}
Revise the output to address all feedback and issues.
"""
response = await generator.run(prompt)
return response.content
# Method 3: Self-Reflection (same agent)
@workflow(durable=True)
async def self_reflection_workflow(task: str):
"""
Same agent reflects on its own output.
"""
agent = Agent(
name="self_reflective_agent",
provider=AnthropicProvider(model="claude-opus-4-6"),
instructions="You generate solutions and critically evaluate them."
)
output = await generate_with_self_reflection(agent, task)
return output
@step()
async def generate_with_self_reflection(agent: Agent, task: str, max_iterations: int = 3):
"""
Agent generates and self-critiques in iterations.
"""
current_output = None
for iteration in range(max_iterations):
if iteration == 0:
# Initial generation
prompt = f"Task: {task}\n\nGenerate a solution."
else:
# Reflection prompt
prompt = f"""
Task: {task}
Your previous output: {current_output}
Reflect on your output:
1. What are the weaknesses?
2. How can you improve it?
3. Generate an improved version.
"""
response = await agent.run(prompt)
current_output = response.content
# Self-evaluation
eval_prompt = f"Rate the quality of this output (0-10): {current_output}"
eval_response = await agent.run(eval_prompt)
quality_score = extract_score(eval_response.content)
if quality_score >= 9:
return current_output
return current_outputKey Mapping to PyWorkflow Primitives:
- Reflection cycle = Workflow loop with event-sourced iterations
- Generate step =
@stepfor generator agent - Evaluate step =
@stepfor evaluator agent - Revise step =
@stepfor generator with feedback - Reflection history =
REFLECTION_ITERATIONevents in event log - Max iterations = Loop counter (prevent infinite reflection)
- Quality threshold = Conditional check to exit loop
Event Types
New events for reflection pattern:
class EventType(str, Enum):
# Existing events...
REFLECTION_START = "reflection_start" # Start reflection process
REFLECTION_ITERATION = "reflection_iteration" # Single reflect-revise cycle
REFLECTION_APPROVED = "reflection_approved" # Output approved by evaluator
REFLECTION_MAX_REACHED = "reflection_max_reached" # Max iterations without approval
REFLECTION_COMPLETE = "reflection_complete" # Final outputEvent Data Schema:
# REFLECTION_START
{
"task": "Generate Python function for Fibonacci",
"generator_agent": "coder",
"evaluator_agent": "reviewer",
"max_reflections": 3,
"quality_threshold": 0.9
}
# REFLECTION_ITERATION
{
"iteration": 1,
"generator_output": "def fib(n): ...",
"evaluator_critique": {
"score": 0.7,
"issues": ["Missing docstring", "No input validation"],
"feedback": "Add type hints and handle edge cases"
},
"approved": false
}
# REFLECTION_APPROVED
{
"iteration": 2,
"final_output": "def fib(n: int) -> int: ...",
"quality_score": 0.95,
"total_reflections": 2
}
# REFLECTION_MAX_REACHED
{
"max_iterations": 3,
"final_quality_score": 0.85,
"quality_threshold": 0.9,
"warning": "Quality threshold not met"
}Trade-offs
Pros
- Accuracy: 78.6% → 97.1% improvement (research-backed)
- Quality: Multiple revision passes catch mistakes
- Self-correction: Agents fix their own errors without human intervention
- Debuggability: Full reflection history visible in event log
- Event replay: Reflection cycles reconstructed during recovery
- Adaptability: Can tune quality threshold and max iterations per task
Cons
- Latency: Multiple LLM calls per reflection cycle (2-4x slower)
- Cost: 2-4x LLM invocations vs single-pass generation
- Diminishing returns: Later iterations may not improve much
- Infinite loops: Need max_iterations to prevent endless reflection
- Critique quality: Evaluator must be good at identifying issues
When to Use
- High-stakes tasks requiring accuracy (code generation, medical analysis)
- Complex problem-solving where first attempts often have flaws
- Tasks where quality > speed (research reports, legal documents)
- Self-improving systems (agent learns from mistakes)
When to Avoid
- Simple tasks where first output is usually correct
- Latency-sensitive applications (real-time chat)
- Cost-sensitive scenarios (multiple LLM calls expensive)
- Tasks where reflection provides little value (data retrieval)
Performance Benchmarks
Based on LangGraph and research papers:
- Code generation: 78.6% → 97.1% accuracy with reflection
- Reasoning tasks: 14-19% improvement (HALO framework)
- Latency: 2-4x slower (depending on max_reflections)
- Cost: 2-4x more LLM calls
Advanced Patterns
1. Multi-Perspective Reflection
Use multiple evaluators for diverse critiques:
reflective_agent = ReflectiveAgent(
generator=generator,
evaluators=[
Agent(name="style_reviewer", ...),
Agent(name="correctness_reviewer", ...),
Agent(name="performance_reviewer", ...),
],
aggregation="consensus", # or "weighted", "all_must_approve"
)2. Reflexion (Learning from Failures)
Store past failures in memory for future reference:
# Record failures in vector DB
await memory.store_failure(task, output, critique)
# In future iterations, retrieve similar failures
similar_failures = await memory.retrieve_similar_failures(task)
prompt = f"Avoid these past mistakes: {similar_failures}\n\nTask: {task}"3. Self-Correcting RAG
Reflection for retrieval-augmented generation:
@workflow(durable=True)
async def self_correcting_rag(question: str):
"""RAG with reflection on retrieval quality."""
# Generate
docs = await retrieve_docs(question)
answer = await generate_answer(question, docs)
# Reflect
for i in range(3):
eval = await evaluate_answer(question, answer, docs)
if eval["grade"] == "correct":
return answer
# Revise retrieval or generation
if eval["issue"] == "missing_info":
docs = await retrieve_additional_docs(question, eval["feedback"])
answer = await generate_answer(question, docs, feedback=eval["feedback"])
return answerImplementation Checklist
- Create
ReflectiveAgentclass inpyworkflow_agents/reflection.py - Implement dual-agent reflection (generator + evaluator)
- Implement self-reflection (same agent)
- Add
REFLECTION_*event types - Add max_reflections limit (default: 3)
- Add quality_threshold parameter (default: 0.9)
- Implement multi-perspective reflection (multiple evaluators)
- Add aggregation strategies for multi-evaluator (consensus, weighted)
- Create
generate_as_step(),evaluate_as_step(),revise_as_step()helpers - Add reflection history to event log
- Implement event replay for reflection cycles
- Create examples in
examples/agents/reflection_pattern.py - Add tests for all reflection scenarios
- Document performance benchmarks (accuracy vs latency/cost)
- Add metrics: reflection count, quality score progression, approval rate
- Add visualization of reflection cycles over time
Related Issues
- Agent Pattern: Supervisor (Manager + Workers) #154 - Supervisor Agent - Can use reflection as a step quality check
- Agent Pattern: Parallel Agent (Scatter-Gather) #168 - Parallel Agent - Can use parallel evaluators for multi-perspective reflection
- Agent Pattern: Collaborative Agent (Shared Scratchpad) #164 - Collaborative Agent - Can store reflection history in scratchpad
References
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
agentsAI Agent module (pyworkflow_agents)AI Agent module (pyworkflow_agents)featureFeature to be implementedFeature to be implementedmulti-agentMulti-agent orchestration patternsMulti-agent orchestration patterns