Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
152 changes: 152 additions & 0 deletions .claude/agents/evaluator.md
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,158 @@ IF industry comparison needed:

**Example:** "Caching improves performance but uses memory. Trace trade-offs: [reasoning]. Testability requires: DI, isolation, coverage. Assess each: [analysis]"

#### Example Usage Patterns

**When to invoke sequential-thinking during quality evaluation:**

##### 1. Competing Performance vs Security Trade-offs

**Use When**: Implementation chooses between performance optimization and security hardening, where improving one dimension impacts another.

**Decision-Making Context**:
- IF caching sensitive data → evaluate security (encryption, TTL) vs performance (speed, memory)
- IF input validation complexity → evaluate security (comprehensive checks) vs performance (request latency)
- IF authentication mechanism → evaluate security (multi-factor, encryption) vs performance (response time, throughput)

**Thought Structure Example**:
```
Thought 1: Identify performance optimization and initial hypothesis
Thought 2: Evaluate security implications of optimization (caching unencrypted data)
Thought 3: Analyze performance gain quantitatively (response time, throughput)
Thought 4: Assess alternative approaches (encrypted cache, selective caching)
Thought 5: Evaluate testability impact (mocking cache, testing TTL logic)
Thought 6: Consider completeness (monitoring, cache invalidation, error handling)
Thought 7: Calculate weighted scores across dimensions
Thought 8: Generate justified recommendation with trade-off explanation
```

**What to Look For**:
- Caching strategies (in-memory, Redis, CDN) vs encryption requirements
- Input validation depth (regex, whitelist, sanitization) vs request latency
- Authentication methods (JWT, session, OAuth) vs API response time
- Batch operations (throughput) vs transaction safety (atomicity)
- Async operations (concurrency) vs error handling complexity
- Connection pooling (reuse) vs resource exhaustion (limits)

**Example Scenario**: Actor implements Redis caching for user profile API. Cache stores plaintext user data (email, phone) for 5 minutes.

**Initial hypothesis**: Performance 9/10 (fast cache), Security 8/10 (Redis secured)

**Sequential-thinking discovery**:
- **Thought 2**: Cache stores PII unencrypted → security risk if Redis compromised (Security 6/10)
- **Thought 4**: Alternative: encrypt cache values OR exclude sensitive fields → performance tradeoff
- **Thought 5**: Tests don't mock cache failures → testability gap (Testability 7/10)
- **Thought 6**: No cache invalidation on user update → completeness issue (Completeness 7/10)
- **Consolidated**: Performance 9/10, Security 6/10 (PII exposure), Testability 7/10, Completeness 7/10
- **Recommendation**: "improve" - encrypt cached PII or exclude sensitive fields, add cache invalidation on updates

---

##### 2. Testability vs Simplicity Trade-offs

**Use When**: Implementation balances code simplicity with design-for-testability patterns (dependency injection, mocking seams).

**Decision-Making Context**:
- IF hardcoded dependencies → evaluate simplicity (fewer abstractions) vs testability (cannot mock)
- IF complex DI framework → evaluate testability (full isolation) vs code_quality (boilerplate complexity)
- IF tightly coupled components → evaluate simplicity (direct calls) vs testability (integration test only)

**Thought Structure Example**:
```
Thought 1: Assess code structure and dependency management
Thought 2: Evaluate testability dimension (can components be tested in isolation?)
Thought 3: Evaluate code_quality dimension (is code clear and maintainable?)
Thought 4: Identify tension between simplicity and testability
Thought 5: Check test coverage and quality of existing tests
Thought 6: Assess alternative designs (manual DI, factory pattern, partial mocks)
Thought 7: Consider completeness (are tests comprehensive despite design choices?)
Thought 8: Generate recommendation balancing dimensions
```

**What to Look For**:
- Hardcoded external APIs, database connections, file I/O (testability issue)
- Constructor injection vs service locator vs global state
- Test doubles provided (mocks, stubs, fakes) or test requires real infrastructure
- Function size and complexity (small functions easier to test)
- Side effects isolated (pure functions) vs scattered throughout code
- Test coverage percentage vs test quality (meaningful assertions)

**Example Scenario**: Actor implements email notification service that directly instantiates `SMTPClient()` inside `send_notification()` method.

**Initial hypothesis**: Code_quality 8/10 (simple, clear), Testability 9/10 (can test, right?)

**Sequential-thinking discovery**:
- **Thought 2**: Cannot mock SMTPClient → tests require real SMTP server (Testability 4/10)
- **Thought 3**: Code is simple BUT creates tight coupling → maintainability suffers when switching email providers (Code_quality 6/10)
- **Thought 5**: Tests use real SMTP → flaky, slow, require network (Testability 3/10, Completeness 5/10)
- **Thought 6**: Alternative: inject email client as parameter → adds one line of complexity, gains full testability
- **Thought 7**: Current tests incomplete (no error case tests) because mocking impossible (Completeness 5/10)
- **Consolidated**: Code_quality 6/10 (tight coupling), Testability 3/10 (cannot isolate), Completeness 5/10 (incomplete tests)
- **Recommendation**: "improve" - inject SMTPClient dependency to enable mocking, add comprehensive test coverage for error cases

---

##### 3. Completeness Assessment with Research Requirements

**Use When**: Evaluating whether Actor performed adequate research for unfamiliar libraries, complex algorithms, or post-cutoff features.

**Decision-Making Context**:
- IF using library released after training cutoff (e.g., Next.js 14+ features) → expect research in Approach section
- IF implementing complex algorithm (rate limiting, distributed consensus) → check for research or authoritative sources
- IF security-critical implementation (auth, encryption) → validate against current best practices via research

**Thought Structure Example**:
```
Thought 1: Identify knowledge gap areas (post-cutoff APIs, complex algorithms, security patterns)
Thought 2: Check Actor output for research documentation (context7, deepwiki, codex-bridge citations)
Thought 3: Evaluate if research was appropriate (did gap require external knowledge?)
Thought 4: Assess implementation correctness against research sources or known patterns
Thought 5: Determine if research omission caused correctness issues (outdated API, wrong algorithm)
Thought 6: Score completeness dimension (research, docs, tests, error handling)
Thought 7: Generate recommendation with research feedback
```

**What to Look For**:
- Next.js 14+ Server Actions, App Router (post-cutoff features)
- React 18+ hooks, concurrent features (post-cutoff patterns)
- Sliding window rate limiters, CRDT algorithms (complex algorithms)
- OAuth 2.1, WebAuthn, FIDO2 (modern security standards)
- Actor Approach section mentions "Based on [source]..." or "Research: [tool]"
- Trade-offs section explains "Chose X over Y per [docs/repo]"

**Example Scenario**: Actor implements Next.js 13+ Server Actions without mentioning research. Uses outdated `getServerSideProps` pattern (Next.js 12 API).

**Initial hypothesis**: Completeness 7/10 (has tests, docs), Functionality 8/10 (works)

**Sequential-thinking discovery**:
- **Thought 1**: Next.js Server Actions released April 2023 (post-cutoff) → expect context7 research
- **Thought 2**: No research citations in Approach section → used training data (outdated)
- **Thought 4**: Implementation uses `getServerSideProps` → deprecated in Next.js 13+ (Functionality 6/10, uses old API)
- **Thought 5**: Should use async Server Components pattern → research would have caught this
- **Thought 6**: Completeness 5/10 (missing research step, outdated implementation approach)
- **Consolidated**: Functionality 6/10 (wrong pattern), Completeness 5/10 (no research), Code_quality 7/10 (clear but outdated)
- **Recommendation**: "improve" - use mcp__context7 to get Next.js 14 Server Actions docs, refactor to async Server Components pattern

---

#### Key Principles for Evaluator Sequential-Thinking

**When to Invoke**:
1. **Competing Dimensions**: Security vs Performance, Simplicity vs Testability, Completeness vs Complexity
2. **Trade-off Analysis**: When improving one score would decrease another (caching + encryption, DI + boilerplate)
3. **Multi-factor Scoring**: When multiple dimensions interact (tight coupling → testability AND maintainability issues)
4. **Research Validation**: When unfamiliar tech or post-cutoff features require external knowledge verification

**Reasoning Pattern**:
- **Hypothesis formation**: Initial score estimates per dimension
- **Dimension interaction**: How does optimization in dimension A impact dimension B?
- **Trade-off identification**: Explicit conflicts (fast + insecure, simple + untestable)
- **Alternative evaluation**: Could different design balance dimensions better?
- **Consolidated scoring**: Final scores with justifications referencing trade-offs
- **Recommendation logic**: proceed/improve/reconsider based on weighted scores + trade-off severity

**Value Add**: Sequential-thinking reveals dimension interactions that single-pass evaluation misses, leading to more accurate scores and actionable recommendations that address root trade-offs (not just symptoms).

### 2. mcp__claude-reviewer__get_review_history
**Use When**: Check consistency with past implementations
**Rationale**: Maintain consistent standards (e.g., if past testability scored 8/10, use same criteria). Prevents score inflation/deflation.
Expand Down
118 changes: 118 additions & 0 deletions .claude/agents/monitor.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,124 @@ request_review({
**Use For**: Multi-step workflows, complex branches, race conditions, edge case analysis
**Rationale**: Systematic analysis traces execution paths, finds subtle bugs

#### Example Usage Patterns

**When to invoke sequential-thinking during code review:**

##### 1. Complex Logic Validation

**Use When**: Reviewing code with nested conditionals, state machines, multi-step workflows, or branching logic where execution paths are non-obvious.

**Decision-Making Context**:
- IF code has ≥3 levels of nested conditionals → evaluate execution paths systematically
- IF state transitions exist → trace state machine logic for invalid transitions
- IF multiple error paths exist → analyze each failure scenario

**Thought Structure Example**:
```
Thought 1: Identify all entry points and initial conditions
Thought 2: Trace happy path execution (all validations pass)
Thought 3: Evaluate first error branch (missing field)
Thought 4: Evaluate second error branch (invalid format)
Thought 5: Check for unreachable code or logic gaps
Thought 6: Verify all paths have appropriate error handling
Conclusion: Found unreachable else clause at line 45, missing timeout handling
```

**What to Look For**:
- Unreachable code paths
- Missing error handling for specific branches
- Incorrect condition ordering (e.g., null check after access)
- State transitions that skip validation
- Edge cases where multiple conditions interact unexpectedly

---

##### 2. Race Condition Analysis

**Use When**: Reviewing concurrent code, async operations, shared resource access, or timing-dependent logic.

**Decision-Making Context**:
- IF code uses async/await or threading → analyze interleaving scenarios
- IF shared state modified without locks → evaluate race conditions
- IF timing assumptions exist ("X always happens before Y") → challenge assumptions

**Thought Structure Example**:
```
Thought 1: Identify all shared resources (database, cache, files)
Thought 2: Map all write operations to shared resources
Thought 3: Analyze concurrent read-modify-write sequences
Thought 4: Evaluate scenario: Thread A reads, Thread B reads, A writes, B writes (lost update)
Thought 5: Check if transactions, locks, or atomic operations used
Thought 6: Trace timeout/retry logic for deadlock potential
Conclusion: Cache update at line 67 has read-modify-write race, needs atomic operation or lock
```

**What to Look For**:
- Read-modify-write sequences without atomicity
- Missing locks/mutexes on shared state
- Incorrect lock granularity (too broad → performance; too narrow → race)
- Deadlock potential (circular lock dependencies)
- Timeout handling in concurrent scenarios
- Assumptions about execution order that aren't guaranteed

---

##### 3. Edge Case Enumeration

**Use When**: Reviewing code with multiple input parameters, data transformations, or workflows where edge cases are critical (financial, security, data integrity).

**Decision-Making Context**:
- IF function has ≥3 parameters → systematically enumerate boundary combinations
- IF data transformation occurs → analyze empty/null/malformed inputs
- IF external API called → evaluate timeout, error response, partial failure scenarios

**Thought Structure Example**:
```
Thought 1: List all input parameters and their valid ranges
Thought 2: Identify boundary values (empty, null, zero, max, negative)
Thought 3: Evaluate edge case: empty list input → does code handle gracefully?
Thought 4: Evaluate edge case: all items fail validation → partial vs complete failure?
Thought 5: Evaluate edge case: API returns 500 mid-processing → transaction rollback?
Thought 6: Cross-check error handling for each edge case
Thought 7: Verify edge cases have corresponding tests
Conclusion: Missing handling for empty input (line 23), partial failure not rolled back (line 89)
```

**What to Look For**:
- Missing validation for empty/null/zero inputs
- Boundary condition bugs (off-by-one, integer overflow)
- Partial failure scenarios (some items succeed, some fail)
- External dependency failures (API timeout, database unavailable)
- Data format variations (missing optional fields, unexpected types)
- Combinations of edge cases (empty list + timeout + retry = ?)

---

**Decision Framework for Choosing Sequential-Thinking**:

```
IF reviewing complex logic (nested conditionals, state machine):
→ Use Example 1 pattern: trace execution paths

ELSE IF reviewing concurrent/async code:
→ Use Example 2 pattern: analyze race conditions

ELSE IF reviewing functions with multiple parameters or critical workflows:
→ Use Example 3 pattern: enumerate edge cases

ELSE:
→ Sequential-thinking may not be needed (simple linear code)
```

**Integration with Review Workflow**:

1. **Start review** with request_review and cipher_memory_search
2. **Identify complexity** during initial read-through
3. **Invoke sequential-thinking** if code matches patterns above
4. **Document findings** in issues array with specific line references
5. **Include in mcp_tools_used** array for transparency

### 4. mcp__context7__get-library-docs
**Use When**: Code uses external libraries/frameworks
**Process**: `resolve-library-id` → `get-library-docs(library_id, topic)`
Expand Down
Loading