fix: Re-check PR badge when session becomes active#459
Conversation
a30850d to
a1e608e
Compare
…runtime validation Updated both worker charters and orchestrator routing to address gaps where multi-agent sessions failed but single-agent sessions succeeded: Implementer charter now requires: - Implementing EVERY requirement from the original prompt (completeness) - Launching runnable apps and verifying at runtime (not just build+test) - Performing any validation steps specified in the prompt Challenger charter now requires: - Cross-referencing original prompt requirements vs implementation - Runtime validation (launching the app, not just static review) - Performing the same validation steps the prompt specifies Orchestrator routing now requires: - Forwarding the COMPLETE original prompt to workers (no summarizing) - Always including full original requirements for completeness checks Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
cb41220 to
a49c977
Compare
…Challenger Implementer now follows 4 steps: Plan → Implement → Validate → Self-review. Creates a requirements checklist before writing code and verifies every item before reporting completion. Challenger now follows 4 steps: Build checklist → Code review → Completeness check → Runtime validation. Extracts requirements into a numbered checklist and verifies each item individually, matching the approach from proven multi-agent orchestration patterns. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
🔍 R1 Review — PR #459Reviewer: PP PR REVIEWER-worker-5 (3-model consensus)
|
| # | Finding | Severity | Models | File / Lines |
|---|---|---|---|---|
| 1 | Misleading PR title/description — diff contains prompt improvements, not PR badge changes. Creates merge confusion risk: reviewers may approve thinking they reviewed a badge fix. Also pollutes git log --grep history. |
🟡 MODERATE | 3/3 (Opus + Sonnet + Codex) | PR metadata |
| 2 | Runtime validation hard requirement may cause stalls — New prompts mandate "MUST launch it and verify it works at runtime" for both Implementer and Challenger. In headless CI or contexts where runtime is unavailable, this could cause repeated failed loops instead of completing valid code-only tasks. | 🟢 MINOR | 2/3 (Opus + Codex) | ModelCapabilities.cs lines ~354-357, ~389-392 |
Non-consensus observations (1/3, informational only)
- Sonnet noted that the new raw string literals introduce leading/trailing
\nin the prompt strings (cosmetic, models tolerate it) - Sonnet noted Challenger "build the checklist" instruction relies on session history retention across iterations (fragile assumption, but works with current persistent sessions)
- Codex noted increased token/context pressure from forwarding full original request + full worker output each iteration (up to 10 reflections)
What's clean ✅
- All 3 models confirm: no code bugs, regressions, security issues, data loss, or race conditions
[[GROUP_REFLECT_COMPLETE]]sentinel preserved — reflection loop won't breakWorkerSystemPromptsarray length unchanged (2) — existing tests passMaxReflectIterations,DefaultWorktreeStrategyunchanged- Prompt content is well-structured and the intent (planning + completeness checking + runtime validation) is sound
Test coverage
No new code paths requiring tests — this is a prompt-text-only change. Existing tests (ImplementAndChallenge_Preset_HasDistinctPersonas, WorktSystemPrompts_MatchWorkerCount) remain valid.
Recommended Action: ⚠️ Request Changes
- Fix the PR title and description to match the actual code (e.g., "improve: Structured planning & validation steps for Implement & Challenge preset"). This is the only blocker.
- Consider softening the "MUST launch" language to "SHOULD launch when runtime is available" to avoid stalls in headless contexts (minor, non-blocking).
The code changes themselves are safe to merge once the metadata is corrected.
…eview Implement & Challenge: - Implementer Step 2: Examine existing files before coding to match patterns - Challenger Step 4: Must cite exact commands and output as evidence PR Review Squad: - Zero tolerance for test failures — always request changes, even for pre-existing/flaky tests. Every PR should leave the suite greener. - Report ALL findings including minor nits. Every PR is an opportunity to improve the codebase. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
🔍 Multi-Model Code Review — PR #459 R2PR: fix: Re-check PR badge when session becomes active (title unchanged) R1 Findings Status
Consensus Findings1. 🟡 MODERATE — PR title/description still mismatches diff (R1 carryover)Flagged by: Opus, Sonnet, Codex (3/3) Title says "fix: Re-check PR badge when session becomes active." Diff only changes prompt text for the PR Reviewer and Implement & Challenge presets. No badge-related code. Misleads reviewers, pollutes git history, breaks Fix: Rename to something like 2. 🔴 CRITICAL — Section 4b directly contradicts SharedContext delivered to the same agentFile: New 4b instructs: "Report ALL findings regardless of severity — even minor nits, naming inconsistencies, missing docs" and "Do not dismiss anything as 'too minor to mention.'" Existing SharedContext (injected into every worker of this preset) instructs: "Only flag real issues: bugs, security holes, logic errors" and "NEVER comment on style, formatting, naming conventions, or documentation." These are directly contradictory — the agent receives both "report all nits" and "NEVER comment on naming/docs" in the same prompt. Behavior becomes model-dependent and unpredictable. This also undermines the 2/3 consensus filter that is the multi-reviewer workflow's core value proposition. Fix: Either (a) update SharedContext to match the new 4b philosophy, or (b) remove/soften 4b to align with the existing "real issues only" filter. The two sections must be consistent. 3. 🟡 MODERATE — "MUST launch" + Section 4a conflict with CI/headless contexts and consensus mechanismFile:
Non-Consensus (1/3, informational)
What's Clean ✅
Verdict:
|
…ional runtime - Update PR Review Squad SharedContext to flag ALL severities including minor nits (was 'NEVER comment on style' which contradicted 4b) - Soften 'MUST launch' to 'launch when runtime is available' for headless/CI contexts Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
🔍 Multi-Model Code Review — PR #459 R3PR: fix: Re-check PR badge when session becomes active (title unchanged) R2 Findings Status
Remaining Finding🟡 MODERATE — PR title still mismatches diff content (R1 → R2 → R3 unresolved)Flagged by: Opus, Sonnet, Codex (3/3) Title: Breaks What's Clean ✅
Verdict: ✅ Approve (with title fix requested)The two substantive R2 findings (SharedContext contradiction, headless stall) are properly resolved. The only remaining item is the misleading PR title — this is a metadata issue, not a code issue. The code changes themselves are safe and ready to merge. Recommendation: Fix the PR title before or at merge time (e.g., edit via GitHub UI or R3 re-review · consensus threshold: 2/3 models must agree |
When clicking a session in the sidebar or switching expanded views, invalidate the PrLinkService cache and re-fetch if no PR was previously found. This ensures PR badges appear promptly after creating a PR or switching branches, without aggressive polling.
Changes:
IsActivetransitions; on activation, invalidate cache + re-fetch if no PR cachedNo change to SessionCard (dashboard) — the 5-minute cache TTL handles that naturally.