Skip to content

fix: Re-check PR badge when session becomes active#459

Merged
PureWeen merged 4 commits intomainfrom
fix/pr-badge-refresh-on-activate
Apr 4, 2026
Merged

fix: Re-check PR badge when session becomes active#459
PureWeen merged 4 commits intomainfrom
fix/pr-badge-refresh-on-activate

Conversation

@PureWeen
Copy link
Copy Markdown
Owner

When clicking a session in the sidebar or switching expanded views, invalidate the PrLinkService cache and re-fetch if no PR was previously found. This ensures PR badges appear promptly after creating a PR or switching branches, without aggressive polling.

Changes:

  • SessionListItem: Track IsActive transitions; on activation, invalidate cache + re-fetch if no PR cached
  • ExpandedSessionView: Re-check on session switch if no PR cached

No change to SessionCard (dashboard) — the 5-minute cache TTL handles that naturally.

@PureWeen PureWeen force-pushed the fix/pr-badge-refresh-on-activate branch from a30850d to a1e608e Compare April 1, 2026 20:13
…runtime validation

Updated both worker charters and orchestrator routing to address gaps
where multi-agent sessions failed but single-agent sessions succeeded:

Implementer charter now requires:
- Implementing EVERY requirement from the original prompt (completeness)
- Launching runnable apps and verifying at runtime (not just build+test)
- Performing any validation steps specified in the prompt

Challenger charter now requires:
- Cross-referencing original prompt requirements vs implementation
- Runtime validation (launching the app, not just static review)
- Performing the same validation steps the prompt specifies

Orchestrator routing now requires:
- Forwarding the COMPLETE original prompt to workers (no summarizing)
- Always including full original requirements for completeness checks

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen PureWeen force-pushed the fix/pr-badge-refresh-on-activate branch from cb41220 to a49c977 Compare April 1, 2026 21:02
…Challenger

Implementer now follows 4 steps: Plan → Implement → Validate → Self-review.
Creates a requirements checklist before writing code and verifies every item
before reporting completion.

Challenger now follows 4 steps: Build checklist → Code review → Completeness
check → Runtime validation. Extracts requirements into a numbered checklist
and verifies each item individually, matching the approach from proven
multi-agent orchestration patterns.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen
Copy link
Copy Markdown
Owner Author

PureWeen commented Apr 3, 2026

🔍 R1 Review — PR #459

Reviewer: PP PR REVIEWER-worker-5 (3-model consensus)
Models: Claude Opus 4.6 · Claude Sonnet 4.6 · GPT-5.3-Codex
CI: No checks configured
Prior reviews: None


⚠️ Important: PR title/description do NOT match the code

The PR title ("fix: Re-check PR badge when session becomes active") and body (describing SessionListItem IsActive tracking, ExpandedSessionView cache invalidation, PrLinkService changes) describe completely different code than what is in the diff. The actual diff only changes worker prompt text in ModelCapabilities.cs for the "Implement & Challenge" group preset.

PR #469 (fix/pr-badge-refresh) is the one with the actual PR badge code changes.


Consensus Findings

# Finding Severity Models File / Lines
1 Misleading PR title/description — diff contains prompt improvements, not PR badge changes. Creates merge confusion risk: reviewers may approve thinking they reviewed a badge fix. Also pollutes git log --grep history. 🟡 MODERATE 3/3 (Opus + Sonnet + Codex) PR metadata
2 Runtime validation hard requirement may cause stalls — New prompts mandate "MUST launch it and verify it works at runtime" for both Implementer and Challenger. In headless CI or contexts where runtime is unavailable, this could cause repeated failed loops instead of completing valid code-only tasks. 🟢 MINOR 2/3 (Opus + Codex) ModelCapabilities.cs lines ~354-357, ~389-392

Non-consensus observations (1/3, informational only)

  • Sonnet noted that the new raw string literals introduce leading/trailing \n in the prompt strings (cosmetic, models tolerate it)
  • Sonnet noted Challenger "build the checklist" instruction relies on session history retention across iterations (fragile assumption, but works with current persistent sessions)
  • Codex noted increased token/context pressure from forwarding full original request + full worker output each iteration (up to 10 reflections)

What's clean ✅

  • All 3 models confirm: no code bugs, regressions, security issues, data loss, or race conditions
  • [[GROUP_REFLECT_COMPLETE]] sentinel preserved — reflection loop won't break
  • WorkerSystemPrompts array length unchanged (2) — existing tests pass
  • MaxReflectIterations, DefaultWorktreeStrategy unchanged
  • Prompt content is well-structured and the intent (planning + completeness checking + runtime validation) is sound

Test coverage

No new code paths requiring tests — this is a prompt-text-only change. Existing tests (ImplementAndChallenge_Preset_HasDistinctPersonas, WorktSystemPrompts_MatchWorkerCount) remain valid.


Recommended Action: ⚠️ Request Changes

  1. Fix the PR title and description to match the actual code (e.g., "improve: Structured planning & validation steps for Implement & Challenge preset"). This is the only blocker.
  2. Consider softening the "MUST launch" language to "SHOULD launch when runtime is available" to avoid stalls in headless contexts (minor, non-blocking).

The code changes themselves are safe to merge once the metadata is corrected.

…eview

Implement & Challenge:
- Implementer Step 2: Examine existing files before coding to match patterns
- Challenger Step 4: Must cite exact commands and output as evidence

PR Review Squad:
- Zero tolerance for test failures — always request changes, even for
  pre-existing/flaky tests. Every PR should leave the suite greener.
- Report ALL findings including minor nits. Every PR is an opportunity
  to improve the codebase.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen
Copy link
Copy Markdown
Owner Author

PureWeen commented Apr 4, 2026

🔍 Multi-Model Code Review — PR #459 R2

PR: fix: Re-check PR badge when session becomes active (title unchanged)
Diff: 107 lines, 1 file (ModelCapabilities.cs) — prompt text only
CI: ⚠️ No checks configured
Prior reviews: R1 (⚠️ Request Changes — 2 findings)
Models: claude-opus-4.6 · claude-sonnet-4.6 · gpt-5.3-codex


R1 Findings Status

# R1 Finding Status
1 🟡 Misleading PR title/description (3/3) Still present — title still says "Re-check PR badge"
2 🟢 "MUST launch" hard requirement may stall (2/3) Still present — see Finding #3 below

Consensus Findings

1. 🟡 MODERATE — PR title/description still mismatches diff (R1 carryover)

Flagged by: Opus, Sonnet, Codex (3/3)

Title says "fix: Re-check PR badge when session becomes active." Diff only changes prompt text for the PR Reviewer and Implement & Challenge presets. No badge-related code. Misleads reviewers, pollutes git history, breaks git log --grep.

Fix: Rename to something like chore: strengthen reviewer and implement-challenge preset prompts.


2. 🔴 CRITICAL — Section 4b directly contradicts SharedContext delivered to the same agent

File: ModelCapabilities.cs, new section 4b (~lines 233–237) vs SharedContext (~lines 274–276)
Flagged by: Opus, Sonnet, Codex (3/3)

New 4b instructs: "Report ALL findings regardless of severity — even minor nits, naming inconsistencies, missing docs" and "Do not dismiss anything as 'too minor to mention.'"

Existing SharedContext (injected into every worker of this preset) instructs: "Only flag real issues: bugs, security holes, logic errors" and "NEVER comment on style, formatting, naming conventions, or documentation."

These are directly contradictory — the agent receives both "report all nits" and "NEVER comment on naming/docs" in the same prompt. Behavior becomes model-dependent and unpredictable. This also undermines the 2/3 consensus filter that is the multi-reviewer workflow's core value proposition.

Fix: Either (a) update SharedContext to match the new 4b philosophy, or (b) remove/soften 4b to align with the existing "real issues only" filter. The two sections must be consistent.


3. 🟡 MODERATE — "MUST launch" + Section 4a conflict with CI/headless contexts and consensus mechanism

File: ModelCapabilities.cs, 4a (~lines 227–232), Implementer prompt (~lines 357–359), Challenger prompt (~lines 389–392)
Flagged by: Opus, Sonnet, Codex (3/3)

  • "MUST launch": Both Implementer and Challenger mandate runtime launch ("Building alone is NOT sufficient"). In headless CI, container agents, or sessions without a display server, this is impossible. Agents will either stall attempting dotnet run, fabricate evidence, or reject valid work. Should be conditional: "If a runtime environment is available, launch and verify."

  • Section 4a: "ALWAYS request changes if ANY test fails, including pre-existing flaky tests" — the PR Reviewer operates on gh pr diff and may not have test output. Also creates tension with the existing CI distinction (PR-specific vs pre-existing failures).


Non-Consensus (1/3, informational)

Observation Model
4a "no exceptions" bypasses adversarial consensus gate for test failures Opus

What's Clean ✅

  • Implement & Challenge structured 4-step prompts (Plan, Implement, Validate, Self-review) are well-designed
  • [[GROUP_REFLECT_COMPLETE]] sentinel preserved
  • WorkerSystemPrompts array length unchanged (2)
  • RoutingContext improvements (forward COMPLETE request, verify completeness) are correct
  • No runtime code changes, no regressions, no security issues

Verdict: ⚠️ Request Changes

Three actions needed before merge:

  1. Fix the PR title and description to match the actual diff content
  2. Resolve the 4b ↔ SharedContext contradiction — these instructions are delivered to the same agent and directly conflict
  3. Soften "MUST launch" to conditional — headless agents cannot comply

The prompt engineering improvements are valuable — the structured planning, checklist verification, and completeness checking are good additions. Just need consistency with the existing review standards.


R2 re-review · consensus threshold: 2/3 models must agree

…ional runtime

- Update PR Review Squad SharedContext to flag ALL severities including
  minor nits (was 'NEVER comment on style' which contradicted 4b)
- Soften 'MUST launch' to 'launch when runtime is available' for
  headless/CI contexts

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen
Copy link
Copy Markdown
Owner Author

PureWeen commented Apr 4, 2026

🔍 Multi-Model Code Review — PR #459 R3

PR: fix: Re-check PR badge when session becomes active (title unchanged)
Diff: 122 lines, 1 file (ModelCapabilities.cs) — 4 commits
CI: ⚠️ No checks configured
Prior reviews: R1 (⚠️), R2 (⚠️ — 3 findings)
Models: claude-opus-4.6 · claude-sonnet-4.6 · gpt-5.3-codex


R2 Findings Status

# R2 Finding Status
1 🟡 PR title/description mismatches diff Still open (3/3 confirm)
2 �� Section 4b ↔ SharedContext contradiction Fixed — SharedContext updated to "Flag ALL issues regardless of severity" (3/3 confirm)
3 🟡 "MUST launch" stalls headless agents Fixed — now conditional: "when a runtime environment is available" / "when possible" (2/3 confirm; Codex notes residual "MUST perform exact validation steps" for user-specified steps, but this is about following explicit user instructions, not general launch)

Remaining Finding

🟡 MODERATE — PR title still mismatches diff content (R1 → R2 → R3 unresolved)

Flagged by: Opus, Sonnet, Codex (3/3)

Title: fix: Re-check PR badge when session becomes active
Diff: Prompt improvements for Implement & Challenge + PR Reviewer presets. Zero badge/session-active code.

Breaks git log traceability and reviewer triage. Should be: refine: Strengthen Implement & Challenge charters and PR reviewer prompts (or similar).

What's Clean ✅

  • SharedContext and 4b are now fully consistent — both say "flag all severities"
  • "MUST launch" conditionally gated for headless environments
  • [[GROUP_REFLECT_COMPLETE]] sentinel preserved
  • WorkerSystemPrompts array length unchanged (2)
  • Structured 4-step prompts (Plan → Implement → Validate → Self-review) well-designed
  • RoutingContext improvements (forward complete request, verify completeness) correct
  • No runtime code changes, no regressions, no security issues

Verdict: ✅ Approve (with title fix requested)

The two substantive R2 findings (SharedContext contradiction, headless stall) are properly resolved. The only remaining item is the misleading PR title — this is a metadata issue, not a code issue. The code changes themselves are safe and ready to merge.

Recommendation: Fix the PR title before or at merge time (e.g., edit via GitHub UI or gh pr edit 459 --title "refine: Strengthen Implement & Challenge charters and PR reviewer prompts").


R3 re-review · consensus threshold: 2/3 models must agree

@PureWeen PureWeen merged commit 32fe00e into main Apr 4, 2026
@PureWeen PureWeen deleted the fix/pr-badge-refresh-on-activate branch April 4, 2026 00:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant