diff --git a/.github/skills/review-council/SKILL.md b/.github/skills/review-council/SKILL.md
index 4f60fd347df..6733bfbc5a4 100644
--- a/.github/skills/review-council/SKILL.md
+++ b/.github/skills/review-council/SKILL.md
@@ -40,6 +40,10 @@ Are the goals met? Cross-reference every claim in the PR description, linked iss
 - **Orphan changes**: code changed but not mentioned in any claim — why is this here?
 - **Partial implementations**: claim says X, code does X-minus-something — what's missing?
 
+**Before flagging**, apply these gates:
+- **Execution context**: understand how and where the code runs before judging it. A test harness, a standalone project, and the main compiler have different constraints. Do not assume the worst-case context.
+- **Direction**: classify each behavioral change as regression (breaks something), improvement (fixes bug, removes race, simplifies), or unclear. Only regressions are findings. Improvements are observations at most. For unclear, state uncertainty — do not present as a defect.
+
 ---
 
 ### Dimension 2: Test Coverage
@@ -71,6 +75,7 @@ Assess structural quality of the changes:
 - **Ad-hoc "if/then" patches**: flag any `if condition then specialCase else normalPath` that looks like a band-aid rather than a systematic fix. These are symptoms of not understanding the root cause — the fix should be at the source, not patched at a consumer. A conditional that exists only to work around a bug elsewhere is a code smell.
 - **Duplicated logic within the PR diff**: same or near-same code appearing in multiple places in the changeset
 - **Error handling**: not swallowing exceptions, not ignoring Result values, not using `failwith` where a typed error would be appropriate
+- **Respect intentional deviations**: some projects (standalone tests, isolated builds, special harnesses) deliberately diverge from repo-wide conventions. Before flagging a pattern as wrong, check whether the project has a structural reason to be different. Intentional isolation is not inconsistency.
 
 ---
 
@@ -103,11 +108,17 @@ Assess complexity of added/changed code:
 Collect all findings from the 15 agents. Then:
 
 1. **Deduplicate**: multiple agents will find the same issue. Merge findings that point at the same location and describe the same problem. Keep the best-written description.
-2. **Classify** into three buckets:
+2. **Filter false positives** before classifying:
+   - **Wrong context assumed**: finding assumes a different execution context than the code actually runs in → discard.
+   - **Correct convention**: finding flags a pattern that is idiomatic for the tool/context in use → discard.
+   - **Improvement, not regression**: change makes things better (bugfix, simplification, race fix) → downgrade to informational, do not include in actionable findings.
+   - **"Unexplained" ≠ "wrong"**: if the only concern is missing rationale in the commit message, that is a documentation gap — not a defect. Convert to comment, not finding.
+   - **Speculation consensus**: ≥2 models agree but reasoning relies on "could cause" without evidence it does → LOW confidence.
+3. **Classify** into three buckets:
    - **Behavioral**: missing feature coverage, missing tests, incorrect logic, claims not met — things that affect correctness
    - **Quality**: code structure, readability, complexity, reuse opportunities — it works, but could be better
    - **Nitpick**: typos, naming, formatting, minor style — agents love producing these to have *something* to say. Low priority. Only surface if there are no higher-level findings.
-3. **Rank within each bucket**: prefer findings flagged by more agents (cross-model agreement = higher confidence).
-4. **Present**: Behavioral first, then Quality, then Nitpicks (if any). For each finding: location, dimension, description, how many agents flagged it.
+4. **Rank within each bucket**: prefer findings flagged by more agents (cross-model agreement = higher confidence).
+5. **Present**: Behavioral first, then Quality, then Nitpicks (if any). For each finding: location, dimension, description, how many agents flagged it.
 
 If a model is unavailable at runtime, proceed with the remaining models. Minimum viable council = 2 models.