devdanzin · devdanzin · Apr 17, 2026 · Apr 17, 2026 · Apr 17, 2026 · Apr 17, 2026
diff --git a/plugins/cpython-review-toolkit/agents/c-complexity-analyzer.md b/plugins/cpython-review-toolkit/agents/c-complexity-analyzer.md
@@ -93,3 +93,10 @@ For each function with accidental complexity, provide:
 - **goto is idiomatic in C error handling**: High goto counts in CPython are normal and not a complexity concern unless the control flow is genuinely hard to follow.
 - **Context matters**: A complex function that is rarely modified and well-tested is lower priority than a complex function that is frequently changed.
 - **Suggest concrete refactorings**: Don't just say "this is complex" — show how to simplify it.
+
+## Running the script
+
+- Call the script with a Bash timeout of **300000 ms** (5 min). The default 120s kills on large repos.
+- Use a **unique temp filename** for the JSON output, e.g. `/tmp/c-complexity-analyzer_<scope>_$$.json` — the `$$` PID suffix prevents collisions when multiple agents run concurrently.
+- Forward `--max-files N` and (where supported) `--workers N` from the caller.
+- If the script **times out or errors, do NOT retry it.** Fall back to Grep/Read for the same question. Long-running runs should use `run_in_background`.
diff --git a/plugins/cpython-review-toolkit/agents/error-path-analyzer.md b/plugins/cpython-review-toolkit/agents/error-path-analyzer.md
@@ -88,3 +88,39 @@ Beyond script findings, look for:
 - **SystemError is a real bug**: Functions that return NULL without setting an exception cause confusing errors for Python users.
 - **The canonical pattern is well-established**: CPython functions should use `goto error / goto done` with cleanup labels. Deviations aren't bugs but increase risk.
 - **Some APIs always succeed**: A few C API functions cannot fail (e.g., Py_INCREF on a known non-NULL pointer). Don't flag missing checks for these.
+
+## Safety Annotations
+
+`scan_error_paths.py` looks at C comments within +/- 5 lines of each candidate
+finding. If any comment contains one of the following keywords (case-insensitive
+substring match), the finding is downgraded to `confidence: low` and marked
+`suppressed_by_annotation: true`.
+
+Suppressing keywords:
+
+- `safety:` / `checked:` — reviewer vouches for the call site
+- `safe because` / `correct because` / `this is safe` — justification follows
+- `intentional` / `by design` / `deliberately` / `expected` — pattern is chosen
+- `not a bug` — known-false-positive marker
+- `nolint` — general lint-suppression convention
+
+Example:
+```c
+/* checked: PyArg_ParseTuple validated above; err is 0 on this path. */
+return result;
+```
+
+## Running the script
+
+- Call the script with a Bash timeout of **300000 ms** (5 min). The default 120s kills on large repos.
+- Use a **unique temp filename** for the JSON output, e.g. `/tmp/error-path-analyzer_<scope>_$$.json` — the `$$` PID suffix prevents collisions when multiple agents run concurrently.
+- Forward `--max-files N` and (where supported) `--workers N` from the caller.
+- If the script **times out or errors, do NOT retry it.** Fall back to Grep/Read for the same question. Long-running runs should use `run_in_background`.
+
+## Confidence
+
+- **HIGH** — structurally identical to a known-bad pattern, or exact signature match; ≥90% likelihood of being a true positive.
+- **MEDIUM** — similar with differences that require human verification; 70–89%.
+- **LOW** — superficially similar; requires code-context reading; 50–69%.
+
+Findings below LOW are not reported.
diff --git a/plugins/cpython-review-toolkit/agents/gil-discipline-checker.md b/plugins/cpython-review-toolkit/agents/gil-discipline-checker.md
@@ -84,3 +84,43 @@ For codebases targeting Python 3.13+:
 - **Blocking with GIL is a judgment call**: Short operations may not be worth the overhead of releasing/reacquiring the GIL. Flag but classify as CONSIDER.
 - **Free-threading is evolving**: For 3.13+ code, flag patterns that are unsafe under nogil but note that this is forward-looking guidance.
 - **Some modules are GIL-free by design**: Code in PC/ or Mac/ may use OS-level threading without the GIL. Understand context before flagging.
+
+## Safety Annotations
+
+`scan_gil_usage.py` looks at C comments within +/- 5 lines of each candidate
+finding. If any comment contains one of the following keywords (case-insensitive
+substring match), the finding is downgraded to `confidence: low` and marked
+`suppressed_by_annotation: true`.
+
+Suppressing keywords (GIL-specific terms included):
+
+- `gil held` / `gil-held` — the GIL is known-held on this path
+- `already locked` / `already protected` — mutex/GIL acquired by caller
+- `safety:` / `checked:` — reviewer vouches for the call site
+- `safe because` / `correct because` / `this is safe` — justification follows
+- `intentional` / `by design` / `deliberately` / `expected` — pattern is chosen
+- `not a bug` — known-false-positive marker
+- `nolint` — general lint-suppression convention
+
+Example:
+```c
+Py_BEGIN_ALLOW_THREADS
+/* safety: gil-held by the _internal helper; no Python API called here. */
+ret = os_level_work();
+Py_END_ALLOW_THREADS
+```
+
+## Running the script
+
+- Call the script with a Bash timeout of **300000 ms** (5 min). The default 120s kills on large repos.
+- Use a **unique temp filename** for the JSON output, e.g. `/tmp/gil-discipline-checker_<scope>_$$.json` — the `$$` PID suffix prevents collisions when multiple agents run concurrently.
+- Forward `--max-files N` and (where supported) `--workers N` from the caller.
+- If the script **times out or errors, do NOT retry it.** Fall back to Grep/Read for the same question. Long-running runs should use `run_in_background`.
+
+## Confidence
+
+- **HIGH** — structurally identical to a known-bad pattern, or exact signature match; ≥90% likelihood of being a true positive.
+- **MEDIUM** — similar with differences that require human verification; 70–89%.
+- **LOW** — superficially similar; requires code-context reading; 50–69%.
+
+Findings below LOW are not reported.
diff --git a/plugins/cpython-review-toolkit/agents/git-history-analyzer.md b/plugins/cpython-review-toolkit/agents/git-history-analyzer.md
@@ -247,3 +247,11 @@ Final summary:
 7. **Cap similar-bug findings at 10.** Cap risk matrix at 10 entries. Note totals if more exist.
 
 8. **Function-level churn uses regex for C files.** The script uses regex-based function boundary detection (consistent with other cpython-review-toolkit scripts). This handles most CPython functions including multi-line signatures and Argument Clinic `_impl` functions, but may miss functions with `#ifdef` brace imbalance.
+
+## Confidence
+
+- **HIGH** — structurally identical to a known-bad pattern, or exact signature match; ≥90% likelihood of being a true positive.
+- **MEDIUM** — similar with differences that require human verification; 70–89%.
+- **LOW** — superficially similar; requires code-context reading; 50–69%.
+
+Findings below LOW are not reported.
diff --git a/plugins/cpython-review-toolkit/agents/include-graph-mapper.md b/plugins/cpython-review-toolkit/agents/include-graph-mapper.md
@@ -106,3 +106,10 @@ From the script's `include_graph` and metrics:
 - **Internal headers are for internal use**: Code outside Include/internal/ using internal headers is a potential API stability issue.
 - **Include guards vs. pragma once**: CPython uses traditional include guards. Both are fine — don't flag this as an issue.
 - **Count accurately**: Report exact numbers from the script, not estimates.
+
+## Running the script
+
+- Call the script with a Bash timeout of **300000 ms** (5 min). The default 120s kills on large repos.
+- Use a **unique temp filename** for the JSON output, e.g. `/tmp/include-graph-mapper_<scope>_$$.json` — the `$$` PID suffix prevents collisions when multiple agents run concurrently.
+- Forward `--max-files N` and (where supported) `--workers N` from the caller.
+- If the script **times out or errors, do NOT retry it.** Fall back to Grep/Read for the same question. Long-running runs should use `run_in_background`.
diff --git a/plugins/cpython-review-toolkit/agents/null-safety-scanner.md b/plugins/cpython-review-toolkit/agents/null-safety-scanner.md
@@ -72,3 +72,39 @@ Beyond script findings:
 - **Memory allocation can always fail**: Every malloc/PyMem_Malloc call must be checked, even if failure is unlikely.
 - **PyArg_ParseTuple failure leaves args uninitialized**: If the parse fails and the return isn't checked, extracted arguments are garbage.
 - **Some allocations are "infallible"**: PyMem_Malloc on small sizes almost never fails, but the check is still required per CPython coding standards.
+
+## Safety Annotations
+
+`scan_null_checks.py` looks at C comments within +/- 5 lines of each candidate
+finding. If any comment contains one of the following keywords (case-insensitive
+substring match), the finding is downgraded to `confidence: low` and marked
+`suppressed_by_annotation: true`.
+
+Suppressing keywords:
+
+- `safety:` / `checked:` — reviewer vouches for the call site
+- `safe because` / `correct because` / `this is safe` — justification follows
+- `intentional` / `by design` / `deliberately` / `expected` — pattern is chosen
+- `not a bug` — known-false-positive marker
+- `nolint` — general lint-suppression convention
+
+Example:
+```c
+/* safety: ptr guaranteed non-NULL by caller contract. */
+ptr->field = value;
+```
+
+## Running the script
+
+- Call the script with a Bash timeout of **300000 ms** (5 min). The default 120s kills on large repos.
+- Use a **unique temp filename** for the JSON output, e.g. `/tmp/null-safety-scanner_<scope>_$$.json` — the `$$` PID suffix prevents collisions when multiple agents run concurrently.
+- Forward `--max-files N` and (where supported) `--workers N` from the caller.
+- If the script **times out or errors, do NOT retry it.** Fall back to Grep/Read for the same question. Long-running runs should use `run_in_background`.
+
+## Confidence
+
+- **HIGH** — structurally identical to a known-bad pattern, or exact signature match; ≥90% likelihood of being a true positive.
+- **MEDIUM** — similar with differences that require human verification; 70–89%.
+- **LOW** — superficially similar; requires code-context reading; 50–69%.
+
+Findings below LOW are not reported.
diff --git a/plugins/cpython-review-toolkit/agents/pep7-style-checker.md b/plugins/cpython-review-toolkit/agents/pep7-style-checker.md
@@ -85,3 +85,10 @@ Provide actionable guidance:
 - **Generated code gets a pass**: Files like `Python/opcode_targets.h` are generated — don't flag style issues in generated files.
 - **Third-party code gets a pass**: Code under `Modules/_decimal/` or similar vendored directories follows their upstream style.
 - **Consistency matters more than perfection**: A file that consistently uses one style is better than a file with mixed styles, even if the consistent style isn't PEP 7.
+
+## Running the script
+
+- Call the script with a Bash timeout of **300000 ms** (5 min). The default 120s kills on large repos.
+- Use a **unique temp filename** for the JSON output, e.g. `/tmp/pep7-style-checker_<scope>_$$.json` — the `$$` PID suffix prevents collisions when multiple agents run concurrently.
+- Forward `--max-files N` and (where supported) `--workers N` from the caller.
+- If the script **times out or errors, do NOT retry it.** Fall back to Grep/Read for the same question. Long-running runs should use `run_in_background`.
diff --git a/plugins/cpython-review-toolkit/agents/refcount-auditor.md b/plugins/cpython-review-toolkit/agents/refcount-auditor.md
@@ -132,3 +132,42 @@ Beyond script findings, look for these patterns in the code:
 - **Context matters**: A refcount leak in a rarely-called initialization function is less critical than one in a hot loop in ceval.c.
 - **CPython's own patterns**: CPython code sometimes intentionally leaks references to immortal objects (None, True, False) or module-level objects that live for the process lifetime. Don't flag these.
 - **Be precise**: Include exact line numbers, variable names, and API calls in every finding. Vague findings are not actionable.
+
+## Safety Annotations
+
+`scan_refcounts.py` looks at C comments within +/- 5 lines of each candidate
+finding. If any comment contains one of the following keywords (case-insensitive
+substring match), the finding is downgraded to `confidence: low` and marked
+`suppressed_by_annotation: true`. Reviewers should still eyeball these — the
+annotation is a hint, not a proof.
+
+Suppressing keywords (add to the comment nearest the flagged line):
+
+- `safety:` / `checked:` — reviewer vouches for the call site
+- `safe because` / `correct because` / `this is safe` — justification follows
+- `intentional` / `by design` / `deliberately` / `expected` — pattern is chosen
+- `not a bug` — known-false-positive marker
+- `borrowed ok` — borrowed reference provably lives long enough
+- `refcount safe` — refcount is accounted for elsewhere
+- `nolint` — general lint-suppression convention
+
+Example:
+```c
+/* safety: PyList_GetItem returns a borrowed ref; list owned by caller. */
+PyObject *item = PyList_GetItem(list, 0);
+```
+
+## Running the script
+
+- Call the script with a Bash timeout of **300000 ms** (5 min). The default 120s kills on large repos.
+- Use a **unique temp filename** for the JSON output, e.g. `/tmp/refcount-auditor_<scope>_$$.json` — the `$$` PID suffix prevents collisions when multiple agents run concurrently.
+- Forward `--max-files N` and (where supported) `--workers N` from the caller.
+- If the script **times out or errors, do NOT retry it.** Fall back to Grep/Read for the same question. Long-running runs should use `run_in_background`.
+
+## Confidence
+
+- **HIGH** — structurally identical to a known-bad pattern, or exact signature match; ≥90% likelihood of being a true positive.
+- **MEDIUM** — similar with differences that require human verification; 70–89%.
+- **LOW** — superficially similar; requires code-context reading; 50–69%.
+
+Findings below LOW are not reported.
diff --git a/plugins/cpython-review-toolkit/commands/explore.md b/plugins/cpython-review-toolkit/commands/explore.md
@@ -137,31 +137,52 @@ After all agents complete, perform deduplication and conflict resolution, then p
 
 ## Findings by Priority
 
-### Must Fix (FIX)
-[Crash risks, memory corruption, undefined behavior]
+**Use global non-restarting numbering**: number ALL findings sequentially across all sections. FIX findings first (1-N), then CONSIDER (N+1-M), then POLICY (M+1-P). Use these same numbers in the action plan. This makes it easy to reference "Finding 37" in issue trackers and emails.
 
-### Should Consider (CONSIDER)
-[Improvement opportunities with trade-offs]
+### Must Fix (FIX) — N
+
+| # | Finding | File:Line | Agents |
+|---|---------|-----------|--------|
+| 1 | [Crash risk / memory corruption / UB — description] | [file:line] | [which agents found it] |
+
+### Should Consider (CONSIDER) — M
+
+| # | Finding | File:Line |
+|---|---------|-----------|
+| N+1 | [Improvement opportunity with trade-offs — description] | [file:line] |
 
 ### Tensions
 [Where agents disagree]
 
-### Policy Decisions (POLICY)
-[Team-level decisions needed]
+### Policy Decisions (POLICY) — P
+
+| # | Finding |
+|---|---------|
+| M+1 | [Team-level decision — description] |
+
+### Acceptable (ACCEPTABLE) — Q
+
+| # | Finding |
+|---|---------|
+| P+1 | [Informational / no action required — description] |
 
 ## Strengths
 [What the C code does well]
 
 ## Recommended Action Plan
 
-### Immediate
-1. [FIX items — safety-critical]
+Reference findings by their global number:
+
+### Immediate (FIX items)
+1. [Fix Finding 1 — description]
+2. [Fix Finding 2 — description]
 
-### Short-term
-1. [CONSIDER items — quality improvements]
+### Short-term (CONSIDER items)
+1. [Finding N+1 — description]
+2. [Finding N+2 — description]
 
-### Ongoing
-1. [POLICY decisions to make]
+### Longer-term (POLICY)
+1. [Finding M+1 — description]
 ```
 
 ## Usage Examples

diff --git a/plugins/cpython-review-toolkit/commands/health.md b/plugins/cpython-review-toolkit/commands/health.md
@@ -39,9 +39,14 @@ Run all agents in summary mode to produce a quick health dashboard. Each agent r
 ## Overall Health: X/10
 
 ## Top 3 Priorities
-1. [Most impactful improvement]
-2. [Next]
-3. [Next]
+
+Reference findings by global non-restarting numbering (FIX 1..N, CONSIDER N+1..M,
+POLICY M+1..P, ACCEPTABLE P+1..Q). Use the same numbers in `/explore` output so
+issues tracked here carry over to the detailed report.
+
+1. [Finding # — most impactful improvement]
+2. [Finding # — next]
+3. [Finding # — next]
 
 For detailed analysis, run:
   /cpython-review-toolkit:explore . [aspect] deep

diff --git a/plugins/cpython-review-toolkit/scripts/analyze_includes.py b/plugins/cpython-review-toolkit/scripts/analyze_includes.py
@@ -9,9 +9,10 @@
 - api_tiers: classification of headers into public/cpython/internal
 
 Usage:
-    python analyze_includes.py [path]
+    python analyze_includes.py [path] [--max-files N]
 
     path: directory, file, or omitted for current directory
+    --max-files N: cap the number of .c/.h files scanned (0 = unlimited)
 """
 
 import json

diff --git a/plugins/cpython-review-toolkit/scripts/check_pep7.py b/plugins/cpython-review-toolkit/scripts/check_pep7.py
@@ -8,9 +8,10 @@
 - trailing whitespace, missing braces
 
 Usage:
-    python check_pep7.py [path]
+    python check_pep7.py [path] [--max-files N]
 
     path: directory, file, or omitted for current directory
+    --max-files N: cap the number of .c/.h files scanned (0 = unlimited)
 """
 
 import json

diff --git a/plugins/cpython-review-toolkit/scripts/measure_c_complexity.py b/plugins/cpython-review-toolkit/scripts/measure_c_complexity.py
@@ -7,9 +7,10 @@
 - weighted score (1-10)
 
 Usage:
-    python measure_c_complexity.py [path]
+    python measure_c_complexity.py [path] [--max-files N]
 
     path: directory, file, or omitted for current directory
+    --max-files N: cap the number of .c/.h files scanned (0 = unlimited)
 """
 
 import json