Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -93,3 +93,10 @@ For each function with accidental complexity, provide:
- **goto is idiomatic in C error handling**: High goto counts in CPython are normal and not a complexity concern unless the control flow is genuinely hard to follow.
- **Context matters**: A complex function that is rarely modified and well-tested is lower priority than a complex function that is frequently changed.
- **Suggest concrete refactorings**: Don't just say "this is complex" — show how to simplify it.

## Running the script

- Call the script with a Bash timeout of **300000 ms** (5 min). The default 120s kills on large repos.
- Use a **unique temp filename** for the JSON output, e.g. `/tmp/c-complexity-analyzer_<scope>_$$.json` — the `$$` PID suffix prevents collisions when multiple agents run concurrently.
- Forward `--max-files N` and (where supported) `--workers N` from the caller.
- If the script **times out or errors, do NOT retry it.** Fall back to Grep/Read for the same question. Long-running runs should use `run_in_background`.
36 changes: 36 additions & 0 deletions plugins/cpython-review-toolkit/agents/error-path-analyzer.md
Original file line number Diff line number Diff line change
Expand Up @@ -88,3 +88,39 @@ Beyond script findings, look for:
- **SystemError is a real bug**: Functions that return NULL without setting an exception cause confusing errors for Python users.
- **The canonical pattern is well-established**: CPython functions should use `goto error / goto done` with cleanup labels. Deviations aren't bugs but increase risk.
- **Some APIs always succeed**: A few C API functions cannot fail (e.g., Py_INCREF on a known non-NULL pointer). Don't flag missing checks for these.

## Safety Annotations

`scan_error_paths.py` looks at C comments within +/- 5 lines of each candidate
finding. If any comment contains one of the following keywords (case-insensitive
substring match), the finding is downgraded to `confidence: low` and marked
`suppressed_by_annotation: true`.

Suppressing keywords:

- `safety:` / `checked:` — reviewer vouches for the call site
- `safe because` / `correct because` / `this is safe` — justification follows
- `intentional` / `by design` / `deliberately` / `expected` — pattern is chosen
- `not a bug` — known-false-positive marker
- `nolint` — general lint-suppression convention

Example:
```c
/* checked: PyArg_ParseTuple validated above; err is 0 on this path. */
return result;
```

## Running the script

- Call the script with a Bash timeout of **300000 ms** (5 min). The default 120s kills on large repos.
- Use a **unique temp filename** for the JSON output, e.g. `/tmp/error-path-analyzer_<scope>_$$.json` — the `$$` PID suffix prevents collisions when multiple agents run concurrently.
- Forward `--max-files N` and (where supported) `--workers N` from the caller.
- If the script **times out or errors, do NOT retry it.** Fall back to Grep/Read for the same question. Long-running runs should use `run_in_background`.

## Confidence

- **HIGH** — structurally identical to a known-bad pattern, or exact signature match; ≥90% likelihood of being a true positive.
- **MEDIUM** — similar with differences that require human verification; 70–89%.
- **LOW** — superficially similar; requires code-context reading; 50–69%.

Findings below LOW are not reported.
40 changes: 40 additions & 0 deletions plugins/cpython-review-toolkit/agents/gil-discipline-checker.md
Original file line number Diff line number Diff line change
Expand Up @@ -84,3 +84,43 @@ For codebases targeting Python 3.13+:
- **Blocking with GIL is a judgment call**: Short operations may not be worth the overhead of releasing/reacquiring the GIL. Flag but classify as CONSIDER.
- **Free-threading is evolving**: For 3.13+ code, flag patterns that are unsafe under nogil but note that this is forward-looking guidance.
- **Some modules are GIL-free by design**: Code in PC/ or Mac/ may use OS-level threading without the GIL. Understand context before flagging.

## Safety Annotations

`scan_gil_usage.py` looks at C comments within +/- 5 lines of each candidate
finding. If any comment contains one of the following keywords (case-insensitive
substring match), the finding is downgraded to `confidence: low` and marked
`suppressed_by_annotation: true`.

Suppressing keywords (GIL-specific terms included):

- `gil held` / `gil-held` — the GIL is known-held on this path
- `already locked` / `already protected` — mutex/GIL acquired by caller
- `safety:` / `checked:` — reviewer vouches for the call site
- `safe because` / `correct because` / `this is safe` — justification follows
- `intentional` / `by design` / `deliberately` / `expected` — pattern is chosen
- `not a bug` — known-false-positive marker
- `nolint` — general lint-suppression convention

Example:
```c
Py_BEGIN_ALLOW_THREADS
/* safety: gil-held by the _internal helper; no Python API called here. */
ret = os_level_work();
Py_END_ALLOW_THREADS
```

## Running the script

- Call the script with a Bash timeout of **300000 ms** (5 min). The default 120s kills on large repos.
- Use a **unique temp filename** for the JSON output, e.g. `/tmp/gil-discipline-checker_<scope>_$$.json` — the `$$` PID suffix prevents collisions when multiple agents run concurrently.
- Forward `--max-files N` and (where supported) `--workers N` from the caller.
- If the script **times out or errors, do NOT retry it.** Fall back to Grep/Read for the same question. Long-running runs should use `run_in_background`.

## Confidence

- **HIGH** — structurally identical to a known-bad pattern, or exact signature match; ≥90% likelihood of being a true positive.
- **MEDIUM** — similar with differences that require human verification; 70–89%.
- **LOW** — superficially similar; requires code-context reading; 50–69%.

Findings below LOW are not reported.
8 changes: 8 additions & 0 deletions plugins/cpython-review-toolkit/agents/git-history-analyzer.md
Original file line number Diff line number Diff line change
Expand Up @@ -247,3 +247,11 @@ Final summary:
7. **Cap similar-bug findings at 10.** Cap risk matrix at 10 entries. Note totals if more exist.

8. **Function-level churn uses regex for C files.** The script uses regex-based function boundary detection (consistent with other cpython-review-toolkit scripts). This handles most CPython functions including multi-line signatures and Argument Clinic `_impl` functions, but may miss functions with `#ifdef` brace imbalance.

## Confidence

- **HIGH** — structurally identical to a known-bad pattern, or exact signature match; ≥90% likelihood of being a true positive.
- **MEDIUM** — similar with differences that require human verification; 70–89%.
- **LOW** — superficially similar; requires code-context reading; 50–69%.

Findings below LOW are not reported.
7 changes: 7 additions & 0 deletions plugins/cpython-review-toolkit/agents/include-graph-mapper.md
Original file line number Diff line number Diff line change
Expand Up @@ -106,3 +106,10 @@ From the script's `include_graph` and metrics:
- **Internal headers are for internal use**: Code outside Include/internal/ using internal headers is a potential API stability issue.
- **Include guards vs. pragma once**: CPython uses traditional include guards. Both are fine — don't flag this as an issue.
- **Count accurately**: Report exact numbers from the script, not estimates.

## Running the script

- Call the script with a Bash timeout of **300000 ms** (5 min). The default 120s kills on large repos.
- Use a **unique temp filename** for the JSON output, e.g. `/tmp/include-graph-mapper_<scope>_$$.json` — the `$$` PID suffix prevents collisions when multiple agents run concurrently.
- Forward `--max-files N` and (where supported) `--workers N` from the caller.
- If the script **times out or errors, do NOT retry it.** Fall back to Grep/Read for the same question. Long-running runs should use `run_in_background`.
36 changes: 36 additions & 0 deletions plugins/cpython-review-toolkit/agents/null-safety-scanner.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,3 +72,39 @@ Beyond script findings:
- **Memory allocation can always fail**: Every malloc/PyMem_Malloc call must be checked, even if failure is unlikely.
- **PyArg_ParseTuple failure leaves args uninitialized**: If the parse fails and the return isn't checked, extracted arguments are garbage.
- **Some allocations are "infallible"**: PyMem_Malloc on small sizes almost never fails, but the check is still required per CPython coding standards.

## Safety Annotations

`scan_null_checks.py` looks at C comments within +/- 5 lines of each candidate
finding. If any comment contains one of the following keywords (case-insensitive
substring match), the finding is downgraded to `confidence: low` and marked
`suppressed_by_annotation: true`.

Suppressing keywords:

- `safety:` / `checked:` — reviewer vouches for the call site
- `safe because` / `correct because` / `this is safe` — justification follows
- `intentional` / `by design` / `deliberately` / `expected` — pattern is chosen
- `not a bug` — known-false-positive marker
- `nolint` — general lint-suppression convention

Example:
```c
/* safety: ptr guaranteed non-NULL by caller contract. */
ptr->field = value;
```

## Running the script

- Call the script with a Bash timeout of **300000 ms** (5 min). The default 120s kills on large repos.
- Use a **unique temp filename** for the JSON output, e.g. `/tmp/null-safety-scanner_<scope>_$$.json` — the `$$` PID suffix prevents collisions when multiple agents run concurrently.
- Forward `--max-files N` and (where supported) `--workers N` from the caller.
- If the script **times out or errors, do NOT retry it.** Fall back to Grep/Read for the same question. Long-running runs should use `run_in_background`.

## Confidence

- **HIGH** — structurally identical to a known-bad pattern, or exact signature match; ≥90% likelihood of being a true positive.
- **MEDIUM** — similar with differences that require human verification; 70–89%.
- **LOW** — superficially similar; requires code-context reading; 50–69%.

Findings below LOW are not reported.
7 changes: 7 additions & 0 deletions plugins/cpython-review-toolkit/agents/pep7-style-checker.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,3 +85,10 @@ Provide actionable guidance:
- **Generated code gets a pass**: Files like `Python/opcode_targets.h` are generated — don't flag style issues in generated files.
- **Third-party code gets a pass**: Code under `Modules/_decimal/` or similar vendored directories follows their upstream style.
- **Consistency matters more than perfection**: A file that consistently uses one style is better than a file with mixed styles, even if the consistent style isn't PEP 7.

## Running the script

- Call the script with a Bash timeout of **300000 ms** (5 min). The default 120s kills on large repos.
- Use a **unique temp filename** for the JSON output, e.g. `/tmp/pep7-style-checker_<scope>_$$.json` — the `$$` PID suffix prevents collisions when multiple agents run concurrently.
- Forward `--max-files N` and (where supported) `--workers N` from the caller.
- If the script **times out or errors, do NOT retry it.** Fall back to Grep/Read for the same question. Long-running runs should use `run_in_background`.
39 changes: 39 additions & 0 deletions plugins/cpython-review-toolkit/agents/refcount-auditor.md
Original file line number Diff line number Diff line change
Expand Up @@ -132,3 +132,42 @@ Beyond script findings, look for these patterns in the code:
- **Context matters**: A refcount leak in a rarely-called initialization function is less critical than one in a hot loop in ceval.c.
- **CPython's own patterns**: CPython code sometimes intentionally leaks references to immortal objects (None, True, False) or module-level objects that live for the process lifetime. Don't flag these.
- **Be precise**: Include exact line numbers, variable names, and API calls in every finding. Vague findings are not actionable.

## Safety Annotations

`scan_refcounts.py` looks at C comments within +/- 5 lines of each candidate
finding. If any comment contains one of the following keywords (case-insensitive
substring match), the finding is downgraded to `confidence: low` and marked
`suppressed_by_annotation: true`. Reviewers should still eyeball these — the
annotation is a hint, not a proof.

Suppressing keywords (add to the comment nearest the flagged line):

- `safety:` / `checked:` — reviewer vouches for the call site
- `safe because` / `correct because` / `this is safe` — justification follows
- `intentional` / `by design` / `deliberately` / `expected` — pattern is chosen
- `not a bug` — known-false-positive marker
- `borrowed ok` — borrowed reference provably lives long enough
- `refcount safe` — refcount is accounted for elsewhere
- `nolint` — general lint-suppression convention

Example:
```c
/* safety: PyList_GetItem returns a borrowed ref; list owned by caller. */
PyObject *item = PyList_GetItem(list, 0);
```

## Running the script

- Call the script with a Bash timeout of **300000 ms** (5 min). The default 120s kills on large repos.
- Use a **unique temp filename** for the JSON output, e.g. `/tmp/refcount-auditor_<scope>_$$.json` — the `$$` PID suffix prevents collisions when multiple agents run concurrently.
- Forward `--max-files N` and (where supported) `--workers N` from the caller.
- If the script **times out or errors, do NOT retry it.** Fall back to Grep/Read for the same question. Long-running runs should use `run_in_background`.

## Confidence

- **HIGH** — structurally identical to a known-bad pattern, or exact signature match; ≥90% likelihood of being a true positive.
- **MEDIUM** — similar with differences that require human verification; 70–89%.
- **LOW** — superficially similar; requires code-context reading; 50–69%.

Findings below LOW are not reported.
45 changes: 33 additions & 12 deletions plugins/cpython-review-toolkit/commands/explore.md
Original file line number Diff line number Diff line change
Expand Up @@ -137,31 +137,52 @@ After all agents complete, perform deduplication and conflict resolution, then p

## Findings by Priority

### Must Fix (FIX)
[Crash risks, memory corruption, undefined behavior]
**Use global non-restarting numbering**: number ALL findings sequentially across all sections. FIX findings first (1-N), then CONSIDER (N+1-M), then POLICY (M+1-P). Use these same numbers in the action plan. This makes it easy to reference "Finding 37" in issue trackers and emails.

### Should Consider (CONSIDER)
[Improvement opportunities with trade-offs]
### Must Fix (FIX) — N

| # | Finding | File:Line | Agents |
|---|---------|-----------|--------|
| 1 | [Crash risk / memory corruption / UB — description] | [file:line] | [which agents found it] |

### Should Consider (CONSIDER) — M

| # | Finding | File:Line |
|---|---------|-----------|
| N+1 | [Improvement opportunity with trade-offs — description] | [file:line] |

### Tensions
[Where agents disagree]

### Policy Decisions (POLICY)
[Team-level decisions needed]
### Policy Decisions (POLICY) — P

| # | Finding |
|---|---------|
| M+1 | [Team-level decision — description] |

### Acceptable (ACCEPTABLE) — Q

| # | Finding |
|---|---------|
| P+1 | [Informational / no action required — description] |

## Strengths
[What the C code does well]

## Recommended Action Plan

### Immediate
1. [FIX items — safety-critical]
Reference findings by their global number:

### Immediate (FIX items)
1. [Fix Finding 1 — description]
2. [Fix Finding 2 — description]

### Short-term
1. [CONSIDER items — quality improvements]
### Short-term (CONSIDER items)
1. [Finding N+1 — description]
2. [Finding N+2 — description]

### Ongoing
1. [POLICY decisions to make]
### Longer-term (POLICY)
1. [Finding M+1 — description]
```

## Usage Examples
Expand Down
11 changes: 8 additions & 3 deletions plugins/cpython-review-toolkit/commands/health.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,9 +39,14 @@ Run all agents in summary mode to produce a quick health dashboard. Each agent r
## Overall Health: X/10

## Top 3 Priorities
1. [Most impactful improvement]
2. [Next]
3. [Next]

Reference findings by global non-restarting numbering (FIX 1..N, CONSIDER N+1..M,
POLICY M+1..P, ACCEPTABLE P+1..Q). Use the same numbers in `/explore` output so
issues tracked here carry over to the detailed report.

1. [Finding # — most impactful improvement]
2. [Finding # — next]
3. [Finding # — next]

For detailed analysis, run:
/cpython-review-toolkit:explore . [aspect] deep
Expand Down
3 changes: 2 additions & 1 deletion plugins/cpython-review-toolkit/scripts/analyze_includes.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,9 +9,10 @@
- api_tiers: classification of headers into public/cpython/internal

Usage:
python analyze_includes.py [path]
python analyze_includes.py [path] [--max-files N]

path: directory, file, or omitted for current directory
--max-files N: cap the number of .c/.h files scanned (0 = unlimited)
"""

import json
Expand Down
3 changes: 2 additions & 1 deletion plugins/cpython-review-toolkit/scripts/check_pep7.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,10 @@
- trailing whitespace, missing braces

Usage:
python check_pep7.py [path]
python check_pep7.py [path] [--max-files N]

path: directory, file, or omitted for current directory
--max-files N: cap the number of .c/.h files scanned (0 = unlimited)
"""

import json
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,10 @@
- weighted score (1-10)

Usage:
python measure_c_complexity.py [path]
python measure_c_complexity.py [path] [--max-files N]

path: directory, file, or omitted for current directory
--max-files N: cap the number of .c/.h files scanned (0 = unlimited)
"""

import json
Expand Down
Loading