Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
b8dec8c
empty
AlexandreYang Mar 14, 2026
aa98631
update skills
AlexandreYang Mar 14, 2026
3d4486c
[iter 1] Fix FuzzEchoEscapes CI timeout by reusing temp directory
AlexandreYang Mar 14, 2026
2290224
Update address-pr-comments skill and add yash reference test scenarios
AlexandreYang Mar 14, 2026
f140744
[iter 1] Fix yash test scenarios: correct exit codes, remove source r…
AlexandreYang Mar 14, 2026
8976317
[iter 2] Rename yash directories to descriptive category names
AlexandreYang Mar 14, 2026
87bcf08
[iter 3] Remove duplicate blocked_commands tests and use exact stderr…
AlexandreYang Mar 14, 2026
cd4714a
[iter 4] Add while/until break/continue tests and fix command_not_fou…
AlexandreYang Mar 14, 2026
d2ea3dd
[iter 4] Fix bash comparison test for command_not_found scenario
AlexandreYang Mar 14, 2026
df4a130
[iter 4] Fix wc fuzz CI timeouts by reusing temp directory
AlexandreYang Mar 14, 2026
d94bbd8
[iter 4] Fix FuzzTail CI timeout by reusing temp directory
AlexandreYang Mar 14, 2026
eb36f85
[iter 6] Use stderr_contains instead of skip_assert_against_bash for …
AlexandreYang Mar 14, 2026
39f5498
[iter 9] Add YAML comments explaining why skip_assert_against_bash: t…
AlexandreYang Mar 14, 2026
f93bdc2
[iter 9] Fix FuzzCatNumberLines CI timeout by reusing temp directory
AlexandreYang Mar 14, 2026
5c1003a
[iter 9] Fix all remaining fuzz tests to reuse temp directory
AlexandreYang Mar 15, 2026
8b9e09c
[iter 9] Fix FuzzTestFileOps CI timeout by isolating per-iteration fi…
AlexandreYang Mar 15, 2026
0a8981e
[iter 9] Fix FuzzCatDisplayFlags CI timeout by isolating fuzz workers
AlexandreYang Mar 15, 2026
364da49
[iter 11] Remove unnecessary skip_assert_against_bash from two test s…
AlexandreYang Mar 15, 2026
000e115
[iter 2] Fix data race in fuzz tests by isolating per-iteration file …
AlexandreYang Mar 15, 2026
d882736
[iter 4] Fix stderr_contains usage and os.RemoveAll error handling
AlexandreYang Mar 15, 2026
4122ce7
[iter 4] Fix unreachable fuzz seeds in FuzzLsRecursive and FuzzString…
AlexandreYang Mar 15, 2026
52aff40
[iter 4] Fix FuzzTestNesting CI timeout by limiting expression comple…
AlexandreYang Mar 15, 2026
a4f2ac5
[iter 5] Extract FuzzIterDir helper, fix FuzzStringsRadix limit, tigh…
AlexandreYang Mar 15, 2026
8dbaa48
[iter 6] Use testutil.FuzzIterDir for ls and echo fuzz test isolation
AlexandreYang Mar 15, 2026
f96db17
[iter 6] Fix Fuzz (tail) CI timeout by removing AllowedPaths from fuz…
AlexandreYang Mar 15, 2026
5e53a1e
[iter 7] Fix variable naming nit in echo fuzz tests
AlexandreYang Mar 15, 2026
c0de65c
[iter 7] Fix FuzzUniq CI timeout by removing AllowedPaths from fuzz h…
AlexandreYang Mar 15, 2026
1a37b9a
[iter 8] Consolidate fuzzRunCtx into testutil.FuzzRunScriptCtx and fi…
AlexandreYang Mar 15, 2026
89229b8
[iter 8] Fix FuzzEcho CI timeout by isolating fuzz workers from Allow…
AlexandreYang Mar 15, 2026
51a56d0
Merge remote-tracking branch 'origin/main' into alex/test_cases
AlexandreYang Mar 16, 2026
114c904
[iter 1] Strengthen command-not-found assertions and restore AllowedP…
AlexandreYang Mar 16, 2026
cf08065
[iter 1] Remove unused cmdRunCtx function in echo fuzz tests
AlexandreYang Mar 16, 2026
5a3d943
[iter 5] Upgrade FuzzIterDir cleanup to t.Errorf to surface resource …
AlexandreYang Mar 16, 2026
f7f0edc
[iter 9] Improve skip comment for cat nonexistent_file test
AlexandreYang Mar 16, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
166 changes: 133 additions & 33 deletions .claude/skills/address-pr-comments/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,18 +16,32 @@ Determine the target PR:

```bash
# If argument provided, use it; otherwise detect from current branch
gh pr view $ARGUMENTS --json number,url,headRefName,baseRefName
gh pr view $ARGUMENTS --json number,url,headRefName,baseRefName,author
```

If no PR is found, stop and inform the user.

Extract owner, repo, and PR number for subsequent API calls:
Extract owner, repo, PR number, and **PR author login** for subsequent API calls:

```bash
gh repo view --json owner,name --jq '"\(.owner.login)/\(.name)"'
```

### 2. Fetch all review comments
### 2. Fetch review comments and summaries

#### 2a. Determine the latest review round

Find the timestamp of the most recent push to the PR branch — this marks the boundary of the current review round:

```bash
# Get the most recent push event (last commit pushed)
gh api repos/{owner}/{repo}/pulls/{pr-number}/commits \
--jq '.[-1].commit.committer.date'
```

Store this as `$LAST_PUSH_DATE`. Comments created **after** this timestamp are from the current (latest) review round. If no filtering by round is desired (e.g., first review), process all unresolved comments.

#### 2b. Fetch inline review comments

Retrieve all review comments (inline code comments) on the PR:

Expand All @@ -38,18 +52,28 @@ gh api repos/{owner}/{repo}/pulls/{pr-number}/comments \
2>&1 | head -500
```

Also fetch top-level review comments (review bodies):
#### 2c. Fetch review summaries

Fetch top-level review comments (review bodies/summaries). These often contain high-level feedback and action items:

```bash
gh api repos/{owner}/{repo}/pulls/{pr-number}/reviews \
--jq '.[] | select(.body != "" and .body != null) | {id: .id, user: .user.login, state: .state, body: .body}' \
--jq '.[] | select(.body != "" and .body != null) | {id: .id, user: .user.login, state: .state, body: .body, submitted_at: .submitted_at}' \
2>&1 | head -200
```

Filter out:
- Comments authored by the PR author (self-comments, unless they contain a TODO/action item)
**Pay special attention to review summaries** — they often list multiple action items in a single review body. Parse each action item from the summary as a separate work item.

#### 2d. Filter comments

**Include** comments from:
- **Reviewers** (anyone who is not the PR author) — standard review feedback
- **The PR author themselves** — self-comments are treated as actionable TODOs/notes-to-self that should be addressed
- **@codex** and other AI reviewers — treat their comments with the same weight as human reviewer comments

**Exclude**:
- Already-resolved threads
- Bot comments that are purely informational
- Bot comments that are purely informational (CI status, auto-generated labels, etc.) — but NOT @codex or other AI reviewer comments, which are substantive

Check which threads are already resolved:

Expand Down Expand Up @@ -79,6 +103,28 @@ gh api graphql -f query='

Only process **unresolved** threads with actionable comments.

#### 2e. Prioritize latest comments

When there are many unresolved comments, prioritize:
1. Comments from the **latest review round** (after `$LAST_PUSH_DATE`)
2. Comments from review summaries (they represent the reviewer's consolidated view)
3. Older unresolved comments that are still relevant

### 2b. Read the PR specs

Before evaluating any comment, read the PR description to check for a **SPECS** section:

```bash
gh pr view $ARGUMENTS --json body --jq '.body'
```

If a SPECS section is present, **it defines the authoritative requirements for this PR**. Specs override:
- Your assumptions about backward compatibility or design intent
- Inline code comments
- Conventions from other parts of the codebase

Store the specs for use in step 4 (validity evaluation). If a reviewer comment aligns with a spec, the comment is **valid by definition** — even if you think the current implementation is reasonable.

### 3. Understand each comment

For each unresolved review comment:
Expand All @@ -99,36 +145,45 @@ For each unresolved review comment:
| **Nitpick** | Minor optional suggestion | Evaluate — fix if trivial, otherwise reply explaining the tradeoff |
| **Invalid/outdated** | Comment doesn't apply or is based on a misunderstanding | Reply politely explaining why |

### 4. Evaluate validity — bash behavior is the source of truth
### 4. Evaluate validity — specs and bash behavior are the sources of truth

There are two sources of truth, checked in this order:

**The shell must match bash behavior unless it intentionally diverges** (e.g., sandbox restrictions, blocked commands, readonly enforcement). This principle overrides reviewer suggestions.
1. **PR specs** (from step 2b) — if present, specs are the highest authority for what this PR should do
2. **Bash behavior** — the shell must match bash unless it intentionally diverges (sandbox restrictions, blocked commands, readonly enforcement)

**CRITICAL: Never invent justifications for dismissing a comment.** If a reviewer says "the spec requires X" and the spec does require X, the comment is valid — even if you think the current implementation is a reasonable alternative. Do not fabricate reasons like "backward compatibility" or "design intent" unless those reasons are explicitly stated in the specs or CLAUDE.md.

For each comment, determine if it is **valid and actionable**:

1. **Verify against bash** — always check what bash actually does:
1. **Check against PR specs first** — if a SPECS section exists and the comment aligns with a spec, the comment is **valid by definition**. Do not dismiss it.
2. **Verify against bash** — for comments about shell behavior, check what bash actually does:
```bash
docker run --rm debian:bookworm-slim bash -c '<relevant script>'
```
2. **Read the relevant code** in full — not just the diff, but the surrounding implementation
3. **Check project conventions** in `CLAUDE.md` and `AGENTS.md`
4. **Consider side effects** — will the change break other tests or behaviors?
5. **Check for duplicates** — is the same issue raised in multiple comments? Group them
3. **Read the relevant code** in full — not just the diff, but the surrounding implementation
4. **Check project conventions** in `CLAUDE.md` and `AGENTS.md`
5. **Consider side effects** — will the change break other tests or behaviors?
6. **Check for duplicates** — is the same issue raised in multiple comments? Group them

Decision matrix:

| Reviewer says | Bash does | Shell intentionally diverges? | Action |
|--------------|-----------|-------------------------------|--------|
| "This is wrong" | Reviewer is right | No | **Fix the implementation** to match bash |
| "This is wrong" | Current code matches bash | No | **Reply** explaining it matches bash, with proof |
| "This is wrong" | N/A | Yes (sandbox/security) | **Reply** explaining the intentional divergence |
| "Do it differently" | Suggestion matches bash better | No | **Fix the implementation** to match bash |
| "Do it differently" | Current code already matches bash | No | **Reply** — bash compatibility takes priority |
| Reviewer says | Spec says | Bash does | Action |
|--------------|-----------|-----------|--------|
| "Spec requires X" | Spec does require X | N/A | **Fix the implementation** to match the spec |
| "Spec requires X" | No such spec exists | N/A | **Reply** noting the spec doesn't mention this |
| "This is wrong" | No spec relevant | Reviewer is right | **Fix the implementation** to match bash |
| "This is wrong" | No spec relevant | Current code matches bash | **Reply** explaining it matches bash, with proof |
| "This is wrong" | No spec relevant | N/A (sandbox/security) | **Reply** explaining the intentional divergence |
| "Do it differently" | No spec relevant | Suggestion matches bash better | **Fix the implementation** to match bash |
| "Do it differently" | No spec relevant | Current code already matches bash | **Reply** — bash compatibility takes priority |

If a comment is **not valid**:
- Prepare a polite reply with proof (e.g., "This matches bash behavior — verified with `docker run --rm debian:bookworm-slim bash -c '...'`")
- If the divergence is intentional, explain why (sandbox restriction, security, etc.)
- **Never claim "backward compatibility" or "design intent" unless you can point to a specific line in the specs or CLAUDE.md that says so**

If a comment is **valid** (i.e., fixing it brings the shell closer to bash, or addresses a real bug):
If a comment is **valid** (i.e., it aligns with a spec, brings the shell closer to bash, or addresses a real bug):
- Proceed to step 5

### 5. Implement fixes
Expand Down Expand Up @@ -181,7 +236,11 @@ If fixes span unrelated areas, prefer multiple focused commits over one large co

**All replies MUST be prefixed with `[<LLM model name>]`** (e.g. `[Claude Opus 4.6]`) so reviewers can tell the response came from an AI.

For each comment that was addressed:
Handle comments differently based on who authored them:

#### Reviewer comments (not the PR author)

For each reviewer comment that was addressed:

1. **Reply** explaining what was fixed:
```bash
Expand Down Expand Up @@ -221,16 +280,57 @@ For each comment that was addressed:
' -f threadId="<thread-id>"
```

For comments that were **not valid** or were **questions**, reply (prefixed with `[<MODELL NAME - VERSION>]`) with an explanation but do NOT resolve — let the reviewer decide.
#### PR author self-comments

For comments authored by the PR author (self-notes/TODOs):

1. **Fix the issue** described in the comment (these are actionable items the author left for themselves)
2. **Resolve** the thread (the PR author can resolve their own threads)
3. **Do NOT reply** to self-comments — just fix and resolve. No need for the AI to narrate back to the same person who wrote the note.

#### Review summary action items

For action items extracted from review summaries (step 2c):

1. **Fix each action item** as if it were an inline comment
2. **Reply to the review** with a summary of all action items addressed:
```bash
gh api repos/{owner}/{repo}/pulls/{pr-number}/reviews/{review-id}/comments \
-f body="[<MODEL NAME> - <VERSION>] Addressed the following from this review:
- <action item 1>: <what was done>
- <action item 2>: <what was done>"
```
If the `comments` endpoint doesn't work for review-level replies, use an issue comment instead:
```bash
gh api repos/{owner}/{repo}/issues/{pr-number}/comments \
-f body="[<MODEL NAME> - <VERSION>] Addressed review feedback from @{reviewer}:
- <action item 1>: <what was done>
- <action item 2>: <what was done>"
```

#### Invalid or question comments

For comments that were **not valid** or were **questions**, reply (prefixed with `[<MODEL NAME> - <VERSION>]`) with an explanation but do NOT resolve — let the reviewer decide.

**IMPORTANT: Never resolve a thread where the reviewer's comment aligns with a PR spec but the implementation doesn't match.** These are valid spec violations — fix the code instead. If you cannot fix it, leave the thread unresolved and explain the blocker.

### 8. Summary

Provide a final summary:
Provide a final summary organized by source:

**Reviewer inline comments addressed:**
- List each comment with: the comment (abbreviated), classification (bug, style, suggestion, etc.), what was changed

**Review summary action items addressed:**
- List each action item from review summaries that was implemented

**PR author self-comments addressed:**
- List each self-note/TODO that was fixed and resolved

**Not fixed (with reason):**
- List any comments replied to but not fixed, with explanation

**Could not be addressed:**
- List any comments that could not be addressed, with explanation

- List each review comment that was addressed with:
- The comment (abbreviated)
- The classification (bug, style, suggestion, etc.)
- What was changed
- List any comments that were replied to but not fixed (with reason)
- List any comments that could not be addressed (with explanation)
- Confirm the commit(s) pushed and threads resolved
Confirm the commit(s) pushed and threads resolved.
31 changes: 30 additions & 1 deletion .claude/skills/code-review/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,36 @@ git diff main...HEAD

If no changes are found, inform the user and stop.

### 2. Read and understand all changed code
### 2. Verify specs implementation
Comment thread
AlexandreYang marked this conversation as resolved.

Read the PR description and look for a **SPECS** section:

```bash
gh pr view $ARGUMENTS --json body --jq '.body'
```

If a SPECS section is present, it defines the requirements that this PR MUST implement. **Every single spec must be verified against the diff.**
The specs override other instructions (code, inline comments in code, etc). ALL specs MUST be implemented.

For each spec:
1. **Find the code** that implements the spec
2. **Verify correctness** — does the implementation fully satisfy the spec?
3. **Check for missing specs** — is any spec not implemented at all?

Flag any unimplemented or partially implemented spec as a **P1 finding** (missing functionality that was explicitly required).

Include a spec coverage table in the review output:

```markdown
| Spec | Implemented | Location | Notes |
|------|:-----------:|----------|-------|
| Must support `--flag` option | Yes | `interp/api.go:42` | Fully implemented |
| Must return exit code 2 on error | **No** | — | Not found in diff |
```

If no SPECS section is found in the PR description, skip this step.

### 3. Read and understand all changed code

For each changed file:

Expand Down
24 changes: 17 additions & 7 deletions .claude/skills/review-fix-loop/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ You MUST follow this execution protocol. Skipping steps or running them out of o
Your very first action — before reading ANY files, before running ANY commands — is to call TaskCreate exactly 11 times, once for each step/sub-step below. Use these exact subjects:

1. "Step 1: Identify the PR"
2. "Step 2: Run the review-fix loop"
2. "Step 2: Run the review-fix loop" ← **Update subject with iteration number each loop** (e.g. "Step 2: Run the review-fix loop (iteration 1)")
3. "Step 2A1: Self-review (code-review)" ← **parallel with 2A2**
4. "Step 2A2: Request external reviews (@codex)" ← **parallel with 2A1**
5. "Step 2B: Address PR comments (address-pr-comments)"
Expand Down Expand Up @@ -91,7 +91,9 @@ Store the owner and repo name.

**GATE CHECK**: Call TaskList. Step 1 must be `completed`. Set Step 2 to `in_progress`.

Set `iteration = 1`. Maximum iterations: **10**. Repeat sub-steps A through E while `iteration <= 10`:
Set `iteration = 1`. Maximum iterations: **30**. Repeat sub-steps A through E while `iteration <= 30`.

**At the start of each iteration**, update the Step 2 task subject to include the current iteration number using TaskUpdate, e.g. `"Step 2: Run the review-fix loop (iteration 3)"`.

---

Expand All @@ -107,7 +109,11 @@ This analyzes the full diff against main, posts findings as a GitHub PR review w

Post a comment to trigger @codex reviews:
```bash
gh pr comment <pr-number> --body "@codex review"
gh pr comment <pr-number> --body "@codex review this PR

Important: Read the SPECS section of the PR description. If SPECS are present: **make sure the implementation matches ALL the specs**.
The specs override other instructions (code, inline comments in code, etc). ALL specs MUST be implemented.
"
```
The external reviews arrive asynchronously — their comments will be picked up by **address-pr-comments** in Sub-step 2B1.

Expand Down Expand Up @@ -240,7 +246,7 @@ Check **all three** review sources for remaining issues:
| Any findings | Any | Any | **Continue** → go back to Sub-step 2A1 ∥ 2A2 |
| APPROVE | Unresolved threads | Any | **Continue** → go back to Sub-step 2A1 ∥ 2A2 (address-pr-comments will handle them) |
| APPROVE | None unresolved | Failing | **Continue** → go back to Sub-step 2A1 ∥ 2A2 (fix-ci-tests will handle it) |
| — | — | — | If `iteration > 10` → **STOP — iteration limit reached** |
| — | — | — | If `iteration > 30` → **STOP — iteration limit reached** |

Log the iteration result before continuing or stopping:
- Iteration number
Expand Down Expand Up @@ -331,9 +337,13 @@ Run a final verification regardless of how the loop exited:

Record the final state of each dimension (self-review, external reviews, CI, Codex response).

**If any verification fails** (CI failing, unresolved threads remain, unpushed commits that can't be pushed, or Codex hasn't responded to the latest review request), reset Step 2 and all its sub-steps to `pending`, and go back to **Step 2: Run the review-fix loop** for another iteration. Only proceed to Step 4 when all verifications pass.
Track how many times Step 3 has **succeeded** (all four verifications passed) across the entire run.

**If any verification fails** (CI failing, unresolved threads remain, unpushed commits that can't be pushed, or Codex hasn't responded to the latest review request), reset the success counter to 0, reset Step 2 and all its sub-steps to `pending`, and go back to **Step 2: Run the review-fix loop** for another iteration.

**If all verifications pass**, increment the success counter. If this is the **5th consecutive success** of Step 3 → proceed to **Step 4**. Otherwise → reset Step 2 and all its sub-steps to `pending`, and go back to **Step 2: Run the review-fix loop** for another iteration to re-confirm stability.

**Completion check:** All four verifications passed. Mark Step 3 as `completed`.
**Completion check:** Step 3 has succeeded 5 consecutive times. Mark Step 3 as `completed`.

---

Expand Down Expand Up @@ -385,5 +395,5 @@ gh pr comment <pr-number> --body "<the summary markdown above>"
- **Run address-pr-comments before fix-ci-tests** — 2B then 2C, sequentially, so CI fixes run on code that already incorporates review feedback.
- **Pull before fixing** — always `git pull --rebase` before launching fix agents to avoid working on stale code.
- **Stop early on APPROVE + CI green + no unresolved threads** — don't waste iterations if the PR is already clean.
- **Respect the iteration limit** — hard stop at 10 to prevent infinite loops. If issues persist after 10 iterations, report what's left for the user to handle.
- **Respect the iteration limit** — hard stop at 30 to prevent infinite loops. If issues persist after 30 iterations, report what's left for the user to handle.
- **Use gate checks** — always call TaskList and verify prerequisites before starting a step. This prevents out-of-order execution.
10 changes: 9 additions & 1 deletion builtins/tests/cat/cat_differential_fuzz_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,11 @@ import (
"os/exec"
"path/filepath"
"strings"
"sync/atomic"
"testing"
"time"

"github.com/DataDog/rshell/builtins/testutil"
)

// runGNUInDir runs a GNU command under LC_ALL=C.UTF-8 with its working
Expand Down Expand Up @@ -68,12 +71,17 @@ func FuzzCatDifferential(f *testing.F) {
f.Add([]byte{0xff, 0xfe, 0x00, 0x01})
f.Add([]byte("line1\nline2\nline3\n"))

baseDir := f.TempDir()
var counter atomic.Int64

f.Fuzz(func(t *testing.T, input []byte) {
if len(input) > 64*1024 {
return
}

dir := t.TempDir()
dir, cleanup := testutil.FuzzIterDir(t, baseDir, &counter)
defer cleanup()

if err := os.WriteFile(filepath.Join(dir, "input.txt"), input, 0644); err != nil {
t.Fatal(err)
}
Expand Down
Loading
Loading