feat(eval): pluggable benchmark harness with in-house coding-agent corpus#562
Conversation
…rpus Adds eval/ tree (outside files field so npm tarball stays thin) with Adapter interface, three reference adapters (grep / vector / agentmemory-hybrid), two benchmarks (LongMemEval _s public, coding-agent-life-v1 in-house 15 sessions), scoring (P@K, R@K, hit, top-gold-rank), NDJSON output, sandbox script. coding-agent-life-v1 published scorecard at docs/benchmarks/2026-05-20-coding-agent-life-v1.md: agentmemory-hybrid R@5=0.967 P@5=0.578 (100% hit) vs grep R@5=0.967 P@5=0.267. 2.2x better precision on identical input, sandbox-reproducible. Adapter contract: init(sessions, config) -> State; query(q, state, k) -> RankedDoc[] npm scripts: npm run eval:coding-life (no download, no API key for grep) npm run eval:longmemeval (needs OPENAI key + 278MB download) eval/scripts/sandbox.sh boots clean agentmemory + iii-engine on ports 3411/3412 with isolated data dir; tears down on exit. README headline updated. 1072/1072 tests pass + 5 new eval tests.
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
📝 WalkthroughWalkthroughAdds an evaluation framework for retrieval: core types, three adapters (grep, vector, agentmemory), datasets and loader, two CLI runners, scoring+aggregation, sandbox script, tests, benchmark docs, and package scripts. ChangesAgentmemory Evaluation Framework
Estimated code review effort: Possibly related PRs:
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
ESLint skipped: no ESLint configuration detected in root package.json. To enable, add Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 13
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@eval/data/coding-agent-life-v1/sessions.json`:
- Line 40: The text in the session content incorrectly labels the migration as a
"two-phase migration" but then enumerates three phases; update the wording in
the string (the value of "content") to either "three-phase migration" or
"multi-phase migration" so it matches the listed phases (phase 1: add nullable
new column + dual-write; phase 2: backfill + flip reads; phase 3: drop old
column), ensuring the summary and the detailed phase list are consistent.
In `@eval/README.md`:
- Around line 31-34: Remove the unsafe rm "$(which iii)" removal and instead
install the pinned iii release into a user-local bin directory without deleting
existing binaries: ensure ~/.local/bin exists, extract the downloaded tarball
into that directory (the existing curl ... | tar -xz -C ~/.local/bin command is
fine) and instruct users to prepend ~/.local/bin to PATH (or add it to their
shell profile) so the pinned v0.11.2 binary is used; do not attempt to remove or
overwrite whatever which iii currently points to.
- Line 69: The README.md contains a fenced code block starting with ``` that
lacks a language tag (triggers markdownlint MD040); update that opening fence to
include a language identifier (for example change ``` to ```text or ```bash) so
the block is explicitly tagged, keeping the rest of the block content unchanged;
locate the fenced block in README.md (the tree-like snippet under the eval/
heading) and modify its opening fence accordingly.
In `@eval/runner/adapters/agentmemory.ts`:
- Around line 81-83: The code currently computes memId then looks up sessionId
via state.observationToSession (using memId) and ignores any direct
row.sessionId; change the logic in the block around memId/sessionId (variables
memId, sessionId and usage of state.observationToSession and seen) to prefer
row.sessionId if present: first set sessionId = row.sessionId (if defined),
otherwise derive memId = row.obsId ?? row.id ?? row.observationId and then set
sessionId = memId ? state.observationToSession.get(memId) : undefined; keep the
seen check and continue behavior the same so rows with an explicit sessionId are
not discarded.
In `@eval/runner/adapters/vector.ts`:
- Around line 40-46: In embedBatch, validate the embeddings response before
building the Float32Array array: ensure data.data length equals texts.length,
each item has a numeric index within [0, texts.length-1], indexes are unique,
and each embedding is a non-empty numeric array with expected dimensionality; if
any check fails, throw a descriptive error (including the offending
index/response snippet) instead of returning undefined entries so callers never
receive incomplete vectors. Use the function name embedBatch and the local
variables data, row.index, and row.embedding to locate and implement these
checks.
In `@eval/runner/coding-life.ts`:
- Line 37: The code currently does const k = Number(opts.k); which allows NaN, 0
or negative values; update the validation after that line to ensure k is a
positive integer (e.g., if (!Number.isInteger(k) || k <= 0) throw new Error or
return with a clear message) so downstream metric calculations and adapter calls
always receive a valid k; reference the variable k and the opts.k source to
locate where to add this check.
- Around line 65-79: The code currently calls adapter.init(sessions) and then
iterates queries with adapter.query, but if any query throws the
adapter.teardown(state) call is skipped; wrap the query loop and related
per-adapter logic in a try/finally so that adapter.teardown(state) always runs:
after obtaining state from adapter.init(sessions) run the for-loop and all
processing inside a try block and invoke await adapter.teardown(state) inside
the finally (guarding existence with if (adapter.teardown)) to guarantee
resource cleanup even on exceptions.
In `@eval/runner/load.ts`:
- Around line 23-26: The mapping silently injects empty content when
haystack_session_ids and haystack_sessions lengths differ; add an explicit
structural check before building haystack (compare r.haystack_session_ids.length
to r.haystack_sessions.length) and fail fast—throw or return an error with a
clear message that includes both lengths (and any relevant request id) instead
of using the null-coalescing fallback; then only construct haystack using the
validated arrays and continue to call flattenSession for each corresponding
element.
In `@eval/runner/longmemeval.ts`:
- Around line 46-48: The variables k, limit, and perType are parsed from opts
without validation and can become NaN or non-positive; update the parsing in
longmemeval.ts to explicitly parse and validate these CLI options (e.g., use
parseInt/Number, then check Number.isFinite(...) and value > 0 for k and
perType, and value > 0 if limit must be positive) and handle invalid values by
either throwing a clear Error or falling back to a safe default; locate the
parsing site where k, limit, and perType are assigned and replace it with
validated parsing that rejects NaN/non-positive inputs and surfaces a clear
message referencing opts.k, opts.limit and opts.stratify (or sets a documented
default) so downstream code using k, limit, and perType never receives invalid
numbers.
- Around line 74-80: Wrap per-question work in a try/finally so
adapter.teardown(state) always runs: after obtaining state from
adapter.init(q.haystack) call adapter.query(...) and the subsequent
scoring/appendFileSync/rows.push inside a try block, and call
adapter.teardown(state) in the finally block (guarding that state is defined).
Update the logic around adapter.init, adapter.query, scoreQuestion,
appendFileSync(ndjsonPath, ...), rows.push(row) and adapter.teardown to ensure
teardown executes even if adapter.query or scoreQuestion throws.
In `@eval/runner/score.ts`:
- Line 13: The precision calculation uses topK.length as the denominator which
inflates P@K when fewer than k results are returned; change the computation in
precisionAtK to divide by k (the requested cutoff) instead of topK.length, while
still handling k === 0 safely (return 0 when k is 0). Update the expression that
sets precisionAtK (referencing precisionAtK, topK, k, hits) so hits / k is used
when k > 0, otherwise 0.
In `@eval/scripts/sandbox.sh`:
- Line 31: The script unguardedly runs rm -rf "$SANDBOX_ROOT" which can
catastrophically delete important paths; before that command, validate
SANDBOX_ROOT by ensuring it is non-empty, not "/" (and not any other critical
root like ""), and resolves to an expected sandbox subtree (e.g. check realpath
"$SANDBOX_ROOT" and assert it starts with a known safe prefix or contains a
specific marker), and fail fast if the check fails; update the code around the
rm -rf invocation to perform these checks on the SANDBOX_ROOT variable and only
run rm -rf when the validation passes.
In `@test/eval-adapters.test.ts`:
- Around line 8-10: The test uses __dirname to build DATA_DIR which breaks in
ESM; update the path resolution to use import.meta.url (or
fileURLToPath(import.meta.url) + path.dirname) to compute DATA_DIR and keep JSON
reads for sessions and queries working; locate the DATA_DIR, sessions and
queries declarations in the file (symbols: DATA_DIR, sessions, queries) and
replace the __dirname-based resolution with an ESM-safe resolution using
import.meta.url so the module loads under ESM.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 8f441b2c-7e82-40a3-8477-55691b317757
📒 Files selected for processing (18)
.gitignoreREADME.mddocs/benchmarks/2026-05-20-coding-agent-life-v1.mddocs/benchmarks/TEMPLATE.mdeval/README.mdeval/data/coding-agent-life-v1/queries.jsoneval/data/coding-agent-life-v1/sessions.jsoneval/runner/adapters/agentmemory.tseval/runner/adapters/grep.tseval/runner/adapters/vector.tseval/runner/coding-life.tseval/runner/load.tseval/runner/longmemeval.tseval/runner/score.tseval/runner/types.tseval/scripts/sandbox.shpackage.jsontest/eval-adapters.test.ts
| const DATA_DIR = resolve(__dirname, "..", "eval", "data", "coding-agent-life-v1"); | ||
| const sessions = JSON.parse(readFileSync(`${DATA_DIR}/sessions.json`, "utf8")) as Session[]; | ||
| const queries = JSON.parse(readFileSync(`${DATA_DIR}/queries.json`, "utf8")) as Array< |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify ESM mode and locate __dirname usage in TS tests.
set -euo pipefail
echo "Checking package.json module type:"
cat package.json | jq -r '.type // "<missing>"'
echo
echo "Searching for __dirname usage in test/*.test.ts files:"
rg -nP --type=ts -C2 '\b__dirname\b' testRepository: rohitg00/agentmemory
Length of output: 1160
Replace __dirname usage with ESM-safe path resolution.
__dirname is undefined in ESM and will fail this suite at module load time.
Proposed fix
import { readFileSync } from "node:fs";
-import { resolve } from "node:path";
+import { dirname, resolve } from "node:path";
+import { fileURLToPath } from "node:url";
import { grepAdapter } from "../eval/runner/adapters/grep.js";
import { aggregate, scoreQuestion } from "../eval/runner/score.js";
import type { Question, Session } from "../eval/runner/types.js";
-const DATA_DIR = resolve(__dirname, "..", "eval", "data", "coding-agent-life-v1");
+const __filename = fileURLToPath(import.meta.url);
+const __dirname = dirname(__filename);
+const DATA_DIR = resolve(__dirname, "..", "eval", "data", "coding-agent-life-v1");📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| const DATA_DIR = resolve(__dirname, "..", "eval", "data", "coding-agent-life-v1"); | |
| const sessions = JSON.parse(readFileSync(`${DATA_DIR}/sessions.json`, "utf8")) as Session[]; | |
| const queries = JSON.parse(readFileSync(`${DATA_DIR}/queries.json`, "utf8")) as Array< | |
| import { readFileSync } from "node:fs"; | |
| import { dirname, resolve } from "node:path"; | |
| import { fileURLToPath } from "node:url"; | |
| import { grepAdapter } from "../eval/runner/adapters/grep.js"; | |
| import { aggregate, scoreQuestion } from "../eval/runner/score.js"; | |
| import type { Question, Session } from "../eval/runner/types.js"; | |
| const __filename = fileURLToPath(import.meta.url); | |
| const __dirname = dirname(__filename); | |
| const DATA_DIR = resolve(__dirname, "..", "eval", "data", "coding-agent-life-v1"); | |
| const sessions = JSON.parse(readFileSync(`${DATA_DIR}/sessions.json`, "utf8")) as Session[]; | |
| const queries = JSON.parse(readFileSync(`${DATA_DIR}/queries.json`, "utf8")) as Array< |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@test/eval-adapters.test.ts` around lines 8 - 10, The test uses __dirname to
build DATA_DIR which breaks in ESM; update the path resolution to use
import.meta.url (or fileURLToPath(import.meta.url) + path.dirname) to compute
DATA_DIR and keep JSON reads for sessions and queries working; locate the
DATA_DIR, sessions and queries declarations in the file (symbols: DATA_DIR,
sessions, queries) and replace the __dirname-based resolution with an ESM-safe
resolution using import.meta.url so the module loads under ESM.
- agentmemory adapter: prefer row.sessionId before observationToSession lookup - vector adapter: validate embedBatch response (length, indexes, non-empty rows) - coding-life: positive-int guard on --k; wrap query loop in try/finally so teardown runs - longmemeval: positive-int guards on --k/--limit/--stratify; per-question try/finally - load: throw on haystack_session_ids vs haystack_sessions length mismatch - score: P@K denominator is k (requested cutoff) not topK.length - sandbox.sh: guard rm -rf with non-empty + /tmp/ prefix check - README: drop unsafe rm "$(which iii)"; instruct ~/.local/bin + PATH instead; add language tag to repo-layout fenced block - sessions.json: fix "two-phase" -> "three-phase" wording mismatch
There was a problem hiding this comment.
♻️ Duplicate comments (1)
eval/scripts/sandbox.sh (1)
31-35:⚠️ Potential issue | 🔴 CriticalCanonicalize
SANDBOX_ROOTbefore the wipe check.The raw prefix check still accepts traversal values like
/tmp/../etc, sorm -rf "$SANDBOX_ROOT"can still delete outside/tmp.Suggested fix
-if [[ -z "${SANDBOX_ROOT:-}" || "$SANDBOX_ROOT" == "/" || "$SANDBOX_ROOT" != /tmp/* ]]; then - echo "refusing to wipe SANDBOX_ROOT='$SANDBOX_ROOT' — must be non-empty and under /tmp/" >&2 +SANDBOX_ROOT_REAL="$(python3 -c 'import os,sys; print(os.path.realpath(sys.argv[1]))' "$SANDBOX_ROOT")" +if [[ -z "${SANDBOX_ROOT:-}" || "$SANDBOX_ROOT_REAL" == "/" || "$SANDBOX_ROOT_REAL" != /tmp/* ]]; then + echo "refusing to wipe SANDBOX_ROOT='$SANDBOX_ROOT' (resolved: '$SANDBOX_ROOT_REAL') — must resolve under /tmp/" >&2 exit 1 fi -rm -rf "$SANDBOX_ROOT" +rm -rf "$SANDBOX_ROOT_REAL"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@eval/scripts/sandbox.sh` around lines 31 - 35, Canonicalize SANDBOX_ROOT before validating and deleting: resolve SANDBOX_ROOT to an absolute canonical path (e.g., using realpath -m or readlink -f) into a new variable (like CANON_SANDBOX_ROOT), then perform the existing non-empty/root and prefix check against CANON_SANDBOX_ROOT (ensure it starts with /tmp/), and finally run rm -rf on the canonicalized variable instead of the raw SANDBOX_ROOT; reference the SANDBOX_ROOT variable and replace its use in the conditional and rm -rf with the canonicalized name.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@eval/scripts/sandbox.sh`:
- Around line 31-35: Canonicalize SANDBOX_ROOT before validating and deleting:
resolve SANDBOX_ROOT to an absolute canonical path (e.g., using realpath -m or
readlink -f) into a new variable (like CANON_SANDBOX_ROOT), then perform the
existing non-empty/root and prefix check against CANON_SANDBOX_ROOT (ensure it
starts with /tmp/), and finally run rm -rf on the canonicalized variable instead
of the raw SANDBOX_ROOT; reference the SANDBOX_ROOT variable and replace its use
in the conditional and rm -rf with the canonicalized name.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 1be9230c-bb00-489a-923d-c49d19cd9abe
📒 Files selected for processing (10)
eval/README.mdeval/data/coding-agent-life-v1/sessions.jsoneval/runner/adapters/agentmemory.tseval/runner/adapters/vector.tseval/runner/coding-life.tseval/runner/load.tseval/runner/longmemeval.tseval/runner/score.tseval/scripts/sandbox.shtest/eval-adapters.test.ts
✅ Files skipped from review due to trivial changes (2)
- eval/data/coding-agent-life-v1/sessions.json
- eval/README.md
…uard (#588) Three concerns surfaced when auditing PRs merged since v0.9.21: 1) Graph parser regex from #494 was correct for self-closing tags ONLY when they're the only entity in the document. With a self-closing entity followed by an explicit-close entity, the greedy `[^>]*` ate the trailing `/` of the self-close, the alternation fell through to the explicit-close branch, and `[\s\S]*?` ran ahead to the *next* `</entity>` — merging two entity declarations into one match and silently dropping a node. Switch to lazy `[^>]*?` so the self-closing alternation gets first chance. Test from #494 (which was failing on main but I didn't notice) now passes. 2) #386 shipped `${AGENTMEMORY_URL}` / `${AGENTMEMORY_SECRET}` placeholders in plugin/.mcp.json. Claude Code and Cursor expand those at MCP launch time; some hosts pass the literal string through. A truthy literal `"${AGENTMEMORY_URL}"` defeats the existing `|| DEFAULT_URL` fallback and would have the shim POST to `${AGENTMEMORY_URL}/agentmemory/...` (DNS failure). Add a defensive guard in src/mcp/rest-proxy.ts that strips any value of the form `${...}` so the fallback engages. Mirror in src/mcp/standalone.ts's mode-announce log line. 3) CHANGELOG entries for all PRs landed since v0.9.21 (#321, #325, #386, #454, #494, #526, #542, #560, #561, #562, #564, #567) plus the regex + env-guard hardening here. Validation: - npm test → 98 files, 1091 tests pass - npm pack → 142 files, fresh install clean, iii-sdk@0.11.2 pinned, plugin/ shipped with hooks/scripts/skills/opencode/ - New test file covers 8 placeholder cases (unset, empty, `${VAR}` literal, mismatched braces, real values with $, unclosed `${`, no-braces `$VAR`).
Summary
Adds an
eval/tree with a pluggable Adapter contract, three reference adapters, two benchmarks, scoring math, and a sandbox script. Stays outsidefilesfield so the npm tarball is unchanged.What ships
eval/runner/types.ts):init(sessions, config) -> Statequery(q, state, k) -> RankedDoc[]grep— tokenized substring baseline, zero depsvector— OpenAItext-embedding-3-small+ cosine (OPENAI_API_KEY)agentmemory-hybrid—POST /agentmemory/{remember,smart-search}(full production stack)coding-agent-life-v1— committed 15-session fictional Rust CLI (shipctl) project with 15 hand-graded queries covering single-session, multi-session causal, preference, temporal_s— public 500-Q benchmark, dataset fetched on demand (278MB)eval/runner/score.ts): P@K, R@K, hit, top-gold-rank, NDJSON per-query rows +summary.jsonper-adapter and per-type aggregateseval/scripts/sandbox.sh): boots clean agentmemory + iii-engine on ports 3411/3412 with isolated data dir at/tmp/agentmemory-eval-sandbox/; tears down on exit. Required because running against a real~/.agentmemorypollutes both the eval and the user's real store.Published scorecard
docs/benchmarks/2026-05-20-coding-agent-life-v1.md— first published scorecard, sandbox-reproducible:agentmemory-hybrid100% top-5 hit rate. 2.2x better precision than the grep baseline on identical input. No LLM in the retrieval loop.
npm scripts
Reproduce locally
source eval/scripts/sandbox.sh npm run eval:coding-life -- --adapters grep,agentmemoryOutputs land in
eval/reports/coding-life/:scores.ndjson+summary.json.Test plan
npm test— 1072 / 1072 pass (96 files), including 5 new tests for the eval scaffoldnpm run build— 12 files, 2398 KB, no regression in bundle sizeagentmemory-hybridhits 15/15 with isolated data direval/excluded from npm tarball (filesfield unchanged)eval/data/coding-agent-life-v1/whitelisted past.gitignore'sdata/rulegbrain,brainbench, etc — clean grep)Follow-up
_snumbers in main README with the new harness (needsOPENAI_API_KEY+ 278MB dataset + ~$2 spend)coding-agent-life-v1to 50 sessions with paraphrased queries + planted distractorsSummary by CodeRabbit
New Features
Documentation
Data
Tests
Chores