feat(eval): pluggable benchmark harness with in-house coding-agent corpus by rohitg00 · Pull Request #562 · rohitg00/agentmemory

rohitg00 · 2026-05-20T11:54:23Z

Summary

Adds an eval/ tree with a pluggable Adapter contract, three reference adapters, two benchmarks, scoring math, and a sandbox script. Stays outside files field so the npm tarball is unchanged.

What ships

Adapter interface (eval/runner/types.ts):
- init(sessions, config) -> State
- query(q, state, k) -> RankedDoc[]
Three adapters:
- grep — tokenized substring baseline, zero deps
- vector — OpenAI text-embedding-3-small + cosine (OPENAI_API_KEY)
- agentmemory-hybrid — POST /agentmemory/{remember,smart-search} (full production stack)
Two benchmarks:
- coding-agent-life-v1 — committed 15-session fictional Rust CLI (shipctl) project with 15 hand-graded queries covering single-session, multi-session causal, preference, temporal
- LongMemEval _s — public 500-Q benchmark, dataset fetched on demand (278MB)
Scoring (eval/runner/score.ts): P@K, R@K, hit, top-gold-rank, NDJSON per-query rows + summary.json per-adapter and per-type aggregates
Sandbox script (eval/scripts/sandbox.sh): boots clean agentmemory + iii-engine on ports 3411/3412 with isolated data dir at /tmp/agentmemory-eval-sandbox/; tears down on exit. Required because running against a real ~/.agentmemory pollutes both the eval and the user's real store.

Published scorecard

docs/benchmarks/2026-05-20-coding-agent-life-v1.md — first published scorecard, sandbox-reproducible:

Adapter	P@5	R@5	Top-5 hit rate	p50 latency
`agentmemory-hybrid`	0.578	0.967	15 / 15	14 ms
grep baseline	0.267	0.967	15 / 15	0 ms

100% top-5 hit rate. 2.2x better precision than the grep baseline on identical input. No LLM in the retrieval loop.

npm scripts

npm run eval:coding-life     # in-house, no download
npm run eval:longmemeval     # public, needs OPENAI_API_KEY + dataset

Reproduce locally

source eval/scripts/sandbox.sh
npm run eval:coding-life -- --adapters grep,agentmemory

Outputs land in eval/reports/coding-life/: scores.ndjson + summary.json.

Test plan

npm test — 1072 / 1072 pass (96 files), including 5 new tests for the eval scaffold
npm run build — 12 files, 2398 KB, no regression in bundle size
Sandbox run on macOS Apple Silicon — agentmemory-hybrid hits 15/15 with isolated data dir
Polluted-vs-clean comparison confirms sandbox is required (precision recovers, recall hits ceiling)
eval/ excluded from npm tarball (files field unchanged)
eval/data/coding-agent-life-v1/ whitelisted past .gitignore's data/ rule
No external project names in diff (gbrain, brainbench, etc — clean grep)

Follow-up

Refresh LongMemEval _s numbers in main README with the new harness (needs OPENAI_API_KEY + 278MB dataset + ~$2 spend)
Add more adapters (BM25-only baseline, embedding-only baseline) to isolate which agentmemory layer is load-bearing
Expand coding-agent-life-v1 to 50 sessions with paraphrased queries + planted distractors

Summary by CodeRabbit

New Features
- Added an evaluation framework to compare grep, vector, and hybrid retrieval adapters and record per-adapter metrics.
Documentation
- Published benchmark reports and a reproducible template; added local reproduction and sandbox instructions.
Data
- Included a new in-house coding-agent-life-v1 dataset of sessions and queries.
Tests
- Added tests validating adapter behavior and scoring/aggregation logic.
Chores
- Updated ignore rules to exclude transient eval artifacts.

…rpus Adds eval/ tree (outside files field so npm tarball stays thin) with Adapter interface, three reference adapters (grep / vector / agentmemory-hybrid), two benchmarks (LongMemEval _s public, coding-agent-life-v1 in-house 15 sessions), scoring (P@K, R@K, hit, top-gold-rank), NDJSON output, sandbox script. coding-agent-life-v1 published scorecard at docs/benchmarks/2026-05-20-coding-agent-life-v1.md: agentmemory-hybrid R@5=0.967 P@5=0.578 (100% hit) vs grep R@5=0.967 P@5=0.267. 2.2x better precision on identical input, sandbox-reproducible. Adapter contract: init(sessions, config) -> State; query(q, state, k) -> RankedDoc[] npm scripts: npm run eval:coding-life (no download, no API key for grep) npm run eval:longmemeval (needs OPENAI key + 278MB download) eval/scripts/sandbox.sh boots clean agentmemory + iii-engine on ports 3411/3412 with isolated data dir; tears down on exit. README headline updated. 1072/1072 tests pass + 5 new eval tests.

vercel · 2026-05-20T11:54:26Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
agentmemory	Ready	Preview, Comment	May 20, 2026 1:06pm

coderabbitai · 2026-05-20T11:54:37Z

📝 Walkthrough

Walkthrough

Adds an evaluation framework for retrieval: core types, three adapters (grep, vector, agentmemory), datasets and loader, two CLI runners, scoring+aggregation, sandbox script, tests, benchmark docs, and package scripts.

Changes

Agentmemory Evaluation Framework

Layer / File(s)	Summary
Core evaluation type contracts `eval/runner/types.ts`	Defines `Session`, `Question`, `RankedDoc`, generic `Adapter<State>`, and `ScoreRow` interfaces used by adapters and runners.
Baseline retrieval adapters `eval/runner/adapters/grep.ts`, `eval/runner/adapters/vector.ts`	`grep` tokenizes queries and ranks by matched term counts. `vector` uses OpenAI embeddings, provides embed helpers, and ranks by cosine similarity.
Agentmemory API adapter `eval/runner/adapters/agentmemory.ts`	Registers sessions via `/agentmemory/remember` and queries `/agentmemory/smart-search`, mapping observation IDs back to sessions and deduplicating results.
Evaluation datasets and loader `eval/data/coding-agent-life-v1/queries.json`, `eval/data/coding-agent-life-v1/sessions.json`, `eval/runner/load.ts`	Adds `coding-agent-life-v1` corpus (15 sessions/queries) and `loadLongMemEval`/`stratifySample` to build `Question[]` with flattened session haystacks.
Scoring and aggregation `eval/runner/score.ts`	Implements `scoreQuestion` (precision@k/recall@k/hit/topGoldRank) and `aggregate` (per-adapter and per-type averages and latency P50).
CLI evaluation runners `eval/runner/coding-life.ts`, `eval/runner/longmemeval.ts`	`coding-life` runs adapters on the in-house corpus; `longmemeval` runs long-context datasets with optional stratification; both write NDJSON rows and summary JSON.
Sandbox environment `eval/scripts/sandbox.sh`	Bash helper to provision an isolated `iii` sandbox, start agentmemory, poll readiness, and export `AGENTMEMORY_BASE_URL`.
Evaluation tests `test/eval-adapters.test.ts`	Vitest suite verifying dataset integrity, grep top-5 hit-rate, scoring unit tests, and aggregation correctness.
Documentation and configuration `.gitignore`, `eval/README.md`, `docs/benchmarks/2026-05-20-coding-agent-life-v1.md`, `docs/benchmarks/TEMPLATE.md`, `README.md`, `package.json`	Adds eval README with quickstart and adapter guidance, benchmark template and sample report, `.gitignore` rules for eval artifacts, README benchmark entries, and npm scripts for evaluation runs.

Estimated code review effort:
🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs:

rohitg00/agentmemory#106: Updates README benchmark/retrieval methodology and reported recall metric descriptions.

"I'm a rabbit with a tiny pen, I hopped through logs and back again,
I stitched adapters, tests, and docs with care 🐇,
Sessions, queries, scores in tow—benchmarks ready to glow,
Sandbox up, runners run, now metrics bloom under the sun!"

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title clearly and specifically describes the main change: introducing a pluggable benchmark harness with an in-house coding-agent corpus for evaluation purposes.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch eval/scaffold-longmemeval

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 13

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@eval/data/coding-agent-life-v1/sessions.json`:
- Line 40: The text in the session content incorrectly labels the migration as a
"two-phase migration" but then enumerates three phases; update the wording in
the string (the value of "content") to either "three-phase migration" or
"multi-phase migration" so it matches the listed phases (phase 1: add nullable
new column + dual-write; phase 2: backfill + flip reads; phase 3: drop old
column), ensuring the summary and the detailed phase list are consistent.

In `@eval/README.md`:
- Around line 31-34: Remove the unsafe rm "$(which iii)" removal and instead
install the pinned iii release into a user-local bin directory without deleting
existing binaries: ensure ~/.local/bin exists, extract the downloaded tarball
into that directory (the existing curl ... | tar -xz -C ~/.local/bin command is
fine) and instruct users to prepend ~/.local/bin to PATH (or add it to their
shell profile) so the pinned v0.11.2 binary is used; do not attempt to remove or
overwrite whatever which iii currently points to.
- Line 69: The README.md contains a fenced code block starting with ``` that
lacks a language tag (triggers markdownlint MD040); update that opening fence to
include a language identifier (for example change ``` to ```text or ```bash) so
the block is explicitly tagged, keeping the rest of the block content unchanged;
locate the fenced block in README.md (the tree-like snippet under the eval/
heading) and modify its opening fence accordingly.

In `@eval/runner/adapters/agentmemory.ts`:
- Around line 81-83: The code currently computes memId then looks up sessionId
via state.observationToSession (using memId) and ignores any direct
row.sessionId; change the logic in the block around memId/sessionId (variables
memId, sessionId and usage of state.observationToSession and seen) to prefer
row.sessionId if present: first set sessionId = row.sessionId (if defined),
otherwise derive memId = row.obsId ?? row.id ?? row.observationId and then set
sessionId = memId ? state.observationToSession.get(memId) : undefined; keep the
seen check and continue behavior the same so rows with an explicit sessionId are
not discarded.

In `@eval/runner/adapters/vector.ts`:
- Around line 40-46: In embedBatch, validate the embeddings response before
building the Float32Array array: ensure data.data length equals texts.length,
each item has a numeric index within [0, texts.length-1], indexes are unique,
and each embedding is a non-empty numeric array with expected dimensionality; if
any check fails, throw a descriptive error (including the offending
index/response snippet) instead of returning undefined entries so callers never
receive incomplete vectors. Use the function name embedBatch and the local
variables data, row.index, and row.embedding to locate and implement these
checks.

In `@eval/runner/coding-life.ts`:
- Line 37: The code currently does const k = Number(opts.k); which allows NaN, 0
or negative values; update the validation after that line to ensure k is a
positive integer (e.g., if (!Number.isInteger(k) || k <= 0) throw new Error or
return with a clear message) so downstream metric calculations and adapter calls
always receive a valid k; reference the variable k and the opts.k source to
locate where to add this check.
- Around line 65-79: The code currently calls adapter.init(sessions) and then
iterates queries with adapter.query, but if any query throws the
adapter.teardown(state) call is skipped; wrap the query loop and related
per-adapter logic in a try/finally so that adapter.teardown(state) always runs:
after obtaining state from adapter.init(sessions) run the for-loop and all
processing inside a try block and invoke await adapter.teardown(state) inside
the finally (guarding existence with if (adapter.teardown)) to guarantee
resource cleanup even on exceptions.

In `@eval/runner/load.ts`:
- Around line 23-26: The mapping silently injects empty content when
haystack_session_ids and haystack_sessions lengths differ; add an explicit
structural check before building haystack (compare r.haystack_session_ids.length
to r.haystack_sessions.length) and fail fast—throw or return an error with a
clear message that includes both lengths (and any relevant request id) instead
of using the null-coalescing fallback; then only construct haystack using the
validated arrays and continue to call flattenSession for each corresponding
element.

In `@eval/runner/longmemeval.ts`:
- Around line 46-48: The variables k, limit, and perType are parsed from opts
without validation and can become NaN or non-positive; update the parsing in
longmemeval.ts to explicitly parse and validate these CLI options (e.g., use
parseInt/Number, then check Number.isFinite(...) and value > 0 for k and
perType, and value > 0 if limit must be positive) and handle invalid values by
either throwing a clear Error or falling back to a safe default; locate the
parsing site where k, limit, and perType are assigned and replace it with
validated parsing that rejects NaN/non-positive inputs and surfaces a clear
message referencing opts.k, opts.limit and opts.stratify (or sets a documented
default) so downstream code using k, limit, and perType never receives invalid
numbers.
- Around line 74-80: Wrap per-question work in a try/finally so
adapter.teardown(state) always runs: after obtaining state from
adapter.init(q.haystack) call adapter.query(...) and the subsequent
scoring/appendFileSync/rows.push inside a try block, and call
adapter.teardown(state) in the finally block (guarding that state is defined).
Update the logic around adapter.init, adapter.query, scoreQuestion,
appendFileSync(ndjsonPath, ...), rows.push(row) and adapter.teardown to ensure
teardown executes even if adapter.query or scoreQuestion throws.

In `@eval/runner/score.ts`:
- Line 13: The precision calculation uses topK.length as the denominator which
inflates P@K when fewer than k results are returned; change the computation in
precisionAtK to divide by k (the requested cutoff) instead of topK.length, while
still handling k === 0 safely (return 0 when k is 0). Update the expression that
sets precisionAtK (referencing precisionAtK, topK, k, hits) so hits / k is used
when k > 0, otherwise 0.

In `@eval/scripts/sandbox.sh`:
- Line 31: The script unguardedly runs rm -rf "$SANDBOX_ROOT" which can
catastrophically delete important paths; before that command, validate
SANDBOX_ROOT by ensuring it is non-empty, not "/" (and not any other critical
root like ""), and resolves to an expected sandbox subtree (e.g. check realpath
"$SANDBOX_ROOT" and assert it starts with a known safe prefix or contains a
specific marker), and fail fast if the check fails; update the code around the
rm -rf invocation to perform these checks on the SANDBOX_ROOT variable and only
run rm -rf when the validation passes.

In `@test/eval-adapters.test.ts`:
- Around line 8-10: The test uses __dirname to build DATA_DIR which breaks in
ESM; update the path resolution to use import.meta.url (or
fileURLToPath(import.meta.url) + path.dirname) to compute DATA_DIR and keep JSON
reads for sessions and queries working; locate the DATA_DIR, sessions and
queries declarations in the file (symbols: DATA_DIR, sessions, queries) and
replace the __dirname-based resolution with an ESM-safe resolution using
import.meta.url so the module loads under ESM.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8f441b2c-7e82-40a3-8477-55691b317757

📥 Commits

Reviewing files that changed from the base of the PR and between e9dc710 and d32abb2.

📒 Files selected for processing (18)

.gitignore
README.md
docs/benchmarks/2026-05-20-coding-agent-life-v1.md
docs/benchmarks/TEMPLATE.md
eval/README.md
eval/data/coding-agent-life-v1/queries.json
eval/data/coding-agent-life-v1/sessions.json
eval/runner/adapters/agentmemory.ts
eval/runner/adapters/grep.ts
eval/runner/adapters/vector.ts
eval/runner/coding-life.ts
eval/runner/load.ts
eval/runner/longmemeval.ts
eval/runner/score.ts
eval/runner/types.ts
eval/scripts/sandbox.sh
package.json
test/eval-adapters.test.ts

coderabbitai · 2026-05-20T11:59:07Z

+const DATA_DIR = resolve(__dirname, "..", "eval", "data", "coding-agent-life-v1");
+const sessions = JSON.parse(readFileSync(`${DATA_DIR}/sessions.json`, "utf8")) as Session[];
+const queries = JSON.parse(readFileSync(`${DATA_DIR}/queries.json`, "utf8")) as Array<


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify ESM mode and locate __dirname usage in TS tests. set -euo pipefail echo "Checking package.json module type:" cat package.json | jq -r '.type // "<missing>"' echo echo "Searching for __dirname usage in test/*.test.ts files:" rg -nP --type=ts -C2 '\b__dirname\b' test

Repository: rohitg00/agentmemory

Length of output: 1160

Replace __dirname usage with ESM-safe path resolution.

__dirname is undefined in ESM and will fail this suite at module load time.

Proposed fix

import { readFileSync } from "node:fs"; -import { resolve } from "node:path"; +import { dirname, resolve } from "node:path"; +import { fileURLToPath } from "node:url"; import { grepAdapter } from "../eval/runner/adapters/grep.js"; import { aggregate, scoreQuestion } from "../eval/runner/score.js"; import type { Question, Session } from "../eval/runner/types.js"; -const DATA_DIR = resolve(__dirname, "..", "eval", "data", "coding-agent-life-v1"); +const __filename = fileURLToPath(import.meta.url); +const __dirname = dirname(__filename); +const DATA_DIR = resolve(__dirname, "..", "eval", "data", "coding-agent-life-v1");

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

const DATA_DIR = resolve(__dirname, "..", "eval", "data", "coding-agent-life-v1");

const sessions = JSON.parse(readFileSync(`${DATA_DIR}/sessions.json`, "utf8")) as Session[];

const queries = JSON.parse(readFileSync(`${DATA_DIR}/queries.json`, "utf8")) as Array<

import { readFileSync } from "node:fs";

import { dirname, resolve } from "node:path";

import { fileURLToPath } from "node:url";

import { grepAdapter } from "../eval/runner/adapters/grep.js";

import { aggregate, scoreQuestion } from "../eval/runner/score.js";

import type { Question, Session } from "../eval/runner/types.js";

const __filename = fileURLToPath(import.meta.url);

const __dirname = dirname(__filename);

const DATA_DIR = resolve(__dirname, "..", "eval", "data", "coding-agent-life-v1");

const sessions = JSON.parse(readFileSync(`${DATA_DIR}/sessions.json`, "utf8")) as Session[];

const queries = JSON.parse(readFileSync(`${DATA_DIR}/queries.json`, "utf8")) as Array<

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/eval-adapters.test.ts` around lines 8 - 10, The test uses __dirname to build DATA_DIR which breaks in ESM; update the path resolution to use import.meta.url (or fileURLToPath(import.meta.url) + path.dirname) to compute DATA_DIR and keep JSON reads for sessions and queries working; locate the DATA_DIR, sessions and queries declarations in the file (symbols: DATA_DIR, sessions, queries) and replace the __dirname-based resolution with an ESM-safe resolution using import.meta.url so the module loads under ESM.

- agentmemory adapter: prefer row.sessionId before observationToSession lookup - vector adapter: validate embedBatch response (length, indexes, non-empty rows) - coding-life: positive-int guard on --k; wrap query loop in try/finally so teardown runs - longmemeval: positive-int guards on --k/--limit/--stratify; per-question try/finally - load: throw on haystack_session_ids vs haystack_sessions length mismatch - score: P@K denominator is k (requested cutoff) not topK.length - sandbox.sh: guard rm -rf with non-empty + /tmp/ prefix check - README: drop unsafe rm "$(which iii)"; instruct ~/.local/bin + PATH instead; add language tag to repo-layout fenced block - sessions.json: fix "two-phase" -> "three-phase" wording mismatch

coderabbitai

♻️ Duplicate comments (1)

eval/scripts/sandbox.sh (1)

31-35: ⚠️ Potential issue | 🔴 Critical

Canonicalize SANDBOX_ROOT before the wipe check.

The raw prefix check still accepts traversal values like /tmp/../etc, so rm -rf "$SANDBOX_ROOT" can still delete outside /tmp.

Suggested fix

-if [[ -z "${SANDBOX_ROOT:-}" || "$SANDBOX_ROOT" == "/" || "$SANDBOX_ROOT" != /tmp/* ]]; then
-  echo "refusing to wipe SANDBOX_ROOT='$SANDBOX_ROOT' — must be non-empty and under /tmp/" >&2
+SANDBOX_ROOT_REAL="$(python3 -c 'import os,sys; print(os.path.realpath(sys.argv[1]))' "$SANDBOX_ROOT")"
+if [[ -z "${SANDBOX_ROOT:-}" || "$SANDBOX_ROOT_REAL" == "/" || "$SANDBOX_ROOT_REAL" != /tmp/* ]]; then
+  echo "refusing to wipe SANDBOX_ROOT='$SANDBOX_ROOT' (resolved: '$SANDBOX_ROOT_REAL') — must resolve under /tmp/" >&2
   exit 1
 fi
-rm -rf "$SANDBOX_ROOT"
+rm -rf "$SANDBOX_ROOT_REAL"

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@eval/scripts/sandbox.sh` around lines 31 - 35, Canonicalize SANDBOX_ROOT
before validating and deleting: resolve SANDBOX_ROOT to an absolute canonical
path (e.g., using realpath -m or readlink -f) into a new variable (like
CANON_SANDBOX_ROOT), then perform the existing non-empty/root and prefix check
against CANON_SANDBOX_ROOT (ensure it starts with /tmp/), and finally run rm -rf
on the canonicalized variable instead of the raw SANDBOX_ROOT; reference the
SANDBOX_ROOT variable and replace its use in the conditional and rm -rf with the
canonicalized name.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@eval/scripts/sandbox.sh`:
- Around line 31-35: Canonicalize SANDBOX_ROOT before validating and deleting:
resolve SANDBOX_ROOT to an absolute canonical path (e.g., using realpath -m or
readlink -f) into a new variable (like CANON_SANDBOX_ROOT), then perform the
existing non-empty/root and prefix check against CANON_SANDBOX_ROOT (ensure it
starts with /tmp/), and finally run rm -rf on the canonicalized variable instead
of the raw SANDBOX_ROOT; reference the SANDBOX_ROOT variable and replace its use
in the conditional and rm -rf with the canonicalized name.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1be9230c-bb00-489a-923d-c49d19cd9abe

📥 Commits

Reviewing files that changed from the base of the PR and between d32abb2 and 2d60560.

📒 Files selected for processing (10)

eval/README.md
eval/data/coding-agent-life-v1/sessions.json
eval/runner/adapters/agentmemory.ts
eval/runner/adapters/vector.ts
eval/runner/coding-life.ts
eval/runner/load.ts
eval/runner/longmemeval.ts
eval/runner/score.ts
eval/scripts/sandbox.sh
test/eval-adapters.test.ts

✅ Files skipped from review due to trivial changes (2)

eval/data/coding-agent-life-v1/sessions.json
eval/README.md

…uard (#588) Three concerns surfaced when auditing PRs merged since v0.9.21: 1) Graph parser regex from #494 was correct for self-closing tags ONLY when they're the only entity in the document. With a self-closing entity followed by an explicit-close entity, the greedy `[^>]*` ate the trailing `/` of the self-close, the alternation fell through to the explicit-close branch, and `[\s\S]*?` ran ahead to the *next* `</entity>` — merging two entity declarations into one match and silently dropping a node. Switch to lazy `[^>]*?` so the self-closing alternation gets first chance. Test from #494 (which was failing on main but I didn't notice) now passes. 2) #386 shipped `${AGENTMEMORY_URL}` / `${AGENTMEMORY_SECRET}` placeholders in plugin/.mcp.json. Claude Code and Cursor expand those at MCP launch time; some hosts pass the literal string through. A truthy literal `"${AGENTMEMORY_URL}"` defeats the existing `|| DEFAULT_URL` fallback and would have the shim POST to `${AGENTMEMORY_URL}/agentmemory/...` (DNS failure). Add a defensive guard in src/mcp/rest-proxy.ts that strips any value of the form `${...}` so the fallback engages. Mirror in src/mcp/standalone.ts's mode-announce log line. 3) CHANGELOG entries for all PRs landed since v0.9.21 (#321, #325, #386, #454, #494, #526, #542, #560, #561, #562, #564, #567) plus the regex + env-guard hardening here. Validation: - npm test → 98 files, 1091 tests pass - npm pack → 142 files, fresh install clean, iii-sdk@0.11.2 pinned, plugin/ shipped with hooks/scripts/skills/opencode/ - New test file covers 8 placeholder cases (unset, empty, `${VAR}` literal, mismatched braces, real values with $, unclosed `${`, no-braces `$VAR`).

vercel Bot deployed to Preview May 20, 2026 11:54 View deployment

coderabbitai Bot reviewed May 20, 2026

View reviewed changes

vercel Bot deployed to Preview May 20, 2026 13:06 View deployment

coderabbitai Bot reviewed May 20, 2026

View reviewed changes

rohitg00 merged commit 7fb72f4 into main May 20, 2026
7 checks passed

rohitg00 deleted the eval/scaffold-longmemeval branch May 20, 2026 13:11

rohitg00 mentioned this pull request May 21, 2026

fix: pre-release hardening for 0.9.22 — regex correctness + MCP env guard #588

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(eval): pluggable benchmark harness with in-house coding-agent corpus#562

feat(eval): pluggable benchmark harness with in-house coding-agent corpus#562
rohitg00 merged 2 commits into
mainfrom
eval/scaffold-longmemeval

rohitg00 commented May 20, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

vercel Bot commented May 20, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 20, 2026 •

edited

Loading

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot May 20, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

-const DATA_DIR = resolve(__dirname, "..", "eval", "data", "coding-agent-life-v1");
-const sessions = JSON.parse(readFileSync(`${DATA_DIR}/sessions.json`, "utf8")) as Session[];
-const queries = JSON.parse(readFileSync(`${DATA_DIR}/queries.json`, "utf8")) as Array<
+import { readFileSync } from "node:fs";
+import { dirname, resolve } from "node:path";
+import { fileURLToPath } from "node:url";
+import { grepAdapter } from "../eval/runner/adapters/grep.js";
+import { aggregate, scoreQuestion } from "../eval/runner/score.js";
+import type { Question, Session } from "../eval/runner/types.js";
+const __filename = fileURLToPath(import.meta.url);
+const __dirname = dirname(__filename);
+const DATA_DIR = resolve(__dirname, "..", "eval", "data", "coding-agent-life-v1");
+const sessions = JSON.parse(readFileSync(`${DATA_DIR}/sessions.json`, "utf8")) as Session[];
+const queries = JSON.parse(readFileSync(`${DATA_DIR}/queries.json`, "utf8")) as Array<

Conversation

rohitg00 commented May 20, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What ships

Published scorecard

npm scripts

Reproduce locally

Test plan

Follow-up

Summary by CodeRabbit

Uh oh!

vercel Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rohitg00 commented May 20, 2026 •

edited by coderabbitai Bot

Loading

vercel Bot commented May 20, 2026 •

edited

Loading

coderabbitai Bot commented May 20, 2026 •

edited

Loading