release: dev → main — pgserve storage, benchmarks, sessions, config CLI by namastex888 · Pull Request #45 · automagik-dev/rlmx

namastex888 · 2026-03-30T04:34:35Z

Summary

Promotes dev to main with all changes since last stable release.

Highlights

Embedded pgserve storage (PR feat: embedded pgserve storage for large context handling + observability #50) — PostgreSQL-backed context handling for 26MB+ files with adaptive chunking, full-text search, and run observability
rlmx stats CLI — query run history, cost breakdowns, tool usage across sessions
rlmx benchmark — cost and Oolong Synth benchmarking suite
rlmx config — global settings management (API keys, model defaults)
Session auto-save — trajectory artifacts persisted to ~/.rlmx/sessions/
Bug fixes — thinking-only response handling, context extensions, benchmark paths, type safety

PRs Included

feat: embedded pgserve storage for large context handling + observability #50 — feat: embedded pgserve storage + observability
fix: use correct Oolong Synth dataset field names #49 — fix: Oolong Synth dataset field names
fix: use uv for benchmark Python venv setup #48 — fix: uv for benchmark venv setup
fix: resolve benchmark-data.json path for npm installations #47 — fix: benchmark-data.json path resolution
feat: Phase 1 — benchmarks, sessions, bug fix #28 #46 — feat: benchmark + sessions + bug [docs-sweep] context.extensions filter not applied — specialists receive all file types #28 fix
fix: reset empty streak on thinking-only responses #44 — fix: reset empty streak on thinking responses
fix: don't abort on thinking-only responses from reasoning models #42 — fix: thinking-only response handling
feat: add rlmx config command for global settings #40 — feat: rlmx config command
chore(deps): bump pi-ai to 0.64.0 #38 — chore: bump pi-ai to 0.64.0
fix(ci): make publish-next non-fatal when version already exists #36 — fix: CI publish non-fatal
fix(types): eliminate all any types and add error type narrowing #35 — fix: type safety (eliminate any types)
chore: rename package to @automagik/rlmx #34/chore: rename package to rlmx (unscoped) #33 — chore: rename to @automagik/rlmx

Test Plan

npm run build passes
npm test — 191/191 tests pass
rlmx --help shows all new commands
rlmx stats works with persistent storage
Small context queries unchanged (no regression)

Summary by CodeRabbit

New Features
- Added benchmark and stats commands with result formatting, saving and a bundled dataset + loader.
- Added session persistence for saving run metadata, usage, answers and trajectories.
- Added optional PostgreSQL-backed context mode and related observability and query tools; REPL can load DB helper utilities.
Behavior Changes
- File-extension handling now normalizes extensions (accepts with/without leading dot).
- Build now bundles benchmark data into the published output.
Chores
- Bumped package version to 0.260331.1 and added PostgreSQL-related dependencies.
Tests
- Added tests for benchmarks, sessions, context/extension handling, and caching.

gemini-code-assist · 2026-03-30T04:34:40Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

coderabbitai · 2026-03-30T04:34:51Z

Caution

Review failed

Pull request was closed or merged during review

📝 Walkthrough

Walkthrough

Adds PostgreSQL-backed context storage, observability and stats tooling, benchmark framework (cost and Oolong modes) with dataset and Python loader, session persistence, CLI hooks, many tests, config/extension normalization, and bumps package/version and build script to include benchmark data.

Changes

Cohort / File(s)	Summary
Packaging / Version `package.json`, `src/version.ts`	Bump package version; build script updated to copy `src/benchmark-data.json` into `dist/src/`; new runtime deps `pg`, `pgserve` and dev dep `@types/pg`.
Benchmarking `src/benchmark.ts`, `src/benchmark-data.json`, `tests/benchmark.test.ts`	New benchmark module (types, cost & oolong runners, aggregation, table formatter, result saving) plus bundled dataset and comprehensive tests.
CLI & Commands `src/cli.ts`	Added `benchmark` and `stats` commands, `--no-session` flag, conditional session saving, and wiring to run/save benchmark or stats flows.
Storage & Postgres integration `src/storage.ts`, `src/ipc.ts`, `src/llm.ts`, `src/observe.ts`, `src/rlm.ts`, `src/repl.ts`, `src/batch.ts`, `src/cache.ts`	New PgStorage implementation (spawn pgserve, ingest/query/stop), IPC request types for pg_* operations, LLM handler branches for pg calls, observability recorder, storage-mode RLM orchestration (ingest, pass storage to handlers, telemetry), REPL loader for Python pg batteries, batch/rlm options to control storage/cache, and export of PROVIDER_LIMITS.
Session persistence `src/session.ts`, `tests/session.test.ts`	New session save API and tests: persists meta/usage/answer/config/trajectory under `~/.rlmx/sessions/<runId>`.
Config & Context normalization `src/config.ts`, `src/context.ts`, `tests/context.test.ts`	Added `StorageConfig` and DEFAULT_STORAGE_CONFIG, extended RlmxConfig.storage parsing/validation; normalized context file extensions (ensure leading dot) and tests for regression.
Python tooling `python/load_dataset.py`, `python/llm_bridge.py`, `python/pg_batteries.py`, `python/gemini_batteries.py`	New loader CLI for oolong dataset; added `send_request` bridge API; new Python pg_batteries IPC helpers; gemini_batteries updated to use public bridge call.
Tests & test helpers `tests/*.test.ts`, `tests/cache.test.ts`	Multiple new/updated tests for benchmark, session, context, cache; added default storage config to test helper.

Sequence Diagram(s)

sequenceDiagram
    participant CLI as CLI (rlmx)
    participant Bench as Benchmark module
    participant Venv as Python venv / Exec
    participant Py as python/load_dataset.py
    participant Direct as Direct LLM
    participant RLM as rlmLoop / REPL
    participant Pg as PgStorage (pgserve + pg)
    participant FS as Filesystem

    CLI->>Bench: runOolongBenchmark(samples, idx, mode)
    Bench->>Venv: ensure venv (~/.rlmx/.bench-venv)
    Bench->>Py: execFile python/load_dataset.py (samples, idx)
    Py-->>Bench: JSON dataset (questions)
    loop per question
        Bench->>Direct: llmComplete (direct)
        Direct-->>Bench: answer + tokens/cost/latency
        Bench->>RLM: rlmLoop(question, storageMode?)
        alt storageMode true
            RLM->>Pg: start() / ingest(context)
            RLM->>Pg: handle pg_* queries (via handleLLMRequest)
            Pg-->>RLM: query results
            RLM->>Bench: rlm answer + tokens/cost/iterations/latency
        else storageMode false
            RLM-->>Bench: rlm answer + tokens/cost/iterations/latency
        end
        Bench->>Bench: compute savings & aggregate
    end
    Bench->>FS: saveBenchmarkResults -> ~/.rlmx/benchmarks/<ts>.json
    CLI->>FS: saveSession(...) unless --no-session

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

chore: rolling promotion dev -> main #39: Overlapping changes to package.json and src/version.ts (version bump).
chore: rolling promotion dev -> main #37: Related edits touching src/cli.ts and src/rlm.ts (CLI/rlm control-flow overlap).
chore: rolling promotion dev -> main #43: Modifies src/version.ts and RLM behavior—likely code-level overlap with storage/rlm changes.

Poem

🐇 I hopped through code with a tiny drum,
Benchmarks bundled, venvs spun, data come,
Sessions saved in folders snug and neat,
Pgserve hummed where context lines would meet,
A carrot crunch of metrics — hop, repeat! 🎉

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 68.57% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main changes: release promotion dev→main, PostgreSQL storage integration (pgserve), benchmark features, session persistence, and CLI configuration enhancements.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch dev

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Normalize extensions to always have a leading dot in both config.ts (YAML parsing) and context.ts (directory scanning). Previously, extensions written without dots in YAML (e.g. [mdx, json]) would silently fail to match any files because the filter compared "mdx" against ".mdx".

Every rlmx run now persists trajectory, config snapshot, usage stats, answer, and metadata to ~/.rlmx/sessions/<run_id>/. Add --no-session flag to disable. Session save failures are caught and warned on stderr without crashing the main run.

feat: Phase 1 — benchmarks, sessions, bug fix #28

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (7)

src/session.ts (1)

69-70: Consider explicit encoding for answer.txt.

Other writeFile calls (lines 64, 67, 74) include explicit content formatting. For consistency and to handle non-ASCII answers correctly, consider being explicit about encoding, though Node.js defaults to UTF-8:

-  await writeFile(join(sessionsDir, "answer.txt"), data.answer);
+  await writeFile(join(sessionsDir, "answer.txt"), data.answer, "utf-8");

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/session.ts` around lines 69 - 70, The writeFile call that writes
"answer.txt" should specify encoding explicitly to match the other writes and
ensure correct handling of non-ASCII text; update the writeFile using the same
pattern as the other calls (referencing writeFile, join, sessionsDir and
data.answer) to pass an options object or encoding argument (e.g., "utf8") so
the file is written with explicit UTF-8 encoding.

python/load_dataset.py (2)

24-25: Inconsistent dict access — potential KeyError.

Lines 22-23 and 26 use .get() for safe access, but lines 24-25 use direct item["question"] and item["context"] which will raise KeyError if those fields are missing. For consistency and robustness:

🛡️ Proposed fix

         output.append({
             "id": f"oolong-{item.get('id', 'unknown')}",
             "name": item.get("question", "")[:50],
-            "question": item["question"],
-            "context": item["context"],
+            "question": item.get("question", ""),
+            "context": item.get("context", ""),
             "expected": item.get("answer", ""),
             "category": "oolong",
         })

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@python/load_dataset.py` around lines 24 - 25, Replace the direct dict lookups
item["question"] and item["context"] with safe .get() calls to match the
surrounding code (e.g., item.get("question") and item.get("context", ""/None) as
appropriate) so missing keys won't raise KeyError; update any downstream logic
that expects a non-None value to handle the chosen default. Target the
occurrences of item["question"] and item["context"] in python/load_dataset.py
and make them consistent with the other .get() usages.

9-10: Missing input validation for CLI arguments.

int(sys.argv[...]) will raise ValueError if non-integer arguments are passed. Since this script is invoked by the benchmark runner, consider adding validation or a try/except:

🛡️ Proposed fix

 def main():
-    samples = int(sys.argv[1]) if len(sys.argv) > 1 else 5
-    idx = int(sys.argv[2]) if len(sys.argv) > 2 else -1
+    try:
+        samples = int(sys.argv[1]) if len(sys.argv) > 1 else 5
+        idx = int(sys.argv[2]) if len(sys.argv) > 2 else -1
+    except ValueError as e:
+        print(f"Error: invalid argument - {e}", file=sys.stderr)
+        sys.exit(1)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@python/load_dataset.py` around lines 9 - 10, The CLI parsing of samples and
idx in load_dataset.py uses direct int(sys.argv[...]) calls which will raise
ValueError on non-integer input; wrap the parsing of samples and idx in a
try/except ValueError block (or validate with str.isdigit()/regex) and handle
invalid input by printing a clear usage/error message and exiting non-zero.
Update the logic around the variables samples and idx so invalid values fall
back to the existing defaults (5 and -1) or force exit, and ensure any exception
caught logs the offending argv value for easier debugging.

src/cli.ts (2)

647-648: Manual argument parsing is fragile.

The --output flag is parsed by searching args.indexOf("--output"), which doesn't handle edge cases like --output=json or missing values. This differs from how other CLI options are parsed via parseArgs. Consider reusing the existing parsing infrastructure or documenting this limitation.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/cli.ts` around lines 647 - 648, The manual parsing of the --output flag
(variables outputIdx, outputFormat and args) is fragile; update the CLI to use
the existing parseArgs infrastructure instead of indexOf or, at minimum,
robustly handle --output=json, --output json and the missing-value case. Replace
the outputIdx/indexOf logic with a parseArgs definition for an --output option
(allowed values "json"|"table") and fall back to "table" when not provided or
invalid, or add parsing that checks for "--output=..." and verifies
args[outputIdx+1] exists before using it.

658-661: parseInt returns NaN on invalid input — consider validation.

If a user passes --samples abc, parseInt("abc", 10) returns NaN, which would be passed to runOolongBenchmark. The Python script would then receive "NaN" as a string argument.

🛡️ Proposed fix

   } else if (mode === "oolong") {
     const samplesIdx = args.indexOf("--samples");
-    const samples = samplesIdx >= 0 ? parseInt(args[samplesIdx + 1], 10) : 5;
+    const samplesRaw = samplesIdx >= 0 ? parseInt(args[samplesIdx + 1], 10) : 5;
+    const samples = Number.isNaN(samplesRaw) ? 5 : samplesRaw;
     const idxArgIdx = args.indexOf("--idx");
-    const idx = idxArgIdx >= 0 ? parseInt(args[idxArgIdx + 1], 10) : undefined;
+    const idxRaw = idxArgIdx >= 0 ? parseInt(args[idxArgIdx + 1], 10) : undefined;
+    const idx = idxRaw !== undefined && Number.isNaN(idxRaw) ? undefined : idxRaw;

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/cli.ts` around lines 658 - 661, The code currently uses parseInt for
samples and idx (variables samplesIdx/samples and idxArgIdx/idx) but does not
validate the results, so invalid inputs produce NaN and propagate to
runOolongBenchmark; update the parsing to check for NaN after parseInt and
handle it: if samples is NaN use the default value (5) or reject the input with
a clear error, and for idx set it to undefined if NaN; ensure you update the
logic where runOolongBenchmark is called to rely on the sanitized samples and
idx values and provide a helpful error message or usage hint when user input is
invalid.

tests/benchmark.test.ts (1)

21-26: Path resolution depends on build output structure.

The path join(__dirname, "..", "..", "src", "benchmark-data.json") assumes tests run from dist/tests/. This works but is fragile if the build structure changes. Consider documenting this dependency or using a more robust resolution.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/benchmark.test.ts` around lines 21 - 26, The path resolution in
loadDataset() is fragile because jsonPath = join(__dirname, "..", "..", "src",
"benchmark-data.json") assumes tests run from dist/tests; update loadDataset to
resolve the JSON robustly (e.g., derive the file location from the test file
location using import.meta.url + fileURLToPath or use require.resolve/paths from
process.cwd()) so it works regardless of build output layout, and update the
jsonPath variable usage accordingly or add a short comment documenting the
required build layout if you prefer to keep the current approach.

src/benchmark.ts (1)

186-189: Venv creation commands have no timeout — could hang indefinitely.

If python3 -m venv or pip install datasets hangs (network issues, disk full), the benchmark command will block forever. Consider adding a timeout:

🛡️ Proposed fix

-    await execFileAsync("python3", ["-m", "venv", venvDir]);
-    await execFileAsync(join(venvDir, "bin", "pip"), ["install", "datasets"]);
+    await execFileAsync("python3", ["-m", "venv", venvDir], { timeout: 60_000 });
+    await execFileAsync(join(venvDir, "bin", "pip"), ["install", "datasets"], { timeout: 300_000 });

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/benchmark.ts` around lines 186 - 189, The two external commands that
create the venv and run pip (the calls to execFileAsync that invoke "python3 -m
venv" and join(venvDir, "bin", "pip") with ["install","datasets"]) lack timeouts
and can hang indefinitely; update both execFileAsync invocations to include a
sane timeout option (e.g., { timeout: <ms> }) and handle the resulting timeout
error (reject/throw or log and clean up) so the benchmark doesn't block forever
— ensure you reference the existing venvDir variable and preserve the mkdir(...)
call and error handling around these execFileAsync calls.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/benchmark.ts`:
- Around line 352-355: In function formatPct replace the dead sign assignment
with a meaningful one or remove it; either remove sign entirely and return
`${n.toFixed(1)}%`, or set sign = n > 0 ? "+" : "" so positive values are
prefixed with "+" (keep function name formatPct and its single return). Ensure
negative numbers still show their built-in "-" from n.toFixed(1).

---

Nitpick comments:
In `@python/load_dataset.py`:
- Around line 24-25: Replace the direct dict lookups item["question"] and
item["context"] with safe .get() calls to match the surrounding code (e.g.,
item.get("question") and item.get("context", ""/None) as appropriate) so missing
keys won't raise KeyError; update any downstream logic that expects a non-None
value to handle the chosen default. Target the occurrences of item["question"]
and item["context"] in python/load_dataset.py and make them consistent with the
other .get() usages.
- Around line 9-10: The CLI parsing of samples and idx in load_dataset.py uses
direct int(sys.argv[...]) calls which will raise ValueError on non-integer
input; wrap the parsing of samples and idx in a try/except ValueError block (or
validate with str.isdigit()/regex) and handle invalid input by printing a clear
usage/error message and exiting non-zero. Update the logic around the variables
samples and idx so invalid values fall back to the existing defaults (5 and -1)
or force exit, and ensure any exception caught logs the offending argv value for
easier debugging.

In `@src/benchmark.ts`:
- Around line 186-189: The two external commands that create the venv and run
pip (the calls to execFileAsync that invoke "python3 -m venv" and join(venvDir,
"bin", "pip") with ["install","datasets"]) lack timeouts and can hang
indefinitely; update both execFileAsync invocations to include a sane timeout
option (e.g., { timeout: <ms> }) and handle the resulting timeout error
(reject/throw or log and clean up) so the benchmark doesn't block forever —
ensure you reference the existing venvDir variable and preserve the mkdir(...)
call and error handling around these execFileAsync calls.

In `@src/cli.ts`:
- Around line 647-648: The manual parsing of the --output flag (variables
outputIdx, outputFormat and args) is fragile; update the CLI to use the existing
parseArgs infrastructure instead of indexOf or, at minimum, robustly handle
--output=json, --output json and the missing-value case. Replace the
outputIdx/indexOf logic with a parseArgs definition for an --output option
(allowed values "json"|"table") and fall back to "table" when not provided or
invalid, or add parsing that checks for "--output=..." and verifies
args[outputIdx+1] exists before using it.
- Around line 658-661: The code currently uses parseInt for samples and idx
(variables samplesIdx/samples and idxArgIdx/idx) but does not validate the
results, so invalid inputs produce NaN and propagate to runOolongBenchmark;
update the parsing to check for NaN after parseInt and handle it: if samples is
NaN use the default value (5) or reject the input with a clear error, and for
idx set it to undefined if NaN; ensure you update the logic where
runOolongBenchmark is called to rely on the sanitized samples and idx values and
provide a helpful error message or usage hint when user input is invalid.

In `@src/session.ts`:
- Around line 69-70: The writeFile call that writes "answer.txt" should specify
encoding explicitly to match the other writes and ensure correct handling of
non-ASCII text; update the writeFile using the same pattern as the other calls
(referencing writeFile, join, sessionsDir and data.answer) to pass an options
object or encoding argument (e.g., "utf8") so the file is written with explicit
UTF-8 encoding.

In `@tests/benchmark.test.ts`:
- Around line 21-26: The path resolution in loadDataset() is fragile because
jsonPath = join(__dirname, "..", "..", "src", "benchmark-data.json") assumes
tests run from dist/tests; update loadDataset to resolve the JSON robustly
(e.g., derive the file location from the test file location using
import.meta.url + fileURLToPath or use require.resolve/paths from process.cwd())
so it works regardless of build output layout, and update the jsonPath variable
usage accordingly or add a short comment documenting the required build layout
if you prefer to keep the current approach.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d04fd3eb-33ea-4c04-97fa-4ae4229a3481

📥 Commits

Reviewing files that changed from the base of the PR and between 70225f6 and 9d4a6be.

⛔ Files ignored due to path filters (1)

package-lock.json is excluded by !**/package-lock.json

📒 Files selected for processing (12)

package.json
python/load_dataset.py
src/benchmark-data.json
src/benchmark.ts
src/cli.ts
src/config.ts
src/context.ts
src/session.ts
src/version.ts
tests/benchmark.test.ts
tests/context.test.ts
tests/session.test.ts

✅ Files skipped from review due to trivial changes (2)

src/version.ts
package.json

coderabbitai · 2026-03-30T06:09:56Z

+function formatPct(n: number): string {
+  const sign = n > 0 ? "" : "";
+  return `${sign}${n.toFixed(1)}%`;
+}


⚠️ Potential issue | 🟡 Minor

Unused sign variable in formatPct.

The sign variable is always an empty string (both branches assign ""), making it dead code:

const sign = n > 0 ? "" : "";

If the intent was to show + for positive savings:

♻️ Proposed fix

function formatPct(n: number): string { - const sign = n > 0 ? "" : ""; - return `${sign}${n.toFixed(1)}%`; + const sign = n > 0 ? "+" : ""; + return `${sign}${n.toFixed(1)}%`; }

Or remove the unused variable entirely.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/benchmark.ts` around lines 352 - 355, In function formatPct replace the dead sign assignment with a meaningful one or remove it; either remove sign entirely and return `${n.toFixed(1)}%`, or set sign = n > 0 ? "+" : "" so positive values are prefixed with "+" (keep function name formatPct and its single return). Ensure negative numbers still show their built-in "-" from n.toFixed(1).

The benchmark data file was resolved relative to source root (../../src/), which doesn't exist in npm installations. Now resolves relative to the compiled JS file (__dirname) and the build script copies the JSON to dist/. Co-authored-by: Genie <genie@namastex.ai>

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/benchmark.ts`:
- Around line 100-160: The loop currently awaits llmComplete and rlmLoop inline
so a single provider error will reject the whole benchmark and drop accumulated
runs; wrap the per-question work (the calls to llmComplete and rlmLoop and the
runs.push) in a try/catch around the body that builds
ctx/directMessages/directStart/etc., on error push a fallback run entry
(referencing questionId/q.name and fields like direct, rlm, savings) containing
error details and null/zeros for metrics and/or an error field, and continue;
additionally after each successful runs.push persist or flush partial results
(e.g., call a savePartialRuns helper) so progress is not lost if the process
crashes.
- Around line 195-199: The findLoadDatasetScript() function currently assumes
the module lives in dist/src and always walks up two directories, which fails
when running from src/benchmark.ts; modify findLoadDatasetScript() to probe both
candidate paths (join(thisDir, "..", "python", "load_dataset.py") and
join(thisDir, "..", "..", "python", "load_dataset.py")) and return the first
that exists, and if neither exists throw a clear Error mentioning both tried
paths and that runOolongBenchmark() could not locate python/load_dataset.py; use
fs.existsSync or fs.promises.access to check existence and keep the function
name findLoadDatasetScript and its return semantics unchanged so callers like
runOolongBenchmark() continue to work.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: fa0e0882-12b0-4a1a-89f8-3ddd180788e3

📥 Commits

Reviewing files that changed from the base of the PR and between 9d4a6be and 8494b36.

📒 Files selected for processing (3)

package.json
src/benchmark.ts
src/version.ts

✅ Files skipped from review due to trivial changes (1)

src/version.ts

🚧 Files skipped from review as they are similar to previous changes (1)

package.json

coderabbitai · 2026-03-30T06:22:18Z

+export interface BenchmarkQuestion {
+  id: string;
+  name: string;
+  question: string;
+  context: string;
+  category: string;
+  expected?: string;


⚠️ Potential issue | 🟠 Major

oolong mode never computes an accuracy result.

runOolongBenchmark() never reads q.expected, BenchmarkResults has no place to store correctness, and Line 361 still labels the mode as "oolong accuracy". As written, this mode only reports token/cost/latency data, so it cannot answer the question it claims to benchmark.

Also applies to: 34-68, 202-294, 361-362

coderabbitai · 2026-03-30T06:22:18Z

+  for (const q of dataset) {
+    const ctx: LoadedContext = {
+      type: "string",
+      content: q.context,
+      metadata: `benchmark context for "${q.name}" (${q.context.length} chars)`,
+    };
+
+    // Direct LLM call
+    const directMessages: ChatMessage[] = [
+      {
+        role: "user",
+        content: `Context:\n${q.context}\n\nQuestion: ${q.question}`,
+      },
+    ];
+
+    const directStart = Date.now();
+    const directResp = await llmComplete(directMessages, config.model);
+    const directLatency = Date.now() - directStart;
+
+    // RLM call
+    const rlmStart = Date.now();
+    const rlmResult = await rlmLoop(q.question, ctx, config, {
+      maxIterations: config.budget.maxTokens ? 5 : 10,
+      timeout: 120_000,
+      verbose: false,
+      output: "text",
+      cache: false,
+    });
+    const rlmLatency = Date.now() - rlmStart;
+
+    const directTotalTokens = directResp.usage.inputTokens + directResp.usage.outputTokens;
+    const rlmTotalTokens = rlmResult.usage.inputTokens + rlmResult.usage.outputTokens;
+
+    runs.push({
+      questionId: q.id,
+      questionName: q.name,
+      direct: {
+        tokens_input: directResp.usage.inputTokens,
+        tokens_output: directResp.usage.outputTokens,
+        cost: directResp.usage.totalCost,
+        latency_ms: directLatency,
+        answer: directResp.text,
+      },
+      rlm: {
+        tokens_input: rlmResult.usage.inputTokens,
+        tokens_output: rlmResult.usage.outputTokens,
+        cost: rlmResult.usage.totalCost,
+        latency_ms: rlmLatency,
+        iterations: rlmResult.iterations,
+        answer: rlmResult.answer,
+      },
+      savings: {
+        tokens_pct: calculateSavings(directTotalTokens, rlmTotalTokens),
+        cost_pct: calculateCostSavings(directResp.usage.totalCost, rlmResult.usage.totalCost),
+      },
+    });
+
+    if (options?.outputFormat !== "json") {
+      process.stderr.write(`  completed: ${q.name}\n`);
+    }
+  }


⚠️ Potential issue | 🟠 Major

One failed sample aborts the entire benchmark run.

Each direct/RLM call is awaited inline with no per-question recovery. If a provider call fails near the end of a long run, the function rejects and the completed runs accumulated so far are lost to the caller. Please isolate failures per sample or persist partial results as the loop progresses.

Also applies to: 225-283

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/benchmark.ts` around lines 100 - 160, The loop currently awaits llmComplete and rlmLoop inline so a single provider error will reject the whole benchmark and drop accumulated runs; wrap the per-question work (the calls to llmComplete and rlmLoop and the runs.push) in a try/catch around the body that builds ctx/directMessages/directStart/etc., on error push a fallback run entry (referencing questionId/q.name and fields like direct, rlm, savings) containing error details and null/zeros for metrics and/or an error field, and continue; additionally after each successful runs.push persist or flush partial results (e.g., call a savePartialRuns helper) so progress is not lost if the process crashes.

coderabbitai · 2026-03-30T06:22:18Z

+function findLoadDatasetScript(): string {
+  // The script is at python/load_dataset.py relative to package root
+  // From dist/src/benchmark.js, package root is ../../
+  const thisDir = dirname(fileURLToPath(import.meta.url));
+  return join(thisDir, "..", "..", "python", "load_dataset.py");


⚠️ Potential issue | 🟠 Major

findLoadDatasetScript() only works from dist/src.

Line 199 walks up two directories, which is correct for dist/src/benchmark.js but overshoots the package root when this module is executed from src/benchmark.ts. In that layout, runOolongBenchmark() resolves python/load_dataset.py outside the repo and fails with ENOENT. Probe both ../python and ../../python, then throw a clear error if neither exists.

🛠️ One way to make the lookup work in both layouts

+import { existsSync } from "node:fs"; ... function findLoadDatasetScript(): string { // The script is at python/load_dataset.py relative to package root - // From dist/src/benchmark.js, package root is ../../ const thisDir = dirname(fileURLToPath(import.meta.url)); - return join(thisDir, "..", "..", "python", "load_dataset.py"); + const candidates = [ + join(thisDir, "..", "python", "load_dataset.py"), + join(thisDir, "..", "..", "python", "load_dataset.py"), + ]; + const scriptPath = candidates.find((candidate) => existsSync(candidate)); + if (!scriptPath) { + throw new Error("Could not locate python/load_dataset.py"); + } + return scriptPath; }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/benchmark.ts` around lines 195 - 199, The findLoadDatasetScript() function currently assumes the module lives in dist/src and always walks up two directories, which fails when running from src/benchmark.ts; modify findLoadDatasetScript() to probe both candidate paths (join(thisDir, "..", "python", "load_dataset.py") and join(thisDir, "..", "..", "python", "load_dataset.py")) and return the first that exists, and if neither exists throw a clear Error mentioning both tried paths and that runOolongBenchmark() could not locate python/load_dataset.py; use fs.existsSync or fs.promises.access to check existence and keep the function name findLoadDatasetScript and its return semantics unchanged so callers like runOolongBenchmark() continue to work.

Systems without python3-venv installed would fail to create the benchmark Python environment. Now tries uv first (preferred), then falls back to python3 -m venv + pip. Co-authored-by: Genie <genie@namastex.ai>

The dataset uses "context_window_text" not "context", and "answer" can be a list. Also maps task_group for category. Co-authored-by: Genie <genie@namastex.ai>

namastex888 · 2026-03-30T06:35:31Z

Comprehensive QA Report — Phase 1 Features

All features have been tested end-to-end on @next v0.260330.18. 4 bugs found and fixed during QA (PRs #47, #48, #49 — all merged to dev).

QA Results

Test	Status	Notes
`rlmx --help`	✅ PASS	Shows benchmark command + `--no-session` flag
`rlmx --version`	✅ PASS	`v0.260330.18`
`rlmx init`	✅ PASS	Scaffolds rlmx.yaml correctly
`rlmx config set/get/list/delete/path`	✅ PASS	All subcommands work
Default extensions (.md only)	✅ PASS	Loads only .md files from directory
`--ext .mdx,.json` flag	✅ PASS	Loads exactly 2 files
YAML `context.extensions: [.mdx, .json]`	✅ PASS	Loads exactly 2 files
YAML `context.extensions: [mdx, json]` (no dots)	✅ PASS	Bug #28 fix — normalized automatically
Session auto-save	✅ PASS	Creates all 5 files (meta.json, usage.json, answer.txt, config.yaml, trajectory.jsonl)
Session meta.json fields	✅ PASS	runId, query, contextPath, timestamp, version
Session usage.json accuracy	✅ PASS	Token counts match --stats output
Session `--log` + trajectory copy	✅ PASS	JSONL events copied to trajectory.jsonl
`--no-session` flag	✅ PASS	No session directory created
`rlmx benchmark` (help)	✅ PASS	Shows cost + oolong modes
`rlmx benchmark cost`	✅ PASS	Runs 6 questions, table renders, results saved to JSON
`rlmx benchmark cost --output json`	✅ PASS	Clean JSON on stdout, progress on stderr
`rlmx benchmark oolong --samples 1`	✅ PASS	HF dataset loads, runs comparison, saves results
Query with `--stats`	✅ PASS	JSON stats emitted to stderr
Query with `--output json`	✅ PASS	Clean JSON response
Query with `--output stream`	✅ PASS	JSONL stream events
Stdin piping	✅ PASS	`echo "query" \| rlmx` works
Cache `--estimate`	✅ PASS	Shows context stats
Cache warmup	✅ PASS	Primes provider cache
`tsc --noEmit`	✅ PASS	Clean
`node --test`	✅ PASS	191 tests, 0 failures

Bugs Found & Fixed

PR fix: resolve benchmark-data.json path for npm installations #47 — benchmark-data.json path resolved to ../../src/ (source root) which doesn't exist in npm installs. Fixed to resolve relative to compiled JS.
PR fix: use uv for benchmark Python venv setup #48 — python3 -m venv fails on systems without python3-venv. Fixed to try uv first.
PR fix: use correct Oolong Synth dataset field names #49 — Oolong Synth dataset uses context_window_text not context. Fixed field mapping.
PR fix: resolve benchmark-data.json path for npm installations #47 — Unused totalW variable removed.

Included Changes (dev → main)

fix: respect context.extensions from rlmx.yaml (closes #28) — bug fix
feat: auto-save session artifacts to ~/.rlmx/sessions/ — new feature
feat: add rlmx benchmark command with cost and oolong modes — new feature
fix: resolve benchmark-data.json path for npm installations — bug fix
fix: use uv for benchmark Python venv setup — bug fix
fix: use correct Oolong Synth dataset field names — bug fix

Ready for human review and merge to main.

…ext handling Add pgserve as a project dependency for embedded PostgreSQL storage. Extend rlmx.yaml schema with a `storage` section supporting enabled mode (auto/always/never), persistent/memory mode, data directory, port, chunk sizing, and token estimation parameters. All fields have sensible defaults and descriptive validation errors.

…ontexts Wire validateContextSize() into runQuery() and runBatchCommand() execution paths. When context exceeds the model's token limit, cache mode is automatically disabled with a stderr warning. Storage mode is signaled when storage.enabled is 'auto' or 'always'. Add storageMode option to RLMOptions interface for future pgserve routing. Export PROVIDER_LIMITS from cache.ts for reuse by other modules.

…ingestion Create src/storage.ts with PgStorage class that: - Spawns pgserve as a child process (memory or persistent mode) - Creates records table with GIN-indexed tsvector for full-text search - Ingests context as JSONL (auto-extracting timestamp/type fields) or plain text - Provides search(), slice(), timeRange(), count(), query() methods - Graceful shutdown with 3s SIGTERM timeout then SIGKILL - Process cleanup on exit/SIGTERM/SIGINT/uncaughtException - Adaptive chunk sizing via getChunkSize() using PROVIDER_LIMITS Also adds pg and @types/pg dependencies.

…orage queries Create python/pg_batteries.py with pg_search(), pg_slice(), pg_time(), pg_count(), pg_query() functions that communicate via IPC to Node.js PgStorage. Results > 2000 chars are auto-truncated with stub format showing first 5 items. Add pg_* request types to IPC protocol (ipc.ts), route them through handleLLMRequest() to PgStorage methods (llm.ts), and add loadPgBatteries option to REPL start (repl.ts) for storage mode.

Add rlmx_sessions and rlmx_events tables to pgserve schema (created alongside records table on start). Create v_cost_breakdown and v_repl_usage aggregation views. New src/observe.ts with ObservabilityRecorder class that records LLM calls, REPL executions, and sub-calls as fire-and-forget inserts. Session lifecycle tracked from start through completion/failure. All recording errors logged to stderr but never block the run.

When storageMode is active in rlmLoop(): - Creates PgStorage, ingests context into Postgres - Creates ObservabilityRecorder, records session start - Passes storage to REPL (loadPgBatteries) and handleLLMRequest - Skips raw context injection, uses storage metadata instead - Appends pg_* tool instructions to system prompt with record count - Records every LLM call, REPL execution, and sub-call to observer - Records final answer or error on completion - Cleans up storage on finalize and error paths Batch engine (batch.ts) now supports storageMode option, passing cache/storage flags through to each rlmLoop call. CLI (cli.ts) forces storageMode when storage.enabled is 'always', and passes storageMode to batch when context exceeds limits.

New src/stats.ts with query functions (listRuns, getRun, costBreakdown, toolUsage) and terminal table formatting. Connects to persistent pgserve data at ~/.rlmx/data temporarily to query rlmx_sessions and rlmx_events. CLI commands: rlmx stats — table of last 20 runs rlmx stats --run <id> — per-event detail for one run rlmx stats --costs — cost breakdown by model rlmx stats --tools — sub-call usage and error rates rlmx stats --since 24h — filter by time window rlmx stats --output json — JSON output mode Graceful "No stats yet" message when ~/.rlmx/data doesn't exist.

FIX 1 (CRITICAL): Eliminate SQL injection in stats queries. - PgStorage.query() now accepts optional params array, passed through to client.query(sql, params) for parameterized execution. - stats.ts getRun(): uses $1 parameter for session_id instead of string interpolation with manual escaping. - stats.ts listRuns(): parameterized LIMIT. - stats.ts costBreakdown()/toolUsage(): compute cutoff timestamp in JS via parseSinceCutoff(), pass as $1 parameter instead of interpolating SQL interval strings. FIX 2 (HIGH): Proper port auto-assignment when config.port is 0. - Added findFreePort() helper that binds a temporary TCP server to port 0, reads the OS-assigned port, then closes the server. - When config.port is 0, uses findFreePort() instead of falling back to hardcoded 8432, preventing conflicts with other instances.

…olumn, binary resolution - TRUNCATE records before ingest to prevent stale data in persistent mode - Reject SQL with semicolons in query() to prevent injection via chained statements - Add source TEXT column to records schema, populate from ContextItem.path - Resolve pgserve binary relative to package install via import.meta.url - Use config dataDir in stats.ts instead of hardcoded path - Check process.exitCode in waitForReady loop for fast failure - Add public send_request() to llm_bridge, update pg/gemini_batteries

- Remove semicolon check in pg_query (READ ONLY wrapper suffices) - Stats loads actual rlmx.yaml config instead of defaults - Expose source column in pg_search/pg_slice, add pg_sources() - Use typeof exitCode === 'number' guard in waitForReady - Add session_id column to records table and ingest()

feat: embedded pgserve storage for large context handling + observability

chore(version): bump to 0.260330.14 [skip ci]

70225f6

automagik-genie and others added 5 commits March 30, 2026 02:55

feat: add rlmx benchmark command with cost and oolong modes

82724f6

Merge pull request #46 from automagik-dev/overnight-phase1

f8561c7

feat: Phase 1 — benchmarks, sessions, bug fix #28

chore(version): bump to 0.260330.15 [skip ci]

9d4a6be

coderabbitai bot reviewed Mar 30, 2026

View reviewed changes

namastex888 and others added 2 commits March 30, 2026 03:12

chore(version): bump to 0.260330.16 [skip ci]

8494b36

coderabbitai bot reviewed Mar 30, 2026

View reviewed changes

namastex888 and others added 4 commits March 30, 2026 03:25

fix: use uv for benchmark venv setup, fall back to python3 -m venv (#48)

e203861

Systems without python3-venv installed would fail to create the benchmark Python environment. Now tries uv first (preferred), then falls back to python3 -m venv + pip. Co-authored-by: Genie <genie@namastex.ai>

chore(version): bump to 0.260330.17 [skip ci]

34711a6

fix: use correct Oolong Synth dataset field names (#49)

e32d973

The dataset uses "context_window_text" not "context", and "answer" can be a list. Also maps task_group for category. Co-authored-by: Genie <genie@namastex.ai>

chore(version): bump to 0.260330.18 [skip ci]

9909c51

automagik-genie and others added 12 commits March 30, 2026 23:06

Merge pull request #50 from automagik-dev/feat/pgserve-storage

d4b89fe

feat: embedded pgserve storage for large context handling + observability

chore(version): bump to 0.260331.1 [skip ci]

f3d0150

namastex888 changed the title ~~chore: rolling promotion dev -> main~~ release: dev → main — pgserve storage, benchmarks, sessions, config CLI Mar 31, 2026

namastex888 merged commit 5f940a2 into main Mar 31, 2026
1 check passed

coderabbitai bot mentioned this pull request Mar 31, 2026

release: dev → main — .rlmx/ scaffolding + pgserve QA fix #53

Merged

3 tasks

Conversation

namastex888 commented Mar 30, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Highlights

PRs Included

Test Plan

Summary by CodeRabbit

Uh oh!

gemini-code-assist bot commented Mar 30, 2026

Uh oh!

coderabbitai bot commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

namastex888 commented Mar 30, 2026

Comprehensive QA Report — Phase 1 Features

QA Results

Bugs Found & Fixed

Included Changes (dev → main)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

namastex888 commented Mar 30, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 30, 2026 •

edited

Loading