Skip to content

release: dev → main — pgserve storage, benchmarks, sessions, config CLI#45

Merged
namastex888 merged 24 commits intomainfrom
dev
Mar 31, 2026
Merged

release: dev → main — pgserve storage, benchmarks, sessions, config CLI#45
namastex888 merged 24 commits intomainfrom
dev

Conversation

@namastex888
Copy link
Copy Markdown
Collaborator

@namastex888 namastex888 commented Mar 30, 2026

Summary

Promotes dev to main with all changes since last stable release.

Highlights

  • Embedded pgserve storage (PR feat: embedded pgserve storage for large context handling + observability #50) — PostgreSQL-backed context handling for 26MB+ files with adaptive chunking, full-text search, and run observability
  • rlmx stats CLI — query run history, cost breakdowns, tool usage across sessions
  • rlmx benchmark — cost and Oolong Synth benchmarking suite
  • rlmx config — global settings management (API keys, model defaults)
  • Session auto-save — trajectory artifacts persisted to ~/.rlmx/sessions/
  • Bug fixes — thinking-only response handling, context extensions, benchmark paths, type safety

PRs Included

Test Plan

  • npm run build passes
  • npm test — 191/191 tests pass
  • rlmx --help shows all new commands
  • rlmx stats works with persistent storage
  • Small context queries unchanged (no regression)

Summary by CodeRabbit

  • New Features
    • Added benchmark and stats commands with result formatting, saving and a bundled dataset + loader.
    • Added session persistence for saving run metadata, usage, answers and trajectories.
    • Added optional PostgreSQL-backed context mode and related observability and query tools; REPL can load DB helper utilities.
  • Behavior Changes
    • File-extension handling now normalizes extensions (accepts with/without leading dot).
    • Build now bundles benchmark data into the published output.
  • Chores
    • Bumped package version to 0.260331.1 and added PostgreSQL-related dependencies.
  • Tests
    • Added tests for benchmarks, sessions, context/extension handling, and caching.

@gemini-code-assist
Copy link
Copy Markdown

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 30, 2026

Caution

Review failed

Pull request was closed or merged during review

📝 Walkthrough

Walkthrough

Adds PostgreSQL-backed context storage, observability and stats tooling, benchmark framework (cost and Oolong modes) with dataset and Python loader, session persistence, CLI hooks, many tests, config/extension normalization, and bumps package/version and build script to include benchmark data.

Changes

Cohort / File(s) Summary
Packaging / Version
package.json, src/version.ts
Bump package version; build script updated to copy src/benchmark-data.json into dist/src/; new runtime deps pg, pgserve and dev dep @types/pg.
Benchmarking
src/benchmark.ts, src/benchmark-data.json, tests/benchmark.test.ts
New benchmark module (types, cost & oolong runners, aggregation, table formatter, result saving) plus bundled dataset and comprehensive tests.
CLI & Commands
src/cli.ts
Added benchmark and stats commands, --no-session flag, conditional session saving, and wiring to run/save benchmark or stats flows.
Storage & Postgres integration
src/storage.ts, src/ipc.ts, src/llm.ts, src/observe.ts, src/rlm.ts, src/repl.ts, src/batch.ts, src/cache.ts
New PgStorage implementation (spawn pgserve, ingest/query/stop), IPC request types for pg_* operations, LLM handler branches for pg calls, observability recorder, storage-mode RLM orchestration (ingest, pass storage to handlers, telemetry), REPL loader for Python pg batteries, batch/rlm options to control storage/cache, and export of PROVIDER_LIMITS.
Session persistence
src/session.ts, tests/session.test.ts
New session save API and tests: persists meta/usage/answer/config/trajectory under ~/.rlmx/sessions/<runId>.
Config & Context normalization
src/config.ts, src/context.ts, tests/context.test.ts
Added StorageConfig and DEFAULT_STORAGE_CONFIG, extended RlmxConfig.storage parsing/validation; normalized context file extensions (ensure leading dot) and tests for regression.
Python tooling
python/load_dataset.py, python/llm_bridge.py, python/pg_batteries.py, python/gemini_batteries.py
New loader CLI for oolong dataset; added send_request bridge API; new Python pg_batteries IPC helpers; gemini_batteries updated to use public bridge call.
Tests & test helpers
tests/*.test.ts, tests/cache.test.ts
Multiple new/updated tests for benchmark, session, context, cache; added default storage config to test helper.

Sequence Diagram(s)

sequenceDiagram
    participant CLI as CLI (rlmx)
    participant Bench as Benchmark module
    participant Venv as Python venv / Exec
    participant Py as python/load_dataset.py
    participant Direct as Direct LLM
    participant RLM as rlmLoop / REPL
    participant Pg as PgStorage (pgserve + pg)
    participant FS as Filesystem

    CLI->>Bench: runOolongBenchmark(samples, idx, mode)
    Bench->>Venv: ensure venv (~/.rlmx/.bench-venv)
    Bench->>Py: execFile python/load_dataset.py (samples, idx)
    Py-->>Bench: JSON dataset (questions)
    loop per question
        Bench->>Direct: llmComplete (direct)
        Direct-->>Bench: answer + tokens/cost/latency
        Bench->>RLM: rlmLoop(question, storageMode?)
        alt storageMode true
            RLM->>Pg: start() / ingest(context)
            RLM->>Pg: handle pg_* queries (via handleLLMRequest)
            Pg-->>RLM: query results
            RLM->>Bench: rlm answer + tokens/cost/iterations/latency
        else storageMode false
            RLM-->>Bench: rlm answer + tokens/cost/iterations/latency
        end
        Bench->>Bench: compute savings & aggregate
    end
    Bench->>FS: saveBenchmarkResults -> ~/.rlmx/benchmarks/<ts>.json
    CLI->>FS: saveSession(...) unless --no-session
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

Poem

🐇 I hopped through code with a tiny drum,
Benchmarks bundled, venvs spun, data come,
Sessions saved in folders snug and neat,
Pgserve hummed where context lines would meet,
A carrot crunch of metrics — hop, repeat! 🎉

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 68.57% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main changes: release promotion dev→main, PostgreSQL storage integration (pgserve), benchmark features, session persistence, and CLI configuration enhancements.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch dev

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

automagik-genie and others added 5 commits March 30, 2026 02:55
Normalize extensions to always have a leading dot in both config.ts
(YAML parsing) and context.ts (directory scanning). Previously,
extensions written without dots in YAML (e.g. [mdx, json]) would
silently fail to match any files because the filter compared "mdx"
against ".mdx".
Every rlmx run now persists trajectory, config snapshot, usage stats,
answer, and metadata to ~/.rlmx/sessions/<run_id>/. Add --no-session
flag to disable. Session save failures are caught and warned on stderr
without crashing the main run.
feat: Phase 1 — benchmarks, sessions, bug fix #28
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (7)
src/session.ts (1)

69-70: Consider explicit encoding for answer.txt.

Other writeFile calls (lines 64, 67, 74) include explicit content formatting. For consistency and to handle non-ASCII answers correctly, consider being explicit about encoding, though Node.js defaults to UTF-8:

-  await writeFile(join(sessionsDir, "answer.txt"), data.answer);
+  await writeFile(join(sessionsDir, "answer.txt"), data.answer, "utf-8");
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/session.ts` around lines 69 - 70, The writeFile call that writes
"answer.txt" should specify encoding explicitly to match the other writes and
ensure correct handling of non-ASCII text; update the writeFile using the same
pattern as the other calls (referencing writeFile, join, sessionsDir and
data.answer) to pass an options object or encoding argument (e.g., "utf8") so
the file is written with explicit UTF-8 encoding.
python/load_dataset.py (2)

24-25: Inconsistent dict access — potential KeyError.

Lines 22-23 and 26 use .get() for safe access, but lines 24-25 use direct item["question"] and item["context"] which will raise KeyError if those fields are missing. For consistency and robustness:

🛡️ Proposed fix
         output.append({
             "id": f"oolong-{item.get('id', 'unknown')}",
             "name": item.get("question", "")[:50],
-            "question": item["question"],
-            "context": item["context"],
+            "question": item.get("question", ""),
+            "context": item.get("context", ""),
             "expected": item.get("answer", ""),
             "category": "oolong",
         })
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/load_dataset.py` around lines 24 - 25, Replace the direct dict lookups
item["question"] and item["context"] with safe .get() calls to match the
surrounding code (e.g., item.get("question") and item.get("context", ""/None) as
appropriate) so missing keys won't raise KeyError; update any downstream logic
that expects a non-None value to handle the chosen default. Target the
occurrences of item["question"] and item["context"] in python/load_dataset.py
and make them consistent with the other .get() usages.

9-10: Missing input validation for CLI arguments.

int(sys.argv[...]) will raise ValueError if non-integer arguments are passed. Since this script is invoked by the benchmark runner, consider adding validation or a try/except:

🛡️ Proposed fix
 def main():
-    samples = int(sys.argv[1]) if len(sys.argv) > 1 else 5
-    idx = int(sys.argv[2]) if len(sys.argv) > 2 else -1
+    try:
+        samples = int(sys.argv[1]) if len(sys.argv) > 1 else 5
+        idx = int(sys.argv[2]) if len(sys.argv) > 2 else -1
+    except ValueError as e:
+        print(f"Error: invalid argument - {e}", file=sys.stderr)
+        sys.exit(1)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/load_dataset.py` around lines 9 - 10, The CLI parsing of samples and
idx in load_dataset.py uses direct int(sys.argv[...]) calls which will raise
ValueError on non-integer input; wrap the parsing of samples and idx in a
try/except ValueError block (or validate with str.isdigit()/regex) and handle
invalid input by printing a clear usage/error message and exiting non-zero.
Update the logic around the variables samples and idx so invalid values fall
back to the existing defaults (5 and -1) or force exit, and ensure any exception
caught logs the offending argv value for easier debugging.
src/cli.ts (2)

647-648: Manual argument parsing is fragile.

The --output flag is parsed by searching args.indexOf("--output"), which doesn't handle edge cases like --output=json or missing values. This differs from how other CLI options are parsed via parseArgs. Consider reusing the existing parsing infrastructure or documenting this limitation.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/cli.ts` around lines 647 - 648, The manual parsing of the --output flag
(variables outputIdx, outputFormat and args) is fragile; update the CLI to use
the existing parseArgs infrastructure instead of indexOf or, at minimum,
robustly handle --output=json, --output json and the missing-value case. Replace
the outputIdx/indexOf logic with a parseArgs definition for an --output option
(allowed values "json"|"table") and fall back to "table" when not provided or
invalid, or add parsing that checks for "--output=..." and verifies
args[outputIdx+1] exists before using it.

658-661: parseInt returns NaN on invalid input — consider validation.

If a user passes --samples abc, parseInt("abc", 10) returns NaN, which would be passed to runOolongBenchmark. The Python script would then receive "NaN" as a string argument.

🛡️ Proposed fix
   } else if (mode === "oolong") {
     const samplesIdx = args.indexOf("--samples");
-    const samples = samplesIdx >= 0 ? parseInt(args[samplesIdx + 1], 10) : 5;
+    const samplesRaw = samplesIdx >= 0 ? parseInt(args[samplesIdx + 1], 10) : 5;
+    const samples = Number.isNaN(samplesRaw) ? 5 : samplesRaw;
     const idxArgIdx = args.indexOf("--idx");
-    const idx = idxArgIdx >= 0 ? parseInt(args[idxArgIdx + 1], 10) : undefined;
+    const idxRaw = idxArgIdx >= 0 ? parseInt(args[idxArgIdx + 1], 10) : undefined;
+    const idx = idxRaw !== undefined && Number.isNaN(idxRaw) ? undefined : idxRaw;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/cli.ts` around lines 658 - 661, The code currently uses parseInt for
samples and idx (variables samplesIdx/samples and idxArgIdx/idx) but does not
validate the results, so invalid inputs produce NaN and propagate to
runOolongBenchmark; update the parsing to check for NaN after parseInt and
handle it: if samples is NaN use the default value (5) or reject the input with
a clear error, and for idx set it to undefined if NaN; ensure you update the
logic where runOolongBenchmark is called to rely on the sanitized samples and
idx values and provide a helpful error message or usage hint when user input is
invalid.
tests/benchmark.test.ts (1)

21-26: Path resolution depends on build output structure.

The path join(__dirname, "..", "..", "src", "benchmark-data.json") assumes tests run from dist/tests/. This works but is fragile if the build structure changes. Consider documenting this dependency or using a more robust resolution.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/benchmark.test.ts` around lines 21 - 26, The path resolution in
loadDataset() is fragile because jsonPath = join(__dirname, "..", "..", "src",
"benchmark-data.json") assumes tests run from dist/tests; update loadDataset to
resolve the JSON robustly (e.g., derive the file location from the test file
location using import.meta.url + fileURLToPath or use require.resolve/paths from
process.cwd()) so it works regardless of build output layout, and update the
jsonPath variable usage accordingly or add a short comment documenting the
required build layout if you prefer to keep the current approach.
src/benchmark.ts (1)

186-189: Venv creation commands have no timeout — could hang indefinitely.

If python3 -m venv or pip install datasets hangs (network issues, disk full), the benchmark command will block forever. Consider adding a timeout:

🛡️ Proposed fix
-    await execFileAsync("python3", ["-m", "venv", venvDir]);
-    await execFileAsync(join(venvDir, "bin", "pip"), ["install", "datasets"]);
+    await execFileAsync("python3", ["-m", "venv", venvDir], { timeout: 60_000 });
+    await execFileAsync(join(venvDir, "bin", "pip"), ["install", "datasets"], { timeout: 300_000 });
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/benchmark.ts` around lines 186 - 189, The two external commands that
create the venv and run pip (the calls to execFileAsync that invoke "python3 -m
venv" and join(venvDir, "bin", "pip") with ["install","datasets"]) lack timeouts
and can hang indefinitely; update both execFileAsync invocations to include a
sane timeout option (e.g., { timeout: <ms> }) and handle the resulting timeout
error (reject/throw or log and clean up) so the benchmark doesn't block forever
— ensure you reference the existing venvDir variable and preserve the mkdir(...)
call and error handling around these execFileAsync calls.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/benchmark.ts`:
- Around line 352-355: In function formatPct replace the dead sign assignment
with a meaningful one or remove it; either remove sign entirely and return
`${n.toFixed(1)}%`, or set sign = n > 0 ? "+" : "" so positive values are
prefixed with "+" (keep function name formatPct and its single return). Ensure
negative numbers still show their built-in "-" from n.toFixed(1).

---

Nitpick comments:
In `@python/load_dataset.py`:
- Around line 24-25: Replace the direct dict lookups item["question"] and
item["context"] with safe .get() calls to match the surrounding code (e.g.,
item.get("question") and item.get("context", ""/None) as appropriate) so missing
keys won't raise KeyError; update any downstream logic that expects a non-None
value to handle the chosen default. Target the occurrences of item["question"]
and item["context"] in python/load_dataset.py and make them consistent with the
other .get() usages.
- Around line 9-10: The CLI parsing of samples and idx in load_dataset.py uses
direct int(sys.argv[...]) calls which will raise ValueError on non-integer
input; wrap the parsing of samples and idx in a try/except ValueError block (or
validate with str.isdigit()/regex) and handle invalid input by printing a clear
usage/error message and exiting non-zero. Update the logic around the variables
samples and idx so invalid values fall back to the existing defaults (5 and -1)
or force exit, and ensure any exception caught logs the offending argv value for
easier debugging.

In `@src/benchmark.ts`:
- Around line 186-189: The two external commands that create the venv and run
pip (the calls to execFileAsync that invoke "python3 -m venv" and join(venvDir,
"bin", "pip") with ["install","datasets"]) lack timeouts and can hang
indefinitely; update both execFileAsync invocations to include a sane timeout
option (e.g., { timeout: <ms> }) and handle the resulting timeout error
(reject/throw or log and clean up) so the benchmark doesn't block forever —
ensure you reference the existing venvDir variable and preserve the mkdir(...)
call and error handling around these execFileAsync calls.

In `@src/cli.ts`:
- Around line 647-648: The manual parsing of the --output flag (variables
outputIdx, outputFormat and args) is fragile; update the CLI to use the existing
parseArgs infrastructure instead of indexOf or, at minimum, robustly handle
--output=json, --output json and the missing-value case. Replace the
outputIdx/indexOf logic with a parseArgs definition for an --output option
(allowed values "json"|"table") and fall back to "table" when not provided or
invalid, or add parsing that checks for "--output=..." and verifies
args[outputIdx+1] exists before using it.
- Around line 658-661: The code currently uses parseInt for samples and idx
(variables samplesIdx/samples and idxArgIdx/idx) but does not validate the
results, so invalid inputs produce NaN and propagate to runOolongBenchmark;
update the parsing to check for NaN after parseInt and handle it: if samples is
NaN use the default value (5) or reject the input with a clear error, and for
idx set it to undefined if NaN; ensure you update the logic where
runOolongBenchmark is called to rely on the sanitized samples and idx values and
provide a helpful error message or usage hint when user input is invalid.

In `@src/session.ts`:
- Around line 69-70: The writeFile call that writes "answer.txt" should specify
encoding explicitly to match the other writes and ensure correct handling of
non-ASCII text; update the writeFile using the same pattern as the other calls
(referencing writeFile, join, sessionsDir and data.answer) to pass an options
object or encoding argument (e.g., "utf8") so the file is written with explicit
UTF-8 encoding.

In `@tests/benchmark.test.ts`:
- Around line 21-26: The path resolution in loadDataset() is fragile because
jsonPath = join(__dirname, "..", "..", "src", "benchmark-data.json") assumes
tests run from dist/tests; update loadDataset to resolve the JSON robustly
(e.g., derive the file location from the test file location using
import.meta.url + fileURLToPath or use require.resolve/paths from process.cwd())
so it works regardless of build output layout, and update the jsonPath variable
usage accordingly or add a short comment documenting the required build layout
if you prefer to keep the current approach.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d04fd3eb-33ea-4c04-97fa-4ae4229a3481

📥 Commits

Reviewing files that changed from the base of the PR and between 70225f6 and 9d4a6be.

⛔ Files ignored due to path filters (1)
  • package-lock.json is excluded by !**/package-lock.json
📒 Files selected for processing (12)
  • package.json
  • python/load_dataset.py
  • src/benchmark-data.json
  • src/benchmark.ts
  • src/cli.ts
  • src/config.ts
  • src/context.ts
  • src/session.ts
  • src/version.ts
  • tests/benchmark.test.ts
  • tests/context.test.ts
  • tests/session.test.ts
✅ Files skipped from review due to trivial changes (2)
  • src/version.ts
  • package.json

Comment thread src/benchmark.ts
Comment on lines +352 to +355
function formatPct(n: number): string {
const sign = n > 0 ? "" : "";
return `${sign}${n.toFixed(1)}%`;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Unused sign variable in formatPct.

The sign variable is always an empty string (both branches assign ""), making it dead code:

const sign = n > 0 ? "" : "";

If the intent was to show + for positive savings:

♻️ Proposed fix
 function formatPct(n: number): string {
-  const sign = n > 0 ? "" : "";
-  return `${sign}${n.toFixed(1)}%`;
+  const sign = n > 0 ? "+" : "";
+  return `${sign}${n.toFixed(1)}%`;
 }

Or remove the unused variable entirely.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/benchmark.ts` around lines 352 - 355, In function formatPct replace the
dead sign assignment with a meaningful one or remove it; either remove sign
entirely and return `${n.toFixed(1)}%`, or set sign = n > 0 ? "+" : "" so
positive values are prefixed with "+" (keep function name formatPct and its
single return). Ensure negative numbers still show their built-in "-" from
n.toFixed(1).

namastex888 and others added 2 commits March 30, 2026 03:12
The benchmark data file was resolved relative to source root (../../src/),
which doesn't exist in npm installations. Now resolves relative to the
compiled JS file (__dirname) and the build script copies the JSON to dist/.

Co-authored-by: Genie <genie@namastex.ai>
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/benchmark.ts`:
- Around line 100-160: The loop currently awaits llmComplete and rlmLoop inline
so a single provider error will reject the whole benchmark and drop accumulated
runs; wrap the per-question work (the calls to llmComplete and rlmLoop and the
runs.push) in a try/catch around the body that builds
ctx/directMessages/directStart/etc., on error push a fallback run entry
(referencing questionId/q.name and fields like direct, rlm, savings) containing
error details and null/zeros for metrics and/or an error field, and continue;
additionally after each successful runs.push persist or flush partial results
(e.g., call a savePartialRuns helper) so progress is not lost if the process
crashes.
- Around line 195-199: The findLoadDatasetScript() function currently assumes
the module lives in dist/src and always walks up two directories, which fails
when running from src/benchmark.ts; modify findLoadDatasetScript() to probe both
candidate paths (join(thisDir, "..", "python", "load_dataset.py") and
join(thisDir, "..", "..", "python", "load_dataset.py")) and return the first
that exists, and if neither exists throw a clear Error mentioning both tried
paths and that runOolongBenchmark() could not locate python/load_dataset.py; use
fs.existsSync or fs.promises.access to check existence and keep the function
name findLoadDatasetScript and its return semantics unchanged so callers like
runOolongBenchmark() continue to work.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: fa0e0882-12b0-4a1a-89f8-3ddd180788e3

📥 Commits

Reviewing files that changed from the base of the PR and between 9d4a6be and 8494b36.

📒 Files selected for processing (3)
  • package.json
  • src/benchmark.ts
  • src/version.ts
✅ Files skipped from review due to trivial changes (1)
  • src/version.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • package.json

Comment thread src/benchmark.ts
Comment on lines +25 to +31
export interface BenchmarkQuestion {
id: string;
name: string;
question: string;
context: string;
category: string;
expected?: string;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

oolong mode never computes an accuracy result.

runOolongBenchmark() never reads q.expected, BenchmarkResults has no place to store correctness, and Line 361 still labels the mode as "oolong accuracy". As written, this mode only reports token/cost/latency data, so it cannot answer the question it claims to benchmark.

Also applies to: 34-68, 202-294, 361-362

Comment thread src/benchmark.ts
Comment on lines +100 to +160
for (const q of dataset) {
const ctx: LoadedContext = {
type: "string",
content: q.context,
metadata: `benchmark context for "${q.name}" (${q.context.length} chars)`,
};

// Direct LLM call
const directMessages: ChatMessage[] = [
{
role: "user",
content: `Context:\n${q.context}\n\nQuestion: ${q.question}`,
},
];

const directStart = Date.now();
const directResp = await llmComplete(directMessages, config.model);
const directLatency = Date.now() - directStart;

// RLM call
const rlmStart = Date.now();
const rlmResult = await rlmLoop(q.question, ctx, config, {
maxIterations: config.budget.maxTokens ? 5 : 10,
timeout: 120_000,
verbose: false,
output: "text",
cache: false,
});
const rlmLatency = Date.now() - rlmStart;

const directTotalTokens = directResp.usage.inputTokens + directResp.usage.outputTokens;
const rlmTotalTokens = rlmResult.usage.inputTokens + rlmResult.usage.outputTokens;

runs.push({
questionId: q.id,
questionName: q.name,
direct: {
tokens_input: directResp.usage.inputTokens,
tokens_output: directResp.usage.outputTokens,
cost: directResp.usage.totalCost,
latency_ms: directLatency,
answer: directResp.text,
},
rlm: {
tokens_input: rlmResult.usage.inputTokens,
tokens_output: rlmResult.usage.outputTokens,
cost: rlmResult.usage.totalCost,
latency_ms: rlmLatency,
iterations: rlmResult.iterations,
answer: rlmResult.answer,
},
savings: {
tokens_pct: calculateSavings(directTotalTokens, rlmTotalTokens),
cost_pct: calculateCostSavings(directResp.usage.totalCost, rlmResult.usage.totalCost),
},
});

if (options?.outputFormat !== "json") {
process.stderr.write(` completed: ${q.name}\n`);
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

One failed sample aborts the entire benchmark run.

Each direct/RLM call is awaited inline with no per-question recovery. If a provider call fails near the end of a long run, the function rejects and the completed runs accumulated so far are lost to the caller. Please isolate failures per sample or persist partial results as the loop progresses.

Also applies to: 225-283

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/benchmark.ts` around lines 100 - 160, The loop currently awaits
llmComplete and rlmLoop inline so a single provider error will reject the whole
benchmark and drop accumulated runs; wrap the per-question work (the calls to
llmComplete and rlmLoop and the runs.push) in a try/catch around the body that
builds ctx/directMessages/directStart/etc., on error push a fallback run entry
(referencing questionId/q.name and fields like direct, rlm, savings) containing
error details and null/zeros for metrics and/or an error field, and continue;
additionally after each successful runs.push persist or flush partial results
(e.g., call a savePartialRuns helper) so progress is not lost if the process
crashes.

Comment thread src/benchmark.ts
Comment on lines +195 to +199
function findLoadDatasetScript(): string {
// The script is at python/load_dataset.py relative to package root
// From dist/src/benchmark.js, package root is ../../
const thisDir = dirname(fileURLToPath(import.meta.url));
return join(thisDir, "..", "..", "python", "load_dataset.py");
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

findLoadDatasetScript() only works from dist/src.

Line 199 walks up two directories, which is correct for dist/src/benchmark.js but overshoots the package root when this module is executed from src/benchmark.ts. In that layout, runOolongBenchmark() resolves python/load_dataset.py outside the repo and fails with ENOENT. Probe both ../python and ../../python, then throw a clear error if neither exists.

🛠️ One way to make the lookup work in both layouts
+import { existsSync } from "node:fs";
...
 function findLoadDatasetScript(): string {
   // The script is at python/load_dataset.py relative to package root
-  // From dist/src/benchmark.js, package root is ../../
   const thisDir = dirname(fileURLToPath(import.meta.url));
-  return join(thisDir, "..", "..", "python", "load_dataset.py");
+  const candidates = [
+    join(thisDir, "..", "python", "load_dataset.py"),
+    join(thisDir, "..", "..", "python", "load_dataset.py"),
+  ];
+  const scriptPath = candidates.find((candidate) => existsSync(candidate));
+  if (!scriptPath) {
+    throw new Error("Could not locate python/load_dataset.py");
+  }
+  return scriptPath;
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/benchmark.ts` around lines 195 - 199, The findLoadDatasetScript()
function currently assumes the module lives in dist/src and always walks up two
directories, which fails when running from src/benchmark.ts; modify
findLoadDatasetScript() to probe both candidate paths (join(thisDir, "..",
"python", "load_dataset.py") and join(thisDir, "..", "..", "python",
"load_dataset.py")) and return the first that exists, and if neither exists
throw a clear Error mentioning both tried paths and that runOolongBenchmark()
could not locate python/load_dataset.py; use fs.existsSync or fs.promises.access
to check existence and keep the function name findLoadDatasetScript and its
return semantics unchanged so callers like runOolongBenchmark() continue to
work.

namastex888 and others added 4 commits March 30, 2026 03:25
Systems without python3-venv installed would fail to create the
benchmark Python environment. Now tries uv first (preferred), then
falls back to python3 -m venv + pip.

Co-authored-by: Genie <genie@namastex.ai>
The dataset uses "context_window_text" not "context", and "answer"
can be a list. Also maps task_group for category.

Co-authored-by: Genie <genie@namastex.ai>
@namastex888
Copy link
Copy Markdown
Collaborator Author

Comprehensive QA Report — Phase 1 Features

All features have been tested end-to-end on @next v0.260330.18. 4 bugs found and fixed during QA (PRs #47, #48, #49 — all merged to dev).

QA Results

Test Status Notes
rlmx --help ✅ PASS Shows benchmark command + --no-session flag
rlmx --version ✅ PASS v0.260330.18
rlmx init ✅ PASS Scaffolds rlmx.yaml correctly
rlmx config set/get/list/delete/path ✅ PASS All subcommands work
Default extensions (.md only) ✅ PASS Loads only .md files from directory
--ext .mdx,.json flag ✅ PASS Loads exactly 2 files
YAML context.extensions: [.mdx, .json] ✅ PASS Loads exactly 2 files
YAML context.extensions: [mdx, json] (no dots) ✅ PASS Bug #28 fix — normalized automatically
Session auto-save ✅ PASS Creates all 5 files (meta.json, usage.json, answer.txt, config.yaml, trajectory.jsonl)
Session meta.json fields ✅ PASS runId, query, contextPath, timestamp, version
Session usage.json accuracy ✅ PASS Token counts match --stats output
Session --log + trajectory copy ✅ PASS JSONL events copied to trajectory.jsonl
--no-session flag ✅ PASS No session directory created
rlmx benchmark (help) ✅ PASS Shows cost + oolong modes
rlmx benchmark cost ✅ PASS Runs 6 questions, table renders, results saved to JSON
rlmx benchmark cost --output json ✅ PASS Clean JSON on stdout, progress on stderr
rlmx benchmark oolong --samples 1 ✅ PASS HF dataset loads, runs comparison, saves results
Query with --stats ✅ PASS JSON stats emitted to stderr
Query with --output json ✅ PASS Clean JSON response
Query with --output stream ✅ PASS JSONL stream events
Stdin piping ✅ PASS echo "query" | rlmx works
Cache --estimate ✅ PASS Shows context stats
Cache warmup ✅ PASS Primes provider cache
tsc --noEmit ✅ PASS Clean
node --test ✅ PASS 191 tests, 0 failures

Bugs Found & Fixed

  1. PR fix: resolve benchmark-data.json path for npm installations #47benchmark-data.json path resolved to ../../src/ (source root) which doesn't exist in npm installs. Fixed to resolve relative to compiled JS.
  2. PR fix: use uv for benchmark Python venv setup #48python3 -m venv fails on systems without python3-venv. Fixed to try uv first.
  3. PR fix: use correct Oolong Synth dataset field names #49 — Oolong Synth dataset uses context_window_text not context. Fixed field mapping.
  4. PR fix: resolve benchmark-data.json path for npm installations #47 — Unused totalW variable removed.

Included Changes (dev → main)

  • fix: respect context.extensions from rlmx.yaml (closes #28) — bug fix
  • feat: auto-save session artifacts to ~/.rlmx/sessions/ — new feature
  • feat: add rlmx benchmark command with cost and oolong modes — new feature
  • fix: resolve benchmark-data.json path for npm installations — bug fix
  • fix: use uv for benchmark Python venv setup — bug fix
  • fix: use correct Oolong Synth dataset field names — bug fix

Ready for human review and merge to main.

automagik-genie and others added 12 commits March 30, 2026 23:06
…ext handling

Add pgserve as a project dependency for embedded PostgreSQL storage.
Extend rlmx.yaml schema with a `storage` section supporting enabled mode
(auto/always/never), persistent/memory mode, data directory, port,
chunk sizing, and token estimation parameters. All fields have sensible
defaults and descriptive validation errors.
…ontexts

Wire validateContextSize() into runQuery() and runBatchCommand() execution
paths. When context exceeds the model's token limit, cache mode is
automatically disabled with a stderr warning. Storage mode is signaled
when storage.enabled is 'auto' or 'always'. Add storageMode option to
RLMOptions interface for future pgserve routing. Export PROVIDER_LIMITS
from cache.ts for reuse by other modules.
…ingestion

Create src/storage.ts with PgStorage class that:
- Spawns pgserve as a child process (memory or persistent mode)
- Creates records table with GIN-indexed tsvector for full-text search
- Ingests context as JSONL (auto-extracting timestamp/type fields) or plain text
- Provides search(), slice(), timeRange(), count(), query() methods
- Graceful shutdown with 3s SIGTERM timeout then SIGKILL
- Process cleanup on exit/SIGTERM/SIGINT/uncaughtException
- Adaptive chunk sizing via getChunkSize() using PROVIDER_LIMITS

Also adds pg and @types/pg dependencies.
…orage queries

Create python/pg_batteries.py with pg_search(), pg_slice(), pg_time(),
pg_count(), pg_query() functions that communicate via IPC to Node.js
PgStorage. Results > 2000 chars are auto-truncated with stub format
showing first 5 items.

Add pg_* request types to IPC protocol (ipc.ts), route them through
handleLLMRequest() to PgStorage methods (llm.ts), and add
loadPgBatteries option to REPL start (repl.ts) for storage mode.
Add rlmx_sessions and rlmx_events tables to pgserve schema (created
alongside records table on start). Create v_cost_breakdown and
v_repl_usage aggregation views.

New src/observe.ts with ObservabilityRecorder class that records LLM
calls, REPL executions, and sub-calls as fire-and-forget inserts.
Session lifecycle tracked from start through completion/failure.
All recording errors logged to stderr but never block the run.
When storageMode is active in rlmLoop():
- Creates PgStorage, ingests context into Postgres
- Creates ObservabilityRecorder, records session start
- Passes storage to REPL (loadPgBatteries) and handleLLMRequest
- Skips raw context injection, uses storage metadata instead
- Appends pg_* tool instructions to system prompt with record count
- Records every LLM call, REPL execution, and sub-call to observer
- Records final answer or error on completion
- Cleans up storage on finalize and error paths

Batch engine (batch.ts) now supports storageMode option, passing
cache/storage flags through to each rlmLoop call.

CLI (cli.ts) forces storageMode when storage.enabled is 'always',
and passes storageMode to batch when context exceeds limits.
New src/stats.ts with query functions (listRuns, getRun, costBreakdown,
toolUsage) and terminal table formatting. Connects to persistent pgserve
data at ~/.rlmx/data temporarily to query rlmx_sessions and rlmx_events.

CLI commands:
  rlmx stats              — table of last 20 runs
  rlmx stats --run <id>   — per-event detail for one run
  rlmx stats --costs      — cost breakdown by model
  rlmx stats --tools      — sub-call usage and error rates
  rlmx stats --since 24h  — filter by time window
  rlmx stats --output json — JSON output mode

Graceful "No stats yet" message when ~/.rlmx/data doesn't exist.
FIX 1 (CRITICAL): Eliminate SQL injection in stats queries.
- PgStorage.query() now accepts optional params array, passed through
  to client.query(sql, params) for parameterized execution.
- stats.ts getRun(): uses $1 parameter for session_id instead of string
  interpolation with manual escaping.
- stats.ts listRuns(): parameterized LIMIT.
- stats.ts costBreakdown()/toolUsage(): compute cutoff timestamp in JS
  via parseSinceCutoff(), pass as $1 parameter instead of interpolating
  SQL interval strings.

FIX 2 (HIGH): Proper port auto-assignment when config.port is 0.
- Added findFreePort() helper that binds a temporary TCP server to
  port 0, reads the OS-assigned port, then closes the server.
- When config.port is 0, uses findFreePort() instead of falling back
  to hardcoded 8432, preventing conflicts with other instances.
…olumn, binary resolution

- TRUNCATE records before ingest to prevent stale data in persistent mode
- Reject SQL with semicolons in query() to prevent injection via chained statements
- Add source TEXT column to records schema, populate from ContextItem.path
- Resolve pgserve binary relative to package install via import.meta.url
- Use config dataDir in stats.ts instead of hardcoded path
- Check process.exitCode in waitForReady loop for fast failure
- Add public send_request() to llm_bridge, update pg/gemini_batteries
- Remove semicolon check in pg_query (READ ONLY wrapper suffices)
- Stats loads actual rlmx.yaml config instead of defaults
- Expose source column in pg_search/pg_slice, add pg_sources()
- Use typeof exitCode === 'number' guard in waitForReady
- Add session_id column to records table and ingest()
feat: embedded pgserve storage for large context handling + observability
@namastex888 namastex888 changed the title chore: rolling promotion dev -> main release: dev → main — pgserve storage, benchmarks, sessions, config CLI Mar 31, 2026
@namastex888 namastex888 merged commit 5f940a2 into main Mar 31, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants