release: dev → main — pgserve storage, benchmarks, sessions, config CLI#45
release: dev → main — pgserve storage, benchmarks, sessions, config CLI#45namastex888 merged 24 commits intomainfrom
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
Caution Review failedPull request was closed or merged during review 📝 WalkthroughWalkthroughAdds PostgreSQL-backed context storage, observability and stats tooling, benchmark framework (cost and Oolong modes) with dataset and Python loader, session persistence, CLI hooks, many tests, config/extension normalization, and bumps package/version and build script to include benchmark data. Changes
Sequence Diagram(s)sequenceDiagram
participant CLI as CLI (rlmx)
participant Bench as Benchmark module
participant Venv as Python venv / Exec
participant Py as python/load_dataset.py
participant Direct as Direct LLM
participant RLM as rlmLoop / REPL
participant Pg as PgStorage (pgserve + pg)
participant FS as Filesystem
CLI->>Bench: runOolongBenchmark(samples, idx, mode)
Bench->>Venv: ensure venv (~/.rlmx/.bench-venv)
Bench->>Py: execFile python/load_dataset.py (samples, idx)
Py-->>Bench: JSON dataset (questions)
loop per question
Bench->>Direct: llmComplete (direct)
Direct-->>Bench: answer + tokens/cost/latency
Bench->>RLM: rlmLoop(question, storageMode?)
alt storageMode true
RLM->>Pg: start() / ingest(context)
RLM->>Pg: handle pg_* queries (via handleLLMRequest)
Pg-->>RLM: query results
RLM->>Bench: rlm answer + tokens/cost/iterations/latency
else storageMode false
RLM-->>Bench: rlm answer + tokens/cost/iterations/latency
end
Bench->>Bench: compute savings & aggregate
end
Bench->>FS: saveBenchmarkResults -> ~/.rlmx/benchmarks/<ts>.json
CLI->>FS: saveSession(...) unless --no-session
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Normalize extensions to always have a leading dot in both config.ts (YAML parsing) and context.ts (directory scanning). Previously, extensions written without dots in YAML (e.g. [mdx, json]) would silently fail to match any files because the filter compared "mdx" against ".mdx".
Every rlmx run now persists trajectory, config snapshot, usage stats, answer, and metadata to ~/.rlmx/sessions/<run_id>/. Add --no-session flag to disable. Session save failures are caught and warned on stderr without crashing the main run.
feat: Phase 1 — benchmarks, sessions, bug fix #28
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (7)
src/session.ts (1)
69-70: Consider explicit encoding foranswer.txt.Other
writeFilecalls (lines 64, 67, 74) include explicit content formatting. For consistency and to handle non-ASCII answers correctly, consider being explicit about encoding, though Node.js defaults to UTF-8:- await writeFile(join(sessionsDir, "answer.txt"), data.answer); + await writeFile(join(sessionsDir, "answer.txt"), data.answer, "utf-8");🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/session.ts` around lines 69 - 70, The writeFile call that writes "answer.txt" should specify encoding explicitly to match the other writes and ensure correct handling of non-ASCII text; update the writeFile using the same pattern as the other calls (referencing writeFile, join, sessionsDir and data.answer) to pass an options object or encoding argument (e.g., "utf8") so the file is written with explicit UTF-8 encoding.python/load_dataset.py (2)
24-25: Inconsistent dict access — potentialKeyError.Lines 22-23 and 26 use
.get()for safe access, but lines 24-25 use directitem["question"]anditem["context"]which will raiseKeyErrorif those fields are missing. For consistency and robustness:🛡️ Proposed fix
output.append({ "id": f"oolong-{item.get('id', 'unknown')}", "name": item.get("question", "")[:50], - "question": item["question"], - "context": item["context"], + "question": item.get("question", ""), + "context": item.get("context", ""), "expected": item.get("answer", ""), "category": "oolong", })🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/load_dataset.py` around lines 24 - 25, Replace the direct dict lookups item["question"] and item["context"] with safe .get() calls to match the surrounding code (e.g., item.get("question") and item.get("context", ""/None) as appropriate) so missing keys won't raise KeyError; update any downstream logic that expects a non-None value to handle the chosen default. Target the occurrences of item["question"] and item["context"] in python/load_dataset.py and make them consistent with the other .get() usages.
9-10: Missing input validation for CLI arguments.
int(sys.argv[...])will raiseValueErrorif non-integer arguments are passed. Since this script is invoked by the benchmark runner, consider adding validation or a try/except:🛡️ Proposed fix
def main(): - samples = int(sys.argv[1]) if len(sys.argv) > 1 else 5 - idx = int(sys.argv[2]) if len(sys.argv) > 2 else -1 + try: + samples = int(sys.argv[1]) if len(sys.argv) > 1 else 5 + idx = int(sys.argv[2]) if len(sys.argv) > 2 else -1 + except ValueError as e: + print(f"Error: invalid argument - {e}", file=sys.stderr) + sys.exit(1)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/load_dataset.py` around lines 9 - 10, The CLI parsing of samples and idx in load_dataset.py uses direct int(sys.argv[...]) calls which will raise ValueError on non-integer input; wrap the parsing of samples and idx in a try/except ValueError block (or validate with str.isdigit()/regex) and handle invalid input by printing a clear usage/error message and exiting non-zero. Update the logic around the variables samples and idx so invalid values fall back to the existing defaults (5 and -1) or force exit, and ensure any exception caught logs the offending argv value for easier debugging.src/cli.ts (2)
647-648: Manual argument parsing is fragile.The
--outputflag is parsed by searchingargs.indexOf("--output"), which doesn't handle edge cases like--output=jsonor missing values. This differs from how other CLI options are parsed viaparseArgs. Consider reusing the existing parsing infrastructure or documenting this limitation.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/cli.ts` around lines 647 - 648, The manual parsing of the --output flag (variables outputIdx, outputFormat and args) is fragile; update the CLI to use the existing parseArgs infrastructure instead of indexOf or, at minimum, robustly handle --output=json, --output json and the missing-value case. Replace the outputIdx/indexOf logic with a parseArgs definition for an --output option (allowed values "json"|"table") and fall back to "table" when not provided or invalid, or add parsing that checks for "--output=..." and verifies args[outputIdx+1] exists before using it.
658-661:parseIntreturnsNaNon invalid input — consider validation.If a user passes
--samples abc,parseInt("abc", 10)returnsNaN, which would be passed torunOolongBenchmark. The Python script would then receive"NaN"as a string argument.🛡️ Proposed fix
} else if (mode === "oolong") { const samplesIdx = args.indexOf("--samples"); - const samples = samplesIdx >= 0 ? parseInt(args[samplesIdx + 1], 10) : 5; + const samplesRaw = samplesIdx >= 0 ? parseInt(args[samplesIdx + 1], 10) : 5; + const samples = Number.isNaN(samplesRaw) ? 5 : samplesRaw; const idxArgIdx = args.indexOf("--idx"); - const idx = idxArgIdx >= 0 ? parseInt(args[idxArgIdx + 1], 10) : undefined; + const idxRaw = idxArgIdx >= 0 ? parseInt(args[idxArgIdx + 1], 10) : undefined; + const idx = idxRaw !== undefined && Number.isNaN(idxRaw) ? undefined : idxRaw;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/cli.ts` around lines 658 - 661, The code currently uses parseInt for samples and idx (variables samplesIdx/samples and idxArgIdx/idx) but does not validate the results, so invalid inputs produce NaN and propagate to runOolongBenchmark; update the parsing to check for NaN after parseInt and handle it: if samples is NaN use the default value (5) or reject the input with a clear error, and for idx set it to undefined if NaN; ensure you update the logic where runOolongBenchmark is called to rely on the sanitized samples and idx values and provide a helpful error message or usage hint when user input is invalid.tests/benchmark.test.ts (1)
21-26: Path resolution depends on build output structure.The path
join(__dirname, "..", "..", "src", "benchmark-data.json")assumes tests run fromdist/tests/. This works but is fragile if the build structure changes. Consider documenting this dependency or using a more robust resolution.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/benchmark.test.ts` around lines 21 - 26, The path resolution in loadDataset() is fragile because jsonPath = join(__dirname, "..", "..", "src", "benchmark-data.json") assumes tests run from dist/tests; update loadDataset to resolve the JSON robustly (e.g., derive the file location from the test file location using import.meta.url + fileURLToPath or use require.resolve/paths from process.cwd()) so it works regardless of build output layout, and update the jsonPath variable usage accordingly or add a short comment documenting the required build layout if you prefer to keep the current approach.src/benchmark.ts (1)
186-189: Venv creation commands have no timeout — could hang indefinitely.If
python3 -m venvorpip install datasetshangs (network issues, disk full), the benchmark command will block forever. Consider adding a timeout:🛡️ Proposed fix
- await execFileAsync("python3", ["-m", "venv", venvDir]); - await execFileAsync(join(venvDir, "bin", "pip"), ["install", "datasets"]); + await execFileAsync("python3", ["-m", "venv", venvDir], { timeout: 60_000 }); + await execFileAsync(join(venvDir, "bin", "pip"), ["install", "datasets"], { timeout: 300_000 });🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/benchmark.ts` around lines 186 - 189, The two external commands that create the venv and run pip (the calls to execFileAsync that invoke "python3 -m venv" and join(venvDir, "bin", "pip") with ["install","datasets"]) lack timeouts and can hang indefinitely; update both execFileAsync invocations to include a sane timeout option (e.g., { timeout: <ms> }) and handle the resulting timeout error (reject/throw or log and clean up) so the benchmark doesn't block forever — ensure you reference the existing venvDir variable and preserve the mkdir(...) call and error handling around these execFileAsync calls.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/benchmark.ts`:
- Around line 352-355: In function formatPct replace the dead sign assignment
with a meaningful one or remove it; either remove sign entirely and return
`${n.toFixed(1)}%`, or set sign = n > 0 ? "+" : "" so positive values are
prefixed with "+" (keep function name formatPct and its single return). Ensure
negative numbers still show their built-in "-" from n.toFixed(1).
---
Nitpick comments:
In `@python/load_dataset.py`:
- Around line 24-25: Replace the direct dict lookups item["question"] and
item["context"] with safe .get() calls to match the surrounding code (e.g.,
item.get("question") and item.get("context", ""/None) as appropriate) so missing
keys won't raise KeyError; update any downstream logic that expects a non-None
value to handle the chosen default. Target the occurrences of item["question"]
and item["context"] in python/load_dataset.py and make them consistent with the
other .get() usages.
- Around line 9-10: The CLI parsing of samples and idx in load_dataset.py uses
direct int(sys.argv[...]) calls which will raise ValueError on non-integer
input; wrap the parsing of samples and idx in a try/except ValueError block (or
validate with str.isdigit()/regex) and handle invalid input by printing a clear
usage/error message and exiting non-zero. Update the logic around the variables
samples and idx so invalid values fall back to the existing defaults (5 and -1)
or force exit, and ensure any exception caught logs the offending argv value for
easier debugging.
In `@src/benchmark.ts`:
- Around line 186-189: The two external commands that create the venv and run
pip (the calls to execFileAsync that invoke "python3 -m venv" and join(venvDir,
"bin", "pip") with ["install","datasets"]) lack timeouts and can hang
indefinitely; update both execFileAsync invocations to include a sane timeout
option (e.g., { timeout: <ms> }) and handle the resulting timeout error
(reject/throw or log and clean up) so the benchmark doesn't block forever —
ensure you reference the existing venvDir variable and preserve the mkdir(...)
call and error handling around these execFileAsync calls.
In `@src/cli.ts`:
- Around line 647-648: The manual parsing of the --output flag (variables
outputIdx, outputFormat and args) is fragile; update the CLI to use the existing
parseArgs infrastructure instead of indexOf or, at minimum, robustly handle
--output=json, --output json and the missing-value case. Replace the
outputIdx/indexOf logic with a parseArgs definition for an --output option
(allowed values "json"|"table") and fall back to "table" when not provided or
invalid, or add parsing that checks for "--output=..." and verifies
args[outputIdx+1] exists before using it.
- Around line 658-661: The code currently uses parseInt for samples and idx
(variables samplesIdx/samples and idxArgIdx/idx) but does not validate the
results, so invalid inputs produce NaN and propagate to runOolongBenchmark;
update the parsing to check for NaN after parseInt and handle it: if samples is
NaN use the default value (5) or reject the input with a clear error, and for
idx set it to undefined if NaN; ensure you update the logic where
runOolongBenchmark is called to rely on the sanitized samples and idx values and
provide a helpful error message or usage hint when user input is invalid.
In `@src/session.ts`:
- Around line 69-70: The writeFile call that writes "answer.txt" should specify
encoding explicitly to match the other writes and ensure correct handling of
non-ASCII text; update the writeFile using the same pattern as the other calls
(referencing writeFile, join, sessionsDir and data.answer) to pass an options
object or encoding argument (e.g., "utf8") so the file is written with explicit
UTF-8 encoding.
In `@tests/benchmark.test.ts`:
- Around line 21-26: The path resolution in loadDataset() is fragile because
jsonPath = join(__dirname, "..", "..", "src", "benchmark-data.json") assumes
tests run from dist/tests; update loadDataset to resolve the JSON robustly
(e.g., derive the file location from the test file location using
import.meta.url + fileURLToPath or use require.resolve/paths from process.cwd())
so it works regardless of build output layout, and update the jsonPath variable
usage accordingly or add a short comment documenting the required build layout
if you prefer to keep the current approach.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: d04fd3eb-33ea-4c04-97fa-4ae4229a3481
⛔ Files ignored due to path filters (1)
package-lock.jsonis excluded by!**/package-lock.json
📒 Files selected for processing (12)
package.jsonpython/load_dataset.pysrc/benchmark-data.jsonsrc/benchmark.tssrc/cli.tssrc/config.tssrc/context.tssrc/session.tssrc/version.tstests/benchmark.test.tstests/context.test.tstests/session.test.ts
✅ Files skipped from review due to trivial changes (2)
- src/version.ts
- package.json
| function formatPct(n: number): string { | ||
| const sign = n > 0 ? "" : ""; | ||
| return `${sign}${n.toFixed(1)}%`; | ||
| } |
There was a problem hiding this comment.
Unused sign variable in formatPct.
The sign variable is always an empty string (both branches assign ""), making it dead code:
const sign = n > 0 ? "" : "";If the intent was to show + for positive savings:
♻️ Proposed fix
function formatPct(n: number): string {
- const sign = n > 0 ? "" : "";
- return `${sign}${n.toFixed(1)}%`;
+ const sign = n > 0 ? "+" : "";
+ return `${sign}${n.toFixed(1)}%`;
}Or remove the unused variable entirely.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/benchmark.ts` around lines 352 - 355, In function formatPct replace the
dead sign assignment with a meaningful one or remove it; either remove sign
entirely and return `${n.toFixed(1)}%`, or set sign = n > 0 ? "+" : "" so
positive values are prefixed with "+" (keep function name formatPct and its
single return). Ensure negative numbers still show their built-in "-" from
n.toFixed(1).
The benchmark data file was resolved relative to source root (../../src/), which doesn't exist in npm installations. Now resolves relative to the compiled JS file (__dirname) and the build script copies the JSON to dist/. Co-authored-by: Genie <genie@namastex.ai>
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/benchmark.ts`:
- Around line 100-160: The loop currently awaits llmComplete and rlmLoop inline
so a single provider error will reject the whole benchmark and drop accumulated
runs; wrap the per-question work (the calls to llmComplete and rlmLoop and the
runs.push) in a try/catch around the body that builds
ctx/directMessages/directStart/etc., on error push a fallback run entry
(referencing questionId/q.name and fields like direct, rlm, savings) containing
error details and null/zeros for metrics and/or an error field, and continue;
additionally after each successful runs.push persist or flush partial results
(e.g., call a savePartialRuns helper) so progress is not lost if the process
crashes.
- Around line 195-199: The findLoadDatasetScript() function currently assumes
the module lives in dist/src and always walks up two directories, which fails
when running from src/benchmark.ts; modify findLoadDatasetScript() to probe both
candidate paths (join(thisDir, "..", "python", "load_dataset.py") and
join(thisDir, "..", "..", "python", "load_dataset.py")) and return the first
that exists, and if neither exists throw a clear Error mentioning both tried
paths and that runOolongBenchmark() could not locate python/load_dataset.py; use
fs.existsSync or fs.promises.access to check existence and keep the function
name findLoadDatasetScript and its return semantics unchanged so callers like
runOolongBenchmark() continue to work.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: fa0e0882-12b0-4a1a-89f8-3ddd180788e3
📒 Files selected for processing (3)
package.jsonsrc/benchmark.tssrc/version.ts
✅ Files skipped from review due to trivial changes (1)
- src/version.ts
🚧 Files skipped from review as they are similar to previous changes (1)
- package.json
| export interface BenchmarkQuestion { | ||
| id: string; | ||
| name: string; | ||
| question: string; | ||
| context: string; | ||
| category: string; | ||
| expected?: string; |
There was a problem hiding this comment.
oolong mode never computes an accuracy result.
runOolongBenchmark() never reads q.expected, BenchmarkResults has no place to store correctness, and Line 361 still labels the mode as "oolong accuracy". As written, this mode only reports token/cost/latency data, so it cannot answer the question it claims to benchmark.
Also applies to: 34-68, 202-294, 361-362
| for (const q of dataset) { | ||
| const ctx: LoadedContext = { | ||
| type: "string", | ||
| content: q.context, | ||
| metadata: `benchmark context for "${q.name}" (${q.context.length} chars)`, | ||
| }; | ||
|
|
||
| // Direct LLM call | ||
| const directMessages: ChatMessage[] = [ | ||
| { | ||
| role: "user", | ||
| content: `Context:\n${q.context}\n\nQuestion: ${q.question}`, | ||
| }, | ||
| ]; | ||
|
|
||
| const directStart = Date.now(); | ||
| const directResp = await llmComplete(directMessages, config.model); | ||
| const directLatency = Date.now() - directStart; | ||
|
|
||
| // RLM call | ||
| const rlmStart = Date.now(); | ||
| const rlmResult = await rlmLoop(q.question, ctx, config, { | ||
| maxIterations: config.budget.maxTokens ? 5 : 10, | ||
| timeout: 120_000, | ||
| verbose: false, | ||
| output: "text", | ||
| cache: false, | ||
| }); | ||
| const rlmLatency = Date.now() - rlmStart; | ||
|
|
||
| const directTotalTokens = directResp.usage.inputTokens + directResp.usage.outputTokens; | ||
| const rlmTotalTokens = rlmResult.usage.inputTokens + rlmResult.usage.outputTokens; | ||
|
|
||
| runs.push({ | ||
| questionId: q.id, | ||
| questionName: q.name, | ||
| direct: { | ||
| tokens_input: directResp.usage.inputTokens, | ||
| tokens_output: directResp.usage.outputTokens, | ||
| cost: directResp.usage.totalCost, | ||
| latency_ms: directLatency, | ||
| answer: directResp.text, | ||
| }, | ||
| rlm: { | ||
| tokens_input: rlmResult.usage.inputTokens, | ||
| tokens_output: rlmResult.usage.outputTokens, | ||
| cost: rlmResult.usage.totalCost, | ||
| latency_ms: rlmLatency, | ||
| iterations: rlmResult.iterations, | ||
| answer: rlmResult.answer, | ||
| }, | ||
| savings: { | ||
| tokens_pct: calculateSavings(directTotalTokens, rlmTotalTokens), | ||
| cost_pct: calculateCostSavings(directResp.usage.totalCost, rlmResult.usage.totalCost), | ||
| }, | ||
| }); | ||
|
|
||
| if (options?.outputFormat !== "json") { | ||
| process.stderr.write(` completed: ${q.name}\n`); | ||
| } | ||
| } |
There was a problem hiding this comment.
One failed sample aborts the entire benchmark run.
Each direct/RLM call is awaited inline with no per-question recovery. If a provider call fails near the end of a long run, the function rejects and the completed runs accumulated so far are lost to the caller. Please isolate failures per sample or persist partial results as the loop progresses.
Also applies to: 225-283
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/benchmark.ts` around lines 100 - 160, The loop currently awaits
llmComplete and rlmLoop inline so a single provider error will reject the whole
benchmark and drop accumulated runs; wrap the per-question work (the calls to
llmComplete and rlmLoop and the runs.push) in a try/catch around the body that
builds ctx/directMessages/directStart/etc., on error push a fallback run entry
(referencing questionId/q.name and fields like direct, rlm, savings) containing
error details and null/zeros for metrics and/or an error field, and continue;
additionally after each successful runs.push persist or flush partial results
(e.g., call a savePartialRuns helper) so progress is not lost if the process
crashes.
| function findLoadDatasetScript(): string { | ||
| // The script is at python/load_dataset.py relative to package root | ||
| // From dist/src/benchmark.js, package root is ../../ | ||
| const thisDir = dirname(fileURLToPath(import.meta.url)); | ||
| return join(thisDir, "..", "..", "python", "load_dataset.py"); |
There was a problem hiding this comment.
findLoadDatasetScript() only works from dist/src.
Line 199 walks up two directories, which is correct for dist/src/benchmark.js but overshoots the package root when this module is executed from src/benchmark.ts. In that layout, runOolongBenchmark() resolves python/load_dataset.py outside the repo and fails with ENOENT. Probe both ../python and ../../python, then throw a clear error if neither exists.
🛠️ One way to make the lookup work in both layouts
+import { existsSync } from "node:fs";
...
function findLoadDatasetScript(): string {
// The script is at python/load_dataset.py relative to package root
- // From dist/src/benchmark.js, package root is ../../
const thisDir = dirname(fileURLToPath(import.meta.url));
- return join(thisDir, "..", "..", "python", "load_dataset.py");
+ const candidates = [
+ join(thisDir, "..", "python", "load_dataset.py"),
+ join(thisDir, "..", "..", "python", "load_dataset.py"),
+ ];
+ const scriptPath = candidates.find((candidate) => existsSync(candidate));
+ if (!scriptPath) {
+ throw new Error("Could not locate python/load_dataset.py");
+ }
+ return scriptPath;
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/benchmark.ts` around lines 195 - 199, The findLoadDatasetScript()
function currently assumes the module lives in dist/src and always walks up two
directories, which fails when running from src/benchmark.ts; modify
findLoadDatasetScript() to probe both candidate paths (join(thisDir, "..",
"python", "load_dataset.py") and join(thisDir, "..", "..", "python",
"load_dataset.py")) and return the first that exists, and if neither exists
throw a clear Error mentioning both tried paths and that runOolongBenchmark()
could not locate python/load_dataset.py; use fs.existsSync or fs.promises.access
to check existence and keep the function name findLoadDatasetScript and its
return semantics unchanged so callers like runOolongBenchmark() continue to
work.
Systems without python3-venv installed would fail to create the benchmark Python environment. Now tries uv first (preferred), then falls back to python3 -m venv + pip. Co-authored-by: Genie <genie@namastex.ai>
The dataset uses "context_window_text" not "context", and "answer" can be a list. Also maps task_group for category. Co-authored-by: Genie <genie@namastex.ai>
Comprehensive QA Report — Phase 1 FeaturesAll features have been tested end-to-end on QA Results
Bugs Found & Fixed
Included Changes (dev → main)
Ready for human review and merge to main. |
…ext handling Add pgserve as a project dependency for embedded PostgreSQL storage. Extend rlmx.yaml schema with a `storage` section supporting enabled mode (auto/always/never), persistent/memory mode, data directory, port, chunk sizing, and token estimation parameters. All fields have sensible defaults and descriptive validation errors.
…ontexts Wire validateContextSize() into runQuery() and runBatchCommand() execution paths. When context exceeds the model's token limit, cache mode is automatically disabled with a stderr warning. Storage mode is signaled when storage.enabled is 'auto' or 'always'. Add storageMode option to RLMOptions interface for future pgserve routing. Export PROVIDER_LIMITS from cache.ts for reuse by other modules.
…ingestion Create src/storage.ts with PgStorage class that: - Spawns pgserve as a child process (memory or persistent mode) - Creates records table with GIN-indexed tsvector for full-text search - Ingests context as JSONL (auto-extracting timestamp/type fields) or plain text - Provides search(), slice(), timeRange(), count(), query() methods - Graceful shutdown with 3s SIGTERM timeout then SIGKILL - Process cleanup on exit/SIGTERM/SIGINT/uncaughtException - Adaptive chunk sizing via getChunkSize() using PROVIDER_LIMITS Also adds pg and @types/pg dependencies.
…orage queries Create python/pg_batteries.py with pg_search(), pg_slice(), pg_time(), pg_count(), pg_query() functions that communicate via IPC to Node.js PgStorage. Results > 2000 chars are auto-truncated with stub format showing first 5 items. Add pg_* request types to IPC protocol (ipc.ts), route them through handleLLMRequest() to PgStorage methods (llm.ts), and add loadPgBatteries option to REPL start (repl.ts) for storage mode.
Add rlmx_sessions and rlmx_events tables to pgserve schema (created alongside records table on start). Create v_cost_breakdown and v_repl_usage aggregation views. New src/observe.ts with ObservabilityRecorder class that records LLM calls, REPL executions, and sub-calls as fire-and-forget inserts. Session lifecycle tracked from start through completion/failure. All recording errors logged to stderr but never block the run.
When storageMode is active in rlmLoop(): - Creates PgStorage, ingests context into Postgres - Creates ObservabilityRecorder, records session start - Passes storage to REPL (loadPgBatteries) and handleLLMRequest - Skips raw context injection, uses storage metadata instead - Appends pg_* tool instructions to system prompt with record count - Records every LLM call, REPL execution, and sub-call to observer - Records final answer or error on completion - Cleans up storage on finalize and error paths Batch engine (batch.ts) now supports storageMode option, passing cache/storage flags through to each rlmLoop call. CLI (cli.ts) forces storageMode when storage.enabled is 'always', and passes storageMode to batch when context exceeds limits.
New src/stats.ts with query functions (listRuns, getRun, costBreakdown, toolUsage) and terminal table formatting. Connects to persistent pgserve data at ~/.rlmx/data temporarily to query rlmx_sessions and rlmx_events. CLI commands: rlmx stats — table of last 20 runs rlmx stats --run <id> — per-event detail for one run rlmx stats --costs — cost breakdown by model rlmx stats --tools — sub-call usage and error rates rlmx stats --since 24h — filter by time window rlmx stats --output json — JSON output mode Graceful "No stats yet" message when ~/.rlmx/data doesn't exist.
FIX 1 (CRITICAL): Eliminate SQL injection in stats queries. - PgStorage.query() now accepts optional params array, passed through to client.query(sql, params) for parameterized execution. - stats.ts getRun(): uses $1 parameter for session_id instead of string interpolation with manual escaping. - stats.ts listRuns(): parameterized LIMIT. - stats.ts costBreakdown()/toolUsage(): compute cutoff timestamp in JS via parseSinceCutoff(), pass as $1 parameter instead of interpolating SQL interval strings. FIX 2 (HIGH): Proper port auto-assignment when config.port is 0. - Added findFreePort() helper that binds a temporary TCP server to port 0, reads the OS-assigned port, then closes the server. - When config.port is 0, uses findFreePort() instead of falling back to hardcoded 8432, preventing conflicts with other instances.
…olumn, binary resolution - TRUNCATE records before ingest to prevent stale data in persistent mode - Reject SQL with semicolons in query() to prevent injection via chained statements - Add source TEXT column to records schema, populate from ContextItem.path - Resolve pgserve binary relative to package install via import.meta.url - Use config dataDir in stats.ts instead of hardcoded path - Check process.exitCode in waitForReady loop for fast failure - Add public send_request() to llm_bridge, update pg/gemini_batteries
- Remove semicolon check in pg_query (READ ONLY wrapper suffices) - Stats loads actual rlmx.yaml config instead of defaults - Expose source column in pg_search/pg_slice, add pg_sources() - Use typeof exitCode === 'number' guard in waitForReady - Add session_id column to records table and ingest()
feat: embedded pgserve storage for large context handling + observability
Summary
Promotes dev to main with all changes since last stable release.
Highlights
PRs Included
Test Plan
Summary by CodeRabbit