volcengine · yangxinxin-7 · Apr 8, 2026 · Apr 8, 2026
diff --git a/benchmark/locomo/mem0/README.md b/benchmark/locomo/mem0/README.md
@@ -0,0 +1,158 @@
+# LoCoMo Benchmark — mem0 Evaluation
+
+Evaluate mem0 memory retrieval on the [LoCoMo](https://github.com/snap-stanford/locomo) benchmark using OpenClaw as the agent.
+
+## Overview
+
+Two-phase pipeline:
+
+1. **Ingest** — Import LoCoMo conversations into mem0 (one `user_id` per sample)
+2. **Eval** — Send QA questions to OpenClaw agent (which recalls from mem0), then judge answers with an LLM
+
+## Prerequisites
+
+- [OpenClaw](https://openclaw.ai) installed and configured
+- `openclaw-mem0` plugin installed (`~/.openclaw/extensions/openclaw-mem0`)
+- `~/.openclaw/openclaw.json` with `plugins.slots.memory = "openclaw-mem0"`
+- API keys in `~/.openviking_benchmark_env`:
+
+```env
+MEM0_API_KEY=m0-...
+ARK_API_KEY=...         # Volcengine ARK, used for judge LLM
+```
+
+- Python dependencies:
+
+```bash
+uv sync --frozen --extra dev
+```
+
+## Data
+
+LoCoMo 10-sample dataset at `benchmark/locomo/data/locomo10.json`:
+
+- 10 samples (conversations between two people)
+- 1986 QA pairs across 5 categories:
+  - 1: single-hop
+  - 2: multi-hop
+  - 3: temporal
+  - 4: world-knowledge
+  - 5: adversarial (skipped by default)
+
+## Step 1 — Ingest
+
+Import conversations into mem0. Each sample is stored under `user_id = sample_id` (e.g. `conv-26`).
+
+```bash
+# Ingest all 10 samples
+python ingest.py
+
+# Ingest a single sample
+python ingest.py --sample conv-26
+
+# Force re-ingest (ignore existing records)
+python ingest.py --sample conv-26 --force-ingest
+
+# Clear all ingest records and start fresh
+python ingest.py --clear-ingest-record
+```
+
+Key options:
+
+| Option | Description |
+|--------|-------------|
+| `--sample` | Sample ID (e.g. `conv-26`) or index (0-based). Default: all |
+| `--sessions` | Session range, e.g. `1-4` or `3`. Default: all |
+| `--limit` | Max samples to process |
+| `--force-ingest` | Re-ingest even if already recorded |
+| `--clear-ingest-record` | Clear `.ingest_record.json` before running |
+
+Ingest records are saved to `result/.ingest_record.json` to avoid duplicate ingestion.
+
+## Step 2 — Eval
+
+Send QA questions to OpenClaw agent and optionally judge answers.
+
+Before each sample, `eval.py` automatically:
+1. Updates `~/.openclaw/openclaw.json` to set `openclaw-mem0.config.userId = sample_id`
+2. Restarts the OpenClaw gateway to pick up the new config
+3. Verifies the correct `userId` is active via a dummy request
+
+```bash
+# Run QA + judge for all samples (6 concurrent threads)
+python eval.py --threads 6 --judge
+
+# Single sample
+python eval.py --sample conv-26 --threads 6 --judge
+
+# First 12 questions only
+python eval.py --sample conv-26 --count 12 --threads 6 --judge
+
+# Judge-only (grade existing responses in CSV)
+python eval.py --judge-only
+```
+
+Key options:
+
+| Option | Description |
+|--------|-------------|
+| `--sample` | Sample ID or index. Default: all |
+| `--count` | Max QA items to process |
+| `--threads` | Concurrent threads per sample (default: 10) |
+| `--judge` | Auto-judge each response after answering |
+| `--judge-only` | Skip QA, only grade ungraded rows in existing CSV |
+| `--no-skip-adversarial` | Include category-5 adversarial questions |
+| `--openclaw-url` | OpenClaw gateway URL (default: `http://127.0.0.1:18789`) |
+| `--openclaw-token` | Auth token (or `OPENCLAW_GATEWAY_TOKEN` env var) |
+| `--judge-base-url` | Judge API base URL (default: Volcengine ARK) |
+| `--judge-model` | Judge model (default: `doubao-seed-2-0-pro-260215`) |
+| `--output` | Output CSV path (default: `result/qa_results.csv`) |
+
+Results are written to `result/qa_results.csv`. Failed (`[ERROR]`) rows are automatically removed at the start of each run and retried.
+
+## Output
+
+`result/qa_results.csv` columns:
+
+| Column | Description |
+|--------|-------------|
+| `sample_id` | Conversation sample ID |
+| `question_id` | Unique question ID (e.g. `conv-26_qa0`) |
+| `question` / `answer` | Question and gold answer |
+| `category` / `category_name` | Question category |
+| `response` | Agent response |
+| `input_tokens` / `output_tokens` / `total_tokens` | LLM token usage (all turns summed) |
+| `time_cost` | End-to-end latency (seconds) |
+| `result` | `CORRECT` or `WRONG` |
+| `reasoning` | Judge's reasoning |
+
+## Summary Output
+
+After eval completes:
+
+```
+=== Token & Latency Summary ===
+  Total input tokens : 123456
+  Avg time per query : 18.3s
+
+=== Accuracy Summary ===
+  Overall: 512/1540 = 33.25%
+  By category:
+    multi-hop           : 120/321 = 37.38%
+    single-hop          : 98/282 = 34.75%
+    temporal            : 28/96  = 29.17%
+    world-knowledge     : 266/841 = 31.63%
+```
+
+## Delete mem0 Data
+
+```bash
+# Delete a specific sample
+python delete_user.py conv-26
+
+# Delete all samples from the dataset
+python delete_user.py --from-data
+
+# Delete first N samples
+python delete_user.py --from-data --limit 3
+```
diff --git a/benchmark/locomo/mem0/delete_user.py b/benchmark/locomo/mem0/delete_user.py
@@ -0,0 +1,84 @@
+"""
+Delete all memories for one or more mem0 users.
+
+Usage:
+    # Delete a single user
+    python delete_user.py conv-26
+
+    # Delete multiple users
+    python delete_user.py conv-26 conv-31 conv-45
+
+    # Delete first N users from locomo10.json
+    python delete_user.py --from-data --limit 2
+
+    # Delete all users from locomo10.json
+    python delete_user.py --from-data
+"""
+
+import argparse
+import json
+import os
+import sys
+from pathlib import Path
+
+from dotenv import load_dotenv
+load_dotenv(Path.home() / ".openviking_benchmark_env")
+
+try:
+    from mem0 import MemoryClient
+except ImportError:
+    print("Error: mem0 package not installed. Run: pip install mem0ai", file=sys.stderr)
+    sys.exit(1)
+
+SCRIPT_DIR = Path(__file__).parent.resolve()
+DEFAULT_DATA_PATH = str(SCRIPT_DIR / ".." / "data" / "locomo10.json")
+
+
+def delete_user(client: MemoryClient, user_id: str) -> bool:
+    try:
+        client.delete_all(user_id=user_id)
+        print(f"  [OK] {user_id}")
+        return True
+    except Exception as e:
+        print(f"  [ERROR] {user_id}: {e}")
+        return False
+
+
+def main() -> None:
+    parser = argparse.ArgumentParser(description="Delete all mem0 memories for given user(s)")
+    parser.add_argument("users", nargs="*", help="user_id(s) to delete (e.g. conv-26 conv-31)")
+    parser.add_argument("--api-key", default=None, help="mem0 API key (or MEM0_API_KEY env var)")
+    parser.add_argument("--from-data", action="store_true", help="load user_ids from locomo10.json")
+    parser.add_argument("--input", default=DEFAULT_DATA_PATH, help="path to locomo10.json")
+    parser.add_argument("--limit", type=int, default=None, help="max users to delete (with --from-data)")
+    args = parser.parse_args()
+
+    api_key = args.api_key or os.environ.get("MEM0_API_KEY", "")
+    if not api_key:
+        print("Error: mem0 API key required (--api-key or MEM0_API_KEY env var)", file=sys.stderr)
+        sys.exit(1)
+
+    # Convert bare sample_ids (e.g. "conv-26") to mem0 user_id format
+    user_ids: list[str] = list(args.users)
+
+    if args.from_data:
+        with open(args.input, "r", encoding="utf-8") as f:
+            data = json.load(f)
+        if args.limit:
+            data = data[: args.limit]
+        user_ids += [s["sample_id"] for s in data]
+
+    if not user_ids:
+        print("Error: no users specified. Pass user_ids or use --from-data", file=sys.stderr)
+        sys.exit(1)
+
+    user_ids = list(dict.fromkeys(user_ids))  # deduplicate, preserve order
+    print(f"Deleting memories for {len(user_ids)} user(s)...")
+
+    client = MemoryClient(api_key=api_key)
+    ok = sum(delete_user(client, uid) for uid in user_ids)
+    print(f"\nDone: {ok}/{len(user_ids)} succeeded")
+
+
+if __name__ == "__main__":
+    main()