Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
158 changes: 158 additions & 0 deletions benchmark/locomo/mem0/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,158 @@
# LoCoMo Benchmark — mem0 Evaluation

Evaluate mem0 memory retrieval on the [LoCoMo](https://github.com/snap-stanford/locomo) benchmark using OpenClaw as the agent.

## Overview

Two-phase pipeline:

1. **Ingest** — Import LoCoMo conversations into mem0 (one `user_id` per sample)
2. **Eval** — Send QA questions to OpenClaw agent (which recalls from mem0), then judge answers with an LLM

## Prerequisites

- [OpenClaw](https://openclaw.ai) installed and configured
- `openclaw-mem0` plugin installed (`~/.openclaw/extensions/openclaw-mem0`)
- `~/.openclaw/openclaw.json` with `plugins.slots.memory = "openclaw-mem0"`
- API keys in `~/.openviking_benchmark_env`:

```env
MEM0_API_KEY=m0-...
ARK_API_KEY=... # Volcengine ARK, used for judge LLM
```

- Python dependencies:

```bash
uv sync --frozen --extra dev
```

## Data

LoCoMo 10-sample dataset at `benchmark/locomo/data/locomo10.json`:

- 10 samples (conversations between two people)
- 1986 QA pairs across 5 categories:
- 1: single-hop
- 2: multi-hop
- 3: temporal
- 4: world-knowledge
- 5: adversarial (skipped by default)

## Step 1 — Ingest

Import conversations into mem0. Each sample is stored under `user_id = sample_id` (e.g. `conv-26`).

```bash
# Ingest all 10 samples
python ingest.py

# Ingest a single sample
python ingest.py --sample conv-26

# Force re-ingest (ignore existing records)
python ingest.py --sample conv-26 --force-ingest

# Clear all ingest records and start fresh
python ingest.py --clear-ingest-record
```

Key options:

| Option | Description |
|--------|-------------|
| `--sample` | Sample ID (e.g. `conv-26`) or index (0-based). Default: all |
| `--sessions` | Session range, e.g. `1-4` or `3`. Default: all |
| `--limit` | Max samples to process |
| `--force-ingest` | Re-ingest even if already recorded |
| `--clear-ingest-record` | Clear `.ingest_record.json` before running |

Ingest records are saved to `result/.ingest_record.json` to avoid duplicate ingestion.

## Step 2 — Eval

Send QA questions to OpenClaw agent and optionally judge answers.

Before each sample, `eval.py` automatically:
1. Updates `~/.openclaw/openclaw.json` to set `openclaw-mem0.config.userId = sample_id`
2. Restarts the OpenClaw gateway to pick up the new config
3. Verifies the correct `userId` is active via a dummy request

```bash
# Run QA + judge for all samples (6 concurrent threads)
python eval.py --threads 6 --judge

# Single sample
python eval.py --sample conv-26 --threads 6 --judge

# First 12 questions only
python eval.py --sample conv-26 --count 12 --threads 6 --judge

# Judge-only (grade existing responses in CSV)
python eval.py --judge-only
```

Key options:

| Option | Description |
|--------|-------------|
| `--sample` | Sample ID or index. Default: all |
| `--count` | Max QA items to process |
| `--threads` | Concurrent threads per sample (default: 10) |
| `--judge` | Auto-judge each response after answering |
| `--judge-only` | Skip QA, only grade ungraded rows in existing CSV |
| `--no-skip-adversarial` | Include category-5 adversarial questions |
| `--openclaw-url` | OpenClaw gateway URL (default: `http://127.0.0.1:18789`) |
| `--openclaw-token` | Auth token (or `OPENCLAW_GATEWAY_TOKEN` env var) |
| `--judge-base-url` | Judge API base URL (default: Volcengine ARK) |
| `--judge-model` | Judge model (default: `doubao-seed-2-0-pro-260215`) |
| `--output` | Output CSV path (default: `result/qa_results.csv`) |

Results are written to `result/qa_results.csv`. Failed (`[ERROR]`) rows are automatically removed at the start of each run and retried.

## Output

`result/qa_results.csv` columns:

| Column | Description |
|--------|-------------|
| `sample_id` | Conversation sample ID |
| `question_id` | Unique question ID (e.g. `conv-26_qa0`) |
| `question` / `answer` | Question and gold answer |
| `category` / `category_name` | Question category |
| `response` | Agent response |
| `input_tokens` / `output_tokens` / `total_tokens` | LLM token usage (all turns summed) |
| `time_cost` | End-to-end latency (seconds) |
| `result` | `CORRECT` or `WRONG` |
| `reasoning` | Judge's reasoning |

## Summary Output

After eval completes:

```
=== Token & Latency Summary ===
Total input tokens : 123456
Avg time per query : 18.3s

=== Accuracy Summary ===
Overall: 512/1540 = 33.25%
By category:
multi-hop : 120/321 = 37.38%
single-hop : 98/282 = 34.75%
temporal : 28/96 = 29.17%
world-knowledge : 266/841 = 31.63%
```

## Delete mem0 Data

```bash
# Delete a specific sample
python delete_user.py conv-26

# Delete all samples from the dataset
python delete_user.py --from-data

# Delete first N samples
python delete_user.py --from-data --limit 3
```
84 changes: 84 additions & 0 deletions benchmark/locomo/mem0/delete_user.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
"""
Delete all memories for one or more mem0 users.

Usage:
# Delete a single user
python delete_user.py conv-26

# Delete multiple users
python delete_user.py conv-26 conv-31 conv-45

# Delete first N users from locomo10.json
python delete_user.py --from-data --limit 2

# Delete all users from locomo10.json
python delete_user.py --from-data
"""

import argparse
import json
import os
import sys
from pathlib import Path

from dotenv import load_dotenv
load_dotenv(Path.home() / ".openviking_benchmark_env")

try:
from mem0 import MemoryClient
except ImportError:
print("Error: mem0 package not installed. Run: pip install mem0ai", file=sys.stderr)
sys.exit(1)

SCRIPT_DIR = Path(__file__).parent.resolve()
DEFAULT_DATA_PATH = str(SCRIPT_DIR / ".." / "data" / "locomo10.json")


def delete_user(client: MemoryClient, user_id: str) -> bool:
try:
client.delete_all(user_id=user_id)
print(f" [OK] {user_id}")
return True
except Exception as e:
print(f" [ERROR] {user_id}: {e}")
return False


def main() -> None:
parser = argparse.ArgumentParser(description="Delete all mem0 memories for given user(s)")
parser.add_argument("users", nargs="*", help="user_id(s) to delete (e.g. conv-26 conv-31)")
parser.add_argument("--api-key", default=None, help="mem0 API key (or MEM0_API_KEY env var)")
parser.add_argument("--from-data", action="store_true", help="load user_ids from locomo10.json")
parser.add_argument("--input", default=DEFAULT_DATA_PATH, help="path to locomo10.json")
parser.add_argument("--limit", type=int, default=None, help="max users to delete (with --from-data)")
args = parser.parse_args()

api_key = args.api_key or os.environ.get("MEM0_API_KEY", "")
if not api_key:
print("Error: mem0 API key required (--api-key or MEM0_API_KEY env var)", file=sys.stderr)
sys.exit(1)

# Convert bare sample_ids (e.g. "conv-26") to mem0 user_id format
user_ids: list[str] = list(args.users)

if args.from_data:
with open(args.input, "r", encoding="utf-8") as f:
data = json.load(f)
if args.limit:
data = data[: args.limit]
user_ids += [s["sample_id"] for s in data]

if not user_ids:
print("Error: no users specified. Pass user_ids or use --from-data", file=sys.stderr)
sys.exit(1)

user_ids = list(dict.fromkeys(user_ids)) # deduplicate, preserve order
print(f"Deleting memories for {len(user_ids)} user(s)...")

client = MemoryClient(api_key=api_key)
ok = sum(delete_user(client, uid) for uid in user_ids)
print(f"\nDone: {ok}/{len(user_ids)} succeeded")


if __name__ == "__main__":
main()
Loading
Loading