Skip to content

Add foundational support for durable storage#135

Draft
krisztianfekete wants to merge 1 commit intomainfrom
feature/add-durable-storage
Draft

Add foundational support for durable storage#135
krisztianfekete wants to merge 1 commit intomainfrom
feature/add-durable-storage

Conversation

@krisztianfekete
Copy link
Copy Markdown
Contributor

@krisztianfekete krisztianfekete commented May 4, 2026

This PR is opt-in (AGENTEVALS_STORAGE_BACKEND=postgres), so the existing in-memory developer experience is unchanged: agentevals run trace.json keeps working, the React UI behaves identically, OTLP streaming is untouched.

Setup

uv lock                        # picks up the new [postgres] extra
uv sync --extra postgres       # installs asyncpg
make pg-up                     # boots postgres:17-alpine, waits for pg_isready (idempotent)
make migrate                   # applies 000001_init; idempotent on replay
make dev-backend-pg            # serves with backend=postgres + worker pool

Look for these log lines on startup:

INFO:agentevals.api.app:Applying any pending migrations to schema 'agentevals'
INFO:agentevals.storage.postgres.pool:Creating asyncpg pool (min=4, max=12) for schema 'agentevals'
INFO:agentevals.run.worker:Started 4 run worker(s) (lease=30s, heartbeat=5s, deadline=300s)

The async run pipeline (POST /api/runs)

Submit a run, watch the worker pick it up, read the persisted results back:

RUN_ID=$(uuidgen | tr 'A-Z' 'a-z')
INLINE=$(cat samples/helm.json)
cat > /tmp/req.json <<EOF
{"runId": "$RUN_ID",
"spec": {"approach": "trace_replay",
"target": {"kind": "inline", "inline": $INLINE},
"evalConfig": {"metrics": ["tool_trajectory_avg_score"]}}}
EOF

curl -s -X POST http://localhost:8001/api/runs -H 'content-type: application/json' -d @/tmp/req.json | jq .data.status
# expect: "queued"

sleep 3
curl -s "http://localhost:8001/api/runs/$RUN_ID" | jq .data.status
# expect: "succeeded"

curl -s "http://localhost:8001/api/runs/$RUN_ID/results" | jq '.data | length'

Idempotency, 409, and cancel

# Idempotent re-submit (HTTP 202)
curl -s -i -X POST http://localhost:8001/api/runs -H 'content-type: application/json' -d @/tmp/req.json | head -1

# Different spec, same id (HTTP 409)
sed 's/tool_trajectory_avg_score/response_match_score/' /tmp/req.json > /tmp/req2.json
curl -s -i -X POST http://localhost:8001/api/runs -H 'content-type: application/json' -d @/tmp/req2.json | head -3

# Cancel (returns "cancelled" if you race the worker, otherwise the terminal status)
curl -s -X POST "http://localhost:8001/api/runs/$RUN_ID/cancel" | jq -r '.data.status'

Existing /api/evaluate flows persist when backend=postgres

UI uploads, multipart curl, SSE stream, and the JSON variant all now write a Run row plus Result rows. The response carries an extra runId field that wasn't there before. No UI changes required.

# Multipart (UI uses this)
curl -s -X POST http://localhost:8001/api/evaluate \
    -F 'trace_files=@samples/helm.json' \
    -F 'config={"metrics": ["tool_trajectory_avg_score"]}' | jq .data.runId

# SSE stream
curl -N -X POST http://localhost:8001/api/evaluate/stream \
    -F 'trace_files=@samples/helm.json' \
    -F 'config={"metrics": ["tool_trajectory_avg_score"]}' \
    | grep '"done": true' | head -1 | sed 's/^data: //' | jq .result.runId

# JSON body
.venv/bin/python -c 'import json; t=json.load(open("samples/helm.json"));
print(json.dumps({"traces":t,"config":{"metrics":["tool_trajectory_avg_score"]}}))' > /tmp/json_req.json
curl -s -X POST http://localhost:8001/api/evaluate/json -H 'content-type: application/json' -d @/tmp/json_req.json | jq .data.runId

Each call yields a new run row with target.kind = "uploaded". That's the OSS user-facing benefit of this PR: persistent run history for any eval that flows through the existing endpoints.

Inspecting the data in Postgres

alias aepsql='docker exec agentevals-pg psql -U agentevals -d agentevals'

# Run history, most recent first
aepsql -c "SELECT run_id, status, attempt, created_at FROM agentevals.run ORDER BY created_at DESC LIMIT 10"

# Counts by status
aepsql -c "SELECT status, COUNT(*) FROM agentevals.run GROUP BY status ORDER BY 2 DESC"

# Counts by submission path (uploaded vs inline POST /api/runs)
aepsql -c "SELECT spec->'target'->>'kind' AS target, status, COUNT(*) FROM agentevals.run GROUP BY 1, 2"

# Drill into the last run
RUN=$(aepsql -At -c "SELECT run_id FROM agentevals.run ORDER BY created_at DESC LIMIT 1")
aepsql -c "SELECT evaluator_name, evaluator_type, status, score, latency_ms FROM agentevals.result WHERE run_id = '$RUN'"

# Aggregate scores per evaluator across all runs
aepsql -c "SELECT evaluator_name, ROUND(AVG(score)::numeric, 3) AS avg_score, COUNT(*) FROM agentevals.result WHERE score IS NOT NULL GROUP BY 1 ORDER BY 1"

# Queue / worker state (snapshot during a hot queue)
aepsql -c "SELECT run_id, status, worker_id, attempt, lease_expires_at, cancel_requested FROM agentevals.run WHERE status IN ('queued','running')"

# Schema state
aepsql -c "SELECT version, dirty FROM agentevals.schema_migrations"

Live tail while exercising the worker:

watch -n 1 "docker exec agentevals-pg psql -U agentevals -d agentevals -c \"SELECT status, COUNT(*) FROM agentevals.run GROUP BY status ORDER BY 1\""

Crash recovery

Submit a slow run using a bigger trace, then Ctrl+C the agentevals process. Wait roughly 35 seconds (one lease window plus slack), restart with make dev-backend-pg. The previously claimed run is re-claimed by a new worker via the SKIP LOCKED predicate and completes; the run row's attempt counter reads 2.

Memory backend regression (zero-config flow unchanged)

make pg-down
make dev-backend                  # default in-memory backend, no AGENTEVALS_STORAGE_BACKEND set

curl -s -i http://localhost:8001/api/runs | head -3       # expect: 503 with hint pointing at the env var
curl -s -X POST http://localhost:8001/api/evaluate \
    -F 'trace_files=@samples/helm.json' \
    -F 'config={"metrics": ["tool_trajectory_avg_score"]}' | jq .data.runId    # expect: null (no persistence configured)
curl -s http://localhost:8001/api/health | jq -r .data.status                  # expect: "ok"

Cleanup

make pg-down
rm /tmp/req.json /tmp/req2.json /tmp/json_req.json

@krisztianfekete krisztianfekete force-pushed the feature/add-durable-storage branch from 18785bd to 99247be Compare May 4, 2026 16:39
@krisztianfekete krisztianfekete force-pushed the feature/add-durable-storage branch from 99247be to 5c6d499 Compare May 4, 2026 16:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant