Add event-sourcing system benchmarks by simonrosenberg · Pull Request #2032 · OpenHands/software-agent-sdk

simonrosenberg · 2026-02-13T12:37:08Z

Summary

Adds benchmarks/event_sourcing/ with three scripts measuring SDK-attributable systems metrics for the event-sourced state management (Section 4.2)
Scripts use real event payloads from SWE-Bench Verified evaluation traces, replayed through the production LocalFileStore I/O path
Reports persist latency per event/action cycle, replay time vs. log size, storage growth by event type, and time-to-recover via unmatched-action detection
Results included in README.md with tables for all four metrics

Test plan

Run python bench_persist_latency.py --eval-dir <path> against an evaluation output directory
Run python bench_replay_and_recovery.py --eval-dir <path> against an evaluation output directory
Run python bench_storage_growth.py --eval-dir <path> against an evaluation output directory
Verify JSON output files are produced with expected structure

🤖 Generated with Claude Code

Agent Server images for this PR

• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant	Architectures	Base Image	Docs / Tags
java	amd64, arm64	`eclipse-temurin:17-jdk`	Link
python	amd64, arm64	`nikolaik/python-nodejs:python3.12-nodejs22`	Link
golang	amd64, arm64	`golang:1.21-bookworm`	Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:0bf300a-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-0bf300a-python \
  ghcr.io/openhands/agent-server:0bf300a-python

All tags pushed for this build

ghcr.io/openhands/agent-server:0bf300a-golang-amd64
ghcr.io/openhands/agent-server:0bf300a-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:0bf300a-golang-arm64
ghcr.io/openhands/agent-server:0bf300a-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:0bf300a-java-amd64
ghcr.io/openhands/agent-server:0bf300a-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:0bf300a-java-arm64
ghcr.io/openhands/agent-server:0bf300a-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:0bf300a-python-amd64
ghcr.io/openhands/agent-server:0bf300a-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-amd64
ghcr.io/openhands/agent-server:0bf300a-python-arm64
ghcr.io/openhands/agent-server:0bf300a-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-arm64
ghcr.io/openhands/agent-server:0bf300a-golang
ghcr.io/openhands/agent-server:0bf300a-java
ghcr.io/openhands/agent-server:0bf300a-python

About Multi-Architecture Support

Each variant tag (e.g., 0bf300a-python) is a multi-arch manifest supporting both amd64 and arm64
Docker automatically pulls the correct architecture for your platform
Individual architecture tags (e.g., 0bf300a-python-amd64) are also available if needed

…nd recovery Adds three benchmark scripts and a results README under benchmarks/event_sourcing/ to report SDK-attributable systems metrics for the event-sourced state management design. Scripts accept --eval-dir pointing to SWE-Bench evaluation traces and measure: - Persist latency per event and per action cycle - Replay time vs. log size (index rebuild + full replay) - Storage growth and composition by event type - Time-to-recover via replay after failures (unmatched-action detection) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

all-hands-bot

Taste Rating: 🟡 Acceptable - Solid benchmark approach with real data, but has a critical logging bug and some duplication.

Key Insight: Using real SWE-Bench traces is pragmatic and valuable. The logging side effect is the only must-fix issue.

benchmarks/event_sourcing/bench_persist_latency.py

benchmarks/event_sourcing/bench_replay_and_recovery.py

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace module-level logging.disable(logging.WARNING) with a scoped logging.getLogger("openhands").setLevel(logging.ERROR) inside main() to avoid globally breaking logging for importers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Break long f-strings to stay within 88-char line limit. Hoist json_files assignment before loops to fix possibly-unbound warning. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Extract duplicated extract_conversation() and read_event_files() into benchmark_utils.py to eliminate code duplication across scripts - Replace hand-rolled unmatched-action detection with the real SDK's ConversationState.get_unmatched_actions() using full Pydantic Event.model_validate_json() deserialization - Add register_tool_types() helper to import concrete tool classes needed for discriminated union deserialization - Update README time-to-recover results with real SDK deserialization Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

all-hands-bot

Taste Rating: 🟡 Acceptable - Solid benchmark implementation with real eval data.

All previous issues properly addressed:

✅ Logging side effect fixed (scoped to main function)
✅ Code duplication eliminated (extracted to benchmark_utils.py)
✅ Now uses SDKs actual get_unmatched_actions()` method

Key Insight: Using real SWE-Bench traces instead of synthetic data gives meaningful performance metrics. The scripts exercise actual SDK code paths (LocalFileStore, ConversationState) rather than mocking - pragmatic and valuable.

Code is straightforward with no over-engineering. Proper resource cleanup, appropriate use of gc.disable() for accurate timing. 👍

openhands-ai · 2026-02-13T13:20:39Z

Looks like there are a few issues preventing this PR from being merged!

GitHub Actions are failing:
- Review Thread Gate

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #2032 at branch `benchmarks/event-sourcing-metrics`

Feel free to include any additional details that might help me get this PR into a better state.

_{^{You can manage your notification settings}}

xingyaoww

@simonrosenberg can you move all these under scripts/ folder?

Otherwise LGTM

xingyaoww · 2026-02-15T09:40:25Z

@OpenHands pls help me move these scripts

openhands-ai · 2026-02-15T09:40:33Z

I'm on it! xingyaoww can track my progress at all-hands.dev

Move benchmark scripts from benchmarks/event_sourcing/ to scripts/event_sourcing_benchmarks/ as requested. Co-authored-by: openhands <openhands@all-hands.dev>

openhands-ai · 2026-02-15T09:42:44Z

Done! I've moved the event-sourcing benchmark scripts from benchmarks/event_sourcing/ to scripts/event_sourcing_benchmarks/ as requested by @xingyaoww in the PR review comment.

Changes made:

Moved all 5 files (3 benchmark scripts, 1 utils module, 1 README) from benchmarks/event_sourcing/ to scripts/event_sourcing_benchmarks/
Removed the now-empty benchmarks/ directory

The changes have been committed and pushed to the benchmarks/event-sourcing-metrics branch to update PR #2032.

View full conversation

all-hands-bot reviewed Feb 13, 2026

View reviewed changes

benchmarks/event_sourcing/bench_persist_latency.py Outdated Show resolved Hide resolved

benchmarks/event_sourcing/bench_persist_latency.py Outdated Show resolved Hide resolved

benchmarks/event_sourcing/bench_replay_and_recovery.py Outdated Show resolved Hide resolved

simonrosenberg and others added 4 commits February 13, 2026 12:54

Add download instructions for evaluation trace data

942e2d1

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: resolve ruff E501 and pyright errors in benchmark scripts

7e580da

Break long f-strings to stay within 88-char line limit. Hoist json_files assignment before loops to fix possibly-unbound warning. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

simonrosenberg requested a review from all-hands-bot February 13, 2026 13:16

all-hands-bot approved these changes Feb 13, 2026

View reviewed changes

simonrosenberg requested review from xingyaoww February 13, 2026 13:27

xingyaoww reviewed Feb 15, 2026

View reviewed changes

refactor: move event-sourcing benchmarks to scripts/ folder

0c5256d

Move benchmark scripts from benchmarks/event_sourcing/ to scripts/event_sourcing_benchmarks/ as requested. Co-authored-by: openhands <openhands@all-hands.dev>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add event-sourcing system benchmarks#2032

Add event-sourcing system benchmarks#2032
simonrosenberg wants to merge 6 commits intomainfrom
benchmarks/event-sourcing-metrics

simonrosenberg commented Feb 13, 2026 •

edited by github-actions bot

Loading

Uh oh!

all-hands-bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

all-hands-bot left a comment

Uh oh!

openhands-ai bot commented Feb 13, 2026

Uh oh!

xingyaoww left a comment

Uh oh!

xingyaoww commented Feb 15, 2026

Uh oh!

openhands-ai bot commented Feb 15, 2026

Uh oh!

openhands-ai bot commented Feb 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Comments

Conversation

simonrosenberg commented Feb 13, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

openhands-ai bot commented Feb 13, 2026

Uh oh!

xingyaoww left a comment

Choose a reason for hiding this comment

Uh oh!

xingyaoww commented Feb 15, 2026

Uh oh!

openhands-ai bot commented Feb 15, 2026

Uh oh!

openhands-ai bot commented Feb 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Comments

simonrosenberg commented Feb 13, 2026 •

edited by github-actions bot

Loading