Skip to content

Add event-sourcing system benchmarks#2032

Open
simonrosenberg wants to merge 6 commits intomainfrom
benchmarks/event-sourcing-metrics
Open

Add event-sourcing system benchmarks#2032
simonrosenberg wants to merge 6 commits intomainfrom
benchmarks/event-sourcing-metrics

Conversation

@simonrosenberg
Copy link
Collaborator

@simonrosenberg simonrosenberg commented Feb 13, 2026

Summary

  • Adds benchmarks/event_sourcing/ with three scripts measuring SDK-attributable systems metrics for the event-sourced state management (Section 4.2)
  • Scripts use real event payloads from SWE-Bench Verified evaluation traces, replayed through the production LocalFileStore I/O path
  • Reports persist latency per event/action cycle, replay time vs. log size, storage growth by event type, and time-to-recover via unmatched-action detection
  • Results included in README.md with tables for all four metrics

Test plan

  • Run python bench_persist_latency.py --eval-dir <path> against an evaluation output directory
  • Run python bench_replay_and_recovery.py --eval-dir <path> against an evaluation output directory
  • Run python bench_storage_growth.py --eval-dir <path> against an evaluation output directory
  • Verify JSON output files are produced with expected structure

🤖 Generated with Claude Code


Agent Server images for this PR

GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant Architectures Base Image Docs / Tags
java amd64, arm64 eclipse-temurin:17-jdk Link
python amd64, arm64 nikolaik/python-nodejs:python3.12-nodejs22 Link
golang amd64, arm64 golang:1.21-bookworm Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:0bf300a-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-0bf300a-python \
  ghcr.io/openhands/agent-server:0bf300a-python

All tags pushed for this build

ghcr.io/openhands/agent-server:0bf300a-golang-amd64
ghcr.io/openhands/agent-server:0bf300a-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:0bf300a-golang-arm64
ghcr.io/openhands/agent-server:0bf300a-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:0bf300a-java-amd64
ghcr.io/openhands/agent-server:0bf300a-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:0bf300a-java-arm64
ghcr.io/openhands/agent-server:0bf300a-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:0bf300a-python-amd64
ghcr.io/openhands/agent-server:0bf300a-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-amd64
ghcr.io/openhands/agent-server:0bf300a-python-arm64
ghcr.io/openhands/agent-server:0bf300a-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-arm64
ghcr.io/openhands/agent-server:0bf300a-golang
ghcr.io/openhands/agent-server:0bf300a-java
ghcr.io/openhands/agent-server:0bf300a-python

About Multi-Architecture Support

  • Each variant tag (e.g., 0bf300a-python) is a multi-arch manifest supporting both amd64 and arm64
  • Docker automatically pulls the correct architecture for your platform
  • Individual architecture tags (e.g., 0bf300a-python-amd64) are also available if needed

…nd recovery

Adds three benchmark scripts and a results README under benchmarks/event_sourcing/
to report SDK-attributable systems metrics for the event-sourced state management
design. Scripts accept --eval-dir pointing to SWE-Bench evaluation traces and measure:
- Persist latency per event and per action cycle
- Replay time vs. log size (index rebuild + full replay)
- Storage growth and composition by event type
- Time-to-recover via replay after failures (unmatched-action detection)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taste Rating: 🟡 Acceptable - Solid benchmark approach with real data, but has a critical logging bug and some duplication.

Key Insight: Using real SWE-Bench traces is pragmatic and valuable. The logging side effect is the only must-fix issue.

simonrosenberg and others added 4 commits February 13, 2026 12:54
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace module-level logging.disable(logging.WARNING) with a scoped
logging.getLogger("openhands").setLevel(logging.ERROR) inside main()
to avoid globally breaking logging for importers.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Break long f-strings to stay within 88-char line limit. Hoist
json_files assignment before loops to fix possibly-unbound warning.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Extract duplicated extract_conversation() and read_event_files() into
  benchmark_utils.py to eliminate code duplication across scripts
- Replace hand-rolled unmatched-action detection with the real SDK's
  ConversationState.get_unmatched_actions() using full Pydantic
  Event.model_validate_json() deserialization
- Add register_tool_types() helper to import concrete tool classes
  needed for discriminated union deserialization
- Update README time-to-recover results with real SDK deserialization

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taste Rating: 🟡 Acceptable - Solid benchmark implementation with real eval data.

All previous issues properly addressed:

  • ✅ Logging side effect fixed (scoped to main function)
  • ✅ Code duplication eliminated (extracted to benchmark_utils.py)
  • ✅ Now uses SDKs actual get_unmatched_actions()` method

Key Insight: Using real SWE-Bench traces instead of synthetic data gives meaningful performance metrics. The scripts exercise actual SDK code paths (LocalFileStore, ConversationState) rather than mocking - pragmatic and valuable.

Code is straightforward with no over-engineering. Proper resource cleanup, appropriate use of gc.disable() for accurate timing. 👍

@openhands-ai
Copy link

openhands-ai bot commented Feb 13, 2026

Looks like there are a few issues preventing this PR from being merged!

  • GitHub Actions are failing:
    • Review Thread Gate

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #2032 at branch `benchmarks/event-sourcing-metrics`

Feel free to include any additional details that might help me get this PR into a better state.

You can manage your notification settings

Copy link
Collaborator

@xingyaoww xingyaoww left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simonrosenberg can you move all these under scripts/ folder?

Otherwise LGTM

@xingyaoww
Copy link
Collaborator

@OpenHands pls help me move these scripts

@openhands-ai
Copy link

openhands-ai bot commented Feb 15, 2026

I'm on it! xingyaoww can track my progress at all-hands.dev

Move benchmark scripts from benchmarks/event_sourcing/ to
scripts/event_sourcing_benchmarks/ as requested.

Co-authored-by: openhands <openhands@all-hands.dev>
@openhands-ai
Copy link

openhands-ai bot commented Feb 15, 2026

Done! I've moved the event-sourcing benchmark scripts from benchmarks/event_sourcing/ to scripts/event_sourcing_benchmarks/ as requested by @xingyaoww in the PR review comment.

Changes made:

  • Moved all 5 files (3 benchmark scripts, 1 utils module, 1 README) from benchmarks/event_sourcing/ to scripts/event_sourcing_benchmarks/
  • Removed the now-empty benchmarks/ directory

The changes have been committed and pushed to the benchmarks/event-sourcing-metrics branch to update PR #2032.

View full conversation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants

Comments