Conversation
…nd recovery Adds three benchmark scripts and a results README under benchmarks/event_sourcing/ to report SDK-attributable systems metrics for the event-sourced state management design. Scripts accept --eval-dir pointing to SWE-Bench evaluation traces and measure: - Persist latency per event and per action cycle - Replay time vs. log size (index rebuild + full replay) - Storage growth and composition by event type - Time-to-recover via replay after failures (unmatched-action detection) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
all-hands-bot
left a comment
There was a problem hiding this comment.
Taste Rating: 🟡 Acceptable - Solid benchmark approach with real data, but has a critical logging bug and some duplication.
Key Insight: Using real SWE-Bench traces is pragmatic and valuable. The logging side effect is the only must-fix issue.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace module-level logging.disable(logging.WARNING) with a scoped
logging.getLogger("openhands").setLevel(logging.ERROR) inside main()
to avoid globally breaking logging for importers.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Break long f-strings to stay within 88-char line limit. Hoist json_files assignment before loops to fix possibly-unbound warning. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Extract duplicated extract_conversation() and read_event_files() into benchmark_utils.py to eliminate code duplication across scripts - Replace hand-rolled unmatched-action detection with the real SDK's ConversationState.get_unmatched_actions() using full Pydantic Event.model_validate_json() deserialization - Add register_tool_types() helper to import concrete tool classes needed for discriminated union deserialization - Update README time-to-recover results with real SDK deserialization Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
all-hands-bot
left a comment
There was a problem hiding this comment.
Taste Rating: 🟡 Acceptable - Solid benchmark implementation with real eval data.
All previous issues properly addressed:
- ✅ Logging side effect fixed (scoped to main function)
- ✅ Code duplication eliminated (extracted to benchmark_utils.py)
- ✅ Now uses SDK
s actualget_unmatched_actions()` method
Key Insight: Using real SWE-Bench traces instead of synthetic data gives meaningful performance metrics. The scripts exercise actual SDK code paths (LocalFileStore, ConversationState) rather than mocking - pragmatic and valuable.
Code is straightforward with no over-engineering. Proper resource cleanup, appropriate use of gc.disable() for accurate timing. 👍
|
Looks like there are a few issues preventing this PR from being merged!
If you'd like me to help, just leave a comment, like Feel free to include any additional details that might help me get this PR into a better state. You can manage your notification settings |
xingyaoww
left a comment
There was a problem hiding this comment.
@simonrosenberg can you move all these under scripts/ folder?
Otherwise LGTM
|
@OpenHands pls help me move these scripts |
|
I'm on it! xingyaoww can track my progress at all-hands.dev |
Move benchmark scripts from benchmarks/event_sourcing/ to scripts/event_sourcing_benchmarks/ as requested. Co-authored-by: openhands <openhands@all-hands.dev>
|
Done! I've moved the event-sourcing benchmark scripts from Changes made:
The changes have been committed and pushed to the |
Summary
benchmarks/event_sourcing/with three scripts measuring SDK-attributable systems metrics for the event-sourced state management (Section 4.2)LocalFileStoreI/O pathREADME.mdwith tables for all four metricsTest plan
python bench_persist_latency.py --eval-dir <path>against an evaluation output directorypython bench_replay_and_recovery.py --eval-dir <path>against an evaluation output directorypython bench_storage_growth.py --eval-dir <path>against an evaluation output directory🤖 Generated with Claude Code
Agent Server images for this PR
• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server
Variants & Base Images
eclipse-temurin:17-jdknikolaik/python-nodejs:python3.12-nodejs22golang:1.21-bookwormPull (multi-arch manifest)
# Each variant is a multi-arch manifest supporting both amd64 and arm64 docker pull ghcr.io/openhands/agent-server:0bf300a-pythonRun
All tags pushed for this build
About Multi-Architecture Support
0bf300a-python) is a multi-arch manifest supporting both amd64 and arm640bf300a-python-amd64) are also available if needed