Skip to content

SWE-bench Lite: 59.0% (177/300) with claude-sonnet-4-6 via Bedrock gateway #116

@harperreed

Description

@harperreed

SWE-bench Lite Evaluation Report

Result: 177/300 resolved (59.0%)

Metric Value
Total instances 300
Resolved 177 (59.0%)
Unresolved 83 (27.7%)
Errors (build/patch) 23 (7.7%)
Empty patches 17 (5.7%)
Patch rate 94.3% (283/300 non-empty)
Timeouts 0

Run Configuration

Setting Value
Model claude-sonnet-4-6
Provider anthropic (via Bedrock gateway)
Gateway bedrock-gateway.2389-research-inc.workers.dev
Max turns 50
Timeout 30m per instance
Commit 76164b0a41f2
Host aibox02
Harness tracker-swebench (in-container agent-runner)
Evaluator Official SWE-bench Docker harness
Dataset princeton-nlp/SWE-bench_Lite

Token Usage

Metric Value
Total input tokens 189,446,826 (189.4M)
Total output tokens 1,975,922 (1.98M)
Avg input per instance 631,489
Avg output per instance 6,586
Total turns 10,365
Avg turns per instance 34.5

Timing

Metric Value
Total agent time (sum) 14.1 hours
Avg duration per instance 2.8 min
Median duration 2.7 min
Min / Max 39s / 9m20s

Per-Repository Breakdown

Repo Total Resolved Unresolved Error Empty Rate
astropy/astropy 6 4 1 1 0 66.7%
django/django 114 84 19 8 3 73.7%
matplotlib/matplotlib 23 9 11 0 3 39.1%
mwaskom/seaborn 4 2 2 0 0 50.0%
pallets/flask 3 1 2 0 0 33.3%
psf/requests 6 6 0 0 0 100.0%
pydata/xarray 5 2 3 0 0 40.0%
pylint-dev/pylint 6 1 2 3 0 16.7%
pytest-dev/pytest 17 11 4 0 2 64.7%
scikit-learn/scikit-learn 23 11 4 5 3 47.8%
sphinx-doc/sphinx 16 3 9 2 2 18.8%
sympy/sympy 77 43 26 4 4 55.8%
Total 300 177 83 23 17 59.0%

Observations

  • Django dominates — 114/300 instances (38%), with a strong 73.7% resolve rate
  • psf/requests: 100% — all 6 instances resolved
  • sphinx-doc/sphinx and pylint-dev/pylint are weak spots (18.8% and 16.7%)
  • 23 errors are infrastructure-level (build failures, patch application errors), not agent failures — mostly sympy setup (exit code 128), malformed patches
  • 17 empty patches — agent ran but produced no diff. Likely hard instances or agent giving up
  • 0 timeouts across all 300 instances with 30m limit — budget is generous

Infrastructure Notes

  • The Bedrock gateway (bedrock-gateway.2389-research-inc.workers.dev) initially didn't support SSE streaming, which tracker's agent loop requires internally. Gateway-dev fixed this by synthesizing SSE events from non-streaming Bedrock responses ("fake stream"). This is transparent to the agent.
  • Evaluation ran on aibox02 using Docker-based SWE-bench harness. Each instance gets its own container with the repo checked out at the correct commit.
  • The agent-runner binary runs inside containers, creating an agent.Session directly with the tracker SDK.

Context

This is tracker's first full SWE-bench Lite run. The 59.0% result with Sonnet 4.6 is competitive — for reference, the SWE-bench Lite leaderboard top entries are in the 40-50% range for unassisted single-agent systems (though methodology varies).

Resolved instance IDs (177)
astropy__astropy-12907
astropy__astropy-14365
astropy__astropy-14995
astropy__astropy-6938
django__django-10914
django__django-11001
django__django-11039
django__django-11049
django__django-11099
django__django-11133
django__django-11179
django__django-11422
django__django-11583
django__django-11620
django__django-11797
django__django-11815
django__django-11848
django__django-11905
django__django-11999
django__django-12125
django__django-12184
django__django-12286
django__django-12308
django__django-12453
django__django-12497
django__django-12589
django__django-12700
django__django-12708
django__django-12747
django__django-12856
django__django-12908
django__django-12915
django__django-12983
django__django-13028
django__django-13033
django__django-13158
django__django-13315
django__django-13401
django__django-13447
django__django-13448
django__django-13551
django__django-13590
django__django-13658
django__django-13710
django__django-13757
django__django-13768
django__django-13925
django__django-13933
django__django-13964
django__django-14238
django__django-14382
django__django-14411
django__django-14580
django__django-14667
django__django-14672
django__django-14752
django__django-14787
django__django-14855
django__django-14915
django__django-14997
django__django-14999
django__django-15061
django__django-15213
django__django-15320
django__django-15347
django__django-15388
django__django-15400
django__django-15498
django__django-15738
django__django-15781
django__django-15790
django__django-15814
django__django-15819
django__django-15851
django__django-15902
django__django-15996
django__django-16041
django__django-16046
django__django-16139
django__django-16255
django__django-16379
django__django-16400
django__django-16408
django__django-16527
django__django-16595
django__django-16873
django__django-16910
django__django-17087
matplotlib__matplotlib-22835
matplotlib__matplotlib-23562
matplotlib__matplotlib-23913
matplotlib__matplotlib-23964
matplotlib__matplotlib-24265
matplotlib__matplotlib-25311
matplotlib__matplotlib-25332
matplotlib__matplotlib-25442
matplotlib__matplotlib-26020
mwaskom__seaborn-3010
mwaskom__seaborn-3190
pallets__flask-4992
psf__requests-1963
psf__requests-2148
psf__requests-2317
psf__requests-2674
psf__requests-3362
psf__requests-863
pydata__xarray-4094
pydata__xarray-5131
pylint-dev__pylint-5859
pytest-dev__pytest-11143
pytest-dev__pytest-5227
pytest-dev__pytest-5413
pytest-dev__pytest-5692
pytest-dev__pytest-6116
pytest-dev__pytest-7168
pytest-dev__pytest-7373
pytest-dev__pytest-7432
pytest-dev__pytest-7490
pytest-dev__pytest-8906
pytest-dev__pytest-9359
scikit-learn__scikit-learn-10297
scikit-learn__scikit-learn-12471
scikit-learn__scikit-learn-13142
scikit-learn__scikit-learn-13241
scikit-learn__scikit-learn-13439
scikit-learn__scikit-learn-13584
scikit-learn__scikit-learn-13779
scikit-learn__scikit-learn-14092
scikit-learn__scikit-learn-14983
scikit-learn__scikit-learn-15512
scikit-learn__scikit-learn-15535
sphinx-doc__sphinx-11445
sphinx-doc__sphinx-8627
sphinx-doc__sphinx-8713
sympy__sympy-12419
sympy__sympy-12481
sympy__sympy-13471
sympy__sympy-13480
sympy__sympy-13647
sympy__sympy-13773
sympy__sympy-13971
sympy__sympy-14396
sympy__sympy-14774
sympy__sympy-15011
sympy__sympy-15345
sympy__sympy-15346
sympy__sympy-15609
sympy__sympy-15678
sympy__sympy-16503
sympy__sympy-16792
sympy__sympy-16988
sympy__sympy-17022
sympy__sympy-17655
sympy__sympy-18057
sympy__sympy-18087
sympy__sympy-18189
sympy__sympy-18532
sympy__sympy-18621
sympy__sympy-18698
sympy__sympy-19487
sympy__sympy-20049
sympy__sympy-20154
sympy__sympy-20212
sympy__sympy-20442
sympy__sympy-21055
sympy__sympy-21379
sympy__sympy-21612
sympy__sympy-21614
sympy__sympy-21627
sympy__sympy-21847
sympy__sympy-22005
sympy__sympy-22714
sympy__sympy-23117
sympy__sympy-23262
sympy__sympy-24066
sympy__sympy-24152
sympy__sympy-24213
Error instance IDs (23)
astropy__astropy-14182
django__django-11283
django__django-11910
django__django-11964
django__django-12113
django__django-12284
django__django-14016
django__django-14608
django__django-17051
pylint-dev__pylint-7114
pylint-dev__pylint-7228
pylint-dev__pylint-7993
scikit-learn__scikit-learn-13496
scikit-learn__scikit-learn-14894
scikit-learn__scikit-learn-25500
scikit-learn__scikit-learn-25638
scikit-learn__scikit-learn-25747
sphinx-doc__sphinx-10325
sphinx-doc__sphinx-8595
sympy__sympy-11870
sympy__sympy-12171
sympy__sympy-13031
sympy__sympy-20590
Empty patch instance IDs (17)
django__django-12470
django__django-16229
django__django-16820
matplotlib__matplotlib-25079
matplotlib__matplotlib-25433
matplotlib__matplotlib-25498
pytest-dev__pytest-11148
pytest-dev__pytest-5103
scikit-learn__scikit-learn-11040
scikit-learn__scikit-learn-11281
scikit-learn__scikit-learn-25570
sphinx-doc__sphinx-8435
sphinx-doc__sphinx-8474
sympy__sympy-13146
sympy__sympy-13915
sympy__sympy-17630
sympy__sympy-23191

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/agentAgent backend and LLM integrationdocumentationImprovements or additions to documentation

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions