SWE-bench Lite Evaluation Report
Result: 177/300 resolved (59.0%)
| Metric |
Value |
| Total instances |
300 |
| Resolved |
177 (59.0%) |
| Unresolved |
83 (27.7%) |
| Errors (build/patch) |
23 (7.7%) |
| Empty patches |
17 (5.7%) |
| Patch rate |
94.3% (283/300 non-empty) |
| Timeouts |
0 |
Run Configuration
| Setting |
Value |
| Model |
claude-sonnet-4-6 |
| Provider |
anthropic (via Bedrock gateway) |
| Gateway |
bedrock-gateway.2389-research-inc.workers.dev |
| Max turns |
50 |
| Timeout |
30m per instance |
| Commit |
76164b0a41f2 |
| Host |
aibox02 |
| Harness |
tracker-swebench (in-container agent-runner) |
| Evaluator |
Official SWE-bench Docker harness |
| Dataset |
princeton-nlp/SWE-bench_Lite |
Token Usage
| Metric |
Value |
| Total input tokens |
189,446,826 (189.4M) |
| Total output tokens |
1,975,922 (1.98M) |
| Avg input per instance |
631,489 |
| Avg output per instance |
6,586 |
| Total turns |
10,365 |
| Avg turns per instance |
34.5 |
Timing
| Metric |
Value |
| Total agent time (sum) |
14.1 hours |
| Avg duration per instance |
2.8 min |
| Median duration |
2.7 min |
| Min / Max |
39s / 9m20s |
Per-Repository Breakdown
| Repo |
Total |
Resolved |
Unresolved |
Error |
Empty |
Rate |
| astropy/astropy |
6 |
4 |
1 |
1 |
0 |
66.7% |
| django/django |
114 |
84 |
19 |
8 |
3 |
73.7% |
| matplotlib/matplotlib |
23 |
9 |
11 |
0 |
3 |
39.1% |
| mwaskom/seaborn |
4 |
2 |
2 |
0 |
0 |
50.0% |
| pallets/flask |
3 |
1 |
2 |
0 |
0 |
33.3% |
| psf/requests |
6 |
6 |
0 |
0 |
0 |
100.0% |
| pydata/xarray |
5 |
2 |
3 |
0 |
0 |
40.0% |
| pylint-dev/pylint |
6 |
1 |
2 |
3 |
0 |
16.7% |
| pytest-dev/pytest |
17 |
11 |
4 |
0 |
2 |
64.7% |
| scikit-learn/scikit-learn |
23 |
11 |
4 |
5 |
3 |
47.8% |
| sphinx-doc/sphinx |
16 |
3 |
9 |
2 |
2 |
18.8% |
| sympy/sympy |
77 |
43 |
26 |
4 |
4 |
55.8% |
| Total |
300 |
177 |
83 |
23 |
17 |
59.0% |
Observations
- Django dominates — 114/300 instances (38%), with a strong 73.7% resolve rate
- psf/requests: 100% — all 6 instances resolved
- sphinx-doc/sphinx and pylint-dev/pylint are weak spots (18.8% and 16.7%)
- 23 errors are infrastructure-level (build failures, patch application errors), not agent failures — mostly sympy setup (exit code 128), malformed patches
- 17 empty patches — agent ran but produced no diff. Likely hard instances or agent giving up
- 0 timeouts across all 300 instances with 30m limit — budget is generous
Infrastructure Notes
- The Bedrock gateway (
bedrock-gateway.2389-research-inc.workers.dev) initially didn't support SSE streaming, which tracker's agent loop requires internally. Gateway-dev fixed this by synthesizing SSE events from non-streaming Bedrock responses ("fake stream"). This is transparent to the agent.
- Evaluation ran on aibox02 using Docker-based SWE-bench harness. Each instance gets its own container with the repo checked out at the correct commit.
- The
agent-runner binary runs inside containers, creating an agent.Session directly with the tracker SDK.
Context
This is tracker's first full SWE-bench Lite run. The 59.0% result with Sonnet 4.6 is competitive — for reference, the SWE-bench Lite leaderboard top entries are in the 40-50% range for unassisted single-agent systems (though methodology varies).
Resolved instance IDs (177)
astropy__astropy-12907
astropy__astropy-14365
astropy__astropy-14995
astropy__astropy-6938
django__django-10914
django__django-11001
django__django-11039
django__django-11049
django__django-11099
django__django-11133
django__django-11179
django__django-11422
django__django-11583
django__django-11620
django__django-11797
django__django-11815
django__django-11848
django__django-11905
django__django-11999
django__django-12125
django__django-12184
django__django-12286
django__django-12308
django__django-12453
django__django-12497
django__django-12589
django__django-12700
django__django-12708
django__django-12747
django__django-12856
django__django-12908
django__django-12915
django__django-12983
django__django-13028
django__django-13033
django__django-13158
django__django-13315
django__django-13401
django__django-13447
django__django-13448
django__django-13551
django__django-13590
django__django-13658
django__django-13710
django__django-13757
django__django-13768
django__django-13925
django__django-13933
django__django-13964
django__django-14238
django__django-14382
django__django-14411
django__django-14580
django__django-14667
django__django-14672
django__django-14752
django__django-14787
django__django-14855
django__django-14915
django__django-14997
django__django-14999
django__django-15061
django__django-15213
django__django-15320
django__django-15347
django__django-15388
django__django-15400
django__django-15498
django__django-15738
django__django-15781
django__django-15790
django__django-15814
django__django-15819
django__django-15851
django__django-15902
django__django-15996
django__django-16041
django__django-16046
django__django-16139
django__django-16255
django__django-16379
django__django-16400
django__django-16408
django__django-16527
django__django-16595
django__django-16873
django__django-16910
django__django-17087
matplotlib__matplotlib-22835
matplotlib__matplotlib-23562
matplotlib__matplotlib-23913
matplotlib__matplotlib-23964
matplotlib__matplotlib-24265
matplotlib__matplotlib-25311
matplotlib__matplotlib-25332
matplotlib__matplotlib-25442
matplotlib__matplotlib-26020
mwaskom__seaborn-3010
mwaskom__seaborn-3190
pallets__flask-4992
psf__requests-1963
psf__requests-2148
psf__requests-2317
psf__requests-2674
psf__requests-3362
psf__requests-863
pydata__xarray-4094
pydata__xarray-5131
pylint-dev__pylint-5859
pytest-dev__pytest-11143
pytest-dev__pytest-5227
pytest-dev__pytest-5413
pytest-dev__pytest-5692
pytest-dev__pytest-6116
pytest-dev__pytest-7168
pytest-dev__pytest-7373
pytest-dev__pytest-7432
pytest-dev__pytest-7490
pytest-dev__pytest-8906
pytest-dev__pytest-9359
scikit-learn__scikit-learn-10297
scikit-learn__scikit-learn-12471
scikit-learn__scikit-learn-13142
scikit-learn__scikit-learn-13241
scikit-learn__scikit-learn-13439
scikit-learn__scikit-learn-13584
scikit-learn__scikit-learn-13779
scikit-learn__scikit-learn-14092
scikit-learn__scikit-learn-14983
scikit-learn__scikit-learn-15512
scikit-learn__scikit-learn-15535
sphinx-doc__sphinx-11445
sphinx-doc__sphinx-8627
sphinx-doc__sphinx-8713
sympy__sympy-12419
sympy__sympy-12481
sympy__sympy-13471
sympy__sympy-13480
sympy__sympy-13647
sympy__sympy-13773
sympy__sympy-13971
sympy__sympy-14396
sympy__sympy-14774
sympy__sympy-15011
sympy__sympy-15345
sympy__sympy-15346
sympy__sympy-15609
sympy__sympy-15678
sympy__sympy-16503
sympy__sympy-16792
sympy__sympy-16988
sympy__sympy-17022
sympy__sympy-17655
sympy__sympy-18057
sympy__sympy-18087
sympy__sympy-18189
sympy__sympy-18532
sympy__sympy-18621
sympy__sympy-18698
sympy__sympy-19487
sympy__sympy-20049
sympy__sympy-20154
sympy__sympy-20212
sympy__sympy-20442
sympy__sympy-21055
sympy__sympy-21379
sympy__sympy-21612
sympy__sympy-21614
sympy__sympy-21627
sympy__sympy-21847
sympy__sympy-22005
sympy__sympy-22714
sympy__sympy-23117
sympy__sympy-23262
sympy__sympy-24066
sympy__sympy-24152
sympy__sympy-24213
Error instance IDs (23)
astropy__astropy-14182
django__django-11283
django__django-11910
django__django-11964
django__django-12113
django__django-12284
django__django-14016
django__django-14608
django__django-17051
pylint-dev__pylint-7114
pylint-dev__pylint-7228
pylint-dev__pylint-7993
scikit-learn__scikit-learn-13496
scikit-learn__scikit-learn-14894
scikit-learn__scikit-learn-25500
scikit-learn__scikit-learn-25638
scikit-learn__scikit-learn-25747
sphinx-doc__sphinx-10325
sphinx-doc__sphinx-8595
sympy__sympy-11870
sympy__sympy-12171
sympy__sympy-13031
sympy__sympy-20590
Empty patch instance IDs (17)
django__django-12470
django__django-16229
django__django-16820
matplotlib__matplotlib-25079
matplotlib__matplotlib-25433
matplotlib__matplotlib-25498
pytest-dev__pytest-11148
pytest-dev__pytest-5103
scikit-learn__scikit-learn-11040
scikit-learn__scikit-learn-11281
scikit-learn__scikit-learn-25570
sphinx-doc__sphinx-8435
sphinx-doc__sphinx-8474
sympy__sympy-13146
sympy__sympy-13915
sympy__sympy-17630
sympy__sympy-23191
SWE-bench Lite Evaluation Report
Result: 177/300 resolved (59.0%)
Run Configuration
claude-sonnet-4-6anthropic(via Bedrock gateway)bedrock-gateway.2389-research-inc.workers.dev76164b0a41f2tracker-swebench(in-containeragent-runner)princeton-nlp/SWE-bench_LiteToken Usage
Timing
Per-Repository Breakdown
Observations
Infrastructure Notes
bedrock-gateway.2389-research-inc.workers.dev) initially didn't support SSE streaming, which tracker's agent loop requires internally. Gateway-dev fixed this by synthesizing SSE events from non-streaming Bedrock responses ("fake stream"). This is transparent to the agent.agent-runnerbinary runs inside containers, creating anagent.Sessiondirectly with the tracker SDK.Context
This is tracker's first full SWE-bench Lite run. The 59.0% result with Sonnet 4.6 is competitive — for reference, the SWE-bench Lite leaderboard top entries are in the 40-50% range for unassisted single-agent systems (though methodology varies).
Resolved instance IDs (177)
Error instance IDs (23)
Empty patch instance IDs (17)