[#982][bugfix][eval] Fix cost calculation with dedup and cache-tier pricing by ayazhankadessova · Pull Request #983 · Synthesys-Lab/agentize

ayazhankadessova · 2026-03-24T04:48:51Z

Summary

Fix cost calculation in eval harness and usage module by adding message.id deduplication for JSONL session parsing and cache-tier-aware pricing for raw mode
JSONL streaming produces duplicate entries per API response (same message.id, different content blocks) — these are now deduplicated per session file
_compute_cost now accepts cache_read/cache_write params and applies four-tier pricing (base input, cache read at 0.1x, cache write at 1.25x, output)
_parse_claude_usage extracts cache token fields from claude -p JSON output
Both eval_harness._sum_jsonl_usage and usage.count_usage now deduplicate by message.id

Test plan

58/58 Python tests pass (python -m pytest python/tests/test_eval_harness.py)
CLI usage shell tests pass (tests/cli/test-lol-usage.sh)
New tests: cache-aware _compute_cost, cache extraction in _parse_claude_usage, message.id dedup in _sum_jsonl_usage
Backward compatibility: _compute_cost defaults cache params to 0

Closes #982

🤖 Generated with Claude Code

- docs/cli/lol.md: Add dedup by message.id and cache-tier cost notes to lol usage - python/agentize/eval/eval_harness.md: Add cost estimation section with JSONL dedup and raw mode cache-tier awareness - python/agentize/usage.md: Document dedup and cache-tier behavior in count_usage - python/tests/test_eval_harness.md: New test doc for eval harness test scope Related: #982 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- python/tests/test_eval_harness.py: Add cache_read/cache_write cost tests for _compute_cost, cache token extraction tests for _parse_claude_usage, and message.id dedup tests for _sum_jsonl_usage - tests/cli/test-lol-usage.sh: Add dedup fixture validation with duplicate message.id entries Tests are expected to fail until implementation is complete. Related: #982 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…L dedup - python/agentize/eval/eval_harness.py: - _compute_cost: Accept cache_read/cache_write params, apply four-tier pricing (base input, cache_read, cache_write, output) - _parse_claude_usage: Extract cache_read_input_tokens and cache_creation_input_tokens from claude -p JSON output - _sum_jsonl_usage: Deduplicate assistant entries by message.id per file to avoid counting streamed content blocks multiple times - python/agentize/usage.py: - count_usage: Add message.id dedup per session file in JSONL parsing All 58 Python tests pass. CLI usage tests pass. Related: #982 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- python/agentize/eval/eval_harness.py: Resolve base_dir to absolute path in _cmd_run so predictions/metrics paths survive os.chdir() into worktrees during full mode execution (line 815) Fixes crash: FileNotFoundError on predictions.jsonl after first full-mode task Related: #982 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…er errors Fixes issue where duck-typed arrays with compatible but different units caused a ValueError instead of allowing numpy's reflected ufunc dispatch to proceed. Changes: - astropy/units/quantity.py: Wrap converter application loop in try/except (TypeError, ValueError) and return NotImplemented on failure, allowing reflected dispatch (e.g. DuckArray.__array_ufunc__) to handle compatible unit conversions. - astropy/units/tests/test_quantity_ufuncs.py: Add TestQuantityDuckArrayUfunc class with a regression test for (1*u.m) + DuckArray(1*u.mm) returning a DuckArray result instead of raising ValueError. - docs/changes/units/13977.bugfix.rst: Add changelog entry for this bugfix. Closes #13977.

- cc.r: $1.47 → $4.64 (cache-tier pricing now included) - cc.script: $195.30 → $70.62 (JSONL dedup removes ~1.7x inflation) - Cost ratio: 133x → ~15x (accurate multi-agent overhead) - Updated all derived metrics: $/task, $/resolved, marginal cost - Added cost correction note explaining the two bugs fixed in #982 Related: #982 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Ayazhan Kadessova and others added 3 commits March 24, 2026 04:48

ayazhankadessova added the agentize:pr PR created by agentize label Mar 24, 2026

Ayazhan Kadessova and others added 3 commits March 24, 2026 10:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[#982][bugfix][eval] Fix cost calculation with dedup and cache-tier pricing#983

[#982][bugfix][eval] Fix cost calculation with dedup and cache-tier pricing#983
ayazhankadessova wants to merge 6 commits intomainfrom
issue-982

ayazhankadessova commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ayazhankadessova commented Mar 24, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants