Skip to content

[#982][bugfix][eval] Fix cost calculation with dedup and cache-tier pricing#983

Open
ayazhankadessova wants to merge 6 commits intomainfrom
issue-982
Open

[#982][bugfix][eval] Fix cost calculation with dedup and cache-tier pricing#983
ayazhankadessova wants to merge 6 commits intomainfrom
issue-982

Conversation

@ayazhankadessova
Copy link
Contributor

Summary

  • Fix cost calculation in eval harness and usage module by adding message.id deduplication for JSONL session parsing and cache-tier-aware pricing for raw mode
  • JSONL streaming produces duplicate entries per API response (same message.id, different content blocks) — these are now deduplicated per session file
  • _compute_cost now accepts cache_read/cache_write params and applies four-tier pricing (base input, cache read at 0.1x, cache write at 1.25x, output)
  • _parse_claude_usage extracts cache token fields from claude -p JSON output
  • Both eval_harness._sum_jsonl_usage and usage.count_usage now deduplicate by message.id

Test plan

  • 58/58 Python tests pass (python -m pytest python/tests/test_eval_harness.py)
  • CLI usage shell tests pass (tests/cli/test-lol-usage.sh)
  • New tests: cache-aware _compute_cost, cache extraction in _parse_claude_usage, message.id dedup in _sum_jsonl_usage
  • Backward compatibility: _compute_cost defaults cache params to 0

Closes #982

🤖 Generated with Claude Code

Ayazhan Kadessova and others added 3 commits March 24, 2026 04:48
- docs/cli/lol.md: Add dedup by message.id and cache-tier cost notes to lol usage
- python/agentize/eval/eval_harness.md: Add cost estimation section with JSONL
  dedup and raw mode cache-tier awareness
- python/agentize/usage.md: Document dedup and cache-tier behavior in count_usage
- python/tests/test_eval_harness.md: New test doc for eval harness test scope

Related: #982

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- python/tests/test_eval_harness.py: Add cache_read/cache_write cost tests
  for _compute_cost, cache token extraction tests for _parse_claude_usage,
  and message.id dedup tests for _sum_jsonl_usage
- tests/cli/test-lol-usage.sh: Add dedup fixture validation with duplicate
  message.id entries

Tests are expected to fail until implementation is complete.

Related: #982

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…L dedup

- python/agentize/eval/eval_harness.py:
  - _compute_cost: Accept cache_read/cache_write params, apply four-tier
    pricing (base input, cache_read, cache_write, output)
  - _parse_claude_usage: Extract cache_read_input_tokens and
    cache_creation_input_tokens from claude -p JSON output
  - _sum_jsonl_usage: Deduplicate assistant entries by message.id per file
    to avoid counting streamed content blocks multiple times
- python/agentize/usage.py:
  - count_usage: Add message.id dedup per session file in JSONL parsing

All 58 Python tests pass. CLI usage tests pass.

Related: #982

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ayazhankadessova ayazhankadessova added the agentize:pr PR created by agentize label Mar 24, 2026
Ayazhan Kadessova and others added 3 commits March 24, 2026 10:00
- python/agentize/eval/eval_harness.py: Resolve base_dir to absolute path
  in _cmd_run so predictions/metrics paths survive os.chdir() into worktrees
  during full mode execution (line 815)

Fixes crash: FileNotFoundError on predictions.jsonl after first full-mode task

Related: #982

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…er errors

Fixes issue where duck-typed arrays with compatible but different units caused
a ValueError instead of allowing numpy's reflected ufunc dispatch to proceed.

Changes:
- astropy/units/quantity.py: Wrap converter application loop in
  try/except (TypeError, ValueError) and return NotImplemented on failure,
  allowing reflected dispatch (e.g. DuckArray.__array_ufunc__) to handle
  compatible unit conversions.
- astropy/units/tests/test_quantity_ufuncs.py: Add TestQuantityDuckArrayUfunc
  class with a regression test for (1*u.m) + DuckArray(1*u.mm) returning a
  DuckArray result instead of raising ValueError.
- docs/changes/units/13977.bugfix.rst: Add changelog entry for this bugfix.

Closes #13977.
- cc.r: $1.47 → $4.64 (cache-tier pricing now included)
- cc.script: $195.30 → $70.62 (JSONL dedup removes ~1.7x inflation)
- Cost ratio: 133x → ~15x (accurate multi-agent overhead)
- Updated all derived metrics: $/task, $/resolved, marginal cost
- Added cost correction note explaining the two bugs fixed in #982

Related: #982

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agentize:pr PR created by agentize

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[plan][eval] Fix cost calculation with session-oriented dedup and cache-aware pricing

2 participants