[#982][bugfix][eval] Fix cost calculation with dedup and cache-tier pricing#983
Open
ayazhankadessova wants to merge 6 commits intomainfrom
Open
[#982][bugfix][eval] Fix cost calculation with dedup and cache-tier pricing#983ayazhankadessova wants to merge 6 commits intomainfrom
ayazhankadessova wants to merge 6 commits intomainfrom
Conversation
- docs/cli/lol.md: Add dedup by message.id and cache-tier cost notes to lol usage - python/agentize/eval/eval_harness.md: Add cost estimation section with JSONL dedup and raw mode cache-tier awareness - python/agentize/usage.md: Document dedup and cache-tier behavior in count_usage - python/tests/test_eval_harness.md: New test doc for eval harness test scope Related: #982 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- python/tests/test_eval_harness.py: Add cache_read/cache_write cost tests for _compute_cost, cache token extraction tests for _parse_claude_usage, and message.id dedup tests for _sum_jsonl_usage - tests/cli/test-lol-usage.sh: Add dedup fixture validation with duplicate message.id entries Tests are expected to fail until implementation is complete. Related: #982 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…L dedup
- python/agentize/eval/eval_harness.py:
- _compute_cost: Accept cache_read/cache_write params, apply four-tier
pricing (base input, cache_read, cache_write, output)
- _parse_claude_usage: Extract cache_read_input_tokens and
cache_creation_input_tokens from claude -p JSON output
- _sum_jsonl_usage: Deduplicate assistant entries by message.id per file
to avoid counting streamed content blocks multiple times
- python/agentize/usage.py:
- count_usage: Add message.id dedup per session file in JSONL parsing
All 58 Python tests pass. CLI usage tests pass.
Related: #982
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- python/agentize/eval/eval_harness.py: Resolve base_dir to absolute path in _cmd_run so predictions/metrics paths survive os.chdir() into worktrees during full mode execution (line 815) Fixes crash: FileNotFoundError on predictions.jsonl after first full-mode task Related: #982 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…er errors Fixes issue where duck-typed arrays with compatible but different units caused a ValueError instead of allowing numpy's reflected ufunc dispatch to proceed. Changes: - astropy/units/quantity.py: Wrap converter application loop in try/except (TypeError, ValueError) and return NotImplemented on failure, allowing reflected dispatch (e.g. DuckArray.__array_ufunc__) to handle compatible unit conversions. - astropy/units/tests/test_quantity_ufuncs.py: Add TestQuantityDuckArrayUfunc class with a regression test for (1*u.m) + DuckArray(1*u.mm) returning a DuckArray result instead of raising ValueError. - docs/changes/units/13977.bugfix.rst: Add changelog entry for this bugfix. Closes #13977.
- cc.r: $1.47 → $4.64 (cache-tier pricing now included) - cc.script: $195.30 → $70.62 (JSONL dedup removes ~1.7x inflation) - Cost ratio: 133x → ~15x (accurate multi-agent overhead) - Updated all derived metrics: $/task, $/resolved, marginal cost - Added cost correction note explaining the two bugs fixed in #982 Related: #982 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
message.iddeduplication for JSONL session parsing and cache-tier-aware pricing for raw modemessage.id, different content blocks) — these are now deduplicated per session file_compute_costnow acceptscache_read/cache_writeparams and applies four-tier pricing (base input, cache read at 0.1x, cache write at 1.25x, output)_parse_claude_usageextracts cache token fields fromclaude -pJSON outputeval_harness._sum_jsonl_usageandusage.count_usagenow deduplicate bymessage.idTest plan
python -m pytest python/tests/test_eval_harness.py)tests/cli/test-lol-usage.sh)_compute_cost, cache extraction in_parse_claude_usage,message.iddedup in_sum_jsonl_usage_compute_costdefaults cache params to 0Closes #982
🤖 Generated with Claude Code