Skip to content

feat(token-usage): Add token usage tracking and cost calculation #40

Merged
hurshore merged 15 commits intomainfrom
ft/token-usage
Dec 30, 2025
Merged

feat(token-usage): Add token usage tracking and cost calculation #40
hurshore merged 15 commits intomainfrom
ft/token-usage

Conversation

@ayo6706
Copy link
Collaborator

@ayo6706 ayo6706 commented Dec 18, 2025

This PR resolves #35,

This PR implements comprehensive token usage tracking and cost calculation for LLM evaluations. It enables users to monitor token consumption per run and estimate costs by configuring pricing rates via environment variables.

Key Changes

  • Core Types: Added TokenUsage, TokenUsageStats, and PricingConfig interfaces to standardizing tracking across the system.
  • Provider Updates: Updated all LLM providers (Anthropic, OpenAI, Azure, Gemini) to extract and return input/output token counts in standard LLMResult format.
  • CLI Aggregation: The orchestrator now aggregates token counts across all evaluated files and prompts.
  • Reporting: Added a new summary section to the CLI output that displays total input/output tokens and estimated cost.
  • Configuration: Introduced INPUT_PRICE_PER_MILLION and OUTPUT_PRICE_PER_MILLION environment variables to enable cost calculation.

Configuration

To enable cost estimation, add the following to your .env file (rates per 1 million tokens):

INPUT_PRICE_PER_MILLION=3.00   # e.g., $3.00/1M input tokens
OUTPUT_PRICE_PER_MILLION=15.00 # e.g., $15.00/1M output tokens

Example Output

Token Usage:
  - Input tokens: 15,420
  - Output tokens: 4,210
  - Total cost: $0.1094
Screenshot (689)

Summary by CodeRabbit

  • New Features

    • Evaluation reports now include token usage (input/output tokens) and optional total cost.
    • CLI prints a token usage summary; pricing can be provided via INPUT_PRICE_PER_MILLION and OUTPUT_PRICE_PER_MILLION env vars.
    • Provider responses now return a wrapped result with data + usage, enabling downstream usage reporting.
  • Tests

    • Added/updated tests for token usage reporting, cost calculation, and wrapped response shape.
  • Documentation / Config

    • Scan config runRules now defaults to an empty list.

✏️ Tip: You can customize this high-level summary in your review settings.

- Add token usage tracking to LLM providers (Anthropic, Azure OpenAI, Gemini, OpenAI)
- Implement LLMResult wrapper type to return both data and token usage from provider calls
- Add TokenUsageStats type and calculateCost function for pricing calculations
- Add environment variables for input and output token pricing configuration
- Integrate token usage accumulation in orchestrator during file evaluation
- Add printTokenUsage function to display token usage and cost in Line output format
- Include token usage stats in EvaluateFileResult for downstream consumption
- Add comprehensive tests for token usage calculation and provider integration
- Update provider interfaces to return structured results with usage metadata
- Move token-usage.ts from src/types/ to src/providers/ for better architectural organization
- Update all import paths across codebase to reference new token-usage location
- Add pricing configuration parameter to EvaluationOptions interface
- Pass pricing config from CLI commands through orchestrator to evaluation functions
- Remove redundant environment variable parsing from orchestrator, use passed config instead
- Update PricingConfig type annotations for explicit undefined handling
- Change config schema runRules default from optional to empty array
- Consolidate token usage and pricing logic in providers module for improved separation of concerns
@coderabbitai
Copy link

coderabbitai bot commented Dec 18, 2025

📝 Walkthrough

Walkthrough

Adds token-usage collection and optional cost calculation: LLM providers now return usage with responses; evaluators and orchestrator aggregate usage and cost (when pricing provided); CLI reads pricing and prints token usage and total cost via reporter. (<=50 words)

Changes

Cohort / File(s) Summary
Provider interfaces & implementations
src/providers/llm-provider.ts, src/providers/anthropic-provider.ts, src/providers/azure-openai-provider.ts, src/providers/gemini-provider.ts, src/providers/openai-provider.ts
Introduce LLMResult<T> ({ data: T; usage?: TokenUsage }) and change runPromptStructured to return Promise<LLMResult<T>>; providers now wrap parsed data and attach input/output token usage.
Token usage & pricing utilities
src/providers/token-usage.ts
New types TokenUsage, TokenUsageStats, PricingConfig and calculateCost(usage, pricing?) that computes cost or returns undefined if pricing incomplete.
CLI types & orchestration
src/cli/types.ts, src/cli/orchestrator.ts, src/cli/commands.ts
Types extended with optional tokenUsage and pricing?; evaluateFiles aggregates usage and computes totalCost when pricing provided; CLI reads env pricing, validates output format, and prints token usage via printTokenUsage.
Evaluators & prompt flows
src/evaluators/base-evaluator.ts, src/evaluators/accuracy-evaluator.ts, ...src/evaluators/*
Consumers of runPromptStructured now handle LLMResult, propagate/merge usage into Subjective/SemiObjective results and per-file/overall evaluation outputs; claim extraction includes usage.
Reporting
src/output/reporter.ts
New exported printTokenUsage(stats: TokenUsageStats) prints input tokens, output tokens, and optional total cost.
Schemas / Configuration
src/schemas/env-schemas.ts, src/schemas/config-schemas.ts
ENV schema merged with base pricing fields INPUT_PRICE_PER_MILLION and OUTPUT_PRICE_PER_MILLION; runRules in config schema now defaults to [].
Prompts/schema changes
src/prompts/schema.ts
SubjectiveResult and SemiObjectiveResult gain optional usage?: TokenUsage.
Tests updated / added
tests/* (e.g., tests/anthropic-e2e.test.ts, tests/anthropic-provider.test.ts, tests/openai-provider.test.ts, tests/scoring-types.test.ts, tests/token-usage.test.ts)
Tests adapted to LLMResult shape (result.data) and to assert usage values; new tests for calculateCost() verify pricing scenarios.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant CLI as CLI (commands.ts)
    participant Orch as Orchestrator (orchestrator.ts)
    participant Eval as Evaluator (base/accuracy)
    participant LLM as LLM Provider
    participant Reporter as Reporter (reporter.ts)

    User->>CLI: run evaluation (env pricing may be set)
    CLI->>CLI: read INPUT_PRICE_PER_MILLION / OUTPUT_PRICE_PER_MILLION
    CLI->>Orch: evaluateFiles(options { pricing, outputFormat })

    rect rgb(240,248,255)
      Orch->>Orch: iterate files & prompts
      Orch->>Eval: runPromptEvaluation(...)
      Eval->>LLM: runPromptStructured(prompt)
      LLM-->>Eval: LLMResult { data, usage{ inputTokens, outputTokens } }
      Eval-->>Orch: Prompt result (includes usage)
      Orch->>Orch: aggregate usage per-file & total
    end

    rect rgb(255,250,240)
      Orch->>Orch: if pricing present → calculateCost(totalUsage, pricing)
      Orch-->>CLI: EvaluationResult { tokenUsage:{ totalInputTokens, totalOutputTokens, totalCost? } }
      CLI->>Reporter: printTokenUsage(tokenUsage)
      Reporter-->>User: display token counts and optional cost
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I counted tokens by moonlit rule,

Hopping through prompts, each API pool.
Input, output — numbers gleam,
Prices turn tokens into a dream.
Cost now prints — hop, tally, and cheer!

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title accurately summarizes the main objective: adding token usage tracking and cost calculation features across the LLM providers and CLI output.
Linked Issues check ✅ Passed The pull request implements all core requirements from issue #35: token tracking infrastructure (TokenUsage, TokenUsageStats, PricingConfig types), per-provider token extraction, orchestrator aggregation, and CLI cost display including input tokens, output tokens, and total cost.
Out of Scope Changes check ✅ Passed The pull request includes scope-appropriate changes: token-usage provider implementation, LLM provider updates to extract tokens, orchestrator aggregation, CLI reporting, schema updates, and related tests. Anthropic provider refactoring (type safety improvements) is complementary to token tracking. Config schema change to default runRules to empty array is a minor ancillary improvement unrelated to core objectives but acceptable.
✨ Finishing touches
  • 📝 Generate docstrings

📜 Recent review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e2b1b1d and 4240bae.

📒 Files selected for processing (2)
  • src/evaluators/base-evaluator.ts
  • src/providers/anthropic-provider.ts
🚧 Files skipped from review as they are similar to previous changes (2)
  • src/providers/anthropic-provider.ts
  • src/evaluators/base-evaluator.ts

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/cli/orchestrator.ts (1)

786-837: Add tokenUsage aggregation to evaluateFiles or update EvaluationResult.

The evaluateFile function returns tokenUsage per file, but evaluateFiles does not aggregate it. All other per-file metrics (errors, warnings, requestFailures, and status flags) follow a consistent aggregation pattern, yet tokenUsage is excluded from both the aggregation loop and the EvaluationResult return type. Either aggregate token usage across files (tracking cumulative input/output tokens and cost) to match the pattern of other metrics, or document why token usage is intentionally excluded from multi-file results.

🧹 Nitpick comments (6)
src/output/reporter.ts (1)

203-210: Consider dynamic precision for cost display.

The cost is formatted with 4 decimal places, which works well for small amounts (e.g., $0.0001) but may be excessive for larger costs (e.g., $123.4567). Consider using dynamic precision based on the cost magnitude, or at least 2 decimal places for costs above $1.

🔎 View suggested refactor
 export function printTokenUsage(stats: TokenUsageStats) {
   console.log(chalk.bold('\nToken Usage:'));
   console.log(`  - Input tokens: ${stats.totalInputTokens.toLocaleString()}`);
   console.log(`  - Output tokens: ${stats.totalOutputTokens.toLocaleString()}`);
   if (stats.totalCost !== undefined) {
-    console.log(`  - Total cost: $${stats.totalCost.toFixed(4)}`);
+    const decimals = stats.totalCost >= 1 ? 2 : 4;
+    console.log(`  - Total cost: $${stats.totalCost.toFixed(decimals)}`);
   }
 }
src/providers/token-usage.ts (2)

12-15: Remove redundant | undefined type annotation.

The ? operator already makes the fields type | undefined, so explicitly adding | undefined is redundant.

🔎 Apply this diff
 export interface PricingConfig {
-    inputPricePerMillion?: number | undefined;
-    outputPricePerMillion?: number | undefined;
+    inputPricePerMillion?: number;
+    outputPricePerMillion?: number;
 }

21-30: Consider validating non-negative token counts.

The calculateCost function doesn't validate that token counts are non-negative. While the upstream data is likely valid, defensive validation could prevent unexpected negative costs from malformed usage data.

🔎 View suggested validation
 export function calculateCost(usage: TokenUsage, pricing?: PricingConfig): number | undefined {
     if (!pricing || pricing.inputPricePerMillion === undefined || pricing.outputPricePerMillion === undefined) {
         return undefined;
     }
+    
+    if (usage.inputTokens < 0 || usage.outputTokens < 0) {
+        return undefined;
+    }
 
     const inputCost = (usage.inputTokens / 1_000_000) * pricing.inputPricePerMillion;
     const outputCost = (usage.outputTokens / 1_000_000) * pricing.outputPricePerMillion;
 
     return inputCost + outputCost;
 }
tests/anthropic-provider.test.ts (1)

339-343: Error mock pattern is valid but differs from OpenAI tests.

This file uses an explicit type cast pattern for error construction:

const mockApiError = anthropic.APIError as unknown as new (params: MockAPIErrorParams) => Error;

While OpenAI tests use @ts-expect-error comments. Both work, but consider standardizing the approach across test files for consistency.

src/cli/orchestrator.ts (2)

604-612: Type cast may fail for non-BaseEvaluator implementations.

The cast (evaluator as BaseEvaluator).getLastUsage?.() assumes all evaluators extend BaseEvaluator. If createEvaluator can return a different evaluator type that doesn't have getLastUsage, the optional chaining protects at runtime, but the explicit cast is misleading. Consider checking if the evaluator is an instance of BaseEvaluator first, or ensure the interface LLMProvider includes getLastUsage.

🔎 Suggested improvement:
-    const usage = (evaluator as BaseEvaluator).getLastUsage?.();
+    const usage = evaluator instanceof BaseEvaluator ? evaluator.getLastUsage?.() : undefined;

745-762: Consider initializing pricing to avoid passing an empty object.

When options.pricing is undefined, pricing becomes {}. The calculateCost function handles missing pricing properties by returning undefined, so this is functionally correct. However, explicitly passing options.pricing (which may be undefined) rather than an empty object is clearer.

🔎 Suggested improvement:
-  const pricing = options.pricing || {};
-
-  const tokenUsageStats: TokenUsageStats = {
-    totalInputTokens,
-    totalOutputTokens,
-  };
-
-  const cost = calculateCost(
-    {
-      inputTokens: totalInputTokens,
-      outputTokens: totalOutputTokens
-    },
-    pricing
-  );
+  const tokenUsageStats: TokenUsageStats = {
+    totalInputTokens,
+    totalOutputTokens,
+  };
+
+  const cost = calculateCost(
+    {
+      inputTokens: totalInputTokens,
+      outputTokens: totalOutputTokens
+    },
+    options.pricing
+  );
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 659ea2e and f43f4ff.

📒 Files selected for processing (18)
  • src/cli/commands.ts (3 hunks)
  • src/cli/orchestrator.ts (6 hunks)
  • src/cli/types.ts (4 hunks)
  • src/evaluators/base-evaluator.ts (5 hunks)
  • src/output/reporter.ts (2 hunks)
  • src/providers/anthropic-provider.ts (4 hunks)
  • src/providers/azure-openai-provider.ts (3 hunks)
  • src/providers/gemini-provider.ts (3 hunks)
  • src/providers/llm-provider.ts (1 hunks)
  • src/providers/openai-provider.ts (3 hunks)
  • src/providers/token-usage.ts (1 hunks)
  • src/schemas/config-schemas.ts (1 hunks)
  • src/schemas/env-schemas.ts (2 hunks)
  • tests/anthropic-e2e.test.ts (7 hunks)
  • tests/anthropic-provider.test.ts (8 hunks)
  • tests/openai-provider.test.ts (15 hunks)
  • tests/scoring-types.test.ts (6 hunks)
  • tests/token-usage.test.ts (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (13)
src/output/reporter.ts (1)
src/providers/token-usage.ts (1)
  • TokenUsageStats (6-10)
tests/token-usage.test.ts (1)
src/providers/token-usage.ts (3)
  • TokenUsage (1-4)
  • PricingConfig (12-15)
  • calculateCost (21-30)
src/providers/openai-provider.ts (1)
src/providers/llm-provider.ts (1)
  • LLMResult (3-6)
tests/openai-provider.test.ts (1)
tests/schemas/mock-schemas.ts (1)
  • MockOpenAIClient (55-64)
src/cli/types.ts (1)
src/providers/token-usage.ts (3)
  • PricingConfig (12-15)
  • TokenUsage (1-4)
  • TokenUsageStats (6-10)
src/providers/llm-provider.ts (1)
src/providers/token-usage.ts (1)
  • TokenUsage (1-4)
src/providers/azure-openai-provider.ts (1)
src/providers/llm-provider.ts (1)
  • LLMResult (3-6)
src/providers/gemini-provider.ts (1)
src/providers/llm-provider.ts (1)
  • LLMResult (3-6)
src/providers/anthropic-provider.ts (1)
src/providers/llm-provider.ts (1)
  • LLMResult (3-6)
src/evaluators/base-evaluator.ts (2)
src/providers/token-usage.ts (1)
  • TokenUsage (1-4)
src/prompts/schema.ts (2)
  • SubjectiveLLMResult (74-82)
  • SemiObjectiveLLMResult (84-92)
src/cli/orchestrator.ts (4)
src/evaluators/index.ts (1)
  • BaseEvaluator (16-16)
src/cli/types.ts (1)
  • RunPromptEvaluationResultSuccess (130-134)
src/providers/token-usage.ts (2)
  • TokenUsageStats (6-10)
  • calculateCost (21-30)
src/output/reporter.ts (2)
  • printEvaluationSummaries (138-175)
  • printTokenUsage (203-210)
tests/anthropic-provider.test.ts (1)
tests/schemas/mock-schemas.ts (4)
  • MockAPIErrorParams (46-46)
  • MockRateLimitErrorParams (48-48)
  • MockAuthenticationErrorParams (47-47)
  • MockBadRequestErrorParams (49-49)
tests/scoring-types.test.ts (2)
src/providers/llm-provider.ts (1)
  • LLMResult (3-6)
src/prompts/schema.ts (2)
  • SubjectiveLLMResult (74-82)
  • SemiObjectiveLLMResult (84-92)
🔇 Additional comments (27)
src/schemas/config-schemas.ts (1)

11-11: The runRules default change is backward compatible and safe.

Using .default([]) makes the input optional (accepts undefined) while ensuring the output is never undefined, which is actually an improvement over .optional(). Existing code patterns in the codebase work correctly with this change:

  • The helper function in tests/utils.ts that checks if (runRules !== undefined) continues to work because Zod's .default() still accepts undefined input during parsing
  • The truthiness check in scan-path-resolver.ts (if (match.runRules)) works correctly—empty arrays are truthy but iterate zero times
  • The FilePatternConfig interface's optional runRules property accommodates the now-guaranteed string array type

This change aligns the type system to reflect the actual parsed output: runRules will always be present as either a user-provided array or an empty array default, eliminating the undefined case.

src/schemas/env-schemas.ts (1)

32-49: LGTM! Clean environment schema extension.

The BASE_ENV_SCHEMA with pricing fields is properly defined and consistently merged across all provider configurations. The use of .positive() ensures valid pricing when provided.

tests/token-usage.test.ts (1)

1-55: LGTM! Comprehensive test coverage.

The test suite thoroughly covers all scenarios including correct calculations, partial millions, missing pricing configurations, and edge cases like zero tokens.

tests/anthropic-e2e.test.ts (1)

144-152: LGTM! Test updates correctly reflect the LLMResult wrapper.

All test assertions have been properly updated to access result.data for the response payload and result.usage for token tracking, aligning with the new structured response format.

Also applies to: 238-242, 480-500, 558-565, 617-621

src/cli/commands.ts (1)

17-17: LGTM! Clean integration of pricing configuration.

The OutputFormat type cast is safe given the validated CLI options, and the pricing configuration is correctly passed through from environment variables to the orchestrator.

Also applies to: 160-176

src/evaluators/base-evaluator.ts (2)

28-48: LGTM! Clean token usage tracking integration.

The protected lastUsage field and public getLastUsage() accessor provide a clean API for external access to token usage without coupling the evaluator to specific consumers.


68-75: LGTM! Consistent usage tracking across evaluation paths.

Both subjective and semi-objective evaluation paths correctly destructure the LLMResult wrapper and conditionally store usage data when present.

Also applies to: 122-129

src/providers/llm-provider.ts (1)

1-10: LGTM! Excellent abstraction for structured LLM responses.

The LLMResult<T> wrapper provides a clean, type-safe interface for returning both the response data and optional usage metrics. The generic type parameter ensures type safety is preserved for the data payload while standardizing usage reporting across all providers.

src/providers/azure-openai-provider.ts (2)

135-145: LGTM! Token usage extraction implemented correctly.

The implementation correctly wraps the parsed data in LLMResult<T> and populates usage metadata when available. The mapping from prompt_tokensinputTokens and completion_tokensoutputTokens is consistent with the OpenAI provider.


51-51: Method signature correctly updated for LLMResult wrapper.

The return type change from Promise<T> to Promise<LLMResult<T>> aligns with the interface contract in llm-provider.ts.

tests/scoring-types.test.ts (2)

34-55: LGTM! Mock response correctly structured with LLMResult wrapper.

The test mock now properly uses the LLMResult<SubjectiveLLMResult> structure with nested data field, aligning with the provider's new return type.


172-172: Type cast is pragmatic for testing mock data.

The as unknown as LLMResult<any> cast is acceptable here since this test mocks claim extraction which uses a different schema than the typed evaluation results.

src/providers/openai-provider.ts (1)

168-179: LGTM! Token usage extraction follows consistent pattern.

The implementation correctly:

  1. Wraps parsed JSON data in LLMResult<T>
  2. Conditionally populates usage only when present in response
  3. Maps OpenAI field names (prompt_tokens, completion_tokens) to standardized names (inputTokens, outputTokens)
src/providers/anthropic-provider.ts (2)

95-95: Good addition of explicit stream: false.

Explicitly setting stream: false ensures the response includes complete usage metadata, which is required for token tracking.


166-172: Verify that usage is always present in Anthropic responses.

Unlike OpenAI/Azure providers which conditionally set usage, this implementation always accesses validatedResponse.usage.input_tokens and output_tokens. This assumes Anthropic always returns usage data.

Based on the ANTHROPIC_RESPONSE_SCHEMA validation, verify that usage is a required field. If not, this could throw when accessing properties on undefined.

#!/bin/bash
# Check Anthropic response schema to verify if usage is required
rg -n -A 20 'ANTHROPIC_RESPONSE_SCHEMA' --type ts
tests/openai-provider.test.ts (1)

200-209: LGTM! Test assertions correctly validate the new response structure.

The test properly verifies:

  1. result.data contains the expected parsed JSON
  2. result.usage is defined when the mock includes usage data
  3. Token counts are correctly mapped from OpenAI field names

The conditional check if (result.usage) is appropriate given usage is optional.

tests/anthropic-provider.test.ts (1)

175-183: LGTM! Test assertions properly validate LLMResult wrapper and usage.

The test correctly:

  1. Asserts result.data contains expected tool output
  2. Validates result.usage is defined
  3. Verifies inputTokens and outputTokens match mock values
src/cli/types.ts (3)

130-138: LGTM! Clean type definitions for token usage support.

The new RunPromptEvaluationResultSuccess interface cleanly separates the success case with optional usage field, and the discriminated union RunPromptEvaluationResult maintains type safety for error handling.


12-12: Imports correctly reference token-usage types.

The import from ../providers/token-usage aligns with the type definitions shown in the relevant code snippets.


146-148: EvaluateFileResult properly extended for aggregated token usage.

Adding tokenUsage?: TokenUsageStats enables propagating aggregated token counts through the evaluation pipeline for final cost calculation.

src/providers/gemini-provider.ts (3)

2-2: LGTM!

Import of LLMResult aligns with the updated return type.


44-48: LGTM!

Return type correctly updated to Promise<LLMResult<T>> to match the interface contract.


77-93: Property names are correct.

The promptTokenCount and candidatesTokenCount properties are the correct property names for usageMetadata in the Gemini API, and the nullish coalescing operator provides appropriate safety for optional properties.

src/cli/orchestrator.ts (4)

8-8: LGTM!

Import of printTokenUsage is consistent with the new token usage reporting feature.


13-13: LGTM!

Import of BaseEvaluator needed for accessing getLastUsage() method.


631-632: LGTM!

Token accumulators properly initialized to zero.


720-724: LGTM!

Token usage accumulation correctly guards against undefined r.usage before adding.

- Import and call printTokenUsage function in main command handler
- Aggregate token usage statistics across all evaluated files
- Calculate total input and output tokens from file results
- Compute aggregated cost based on configured pricing
- Add tokenUsage field to EvaluationResult interface
- Improve cost formatting in reporter with dynamic precision (4 decimals for costs < $1, 2 for larger amounts)
- Token usage stats now displayed after global summary when available
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
src/cli/orchestrator.ts (1)

604-612: Type assertion could be made more defensive.

The (evaluator as BaseEvaluator).getLastUsage?.() pattern works due to optional chaining but relies on knowing that all evaluators extend BaseEvaluator. Consider adding a type guard or checking the method existence more explicitly if the evaluator hierarchy might change.

That said, the current implementation is functionally safe with optional chaining.

🔎 Alternative approach using explicit check:
-    const usage = (evaluator as BaseEvaluator).getLastUsage?.();
+    const usage = 'getLastUsage' in evaluator
+      ? (evaluator as BaseEvaluator).getLastUsage()
+      : undefined;
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f43f4ff and 244e3ae.

📒 Files selected for processing (4)
  • src/cli/commands.ts (4 hunks)
  • src/cli/orchestrator.ts (9 hunks)
  • src/cli/types.ts (5 hunks)
  • src/output/reporter.ts (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/cli/types.ts
🧰 Additional context used
🧬 Code graph analysis (3)
src/output/reporter.ts (1)
src/providers/token-usage.ts (1)
  • TokenUsageStats (6-10)
src/cli/commands.ts (1)
src/output/reporter.ts (1)
  • printTokenUsage (203-212)
src/cli/orchestrator.ts (4)
src/evaluators/base-evaluator.ts (1)
  • BaseEvaluator (27-212)
src/cli/types.ts (1)
  • RunPromptEvaluationResultSuccess (131-135)
src/providers/token-usage.ts (2)
  • TokenUsageStats (6-10)
  • calculateCost (21-30)
src/output/reporter.ts (1)
  • printTokenUsage (203-212)
🔇 Additional comments (5)
src/output/reporter.ts (1)

203-212: LGTM!

The implementation is clean and handles the optional totalCost appropriately. The dynamic precision (4 decimals for sub-dollar costs, 2 for larger amounts) provides good UX for typical LLM API pricing.

src/cli/commands.ts (2)

172-175: Pricing configuration integration looks correct.

The pricing object correctly passes the environment variables to the orchestrator. The calculateCost function in token-usage.ts handles undefined values gracefully, so this is safe even when pricing is not configured.


186-190: LGTM!

The token usage is printed after the global summary as specified in the requirements, and the guard on result.tokenUsage existence is appropriate.

src/cli/orchestrator.ts (2)

720-724: Token usage aggregation logic is correct.

The accumulation of input/output tokens both at the prompt level within a file and across files is implemented correctly. The cost is recalculated from aggregated tokens at the end rather than summing per-file costs, which is mathematically equivalent and avoids potential floating-point accumulation errors.

Also applies to: 819-824


838-849: LGTM!

The final token usage stats calculation correctly aggregates totals and only sets totalCost when pricing is configured. This is consistent with the type definition where totalCost is optional.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
src/cli/orchestrator.ts (3)

607-615: Consider clarifying the type safety pattern.

The code casts evaluator to BaseEvaluator and then uses optional chaining on getLastUsage, which is somewhat redundant. If all evaluators extend BaseEvaluator, the optional chaining is unnecessary; if some don't, the cast is misleading. The current code is safe and functional, but could be clearer.

💡 Alternative approaches

Option 1: If all evaluators extend BaseEvaluator, remove optional chaining:

-const usage = (evaluator as BaseEvaluator).getLastUsage?.();
+const usage = (evaluator as BaseEvaluator).getLastUsage();

Option 2: If some evaluators don't have getLastUsage, use type guard instead of cast:

-const usage = (evaluator as BaseEvaluator).getLastUsage?.();
+const usage = 'getLastUsage' in evaluator && typeof evaluator.getLastUsage === 'function'
+  ? evaluator.getLastUsage()
+  : undefined;

748-765: Minor clarity improvements.

Two small issues:

  1. The comment on Line 748 says "Calculate costs if output format is Line" but the cost is calculated unconditionally for all formats (which is correct). The comment is misleading.

  2. Line 749 uses options.pricing || {} as a fallback, but calculateCost already handles undefined pricing by returning undefined. The fallback to {} is unnecessary defensive code.

🔎 Suggested improvements
-  // Calculate costs if output format is Line
-  const pricing = options.pricing || {};
+  // Calculate token usage stats and cost
+  const pricing = options.pricing;

839-850: Same minor improvement as per-file calculation.

Line 846 uses options.pricing || {} as a fallback, but calculateCost already handles undefined pricing by returning undefined. The fallback is unnecessary.

🔎 Suggested improvement
-  const pricing = options.pricing || {};
+  const pricing = options.pricing;
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 244e3ae and cee544f.

📒 Files selected for processing (3)
  • src/cli/commands.ts (3 hunks)
  • src/cli/orchestrator.ts (9 hunks)
  • src/output/reporter.ts (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/cli/commands.ts
🧰 Additional context used
🧬 Code graph analysis (2)
src/output/reporter.ts (1)
src/providers/token-usage.ts (1)
  • TokenUsageStats (6-10)
src/cli/orchestrator.ts (4)
src/evaluators/base-evaluator.ts (1)
  • BaseEvaluator (27-212)
src/evaluators/index.ts (1)
  • BaseEvaluator (16-16)
src/cli/types.ts (1)
  • RunPromptEvaluationResultSuccess (131-135)
src/providers/token-usage.ts (2)
  • TokenUsageStats (6-10)
  • calculateCost (21-30)
🔇 Additional comments (8)
src/output/reporter.ts (2)

5-5: LGTM: Import is correct.

The import of TokenUsageStats is properly structured and the relative path is correct.


203-213: LGTM: Well-implemented token usage display.

The function correctly implements the requirements from the PR objectives:

  • Uses .toLocaleString() for readable token counts (e.g., "1,250")
  • Conditionally displays cost only when available (line 207)
  • Applies sensible dynamic precision (4 decimals for costs < $1, 2 decimals otherwise)
  • Formatting and structure are consistent with other reporter functions

The implementation handles edge cases appropriately and matches the example output format specified in the PR.

src/cli/orchestrator.ts (6)

13-13: LGTM!

The new imports for token usage tracking are appropriate and all are utilized in the implementation below.

Also applies to: 20-26


634-635: LGTM!

The per-file token accumulation logic correctly aggregates usage only from successful prompt evaluations, with appropriate defensive checks for r.usage existence.

Also applies to: 723-727


767-770: Past review issue appears resolved.

The previous review comment flagged duplicate token usage printing at this location. The current code only prints evaluation summaries here and does not call printTokenUsage, so the duplication concern has been addressed.


778-778: LGTM!

Correctly includes token usage statistics in the file evaluation result.


799-800: LGTM!

The global token aggregation correctly accumulates usage across all evaluated files with appropriate defensive checks.

Also applies to: 820-825


859-859: LGTM!

Correctly includes aggregated token usage statistics in the final evaluation result.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
src/cli/commands.ts (1)

161-161: Unsafe cast from string to OutputFormat enum.

The CLI option --output accepts any string, but it's cast directly to OutputFormat without validation. If a user passes an invalid value (e.g., --output xml), the code will proceed with an invalid format value instead of failing fast.

Consider validating the output format before use:

🔎 Proposed fix
-      const outputFormat = cliOptions.output as OutputFormat;
+      const outputFormat = cliOptions.output as OutputFormat;
+      if (!Object.values(OutputFormat).includes(outputFormat)) {
+        console.error(`Error: Invalid output format '${cliOptions.output}'. Valid options: line, json, vale-json, rdjson`);
+        process.exit(1);
+      }
src/cli/orchestrator.ts (2)

701-709: Type assertion for getLastUsage access.

The cast (evaluator as BaseEvaluator).getLastUsage?.() works because BaseEvaluator defines getLastUsage(), but this couples the orchestrator to the base implementation. If a custom evaluator doesn't extend BaseEvaluator, the usage won't be captured.

This is acceptable for now since all evaluators extend BaseEvaluator, but consider adding getLastUsage() to the Evaluator interface in the future for type safety.


851-868: Per-file cost calculation is redundant with aggregate calculation.

Cost is calculated per-file at lines 859-868, but this per-file cost is never used for display or reporting—only the aggregate cost in evaluateFiles is shown. The per-file tokenUsageStats.totalCost is included in EvaluateFileResult.tokenUsage, but the aggregate function at lines 926-929 only sums totalInputTokens and totalOutputTokens, ignoring the per-file costs and recalculating at line 955.

This is technically correct (due to floating-point precision, recalculating from totals is more accurate), but the per-file cost calculation is dead code. Consider removing it to simplify:

🔎 Simplify by removing per-file cost calculation
  const tokenUsageStats: TokenUsageStats = {
    totalInputTokens,
    totalOutputTokens,
  };

-  const cost = calculateCost(
-    {
-      inputTokens: totalInputTokens,
-      outputTokens: totalOutputTokens
-    },
-    pricing
-  );
-  if (cost !== undefined) {
-    tokenUsageStats.totalCost = cost;
-  }
src/cli/types.ts (1)

12-12: Consider using import type for type-only imports.

The import on line 12 brings in TokenUsage, TokenUsageStats, and PricingConfig which are all interfaces (types only). Per coding guidelines ("Use TypeScript ESM with explicit imports and narrow types"), consider using import type for consistency with other type imports in this file (lines 1-6).

🔎 Proposed fix
-import { TokenUsage, TokenUsageStats, PricingConfig } from '../providers/token-usage';
+import type { TokenUsage, TokenUsageStats, PricingConfig } from '../providers/token-usage';
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cee544f and d0c425a.

📒 Files selected for processing (5)
  • src/cli/commands.ts
  • src/cli/orchestrator.ts
  • src/cli/types.ts
  • src/evaluators/base-evaluator.ts
  • src/schemas/config-schemas.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/evaluators/base-evaluator.ts
🧰 Additional context used
📓 Path-based instructions (1)
src/**/*.ts

📄 CodeRabbit inference engine (AGENTS.md)

src/**/*.ts: Use TypeScript ESM with explicit imports and narrow types
Use 2-space indentation; avoid trailing whitespace
Maintain strict TypeScript with no any; use unknown + schema validation for external data
Use custom error types with proper inheritance; catch blocks use unknown type

Files:

  • src/cli/commands.ts
  • src/schemas/config-schemas.ts
  • src/cli/types.ts
  • src/cli/orchestrator.ts
🧠 Learnings (4)
📚 Learning: 2025-12-28T19:43:51.176Z
Learnt from: CR
Repo: TRocket-Labs/vectorlint PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-28T19:43:51.176Z
Learning: Applies to src/boundaries/**/*.ts : Use Zod schemas for boundary validation of all external data (files, CLI, env, APIs) at system boundaries

Applied to files:

  • src/schemas/config-schemas.ts
📚 Learning: 2025-12-28T19:43:51.176Z
Learnt from: CR
Repo: TRocket-Labs/vectorlint PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-28T19:43:51.176Z
Learning: Applies to src/**/*.ts : Maintain strict TypeScript with no `any`; use `unknown` + schema validation for external data

Applied to files:

  • src/schemas/config-schemas.ts
📚 Learning: 2025-12-28T19:43:51.176Z
Learnt from: CR
Repo: TRocket-Labs/vectorlint PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-28T19:43:51.176Z
Learning: Applies to src/output/**/*.ts : IDs should be shown as `PromptId.CriterionId` in output

Applied to files:

  • src/cli/types.ts
  • src/cli/orchestrator.ts
📚 Learning: 2025-12-28T19:43:51.176Z
Learnt from: CR
Repo: TRocket-Labs/vectorlint PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-28T19:43:51.176Z
Learning: Applies to evals/**/*.md : Eval YAML should use `name` (human), `id` (PascalCase), and criteria `id` (PascalCase)

Applied to files:

  • src/cli/orchestrator.ts
🧬 Code graph analysis (3)
src/cli/commands.ts (1)
src/output/reporter.ts (1)
  • printTokenUsage (203-213)
src/cli/types.ts (1)
src/providers/token-usage.ts (3)
  • PricingConfig (12-15)
  • TokenUsageStats (6-10)
  • TokenUsage (1-4)
src/cli/orchestrator.ts (4)
src/evaluators/base-evaluator.ts (1)
  • BaseEvaluator (28-236)
src/evaluators/index.ts (1)
  • BaseEvaluator (16-16)
src/cli/types.ts (1)
  • RunPromptEvaluationResultSuccess (120-124)
src/providers/token-usage.ts (2)
  • TokenUsageStats (6-10)
  • calculateCost (21-30)
🔇 Additional comments (9)
src/schemas/config-schemas.ts (1)

11-11: Type safety improvement is correct and consistent.

The change from .optional() to .default([]) improves type safety by guaranteeing that runRules is always a defined array. This is consistent with the concurrency field pattern on line 6.

Verification shows all usages of runRules handle this change correctly. Code iterating over runRules works without issues since the schema now guarantees an empty array when not provided. Existing undefined checks in file-section-parser.ts and tests/utils.ts remain harmless though now always-true.

src/cli/commands.ts (2)

173-177: LGTM!

The pricing configuration is correctly passed through from environment variables. The PricingConfig interface properly allows undefined values, and downstream calculateCost handles missing pricing gracefully by returning undefined for cost.


180-190: Consider the display order of token usage vs. global summary.

Based on the PR objective ("Display token usage and cost information in the terminal after each completed evaluation, shown after the calculation summary"), the token usage should be displayed after the global summary, not before. Currently, printTokenUsage is called before printGlobalSummary.

If the current order is intentional for better UX, this is fine. Otherwise, consider swapping the order:

🔎 Suggested reorder (if after summary was intended)
      if (cliOptions.output === 'line') {
-        if (result.tokenUsage) {
-          printTokenUsage(result.tokenUsage);
-        }
        printGlobalSummary(
          result.totalFiles,
          result.totalErrors,
          result.totalWarnings,
          result.requestFailures
        );
+        if (result.tokenUsage) {
+          printTokenUsage(result.tokenUsage);
+        }
      }
src/cli/orchestrator.ts (3)

22-25: LGTM!

The token usage imports are correctly added and align with the types defined in src/providers/token-usage.ts.


947-958: LGTM!

The aggregate token usage and cost calculation is correctly implemented. Cost is calculated once from the total summed tokens, which is more accurate than summing per-file costs due to floating-point precision.


262-268: No functional change.

This is a formatting adjustment in the criterionId ternary expression. Logic remains unchanged.

src/cli/types.ts (3)

14-19: LGTM!

The OutputFormat enum properly defines all supported output formats with string values matching the CLI option strings.


120-128: LGTM!

The RunPromptEvaluationResultSuccess interface and updated RunPromptEvaluationResult union type follow a clean discriminated union pattern. The optional usage field appropriately handles cases where token usage isn't available.


30-30: LGTM!

The optional pricing and tokenUsage fields are correctly added to their respective interfaces, maintaining backward compatibility while enabling the new token tracking feature.

Also applies to: 40-40, 138-138

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
src/evaluators/evaluator.ts (1)

10-10: LGTM: Optional method enables token usage tracking.

The optional method maintains backward compatibility while enabling token usage tracking as intended by the PR objectives.

Optional refinements:

  1. The return type could be simplified from TokenUsage | undefined to just TokenUsage, since the ? already signals the method may not exist. However, the explicit undefined makes it clear the method can return undefined even when implemented (e.g., before first evaluation).

  2. Consider adding JSDoc to document the method's purpose:

    /**
     * Returns token usage from the last evaluation, if available.
     */
    getLastUsage?(): TokenUsage | undefined;
src/cli/orchestrator.ts (1)

940-941: Consider passing options.pricing directly.

The || {} fallback is unnecessary since calculateCost already handles undefined pricing gracefully (returns undefined when pricing config is missing). This simplifies the code and removes an unneeded intermediate variable.

🔎 Suggested simplification
-  // Calculate cost if pricing is configured
-  const pricing = options.pricing || {};
-  const cost = calculateCost({ inputTokens: totalInputTokens, outputTokens: totalOutputTokens }, pricing);
+  // Calculate cost if pricing is configured
+  const cost = calculateCost({ inputTokens: totalInputTokens, outputTokens: totalOutputTokens }, options.pricing);
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d0c425a and 708e2c7.

📒 Files selected for processing (4)
  • src/cli/commands.ts
  • src/cli/orchestrator.ts
  • src/cli/types.ts
  • src/evaluators/evaluator.ts
🚧 Files skipped from review as they are similar to previous changes (2)
  • src/cli/types.ts
  • src/cli/commands.ts
🧰 Additional context used
📓 Path-based instructions (1)
src/**/*.ts

📄 CodeRabbit inference engine (AGENTS.md)

src/**/*.ts: Use TypeScript ESM with explicit imports and narrow types
Use 2-space indentation; avoid trailing whitespace
Maintain strict TypeScript with no any; use unknown + schema validation for external data
Use custom error types with proper inheritance; catch blocks use unknown type

Files:

  • src/evaluators/evaluator.ts
  • src/cli/orchestrator.ts
🧠 Learnings (2)
📚 Learning: 2025-12-28T19:43:51.176Z
Learnt from: CR
Repo: TRocket-Labs/vectorlint PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-28T19:43:51.176Z
Learning: Applies to src/output/**/*.ts : IDs should be shown as `PromptId.CriterionId` in output

Applied to files:

  • src/cli/orchestrator.ts
📚 Learning: 2025-12-28T19:43:51.176Z
Learnt from: CR
Repo: TRocket-Labs/vectorlint PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-28T19:43:51.176Z
Learning: Applies to evals/**/*.md : Eval YAML should use `name` (human), `id` (PascalCase), and criteria `id` (PascalCase)

Applied to files:

  • src/cli/orchestrator.ts
🧬 Code graph analysis (2)
src/evaluators/evaluator.ts (4)
src/evaluators/index.ts (1)
  • Evaluator (13-13)
src/cli/types.ts (1)
  • EvaluationResult (33-41)
src/prompts/schema.ts (1)
  • EvaluationResult (171-171)
src/providers/token-usage.ts (1)
  • TokenUsage (1-4)
src/cli/orchestrator.ts (2)
src/cli/types.ts (1)
  • RunPromptEvaluationResultSuccess (120-124)
src/providers/token-usage.ts (2)
  • TokenUsageStats (6-10)
  • calculateCost (21-30)
🔇 Additional comments (4)
src/evaluators/evaluator.ts (1)

2-2: LGTM: Type import follows best practices.

The type-only import is correctly specified and follows TypeScript ESM guidelines.

src/cli/orchestrator.ts (3)

701-709: LGTM!

Clean use of optional chaining for getLastUsage?.() and conditional attachment of usage data to the result object. This correctly handles evaluators that may not support token usage tracking.


824-829: LGTM!

Token usage accumulation is correctly guarded and only processes successful evaluations that return usage data.


910-915: LGTM!

Aggregation of token usage across files is correctly guarded.

@hurshore
Copy link
Collaborator

There are ts errors in the following files: anthropic-provider.ts, anthropic-e2e.test.ts, openai-provider.test.ts, scoring-types.test.ts

@hurshore
Copy link
Collaborator

Have you been able to test this feature with all supported providers?

- Remove `getLastUsage()` method and `lastUsage` field from BaseEvaluator
- Move token usage from orchestrator result wrapper to nested evaluation result
- Update `RunPromptEvaluationResultSuccess` to remove top-level usage field
- Modify accuracy evaluator to aggregate token usage from claim extraction and base evaluation
- Update orchestrator to access usage from `result.usage` instead of `r.usage`
- Create `ClaimExtractionResult` interface to return claims with optional usage data
- Ensure token usage is properly accumulated and returned within evaluation results
- Simplifies the result structure by keeping usage data closer to the actual evaluation data
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
src/evaluators/accuracy-evaluator.ts (1)

64-68: Consider unconditionally assigning usage to avoid mutation.

The pattern of conditionally mutating result after creation works, but could be cleaner. Since calculateSemiObjectiveResult returns a mutable object, this is fine, but an alternative would be to spread usage into a new object.

🔎 Optional refactor for immutability
    if (claims.length === 0) {
      const wordCount = content.trim().split(/\s+/).length || 1;
-      const result = this.calculateSemiObjectiveResult([], wordCount);
-      if (claimUsage) result.usage = claimUsage;
-      return result;
+      return {
+        ...this.calculateSemiObjectiveResult([], wordCount),
+        ...(claimUsage && { usage: claimUsage }),
+      };
    }
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between da2ffe5 and 705d914.

📒 Files selected for processing (5)
  • src/cli/orchestrator.ts
  • src/cli/types.ts
  • src/evaluators/accuracy-evaluator.ts
  • src/evaluators/base-evaluator.ts
  • src/prompts/schema.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/evaluators/base-evaluator.ts
🧰 Additional context used
📓 Path-based instructions (1)
src/**/*.ts

📄 CodeRabbit inference engine (AGENTS.md)

src/**/*.ts: Use TypeScript ESM with explicit imports and narrow types
Use 2-space indentation; avoid trailing whitespace
Maintain strict TypeScript with no any; use unknown + schema validation for external data
Use custom error types with proper inheritance; catch blocks use unknown type

Files:

  • src/prompts/schema.ts
  • src/cli/orchestrator.ts
  • src/cli/types.ts
  • src/evaluators/accuracy-evaluator.ts
🧠 Learnings (3)
📚 Learning: 2025-12-28T19:43:51.176Z
Learnt from: CR
Repo: TRocket-Labs/vectorlint PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-28T19:43:51.176Z
Learning: Applies to src/output/**/*.ts : IDs should be shown as `PromptId.CriterionId` in output

Applied to files:

  • src/cli/orchestrator.ts
  • src/cli/types.ts
📚 Learning: 2025-12-28T19:43:51.176Z
Learnt from: CR
Repo: TRocket-Labs/vectorlint PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-28T19:43:51.176Z
Learning: Applies to evals/**/*.md : Eval YAML should use `name` (human), `id` (PascalCase), and criteria `id` (PascalCase)

Applied to files:

  • src/cli/orchestrator.ts
📚 Learning: 2025-12-28T19:43:51.176Z
Learnt from: CR
Repo: TRocket-Labs/vectorlint PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-28T19:43:51.176Z
Learning: Applies to src/providers/**/*.ts : Depend on `LLMProvider` and `SearchProvider` interfaces; keep providers thin (transport only)

Applied to files:

  • src/evaluators/accuracy-evaluator.ts
🧬 Code graph analysis (4)
src/prompts/schema.ts (2)
src/providers/token-usage.ts (1)
  • TokenUsage (1-4)
src/evaluators/types.ts (1)
  • EvaluationType (14-17)
src/cli/orchestrator.ts (2)
src/cli/types.ts (1)
  • RunPromptEvaluationResultSuccess (120-123)
src/providers/token-usage.ts (2)
  • TokenUsageStats (6-10)
  • calculateCost (21-30)
src/cli/types.ts (9)
src/prompts/prompt-loader.ts (3)
  • PromptFile (8-8)
  • PromptCriterionSpec (8-8)
  • PromptMeta (8-8)
src/schemas/prompt-schemas.ts (3)
  • PromptFile (64-64)
  • PromptCriterionSpec (62-62)
  • PromptMeta (63-63)
src/providers/llm-provider.ts (1)
  • LLMProvider (8-10)
src/providers/search-provider.ts (1)
  • SearchProvider (5-7)
src/boundaries/file-section-parser.ts (1)
  • FilePatternConfig (2-6)
src/providers/token-usage.ts (2)
  • PricingConfig (12-15)
  • TokenUsageStats (6-10)
src/prompts/schema.ts (2)
  • EvaluationResult (174-174)
  • SubjectiveResult (123-143)
src/output/reporter.ts (1)
  • EvaluationSummary (7-11)
src/output/json-formatter.ts (2)
  • JsonFormatter (54-94)
  • ScoreComponent (10-18)
src/evaluators/accuracy-evaluator.ts (1)
src/providers/token-usage.ts (1)
  • TokenUsage (1-4)
🔇 Additional comments (14)
src/prompts/schema.ts (1)

2-2: LGTM!

The TokenUsage type is correctly imported and consistently added as an optional field to both SubjectiveResult and SemiObjectiveResult. This maintains type safety while allowing evaluation results to optionally carry token usage metadata from LLM providers.

Also applies to: 142-142, 171-171

src/evaluators/accuracy-evaluator.ts (3)

28-31: LGTM!

The ClaimExtractionResult interface cleanly encapsulates the return type of extractClaims, making the optional usage field explicit in the contract.


95-101: LGTM on token aggregation logic.

The aggregation correctly combines claim extraction usage with the base evaluator's usage, using || 0 to handle cases where result.usage might be undefined. This ensures accurate total token counts across the multi-step evaluation pipeline.


129-139: LGTM on claim extraction with usage propagation.

The destructuring correctly extracts both data and usage from the LLM provider's structured response. The conditional spread ...(usage && { usage }) is an idiomatic way to optionally include usage in the return object.

src/cli/orchestrator.ts (5)

1-27: LGTM on imports.

The imports are well-organized, bringing in the necessary types (TokenUsageStats) and utilities (calculateCost) for token usage tracking.


701-704: LGTM on typed success result.

Using the RunPromptEvaluationResultSuccess interface provides better type safety and aligns with the discriminated union pattern used for RunPromptEvaluationResult.


819-823: LGTM on per-prompt usage accumulation.

The accumulation correctly extracts token counts from each successful prompt evaluation result. The optional chaining on r.result.usage safely handles cases where usage data isn't present.


846-862: LGTM on file-level token usage aggregation.

The TokenUsageStats object correctly captures per-file totals and is properly included in the EvaluateFileResult.


928-941: LGTM on cost calculation integration.

The cost calculation is correctly applied only when pricing config is available, and the result is conditionally added to tokenUsage.totalCost. This aligns with the calculateCost function signature which returns undefined when pricing isn't configured.

src/cli/types.ts (5)

12-12: LGTM on type imports.

Using type imports for TokenUsageStats and PricingConfig correctly indicates these are type-only imports, which helps with tree-shaking and ensures no runtime code is generated.


30-30: LGTM on pricing config option.

Adding pricing?: PricingConfig to EvaluationOptions enables cost calculation when users provide pricing information via environment variables.


40-40: LGTM on token usage in EvaluationResult.

The optional tokenUsage field properly surfaces aggregated token statistics to callers of evaluateFiles.


120-127: LGTM on discriminated union refactor.

Extracting RunPromptEvaluationResultSuccess as a named interface improves code clarity and enables the orchestrator to reference this type explicitly when constructing success results. The union at lines 125-127 maintains the discriminated union pattern with ok as the discriminant.


135-138: LGTM on file result token usage.

Adding tokenUsage?: TokenUsageStats to EvaluateFileResult enables per-file token tracking, which is then aggregated in evaluateFiles for the overall total.

@ayo6706
Copy link
Collaborator Author

ayo6706 commented Dec 29, 2025

anthropic-provider.ts

There are ts errors in the following files: anthropic-provider.ts, anthropic-e2e.test.ts, openai-provider.test.ts, scoring-types.test.ts

The errors from these files aren't related to this PR. I can remember fixing them. can't remember which actual PR.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
src/providers/anthropic-provider.ts (2)

95-95: Clarify the necessity of setting stream: false in both locations.

The stream: false parameter is set on both line 95 (in params) and line 131 (in anthropicParams). Unlike other E2E compatibility aliases (e.g., maxTokens, toolChoice), this uses the same snake_case name in both places.

Since anthropicParams is constructed independently and sent to the Anthropic API, is the stream: false on line 95 necessary for E2E mocks? If not, it could be removed from the initial params object and only set in anthropicParams.

This addresses the previous review question: "stream: false" ensures the API returns a complete response rather than a streaming response.

🔎 Optional: Remove redundant stream parameter if not needed for E2E

If E2E mocks don't require stream in the params object, you could simplify:

      max_tokens: this.config.maxTokens!,
      tools: [toolSchema],
      tool_choice: { type: 'tool', name: schema.name },
-      stream: false,
      maxTokens: this.config.maxTokens!,
      toolChoice: { type: 'tool', name: schema.name },

Also applies to: 131-131


132-134: Consider simplifying conditional spreads for always-defined fields.

The conditional spreads for system, tools, and tool_choice check for undefined, but these fields are always set earlier in the code:

  • system is set on line 85 from systemPrompt (line 76)
  • tools is always set on line 93
  • tool_choice is always set on line 94

Unless there are edge cases or future scenarios where these might be undefined, you could simplify to unconditional spreads or direct assignment.

🔎 Simplified approach
    const anthropicParams: Anthropic.Messages.MessageCreateParams = {
      model: params.model,
      messages: params.messages,
      max_tokens: params.max_tokens,
      stream: false,
-      ...(params.system !== undefined && { system: params.system }),
-      ...(params.tools !== undefined && { tools: params.tools }),
-      ...(params.tool_choice !== undefined && { tool_choice: params.tool_choice }),
+      system: params.system,
+      tools: params.tools,
+      tool_choice: params.tool_choice,
      ...(params.temperature !== undefined && { temperature: params.temperature }),
    };
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 45889c6 and e2b1b1d.

📒 Files selected for processing (3)
  • src/evaluators/base-evaluator.ts
  • src/providers/anthropic-provider.ts
  • tests/scoring-types.test.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/scoring-types.test.ts
🧰 Additional context used
📓 Path-based instructions (2)
src/**/*.ts

📄 CodeRabbit inference engine (AGENTS.md)

src/**/*.ts: Use TypeScript ESM with explicit imports and narrow types
Use 2-space indentation; avoid trailing whitespace
Maintain strict TypeScript with no any; use unknown + schema validation for external data
Use custom error types with proper inheritance; catch blocks use unknown type

Files:

  • src/providers/anthropic-provider.ts
  • src/evaluators/base-evaluator.ts
src/providers/**/*.ts

📄 CodeRabbit inference engine (AGENTS.md)

src/providers/**/*.ts: Depend on LLMProvider and SearchProvider interfaces; keep providers thin (transport only)
Inject RequestBuilder via provider constructor to avoid coupling

Files:

  • src/providers/anthropic-provider.ts
🧠 Learnings (1)
📚 Learning: 2025-12-28T19:43:51.176Z
Learnt from: CR
Repo: TRocket-Labs/vectorlint PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-28T19:43:51.176Z
Learning: Applies to src/providers/**/*.ts : Depend on `LLMProvider` and `SearchProvider` interfaces; keep providers thin (transport only)

Applied to files:

  • src/providers/anthropic-provider.ts
🧬 Code graph analysis (2)
src/providers/anthropic-provider.ts (1)
src/providers/llm-provider.ts (1)
  • LLMResult (3-6)
src/evaluators/base-evaluator.ts (1)
src/prompts/schema.ts (2)
  • SubjectiveLLMResult (96-110)
  • SemiObjectiveLLMResult (112-121)
🔇 Additional comments (4)
src/evaluators/base-evaluator.ts (1)

71-75: No changes needed—usage field is already properly typed.

The usage field in both SubjectiveResult and SemiObjectiveResult is already typed as usage?: TokenUsage;, where TokenUsage is a well-defined interface with inputTokens and outputTokens properties. The code already complies with the coding guidelines requiring strict TypeScript without any.

Likely an incorrect or invalid review comment.

src/providers/anthropic-provider.ts (3)

3-3: LGTM! Return type updated to support token usage tracking.

The import of LLMResult and the updated return type correctly align with the standardized interface for tracking token usage across providers.

Also applies to: 75-75


231-233: LGTM! Improved error handling for text blocks.

The safer extraction of firstTextBlock with an explicit existence check before accessing its properties is a good improvement that prevents potential undefined access errors.


163-171: No changes required. The ANTHROPIC_RESPONSE_SCHEMA defines usage as a required field (line 33 in src/schemas/anthropic-responses.ts), and ANTHROPIC_USAGE_SCHEMA requires both input_tokens and output_tokens as non-optional numbers. After schema validation via ANTHROPIC_RESPONSE_SCHEMA.parse(), TypeScript guarantees validatedResponse.usage is present, making direct access to validatedResponse.usage.input_tokens and validatedResponse.usage.output_tokens type-safe with no null checks needed.

Likely an incorrect or invalid review comment.

@hurshore hurshore merged commit 2f5fe6c into main Dec 30, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Evaluation Cost Tracking

2 participants