Skip to content

feat: implement content chunking and dedicated scoring logic for evaluators#39

Merged
oshorefueled merged 11 commits intomainfrom
feat/chunking
Jan 2, 2026
Merged

feat: implement content chunking and dedicated scoring logic for evaluators#39
oshorefueled merged 11 commits intomainfrom
feat/chunking

Conversation

@hurshore
Copy link
Collaborator

@hurshore hurshore commented Dec 18, 2025

Summary

This PR introduces automatic content chunking for large documents and extracts scoring logic into a dedicated module, improving evaluation accuracy for long-form content.

Changes

New Chunking Module (src/chunking/)

  • RecursiveChunker: Splits content using hierarchical separators (\n\n, \n, . , ) to preserve semantic boundaries
  • Automatic chunking for documents exceeding 600 words (max chunk size: 500 words)
  • Violation merging with deduplication across chunks
  • Type definitions for Chunk, ChunkingOptions, and ChunkingStrategy

New Scoring Module (src/scoring/)

  • Extracted calculateSemiObjectiveScore() for density-based scoring
  • Extracted calculateSubjectiveScore() for weighted criteria scoring
  • Added averageSubjectiveScores() for chunk aggregation with word-count weighting

Evaluator Updates

  • BaseEvaluator: Now automatically chunks content >600 words before evaluation
    • Semi-objective: Collects all violations across chunks, then scores once
    • Subjective: Averages scores across chunks weighted by chunk size
  • TechnicalAccuracyEvaluator: Migrated to use new scoring module

Tests

  • Updated scoring-types.test.ts with unique violation analysis strings to avoid deduplication issues

Version

Bumped to 2.1.1

Summary by CodeRabbit

  • New Features

    • Recursive chunking for large content and new scoring utilities for semi‑objective and subjective evaluations.
  • Improvements

    • Merges/deduplicates violations across chunks, averages subjective scores by segment size, aggregates token usage, and adds a prompt meta flag to control document vs. chunk evaluation.
  • Breaking Changes

    • Semi‑objective result schema now uses a single violation_count field instead of passed/total counts.
  • Tests

    • Evaluation tests modernized, strongly typed and reorganized.
  • Documentation

    • Editorial formatting updates to guidance docs.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Dec 18, 2025

📝 Walkthrough

Walkthrough

Adds a chunking subsystem, scoring utilities, and evaluator integration: text is recursively chunked, per-chunk evaluations run and deduplicated, scores are aggregated (subjective and semi-objective), and LLM token usage is combined. New types, utils, and barrel exports are introduced.

Changes

Cohort / File(s) Summary
Chunking core
src/chunking/types.ts, src/chunking/chunker.ts, src/chunking/utils.ts, src/chunking/merger.ts, src/chunking/index.ts
New chunking types and RecursiveChunker (hierarchical separator-based splitting with word-level fallback); splitIntoWords / countWords utilities; mergeViolations deduplicates flattened violations; barrel exports added.
Scoring
src/scoring/scorer.ts, src/scoring/index.ts
New scoring option types and functions: calculateSemiObjectiveScore, calculateSubjectiveScore, averageSubjectiveScores; barrel export for scoring utilities.
Evaluator integration
src/evaluators/base-evaluator.ts, src/evaluators/accuracy-evaluator.ts
BaseEvaluator gains chunking threshold, chunkContent, and aggregateUsage; evaluation flows updated to run per-chunk LLM calls, merge violations, aggregate scores, and sum token usage; accuracy evaluator uses semi-objective scorer when no claims and uses countWords.
Schemas / API shape / tests
src/schemas/prompt-schemas.ts, src/prompts/schema.ts, tests/scoring-types.test.ts
PROMPT_META_SCHEMA gains optional evaluateAs enum (`"document"
Output utilities
src/output/line-numbering.ts
Removed exports: stripLineNumbers, getLineContent, getLineStartIndex — only prependLineNumbers remains exported.
Docs / Misc
CREATING_RULES.md
Formatting and presentation edits only (no behavioral changes).

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Evaluator as Evaluator (Base / Accuracy)
    participant Chunker as RecursiveChunker
    participant LLM as LLM Provider
    participant Merger as mergeViolations
    participant Scorer as Scoring (semi/subjective)

    Evaluator->>Chunker: chunkContent(text)
    Chunker-->>Evaluator: chunks[]
    alt multiple chunks
        loop per chunk
            Evaluator->>LLM: evaluate(chunk)
            LLM-->>Evaluator: chunkResult + usage
        end
        Evaluator->>Merger: mergeViolations(all chunk violations)
        Merger-->>Evaluator: mergedViolations
        Evaluator->>Scorer: calculateSemiObjectiveScore(mergedViolations, totalWords)
        Scorer-->>Evaluator: semi-objective result
        Evaluator->>Scorer: averageSubjectiveScores(per-chunk results, chunkWordCounts)
        Scorer-->>Evaluator: subjective result
        Evaluator->>Evaluator: aggregateUsage(all usages)
    else single chunk or chunking disabled
        Evaluator->>LLM: evaluate(full content)
        LLM-->>Evaluator: result + usage
        Evaluator->>Scorer: calculate...Score(result)
        Scorer-->>Evaluator: final result
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Suggested reviewers

  • ayo6706

Poem

🐇 I hop through text and split with care,
I stitch the chunks and hunt what's there,
I count the words and merge the clues,
I score the faults and share the news,
A carrot-coded cheer — I bound on air! 🥕

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 42.86% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title clearly and concisely summarizes the main changes: implementing content chunking and dedicated scoring logic for evaluators, which are the primary objectives of this changeset.
✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (4)
src/evaluators/accuracy-evaluator.ts (1)

57-66: Consider using consistent word counting logic across the codebase.

Line 60 uses content.trim().split(/\s+/).length || 1 for word counting, while the new src/chunking/utils.ts provides countWords() with slightly different normalization logic. For consistency and edge-case handling (empty strings, multiple whitespace types), consider using the shared utility.

🔎 Apply this diff for consistency:
+import { countWords } from "../chunking/utils";
+
 async evaluate(_file: string, content: string): Promise<EvaluationResult> {
   // Step 1: Extract factual claims from the content
   const claims = await this.extractClaims(content);
 
   // If no claims found, return success (empty items array, perfect score)
   // Use the scoring module to calculate result
   if (claims.length === 0) {
-    const wordCount = content.trim().split(/\s+/).length || 1;
+    const wordCount = Math.max(countWords(content), 1);
     return calculateSemiObjectiveScore([], wordCount, {
       strictness: this.prompt.meta.strictness,
       defaultSeverity: this.defaultSeverity,
       promptSeverity: this.prompt.meta.severity,
     });
   }
src/chunking/chunker.ts (1)

4-8: Document or implement unused chunking options.

DEFAULT_OPTIONS includes overlapFraction and preserveSentences, but these options aren't used in the chunking logic. This is consistent with the concern raised for src/chunking/types.ts. Either implement these features or remove them from the defaults until they're ready.

src/evaluators/base-evaluator.ts (2)

66-83: Consider extracting word count calculation to avoid duplication.

The word count logic content.trim().split(/\s+/).length is repeated here (line 67), in runSubjectiveEvaluation (line 128), and in runSemiObjectiveEvaluation (line 147). A small helper would improve consistency and maintainability.

🔎 Suggested helper
function getWordCount(text: string): number {
  return text.trim().split(/\s+/).length || 1;
}

115-129: Consider parallelizing chunk evaluations for performance.

The sequential for...of loop processes chunks one at a time. If rate limits and ordering aren't concerns, Promise.all could reduce latency for large documents with multiple chunks.

🔎 Parallel approach
-    for (const chunk of chunks) {
-      const llmResult =
-        await this.llmProvider.runPromptStructured<SubjectiveLLMResult>(
-          chunk.content,
-          this.prompt.body,
-          schema
-        );
-
-      const result = calculateSubjectiveScore(llmResult.criteria, {
-        promptCriteria: this.prompt.meta.criteria,
-      });
-
-      chunkResults.push(result);
-      chunkWordCounts.push(chunk.content.trim().split(/\s+/).length);
-    }
+    const evaluations = await Promise.all(
+      chunks.map(async (chunk) => {
+        const llmResult =
+          await this.llmProvider.runPromptStructured<SubjectiveLLMResult>(
+            chunk.content,
+            this.prompt.body,
+            schema
+          );
+        return {
+          result: calculateSubjectiveScore(llmResult.criteria, {
+            promptCriteria: this.prompt.meta.criteria,
+          }),
+          wordCount: chunk.content.trim().split(/\s+/).length,
+        };
+      })
+    );
+    for (const { result, wordCount } of evaluations) {
+      chunkResults.push(result);
+      chunkWordCounts.push(wordCount);
+    }
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 659ea2e and 6dae98f.

⛔ Files ignored due to path filters (1)
  • package-lock.json is excluded by !**/package-lock.json
📒 Files selected for processing (10)
  • src/chunking/chunker.ts (1 hunks)
  • src/chunking/index.ts (1 hunks)
  • src/chunking/merger.ts (1 hunks)
  • src/chunking/types.ts (1 hunks)
  • src/chunking/utils.ts (1 hunks)
  • src/evaluators/accuracy-evaluator.ts (4 hunks)
  • src/evaluators/base-evaluator.ts (4 hunks)
  • src/scoring/index.ts (1 hunks)
  • src/scoring/scorer.ts (1 hunks)
  • tests/scoring-types.test.ts (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (4)
tests/scoring-types.test.ts (5)
src/schemas/prompt-schemas.ts (1)
  • PromptFile (64-64)
src/evaluators/base-evaluator.ts (1)
  • BaseEvaluator (37-172)
src/prompts/schema.ts (2)
  • SubjectiveLLMResult (74-82)
  • SemiObjectiveLLMResult (84-92)
src/providers/search-provider.ts (1)
  • SearchProvider (5-7)
src/evaluators/accuracy-evaluator.ts (1)
  • TechnicalAccuracyEvaluator (41-216)
src/chunking/merger.ts (2)
src/chunking/index.ts (1)
  • mergeViolations (3-3)
src/prompts/schema.ts (1)
  • SemiObjectiveItem (109-115)
src/scoring/scorer.ts (1)
src/prompts/schema.ts (4)
  • SemiObjectiveItem (109-115)
  • SemiObjectiveResult (117-133)
  • SubjectiveLLMResult (74-82)
  • SubjectiveResult (94-107)
src/chunking/chunker.ts (3)
src/chunking/types.ts (3)
  • ChunkingOptions (9-13)
  • ChunkingStrategy (15-18)
  • Chunk (1-7)
src/evaluators/base-evaluator.ts (1)
  • chunkContent (66-83)
src/chunking/utils.ts (2)
  • countWords (5-7)
  • splitIntoWords (1-3)
🔇 Additional comments (23)
src/chunking/utils.ts (2)

1-3: LGTM!

The word splitting logic is correct and handles edge cases well (empty strings, whitespace-only input).


5-7: LGTM!

Clean delegation to splitIntoWords for consistent word counting logic.

src/chunking/index.ts (1)

1-3: LGTM!

Clean barrel exports for the chunking module API surface.

src/scoring/index.ts (1)

1-7: LGTM!

Clean barrel exports for the scoring module API surface.

src/evaluators/accuracy-evaluator.ts (3)

12-12: LGTM!

Appropriate import of the new scoring function to support the updated evaluation flow.


89-94: LGTM!

Formatting change improves readability; no functional change.


219-236: LGTM!

The explicit search provider validation and clear error message improve the developer experience when configuration is missing.

tests/scoring-types.test.ts (5)

1-11: LGTM!

Imports updated appropriately to align with the new module structure.


34-74: LGTM!

The subjective evaluation test correctly validates weighted scoring logic with clear expectations.


90-128: LGTM!

Good test coverage for semi-objective scoring. Note that the unique analysis strings ("First issue found", "Second issue found") correctly prevent deduplication issues that would arise from the merger logic.


130-151: LGTM!

Appropriate test for the perfect score scenario (empty violations).


153-198: LGTM!

The test appropriately validates the no-claims scenario returns a perfect score.

src/chunking/types.ts (2)

1-7: LGTM!

The Chunk interface provides clear structure for representing content segments with proper offset tracking.


15-18: LGTM!

Clean interface definition for implementing different chunking strategies.

src/chunking/chunker.ts (3)

15-44: LGTM with minor observation.

The chunk offset calculation logic is sound. The fallback handling (lines 33-35, 39-40) ensures chunks are tracked even when indexOf fails, which is good defensive coding.


46-105: LGTM!

The recursive chunking algorithm correctly handles hierarchical separator fallback and properly recursively processes oversized chunks (lines 92-102). The logic at line 83 (setting currentChunk = part even if large) is correctly handled by the recursive pass.


107-117: LGTM!

Clean fallback implementation for force-splitting when separator-based chunking fails.

src/evaluators/base-evaluator.ts (3)

16-24: LGTM!

The new imports for chunking and scoring modules are well-organized, and the constants for chunking thresholds (600 words to trigger chunking, 500 words max per chunk) provide sensible defaults.


142-171: LGTM!

The approach of calculating the total word count from the original content and merging all chunk violations before scoring is correct for density-based scoring. The same parallelization opportunity applies here as noted for subjective evaluation.


176-181: LGTM!

The evaluator registration correctly handles the factory signature and properly ignores the unused searchProvider parameter.

src/scoring/scorer.ts (3)

9-41: LGTM!

The interfaces are well-defined with appropriate optional types, and resolveStrictness provides a clean mapping from named presets to numeric values.


109-145: LGTM!

The normalization formula correctly maps 1-4 scores to the 1-10 scale, and the weighted average calculation is sound. Minor note: if criteria is empty, totalWeight is 0, returning final_score: 1 which is a reasonable default.


250-268: LGTM!

The aggregated criteria construction and final score calculation correctly apply weighted averaging across the normalized scores.

@hurshore hurshore changed the title feat: implement content chunking and dedicated scoring logic for evaluators draft: implement content chunking and dedicated scoring logic for evaluators Dec 28, 2025
@hurshore hurshore changed the title draft: implement content chunking and dedicated scoring logic for evaluators feat: implement content chunking and dedicated scoring logic for evaluators Dec 28, 2025
@hurshore hurshore marked this pull request as draft December 28, 2025 18:34
- Add line numbering before chunking for accurate LLM line reporting
- Add token usage tracking and aggregation across chunks
- Add 'chunking' frontmatter option to disable chunking per-rule
@hurshore hurshore marked this pull request as ready for review December 30, 2025 23:48
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (3)
src/scoring/scorer.ts (3)

91-101: Field naming (passed_count, total_count) may be misleading.

As noted in a previous review, passed_count is hardcoded to 0 and total_count equals violation count. The naming suggests "items passed vs total checked" but only violations are counted. Consider renaming or documenting these fields for clarity.


182-188: Array length mismatch silently gives zero weight.

As noted in a previous review, if chunkWordCounts[i] is undefined due to array length mismatch, the fallback to 0 gives that chunk zero weight, effectively ignoring its results. Consider adding validation or a warning.


244-251: Violations with empty analysis will incorrectly deduplicate.

As noted in a previous review, the deduplication key falls back to "" for empty/undefined analysis strings, causing all such violations to collapse into one. This could silently drop valid violations.

🧹 Nitpick comments (4)
src/schemas/prompt-schemas.ts (1)

49-50: Consider using .optional().default(true) instead of .default(true).optional().

The current order is semantically valid but unconventional. In Zod, .default(true) already provides a fallback when the field is absent, making the trailing .optional() redundant. The standard pattern is .optional().default(true) which clearly expresses "field is optional, defaults to true if missing."

🔎 Suggested fix
-  chunking: z.boolean().default(true).optional(),
+  chunking: z.boolean().optional().default(true),
tests/scoring-types.test.ts (1)

168-174: Move vi.mock to the module level for reliable hoisting.

vi.mock calls are hoisted to the top of the file by Vitest, but placing them inside it() blocks can cause confusing behavior across tests. The mock may persist beyond this test or not apply correctly.

🔎 Suggested approach

Move the mock to the top of the file or use vi.doMock for test-scoped mocking:

// At file level, before describe blocks
vi.mock("../src/evaluators/prompt-loader", () => ({
  getPrompt: vi.fn().mockReturnValue({ body: "Extract claims" }),
}));

Or use vi.doMock with dynamic import for test-scoped mocking:

it("should return perfect score when no claims are found", async () => {
  vi.doMock("../src/evaluators/prompt-loader", () => ({
    getPrompt: vi.fn().mockReturnValue({ body: "Extract claims" }),
  }));
  
  const { TechnicalAccuracyEvaluator } = await import(
    "../src/evaluators/accuracy-evaluator"
  );
  // ...
});
src/evaluators/base-evaluator.ts (1)

146-162: Minor: Chunk word counts include line number prefixes.

At line 161, chunk.content.trim().split(/\s+/).length counts words including the prepended line numbers (e.g., "1: ", "2: "), while totalWordCount at line 190 uses the original content without line numbers. This creates a slight inconsistency in chunk weighting. Consider whether this affects the accuracy of weighted averaging.

src/scoring/scorer.ts (1)

259-260: Consider using newlines or markers when joining chunk summaries/reasonings.

Joining multiple chunk summaries with a single space may create hard-to-read text. Consider using " | " or "\n" as separators for better readability.

🔎 Suggested change
-      summary: entry.summaries.join(" "),
-      reasoning: entry.reasonings.join(" "),
+      summary: entry.summaries.join(" | "),
+      reasoning: entry.reasonings.join(" | "),
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6dae98f and 89c0310.

📒 Files selected for processing (5)
  • src/evaluators/accuracy-evaluator.ts
  • src/evaluators/base-evaluator.ts
  • src/schemas/prompt-schemas.ts
  • src/scoring/scorer.ts
  • tests/scoring-types.test.ts
🧰 Additional context used
📓 Path-based instructions (2)
src/**/*.ts

📄 CodeRabbit inference engine (AGENTS.md)

src/**/*.ts: Use TypeScript ESM with explicit imports and narrow types
Use 2-space indentation; avoid trailing whitespace
Maintain strict TypeScript with no any; use unknown + schema validation for external data
Use custom error types with proper inheritance; catch blocks use unknown type

Files:

  • src/evaluators/accuracy-evaluator.ts
  • src/schemas/prompt-schemas.ts
  • src/evaluators/base-evaluator.ts
  • src/scoring/scorer.ts
tests/**/*.test.ts

📄 CodeRabbit inference engine (AGENTS.md)

tests/**/*.test.ts: Write tests using Vitest framework with focus on config parsing, file discovery, schema/structured output, and locator
Use dependency injection in tests: mock providers; do not hit network in unit tests

Files:

  • tests/scoring-types.test.ts
🧠 Learnings (5)
📚 Learning: 2025-12-28T19:43:51.176Z
Learnt from: CR
Repo: TRocket-Labs/vectorlint PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-28T19:43:51.176Z
Learning: Applies to src/boundaries/**/*.ts : Use Zod schemas for boundary validation of all external data (files, CLI, env, APIs) at system boundaries

Applied to files:

  • src/evaluators/accuracy-evaluator.ts
  • src/schemas/prompt-schemas.ts
📚 Learning: 2025-12-28T19:43:51.176Z
Learnt from: CR
Repo: TRocket-Labs/vectorlint PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-28T19:43:51.176Z
Learning: Applies to tests/**/*.test.ts : Write tests using Vitest framework with focus on config parsing, file discovery, schema/structured output, and locator

Applied to files:

  • tests/scoring-types.test.ts
📚 Learning: 2025-12-28T19:43:51.176Z
Learnt from: CR
Repo: TRocket-Labs/vectorlint PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-28T19:43:51.176Z
Learning: Applies to src/**/*.ts : Use TypeScript ESM with explicit imports and narrow types

Applied to files:

  • tests/scoring-types.test.ts
📚 Learning: 2025-12-28T19:43:51.176Z
Learnt from: CR
Repo: TRocket-Labs/vectorlint PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-28T19:43:51.176Z
Learning: Applies to src/output/**/*.ts : IDs should be shown as `PromptId.CriterionId` in output

Applied to files:

  • tests/scoring-types.test.ts
  • src/schemas/prompt-schemas.ts
📚 Learning: 2025-12-28T19:43:51.176Z
Learnt from: CR
Repo: TRocket-Labs/vectorlint PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-28T19:43:51.176Z
Learning: Applies to src/boundaries/env-parser.ts : All environment variables must be validated via Zod schemas in `src/boundaries/env-parser.ts`

Applied to files:

  • src/schemas/prompt-schemas.ts
🧬 Code graph analysis (2)
src/evaluators/accuracy-evaluator.ts (3)
src/scoring/scorer.ts (1)
  • calculateSemiObjectiveScore (48-102)
src/evaluators/index.ts (1)
  • registerEvaluator (20-20)
src/providers/token-usage.ts (1)
  • TokenUsage (1-4)
src/scoring/scorer.ts (1)
src/prompts/schema.ts (4)
  • SemiObjectiveItem (145-152)
  • SemiObjectiveResult (154-172)
  • SubjectiveLLMResult (96-110)
  • SubjectiveResult (123-143)
🔇 Additional comments (7)
src/evaluators/accuracy-evaluator.ts (2)

64-76: LGTM!

The integration with calculateSemiObjectiveScore is correct. The word count fallback to 1 prevents division-by-zero, and the scoring options are properly passed. The spread operator correctly handles the optional claimUsage.


241-258: LGTM!

The evaluator registration is correctly structured with proper dependency validation for the search provider.

tests/scoring-types.test.ts (1)

98-116: Good: Unique analysis strings prevent deduplication collisions.

Using distinct analysis strings ("First issue found", "Second issue found") ensures these violations are not incorrectly deduplicated, addressing a potential issue with the deduplication logic.

src/evaluators/base-evaluator.ts (2)

67-86: LGTM!

The chunkContent method correctly checks the chunking flag from prompt metadata and applies the word count threshold. The single-chunk fallback for disabled/small content is properly structured.


91-104: LGTM!

The aggregateUsage method correctly filters undefined values with a type predicate and aggregates token counts using reduce.

src/scoring/scorer.ts (2)

9-18: LGTM!

The SemiObjectiveScoringOptions interface correctly types the strictness union and severity options.


110-146: LGTM!

The calculateSubjectiveScore function correctly normalizes 1-4 scores to 1-10 scale and computes weighted averages. The weight lookup by criterion name is straightforward.

- Use composite key for violation deduplication (quoted_text + description + analysis)
- Remove unused chunking options (overlapFraction, preserveSentences)
- Rename total_count to violation_count, remove meaningless passed_count
- Add warning for array length mismatch in averageSubjectiveScores
- Fix Zod schema ordering (.optional().default(true))
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
src/scoring/scorer.ts (1)

252-258: Deduplication may collapse violations with empty analysis.

Line 254 uses || "" as a fallback, meaning all violations with missing or empty analysis will share the same deduplication key and collapse to a single violation. While the subjective schema requires the analysis field, defensive programming suggests not relying solely on schema enforcement.

🔎 Suggested defensive fix

Option 1: Generate unique keys for empty analysis:

     // Deduplicate violations
     const seen = new Set<string>();
+    let emptyAnalysisCounter = 0;
     const uniqueViolations = entry.violations.filter((v) => {
-      const key = v.analysis?.toLowerCase().trim() || "";
+      const analysisText = v.analysis?.toLowerCase().trim();
+      const key = analysisText || `__empty_${emptyAnalysisCounter++}`;
       if (seen.has(key)) return false;
       seen.add(key);
       return true;
     });

Option 2: Use composite key (analysis + quoted_text):

     const uniqueViolations = entry.violations.filter((v) => {
-      const key = v.analysis?.toLowerCase().trim() || "";
+      const key = [
+        v.analysis?.toLowerCase().trim() || "",
+        v.quoted_text?.toLowerCase().trim() || "",
+      ].join("|");
       if (seen.has(key)) return false;
       seen.add(key);
       return true;
     });
🧹 Nitpick comments (1)
src/evaluators/base-evaluator.ts (1)

67-86: Consider using consistent word counting across modules.

Line 68 uses a simple split(/\s+/) for word counting, while the chunking utilities use splitIntoWords which filters empty strings. This could lead to minor discrepancies in word counts between the threshold check and actual chunking logic.

🔎 Suggested refactor for consistency
+import { countWords } from "../chunking/utils";
+
 protected chunkContent(content: string): Chunk[] {
-  const wordCount = content.trim().split(/\s+/).length || 1;
+  const wordCount = countWords(content) || 1;

   const chunkingEnabled = this.prompt.meta.chunking !== false;
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 89c0310 and 3309645.

📒 Files selected for processing (8)
  • src/chunking/chunker.ts
  • src/chunking/merger.ts
  • src/chunking/types.ts
  • src/evaluators/base-evaluator.ts
  • src/prompts/schema.ts
  • src/schemas/prompt-schemas.ts
  • src/scoring/scorer.ts
  • tests/scoring-types.test.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/schemas/prompt-schemas.ts
🧰 Additional context used
📓 Path-based instructions (2)
src/**/*.ts

📄 CodeRabbit inference engine (AGENTS.md)

src/**/*.ts: Use TypeScript ESM with explicit imports and narrow types
Use 2-space indentation; avoid trailing whitespace
Maintain strict TypeScript with no any; use unknown + schema validation for external data
Use custom error types with proper inheritance; catch blocks use unknown type

Files:

  • src/chunking/chunker.ts
  • src/scoring/scorer.ts
  • src/prompts/schema.ts
  • src/chunking/types.ts
  • src/chunking/merger.ts
  • src/evaluators/base-evaluator.ts
tests/**/*.test.ts

📄 CodeRabbit inference engine (AGENTS.md)

tests/**/*.test.ts: Write tests using Vitest framework with focus on config parsing, file discovery, schema/structured output, and locator
Use dependency injection in tests: mock providers; do not hit network in unit tests

Files:

  • tests/scoring-types.test.ts
🧠 Learnings (3)
📚 Learning: 2025-12-28T19:43:51.176Z
Learnt from: CR
Repo: TRocket-Labs/vectorlint PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-28T19:43:51.176Z
Learning: Applies to tests/**/*.test.ts : Write tests using Vitest framework with focus on config parsing, file discovery, schema/structured output, and locator

Applied to files:

  • tests/scoring-types.test.ts
📚 Learning: 2025-12-28T19:43:51.176Z
Learnt from: CR
Repo: TRocket-Labs/vectorlint PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-28T19:43:51.176Z
Learning: Applies to src/**/*.ts : Use TypeScript ESM with explicit imports and narrow types

Applied to files:

  • tests/scoring-types.test.ts
📚 Learning: 2025-12-28T19:43:51.176Z
Learnt from: CR
Repo: TRocket-Labs/vectorlint PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-28T19:43:51.176Z
Learning: Applies to src/**/*.ts : Maintain strict TypeScript with no `any`; use `unknown` + schema validation for external data

Applied to files:

  • tests/scoring-types.test.ts
🧬 Code graph analysis (3)
src/chunking/chunker.ts (3)
src/chunking/types.ts (3)
  • ChunkingOptions (9-11)
  • ChunkingStrategy (13-16)
  • Chunk (1-7)
src/evaluators/base-evaluator.ts (1)
  • chunkContent (67-86)
src/chunking/utils.ts (2)
  • countWords (5-7)
  • splitIntoWords (1-3)
src/scoring/scorer.ts (2)
src/scoring/index.ts (5)
  • SemiObjectiveScoringOptions (2-2)
  • SubjectiveScoringOptions (3-3)
  • calculateSemiObjectiveScore (4-4)
  • calculateSubjectiveScore (5-5)
  • averageSubjectiveScores (6-6)
src/prompts/schema.ts (4)
  • SemiObjectiveItem (145-152)
  • SemiObjectiveResult (154-171)
  • SubjectiveLLMResult (96-110)
  • SubjectiveResult (123-143)
src/chunking/merger.ts (2)
src/chunking/index.ts (1)
  • mergeViolations (3-3)
src/prompts/schema.ts (1)
  • SemiObjectiveItem (145-152)
🔇 Additional comments (9)
src/chunking/merger.ts (1)

3-20: LGTM! Composite key properly addresses deduplication concerns.

The implementation correctly uses a composite key combining quoted_text, description, and analysis fields. This ensures that only violations matching across all three fields are treated as duplicates, preventing the false positive deduplication issue raised in previous reviews.

src/chunking/types.ts (1)

1-16: LGTM! Clean interface design.

The interface definitions are well-structured and minimal. The removal of unused overlapFraction and preserveSentences options (mentioned in past reviews) has resulted in a cleaner, more maintainable design.

tests/scoring-types.test.ts (1)

1-205: LGTM! Comprehensive test coverage.

The tests properly validate the new scoring logic with:

  • Weighted average calculations for subjective evaluation
  • Violation-based scoring for semi-objective evaluation
  • Edge cases (empty violations, perfect scores)
  • Unique analysis strings in test data to avoid unintended deduplication

Based on learnings, tests use Vitest framework appropriately with proper mocking.

src/prompts/schema.ts (1)

158-158: LGTM! Clearer field naming.

The change from passed_count/total_count to violation_count significantly improves API clarity and directly addresses the semantic confusion raised in past reviews.

src/chunking/chunker.ts (1)

44-103: LGTM! Sound recursive chunking logic.

The hierarchical separator approach with recursive refinement is well-designed. The method properly handles:

  • Base case for small-enough text
  • Fallback to next separator level
  • Recursive processing of oversized chunks
  • Final fallback to word-based splitting
src/evaluators/base-evaluator.ts (1)

113-230: LGTM! Well-structured chunked evaluation flow.

The implementation properly handles both single and multiple chunk scenarios:

  • Subjective evaluation averages per-chunk scores weighted by chunk word count
  • Semi-objective evaluation collects and merges violations, then scores once
  • Token usage is correctly aggregated across multiple LLM calls

This aligns perfectly with the PR objectives for chunk-aware evaluation.

src/scoring/scorer.ts (3)

48-101: LGTM! Clear scoring logic with improved field naming.

The function properly implements density-based scoring with the new violation_count field that addresses past review concerns about unclear semantics. The formula is well-documented and the severity determination logic is appropriate.


109-145: LGTM! Correct weighted scoring calculation.

The subjective scoring properly:

  • Normalizes 1-4 scores to 1-10 scale using the correct formula
  • Applies criterion weights from prompt configuration
  • Handles missing weights with sensible default of 1

161-166: LGTM! Defensive check properly added.

The warning for array length mismatches addresses the past review concern and provides helpful debugging information if the arrays are mismatched due to a programming error.

- Use composite key (quoted_text + analysis) for subjective violation deduplication
- Use countWords utility instead of inline split for word counting
- Export countWords from chunking module
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
src/evaluators/base-evaluator.ts (2)

166-169: Use countWords utility for consistent word counting.

The inline word counting trim().split(/\s+/).length differs from the imported countWords utility used in chunkContent. This inconsistency could cause subtle discrepancies since countWords uses splitIntoWords which may handle edge cases (e.g., multiple spaces, empty strings) differently.

🔎 Proposed fix
-      chunkWordCounts.push(
-        stripLineNumbers(chunk.content).trim().split(/\s+/).length
-      );
+      chunkWordCounts.push(countWords(stripLineNumbers(chunk.content)));

198-198: Use countWords utility for consistent word counting.

Same issue as in runSubjectiveEvaluation — use the imported countWords utility instead of inline splitting for consistency.

🔎 Proposed fix
-    const totalWordCount = content.trim().split(/\s+/).length || 1;
+    const totalWordCount = countWords(content) || 1;
src/scoring/scorer.ts (2)

9-18: Consider narrowing promptSeverity type.

The | string in promptSeverity type weakens type safety. If this accepts prompt frontmatter values, consider using a union of expected string literals (e.g., "warning" | "error") or document why arbitrary strings are accepted.


269-270: Consider using a proper sentence separator when joining summaries/reasonings.

Joining with " " may produce awkward text if chunks' summaries don't end with punctuation. Consider using " | " or "\n" as a separator for clearer delineation.

🔎 Proposed fix
-      summary: entry.summaries.join(" "),
-      reasoning: entry.reasonings.join(" "),
+      summary: entry.summaries.join(" "),
+      reasoning: entry.reasonings.join(" "),

Or filter empty strings and use a cleaner separator:

summary: entry.summaries.filter(Boolean).join(" | "),
reasoning: entry.reasonings.filter(Boolean).join(" | "),
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3309645 and c3b8d42.

📒 Files selected for processing (3)
  • src/chunking/index.ts
  • src/evaluators/base-evaluator.ts
  • src/scoring/scorer.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/chunking/index.ts
🧰 Additional context used
📓 Path-based instructions (1)
src/**/*.ts

📄 CodeRabbit inference engine (AGENTS.md)

src/**/*.ts: Use TypeScript ESM with explicit imports and narrow types
Use 2-space indentation; avoid trailing whitespace
Maintain strict TypeScript with no any; use unknown + schema validation for external data
Use custom error types with proper inheritance; catch blocks use unknown type

Files:

  • src/scoring/scorer.ts
  • src/evaluators/base-evaluator.ts
🧬 Code graph analysis (2)
src/scoring/scorer.ts (1)
src/prompts/schema.ts (4)
  • SemiObjectiveItem (145-152)
  • SemiObjectiveResult (154-171)
  • SubjectiveLLMResult (96-110)
  • SubjectiveResult (123-143)
src/evaluators/base-evaluator.ts (10)
src/chunking/index.ts (4)
  • Chunk (1-1)
  • countWords (4-4)
  • RecursiveChunker (2-2)
  • mergeViolations (3-3)
src/chunking/types.ts (1)
  • Chunk (1-7)
src/chunking/utils.ts (1)
  • countWords (5-7)
src/chunking/chunker.ts (2)
  • RecursiveChunker (8-116)
  • chunk (13-42)
src/providers/token-usage.ts (1)
  • TokenUsage (1-4)
src/prompts/schema.ts (3)
  • SubjectiveLLMResult (96-110)
  • SubjectiveResult (123-143)
  • SemiObjectiveLLMResult (112-121)
src/scoring/scorer.ts (3)
  • calculateSubjectiveScore (109-145)
  • averageSubjectiveScores (148-282)
  • calculateSemiObjectiveScore (48-101)
src/scoring/index.ts (3)
  • calculateSubjectiveScore (5-5)
  • averageSubjectiveScores (6-6)
  • calculateSemiObjectiveScore (4-4)
src/output/line-numbering.ts (1)
  • stripLineNumbers (27-32)
src/chunking/merger.ts (1)
  • mergeViolations (3-20)
🔇 Additional comments (9)
src/evaluators/base-evaluator.ts (4)

16-30: LGTM!

The new imports are well-organized and the chunking constants are reasonable values (600-word threshold, 500-word max chunk size).


72-91: LGTM!

The chunkContent method correctly uses the countWords utility and provides a clean opt-out mechanism via prompt.meta.chunking. The defensive || 1 handles edge cases well.


96-109: LGTM!

Clean token aggregation with proper type narrowing via the type guard filter.


200-227: LGTM!

The chunked evaluation flow correctly collects violations per chunk, merges with deduplication via mergeViolations, and calculates the final score from the merged set. Token usage aggregation is properly handled.

src/scoring/scorer.ts (5)

26-41: LGTM!

Clean strictness resolution with sensible default values.


48-101: LGTM!

The scoring logic is correct. Using violation_count is clearer than the previous field names mentioned in past reviews.


109-145: LGTM!

The normalization formula correctly maps the 1-4 scale to 1-10, and the weighted average calculation is properly implemented.


160-166: LGTM!

Good defensive logging for array length mismatches. This addresses the past review concern about silent data loss.


251-261: LGTM!

The composite key approach using both quoted_text and analysis addresses the prior deduplication concern. Violations with both fields empty colliding is an acceptable edge case since such violations would provide minimal diagnostic value anyway.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
src/evaluators/base-evaluator.ts (2)

29-30: Consider making chunking parameters configurable.

The hardcoded constants CHUNKING_THRESHOLD = 600 and MAX_CHUNK_SIZE = 500 work for general use, but different evaluation types or content domains might benefit from different thresholds. Consider exposing these via prompt.meta (e.g., prompt.meta.chunkingThreshold, prompt.meta.maxChunkSize) with these values as defaults.


72-91: Consider adding error handling for chunking operations.

The RecursiveChunker could potentially throw errors during chunking (e.g., invalid input, configuration issues). Wrapping lines 89-90 in a try-catch with a fallback to a single-chunk strategy would improve robustness.

🔎 Suggested error handling
   const chunker = new RecursiveChunker();
-  return chunker.chunk(content, { maxChunkSize: MAX_CHUNK_SIZE });
+  try {
+    return chunker.chunk(content, { maxChunkSize: MAX_CHUNK_SIZE });
+  } catch (e: unknown) {
+    const err = e instanceof Error ? e : new Error(String(e));
+    console.warn(`[vectorlint] Chunking failed: ${err.message}`);
+    // Fallback to single chunk
+    return [
+      {
+        content,
+        startOffset: 0,
+        endOffset: content.length,
+        index: 0,
+      },
+    ];
+  }
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c3b8d42 and 6c8e69b.

📒 Files selected for processing (2)
  • src/evaluators/accuracy-evaluator.ts
  • src/evaluators/base-evaluator.ts
🧰 Additional context used
📓 Path-based instructions (1)
src/**/*.ts

📄 CodeRabbit inference engine (AGENTS.md)

src/**/*.ts: Use TypeScript ESM with explicit imports and narrow types
Use 2-space indentation; avoid trailing whitespace
Maintain strict TypeScript with no any; use unknown + schema validation for external data
Use custom error types with proper inheritance; catch blocks use unknown type

Files:

  • src/evaluators/accuracy-evaluator.ts
  • src/evaluators/base-evaluator.ts
🧠 Learnings (1)
📚 Learning: 2025-12-28T19:43:51.176Z
Learnt from: CR
Repo: TRocket-Labs/vectorlint PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-28T19:43:51.176Z
Learning: Applies to src/boundaries/**/*.ts : Use Zod schemas for boundary validation of all external data (files, CLI, env, APIs) at system boundaries

Applied to files:

  • src/evaluators/accuracy-evaluator.ts
🔇 Additional comments (7)
src/evaluators/accuracy-evaluator.ts (3)

13-14: LGTM! Clean integration of scoring module.

The no-claims path now correctly uses calculateSemiObjectiveScore from the scoring module for consistency with the base evaluator. The countWords utility integration aligns with the standardization goal mentioned in the PR objectives.

Also applies to: 65-77


107-115: Usage aggregation is correctly implemented.

Token usage from claim extraction and evaluation phases are properly summed, and the optional nature of usage fields is handled correctly.


242-259: Registration pattern is correct.

The factory function properly guards against missing searchProvider with a clear error message, and the multiline formatting improves readability.

src/evaluators/base-evaluator.ts (4)

16-27: Chunking integration is well-architected.

The approach of prepending line numbers before chunking (lines 124, 195) ensures that line references reported by the LLM remain globally accurate across all chunks. Stripping line numbers before word counting (line 167, 196) correctly excludes line-number artifacts from the word count.

Also applies to: 72-91


93-109: Token usage aggregation is correctly implemented.

The type guard on line 99 properly filters undefined values, and the reduce operation correctly sums token counts across multiple LLM calls.


113-177: Subjective evaluation chunking flow is correct.

The implementation correctly:

  • Handles the single-chunk optimization (lines 129-145)
  • Strips line numbers before counting words for chunk weights (line 167)
  • Uses averageSubjectiveScores with chunk word counts to produce a weighted average
  • Aggregates token usage across all chunk evaluations

180-233: Semi-objective evaluation chunking flow is correct.

The implementation correctly:

  • Collects violations from all chunks (lines 202-211)
  • Merges and deduplicates violations via mergeViolations (line 214)
  • Calculates the final score once using total word count from the original content (line 196)
  • Aggregates token usage across all chunk evaluations

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
src/evaluators/base-evaluator.ts (1)

145-166: Consider extracting chunk word counting logic.

Line 165 strips line numbers and counts words for each chunk. This pattern could be extracted to reduce duplication and clarify intent, though the current implementation is functional.

🔎 Optional refactor
+  protected getActualWordCount(content: string): number {
+    return countWords(stripLineNumbers(content));
+  }

   for (const chunk of chunks) {
     const { data: llmResult, usage } =
       await this.llmProvider.runPromptStructured<SubjectiveLLMResult>(
         chunk.content,
         this.prompt.body,
         schema
       );

     usages.push(usage);

     const result = calculateSubjectiveScore(llmResult.criteria, {
       promptCriteria: this.prompt.meta.criteria,
     });

     chunkResults.push(result);
-    // Strip line numbers before counting to get accurate word count
-    chunkWordCounts.push(countWords(stripLineNumbers(chunk.content)));
+    chunkWordCounts.push(this.getActualWordCount(chunk.content));
   }
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6c8e69b and 59b3d1d.

📒 Files selected for processing (3)
  • src/chunking/chunker.ts
  • src/chunking/types.ts
  • src/evaluators/base-evaluator.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/chunking/types.ts
🧰 Additional context used
📓 Path-based instructions (1)
src/**/*.ts

📄 CodeRabbit inference engine (AGENTS.md)

src/**/*.ts: Use TypeScript ESM with explicit imports and narrow types
Use 2-space indentation; avoid trailing whitespace
Maintain strict TypeScript with no any; use unknown + schema validation for external data
Use custom error types with proper inheritance; catch blocks use unknown type

Files:

  • src/evaluators/base-evaluator.ts
  • src/chunking/chunker.ts
🧬 Code graph analysis (1)
src/evaluators/base-evaluator.ts (9)
src/chunking/types.ts (1)
  • Chunk (1-5)
src/chunking/index.ts (4)
  • Chunk (1-1)
  • countWords (4-4)
  • RecursiveChunker (2-2)
  • mergeViolations (3-3)
src/chunking/utils.ts (1)
  • countWords (5-7)
src/chunking/chunker.ts (2)
  • RecursiveChunker (8-98)
  • chunk (13-24)
src/providers/token-usage.ts (1)
  • TokenUsage (1-4)
src/prompts/schema.ts (3)
  • SubjectiveLLMResult (96-110)
  • SubjectiveResult (123-143)
  • SemiObjectiveLLMResult (112-121)
src/scoring/index.ts (3)
  • calculateSubjectiveScore (5-5)
  • averageSubjectiveScores (6-6)
  • calculateSemiObjectiveScore (4-4)
src/output/line-numbering.ts (1)
  • stripLineNumbers (27-32)
src/chunking/merger.ts (1)
  • mergeViolations (3-20)
🔇 Additional comments (4)
src/chunking/chunker.ts (2)

13-24: LGTM! Past review concern addressed.

The simplified implementation only tracks content and index for chunks, avoiding the offset calculation issues flagged in the previous review. The filtering and mapping logic is correct.


26-97: Solid recursive chunking implementation.

The hierarchical splitting with automatic fallback to the next separator and recursive refinement for oversized chunks is well-designed. The logic correctly handles edge cases, including parts that exceed maxSize (refined recursively at lines 75-78).

src/evaluators/base-evaluator.ts (2)

94-107: Correct usage aggregation.

The type guard filtering, empty check, and reduce logic properly aggregate token usage across multiple LLM calls.


186-231: Correct semi-objective evaluation flow.

The implementation properly collects violations across chunks, merges with deduplication, and calculates density-based scoring. Line 194 correctly uses the original content word count (not numbered) for accurate density calculation.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
src/evaluators/base-evaluator.ts (1)

72-89: Word counting on numbered content causes incorrect chunking decisions.

This method receives numberedContent (with line number prefixes) from both evaluation methods (lines 123, 192). Line 73 counts words on this numbered content, inflating the word count and causing:

  1. Premature chunking — content below 600 actual words may exceed the threshold due to line numbers
  2. Incorrect chunk sizesRecursiveChunker bases chunk sizes on inflated word counts (line 88)

For example, 500 actual words might count as 600+ with line numbers, incorrectly triggering chunking.

🔎 Proposed fix: Strip line numbers before word counting

Option 1 (Recommended): Strip line numbers in chunkContent before counting

Add a utility to strip line numbers and count words on actual content:

 protected chunkContent(content: string): Chunk[] {
-  const wordCount = countWords(content) || 1;
+  // Strip line numbers for accurate word counting
+  const actualContent = content.replace(/^\s*\d+[:.)]\s*/gm, '');
+  const wordCount = countWords(actualContent) || 1;

   const chunkingEnabled = this.prompt.meta.evaluateAs !== "document";

   if (!chunkingEnabled || wordCount <= CHUNKING_THRESHOLD) {
     // Chunking disabled or content is small enough - return as single chunk
     return [
       {
         content,
         index: 0,
       },
     ];
   }

   const chunker = new RecursiveChunker();
-  return chunker.chunk(content, { maxChunkSize: MAX_CHUNK_SIZE });
+  return chunker.chunk(actualContent, { maxChunkSize: MAX_CHUNK_SIZE });
 }

Note: After stripping, you'll need to restore line numbers to chunk content before returning, or pass original numbered content to the chunker and only strip for counting. The latter approach preserves line numbers in chunks for LLM reporting.

Option 2: Pass original content for chunking decisions

Refactor callers to pass both original and numbered content, using original for decisions and numbered for LLM processing. This avoids repeated stripping but requires signature changes.

🧹 Nitpick comments (1)
src/schemas/prompt-schemas.ts (1)

49-50: Consider clarifying default behavior in comment.

The comment indicates "chunk (default)" but the schema doesn't enforce a default value via .default(). The default behavior is implicit in the consuming code (base-evaluator.ts line 75: evaluateAs !== "document"), which treats undefined as enabling chunking.

For better clarity, consider either:

  1. Adding .default("chunk") to the schema
  2. Or updating the comment to: // Determines how content is evaluated: 'document' for full document, 'chunk' for chunked processing (chunking enabled when unspecified)
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bfda6d2 and 23aaa1a.

📒 Files selected for processing (3)
  • CREATING_RULES.md
  • src/evaluators/base-evaluator.ts
  • src/schemas/prompt-schemas.ts
✅ Files skipped from review due to trivial changes (1)
  • CREATING_RULES.md
🧰 Additional context used
📓 Path-based instructions (1)
src/**/*.ts

📄 CodeRabbit inference engine (AGENTS.md)

src/**/*.ts: Use TypeScript ESM with explicit imports and narrow types
Use 2-space indentation; avoid trailing whitespace
Maintain strict TypeScript with no any; use unknown + schema validation for external data
Use custom error types with proper inheritance; catch blocks use unknown type

Files:

  • src/evaluators/base-evaluator.ts
  • src/schemas/prompt-schemas.ts
🧠 Learnings (4)
📚 Learning: 2025-12-28T19:43:51.189Z
Learnt from: CR
Repo: TRocket-Labs/vectorlint PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-28T19:43:51.189Z
Learning: Applies to src/output/**/*.ts : IDs should be shown as `PromptId.CriterionId` in output

Applied to files:

  • src/schemas/prompt-schemas.ts
📚 Learning: 2025-12-28T19:43:51.189Z
Learnt from: CR
Repo: TRocket-Labs/vectorlint PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-28T19:43:51.189Z
Learning: Applies to src/**/*.ts : Maintain strict TypeScript with no `any`; use `unknown` + schema validation for external data

Applied to files:

  • src/schemas/prompt-schemas.ts
📚 Learning: 2025-12-28T19:43:51.189Z
Learnt from: CR
Repo: TRocket-Labs/vectorlint PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-28T19:43:51.189Z
Learning: Applies to src/boundaries/**/*.ts : Use Zod schemas for boundary validation of all external data (files, CLI, env, APIs) at system boundaries

Applied to files:

  • src/schemas/prompt-schemas.ts
📚 Learning: 2025-12-28T19:43:51.189Z
Learnt from: CR
Repo: TRocket-Labs/vectorlint PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-28T19:43:51.189Z
Learning: Applies to src/boundaries/env-parser.ts : All environment variables must be validated via Zod schemas in `src/boundaries/env-parser.ts`

Applied to files:

  • src/schemas/prompt-schemas.ts
🧬 Code graph analysis (1)
src/evaluators/base-evaluator.ts (7)
src/chunking/types.ts (1)
  • Chunk (1-5)
src/chunking/index.ts (4)
  • Chunk (1-1)
  • countWords (4-4)
  • RecursiveChunker (2-2)
  • mergeViolations (3-3)
src/chunking/chunker.ts (2)
  • RecursiveChunker (8-98)
  • chunk (13-24)
src/providers/token-usage.ts (1)
  • TokenUsage (1-4)
src/prompts/schema.ts (3)
  • SubjectiveLLMResult (96-110)
  • SubjectiveResult (123-143)
  • SemiObjectiveLLMResult (112-121)
src/scoring/index.ts (3)
  • calculateSubjectiveScore (5-5)
  • averageSubjectiveScores (6-6)
  • calculateSemiObjectiveScore (4-4)
src/chunking/merger.ts (1)
  • mergeViolations (3-20)
🔇 Additional comments (5)
src/schemas/prompt-schemas.ts (2)

5-13: Good addition of .strict() for schema validation.

Adding .strict() to TARGET_SPEC_SCHEMA prevents unknown properties from being accepted, improving validation rigor.


39-46: LGTM!

Using z.enum() for evaluator and type fields provides better type checking and inference compared to string literals. The expanded strictness union allows both numeric and named presets, providing flexibility while maintaining type safety.

src/evaluators/base-evaluator.ts (3)

16-30: LGTM!

The new imports for chunking and scoring utilities are properly typed, and the chunking constants (600-word threshold, 500-word max chunk size) are reasonable for managing large documents.


91-107: LGTM!

The aggregateUsage method correctly handles optional TokenUsage values with proper type guards and aggregates token counts across multiple LLM calls.


185-230: Semi-objective evaluation flow is well-structured.

The evaluation correctly:

  1. Collects violations from all chunks
  2. Merges and deduplicates violations (line 211)
  3. Calculates the final score once using total violations and accurate word count from original content (lines 193, 214-222)

Note: Line 193 correctly uses the original content parameter for total word count, avoiding the numbered content issue. However, the chunking decision at line 192 is still affected by the word counting bug in chunkContent.

@oshorefueled oshorefueled merged commit 2cd9bbe into main Jan 2, 2026
3 checks passed
oshorefueled pushed a commit that referenced this pull request Mar 2, 2026
…uators (#39)

* feat: implement content chunking and dedicated scoring logic for evaluators

* refactor: replace `pre` and `post` fields with `quoted_text`, `context_before`, and `context_after`

* feat: add line numbering and per-rule chunking control

- Add line numbering before chunking for accurate LLM line reporting
- Add token usage tracking and aggregation across chunks
- Add 'chunking' frontmatter option to disable chunking per-rule

* refactor: improve chunking and scoring code quality

- Use composite key for violation deduplication (quoted_text + description + analysis)
- Remove unused chunking options (overlapFraction, preserveSentences)
- Rename total_count to violation_count, remove meaningless passed_count
- Add warning for array length mismatch in averageSubjectiveScores
- Fix Zod schema ordering (.optional().default(true))

* fix: correct word counting and improve test mocking

* refactor: use consistent deduplication and word counting

- Use composite key (quoted_text + analysis) for subjective violation deduplication
- Use countWords utility instead of inline split for word counting
- Export countWords from chunking module

* refactor: use countWords utility consistently in evaluators

* refactor: remove unused chunk offset calculation

* refactor: make countWords automatically strip line number prefixes

* refactor: replace chunking with evaluateAs property
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants