feat: implement content chunking and dedicated scoring logic for evaluators#39
feat: implement content chunking and dedicated scoring logic for evaluators#39oshorefueled merged 11 commits intomainfrom
Conversation
📝 WalkthroughWalkthroughAdds a chunking subsystem, scoring utilities, and evaluator integration: text is recursively chunked, per-chunk evaluations run and deduplicated, scores are aggregated (subjective and semi-objective), and LLM token usage is combined. New types, utils, and barrel exports are introduced. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Evaluator as Evaluator (Base / Accuracy)
participant Chunker as RecursiveChunker
participant LLM as LLM Provider
participant Merger as mergeViolations
participant Scorer as Scoring (semi/subjective)
Evaluator->>Chunker: chunkContent(text)
Chunker-->>Evaluator: chunks[]
alt multiple chunks
loop per chunk
Evaluator->>LLM: evaluate(chunk)
LLM-->>Evaluator: chunkResult + usage
end
Evaluator->>Merger: mergeViolations(all chunk violations)
Merger-->>Evaluator: mergedViolations
Evaluator->>Scorer: calculateSemiObjectiveScore(mergedViolations, totalWords)
Scorer-->>Evaluator: semi-objective result
Evaluator->>Scorer: averageSubjectiveScores(per-chunk results, chunkWordCounts)
Scorer-->>Evaluator: subjective result
Evaluator->>Evaluator: aggregateUsage(all usages)
else single chunk or chunking disabled
Evaluator->>LLM: evaluate(full content)
LLM-->>Evaluator: result + usage
Evaluator->>Scorer: calculate...Score(result)
Scorer-->>Evaluator: final result
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (4)
src/evaluators/accuracy-evaluator.ts (1)
57-66: Consider using consistent word counting logic across the codebase.Line 60 uses
content.trim().split(/\s+/).length || 1for word counting, while the newsrc/chunking/utils.tsprovidescountWords()with slightly different normalization logic. For consistency and edge-case handling (empty strings, multiple whitespace types), consider using the shared utility.🔎 Apply this diff for consistency:
+import { countWords } from "../chunking/utils"; + async evaluate(_file: string, content: string): Promise<EvaluationResult> { // Step 1: Extract factual claims from the content const claims = await this.extractClaims(content); // If no claims found, return success (empty items array, perfect score) // Use the scoring module to calculate result if (claims.length === 0) { - const wordCount = content.trim().split(/\s+/).length || 1; + const wordCount = Math.max(countWords(content), 1); return calculateSemiObjectiveScore([], wordCount, { strictness: this.prompt.meta.strictness, defaultSeverity: this.defaultSeverity, promptSeverity: this.prompt.meta.severity, }); }src/chunking/chunker.ts (1)
4-8: Document or implement unused chunking options.
DEFAULT_OPTIONSincludesoverlapFractionandpreserveSentences, but these options aren't used in the chunking logic. This is consistent with the concern raised forsrc/chunking/types.ts. Either implement these features or remove them from the defaults until they're ready.src/evaluators/base-evaluator.ts (2)
66-83: Consider extracting word count calculation to avoid duplication.The word count logic
content.trim().split(/\s+/).lengthis repeated here (line 67), inrunSubjectiveEvaluation(line 128), and inrunSemiObjectiveEvaluation(line 147). A small helper would improve consistency and maintainability.🔎 Suggested helper
function getWordCount(text: string): number { return text.trim().split(/\s+/).length || 1; }
115-129: Consider parallelizing chunk evaluations for performance.The sequential
for...ofloop processes chunks one at a time. If rate limits and ordering aren't concerns,Promise.allcould reduce latency for large documents with multiple chunks.🔎 Parallel approach
- for (const chunk of chunks) { - const llmResult = - await this.llmProvider.runPromptStructured<SubjectiveLLMResult>( - chunk.content, - this.prompt.body, - schema - ); - - const result = calculateSubjectiveScore(llmResult.criteria, { - promptCriteria: this.prompt.meta.criteria, - }); - - chunkResults.push(result); - chunkWordCounts.push(chunk.content.trim().split(/\s+/).length); - } + const evaluations = await Promise.all( + chunks.map(async (chunk) => { + const llmResult = + await this.llmProvider.runPromptStructured<SubjectiveLLMResult>( + chunk.content, + this.prompt.body, + schema + ); + return { + result: calculateSubjectiveScore(llmResult.criteria, { + promptCriteria: this.prompt.meta.criteria, + }), + wordCount: chunk.content.trim().split(/\s+/).length, + }; + }) + ); + for (const { result, wordCount } of evaluations) { + chunkResults.push(result); + chunkWordCounts.push(wordCount); + }
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
package-lock.jsonis excluded by!**/package-lock.json
📒 Files selected for processing (10)
src/chunking/chunker.ts(1 hunks)src/chunking/index.ts(1 hunks)src/chunking/merger.ts(1 hunks)src/chunking/types.ts(1 hunks)src/chunking/utils.ts(1 hunks)src/evaluators/accuracy-evaluator.ts(4 hunks)src/evaluators/base-evaluator.ts(4 hunks)src/scoring/index.ts(1 hunks)src/scoring/scorer.ts(1 hunks)tests/scoring-types.test.ts(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (4)
tests/scoring-types.test.ts (5)
src/schemas/prompt-schemas.ts (1)
PromptFile(64-64)src/evaluators/base-evaluator.ts (1)
BaseEvaluator(37-172)src/prompts/schema.ts (2)
SubjectiveLLMResult(74-82)SemiObjectiveLLMResult(84-92)src/providers/search-provider.ts (1)
SearchProvider(5-7)src/evaluators/accuracy-evaluator.ts (1)
TechnicalAccuracyEvaluator(41-216)
src/chunking/merger.ts (2)
src/chunking/index.ts (1)
mergeViolations(3-3)src/prompts/schema.ts (1)
SemiObjectiveItem(109-115)
src/scoring/scorer.ts (1)
src/prompts/schema.ts (4)
SemiObjectiveItem(109-115)SemiObjectiveResult(117-133)SubjectiveLLMResult(74-82)SubjectiveResult(94-107)
src/chunking/chunker.ts (3)
src/chunking/types.ts (3)
ChunkingOptions(9-13)ChunkingStrategy(15-18)Chunk(1-7)src/evaluators/base-evaluator.ts (1)
chunkContent(66-83)src/chunking/utils.ts (2)
countWords(5-7)splitIntoWords(1-3)
🔇 Additional comments (23)
src/chunking/utils.ts (2)
1-3: LGTM!The word splitting logic is correct and handles edge cases well (empty strings, whitespace-only input).
5-7: LGTM!Clean delegation to
splitIntoWordsfor consistent word counting logic.src/chunking/index.ts (1)
1-3: LGTM!Clean barrel exports for the chunking module API surface.
src/scoring/index.ts (1)
1-7: LGTM!Clean barrel exports for the scoring module API surface.
src/evaluators/accuracy-evaluator.ts (3)
12-12: LGTM!Appropriate import of the new scoring function to support the updated evaluation flow.
89-94: LGTM!Formatting change improves readability; no functional change.
219-236: LGTM!The explicit search provider validation and clear error message improve the developer experience when configuration is missing.
tests/scoring-types.test.ts (5)
1-11: LGTM!Imports updated appropriately to align with the new module structure.
34-74: LGTM!The subjective evaluation test correctly validates weighted scoring logic with clear expectations.
90-128: LGTM!Good test coverage for semi-objective scoring. Note that the unique analysis strings ("First issue found", "Second issue found") correctly prevent deduplication issues that would arise from the merger logic.
130-151: LGTM!Appropriate test for the perfect score scenario (empty violations).
153-198: LGTM!The test appropriately validates the no-claims scenario returns a perfect score.
src/chunking/types.ts (2)
1-7: LGTM!The Chunk interface provides clear structure for representing content segments with proper offset tracking.
15-18: LGTM!Clean interface definition for implementing different chunking strategies.
src/chunking/chunker.ts (3)
15-44: LGTM with minor observation.The chunk offset calculation logic is sound. The fallback handling (lines 33-35, 39-40) ensures chunks are tracked even when
indexOffails, which is good defensive coding.
46-105: LGTM!The recursive chunking algorithm correctly handles hierarchical separator fallback and properly recursively processes oversized chunks (lines 92-102). The logic at line 83 (setting
currentChunk = parteven if large) is correctly handled by the recursive pass.
107-117: LGTM!Clean fallback implementation for force-splitting when separator-based chunking fails.
src/evaluators/base-evaluator.ts (3)
16-24: LGTM!The new imports for chunking and scoring modules are well-organized, and the constants for chunking thresholds (600 words to trigger chunking, 500 words max per chunk) provide sensible defaults.
142-171: LGTM!The approach of calculating the total word count from the original content and merging all chunk violations before scoring is correct for density-based scoring. The same parallelization opportunity applies here as noted for subjective evaluation.
176-181: LGTM!The evaluator registration correctly handles the factory signature and properly ignores the unused
searchProviderparameter.src/scoring/scorer.ts (3)
9-41: LGTM!The interfaces are well-defined with appropriate optional types, and
resolveStrictnessprovides a clean mapping from named presets to numeric values.
109-145: LGTM!The normalization formula correctly maps 1-4 scores to the 1-10 scale, and the weighted average calculation is sound. Minor note: if
criteriais empty,totalWeightis 0, returningfinal_score: 1which is a reasonable default.
250-268: LGTM!The aggregated criteria construction and final score calculation correctly apply weighted averaging across the normalized scores.
…t_before`, and `context_after`
- Add line numbering before chunking for accurate LLM line reporting - Add token usage tracking and aggregation across chunks - Add 'chunking' frontmatter option to disable chunking per-rule
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (3)
src/scoring/scorer.ts (3)
91-101: Field naming (passed_count,total_count) may be misleading.As noted in a previous review,
passed_countis hardcoded to0andtotal_countequals violation count. The naming suggests "items passed vs total checked" but only violations are counted. Consider renaming or documenting these fields for clarity.
182-188: Array length mismatch silently gives zero weight.As noted in a previous review, if
chunkWordCounts[i]is undefined due to array length mismatch, the fallback to0gives that chunk zero weight, effectively ignoring its results. Consider adding validation or a warning.
244-251: Violations with emptyanalysiswill incorrectly deduplicate.As noted in a previous review, the deduplication key falls back to
""for empty/undefined analysis strings, causing all such violations to collapse into one. This could silently drop valid violations.
🧹 Nitpick comments (4)
src/schemas/prompt-schemas.ts (1)
49-50: Consider using.optional().default(true)instead of.default(true).optional().The current order is semantically valid but unconventional. In Zod,
.default(true)already provides a fallback when the field is absent, making the trailing.optional()redundant. The standard pattern is.optional().default(true)which clearly expresses "field is optional, defaults to true if missing."🔎 Suggested fix
- chunking: z.boolean().default(true).optional(), + chunking: z.boolean().optional().default(true),tests/scoring-types.test.ts (1)
168-174: Movevi.mockto the module level for reliable hoisting.
vi.mockcalls are hoisted to the top of the file by Vitest, but placing them insideit()blocks can cause confusing behavior across tests. The mock may persist beyond this test or not apply correctly.🔎 Suggested approach
Move the mock to the top of the file or use
vi.doMockfor test-scoped mocking:// At file level, before describe blocks vi.mock("../src/evaluators/prompt-loader", () => ({ getPrompt: vi.fn().mockReturnValue({ body: "Extract claims" }), }));Or use
vi.doMockwith dynamic import for test-scoped mocking:it("should return perfect score when no claims are found", async () => { vi.doMock("../src/evaluators/prompt-loader", () => ({ getPrompt: vi.fn().mockReturnValue({ body: "Extract claims" }), })); const { TechnicalAccuracyEvaluator } = await import( "../src/evaluators/accuracy-evaluator" ); // ... });src/evaluators/base-evaluator.ts (1)
146-162: Minor: Chunk word counts include line number prefixes.At line 161,
chunk.content.trim().split(/\s+/).lengthcounts words including the prepended line numbers (e.g., "1: ", "2: "), whiletotalWordCountat line 190 uses the original content without line numbers. This creates a slight inconsistency in chunk weighting. Consider whether this affects the accuracy of weighted averaging.src/scoring/scorer.ts (1)
259-260: Consider using newlines or markers when joining chunk summaries/reasonings.Joining multiple chunk summaries with a single space may create hard-to-read text. Consider using
" | "or"\n"as separators for better readability.🔎 Suggested change
- summary: entry.summaries.join(" "), - reasoning: entry.reasonings.join(" "), + summary: entry.summaries.join(" | "), + reasoning: entry.reasonings.join(" | "),
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
src/evaluators/accuracy-evaluator.tssrc/evaluators/base-evaluator.tssrc/schemas/prompt-schemas.tssrc/scoring/scorer.tstests/scoring-types.test.ts
🧰 Additional context used
📓 Path-based instructions (2)
src/**/*.ts
📄 CodeRabbit inference engine (AGENTS.md)
src/**/*.ts: Use TypeScript ESM with explicit imports and narrow types
Use 2-space indentation; avoid trailing whitespace
Maintain strict TypeScript with noany; useunknown+ schema validation for external data
Use custom error types with proper inheritance; catch blocks useunknowntype
Files:
src/evaluators/accuracy-evaluator.tssrc/schemas/prompt-schemas.tssrc/evaluators/base-evaluator.tssrc/scoring/scorer.ts
tests/**/*.test.ts
📄 CodeRabbit inference engine (AGENTS.md)
tests/**/*.test.ts: Write tests using Vitest framework with focus on config parsing, file discovery, schema/structured output, and locator
Use dependency injection in tests: mock providers; do not hit network in unit tests
Files:
tests/scoring-types.test.ts
🧠 Learnings (5)
📚 Learning: 2025-12-28T19:43:51.176Z
Learnt from: CR
Repo: TRocket-Labs/vectorlint PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-28T19:43:51.176Z
Learning: Applies to src/boundaries/**/*.ts : Use Zod schemas for boundary validation of all external data (files, CLI, env, APIs) at system boundaries
Applied to files:
src/evaluators/accuracy-evaluator.tssrc/schemas/prompt-schemas.ts
📚 Learning: 2025-12-28T19:43:51.176Z
Learnt from: CR
Repo: TRocket-Labs/vectorlint PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-28T19:43:51.176Z
Learning: Applies to tests/**/*.test.ts : Write tests using Vitest framework with focus on config parsing, file discovery, schema/structured output, and locator
Applied to files:
tests/scoring-types.test.ts
📚 Learning: 2025-12-28T19:43:51.176Z
Learnt from: CR
Repo: TRocket-Labs/vectorlint PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-28T19:43:51.176Z
Learning: Applies to src/**/*.ts : Use TypeScript ESM with explicit imports and narrow types
Applied to files:
tests/scoring-types.test.ts
📚 Learning: 2025-12-28T19:43:51.176Z
Learnt from: CR
Repo: TRocket-Labs/vectorlint PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-28T19:43:51.176Z
Learning: Applies to src/output/**/*.ts : IDs should be shown as `PromptId.CriterionId` in output
Applied to files:
tests/scoring-types.test.tssrc/schemas/prompt-schemas.ts
📚 Learning: 2025-12-28T19:43:51.176Z
Learnt from: CR
Repo: TRocket-Labs/vectorlint PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-28T19:43:51.176Z
Learning: Applies to src/boundaries/env-parser.ts : All environment variables must be validated via Zod schemas in `src/boundaries/env-parser.ts`
Applied to files:
src/schemas/prompt-schemas.ts
🧬 Code graph analysis (2)
src/evaluators/accuracy-evaluator.ts (3)
src/scoring/scorer.ts (1)
calculateSemiObjectiveScore(48-102)src/evaluators/index.ts (1)
registerEvaluator(20-20)src/providers/token-usage.ts (1)
TokenUsage(1-4)
src/scoring/scorer.ts (1)
src/prompts/schema.ts (4)
SemiObjectiveItem(145-152)SemiObjectiveResult(154-172)SubjectiveLLMResult(96-110)SubjectiveResult(123-143)
🔇 Additional comments (7)
src/evaluators/accuracy-evaluator.ts (2)
64-76: LGTM!The integration with
calculateSemiObjectiveScoreis correct. The word count fallback to 1 prevents division-by-zero, and the scoring options are properly passed. The spread operator correctly handles the optionalclaimUsage.
241-258: LGTM!The evaluator registration is correctly structured with proper dependency validation for the search provider.
tests/scoring-types.test.ts (1)
98-116: Good: Unique analysis strings prevent deduplication collisions.Using distinct analysis strings ("First issue found", "Second issue found") ensures these violations are not incorrectly deduplicated, addressing a potential issue with the deduplication logic.
src/evaluators/base-evaluator.ts (2)
67-86: LGTM!The
chunkContentmethod correctly checks thechunkingflag from prompt metadata and applies the word count threshold. The single-chunk fallback for disabled/small content is properly structured.
91-104: LGTM!The
aggregateUsagemethod correctly filters undefined values with a type predicate and aggregates token counts using reduce.src/scoring/scorer.ts (2)
9-18: LGTM!The
SemiObjectiveScoringOptionsinterface correctly types the strictness union and severity options.
110-146: LGTM!The
calculateSubjectiveScorefunction correctly normalizes 1-4 scores to 1-10 scale and computes weighted averages. The weight lookup by criterion name is straightforward.
- Use composite key for violation deduplication (quoted_text + description + analysis) - Remove unused chunking options (overlapFraction, preserveSentences) - Rename total_count to violation_count, remove meaningless passed_count - Add warning for array length mismatch in averageSubjectiveScores - Fix Zod schema ordering (.optional().default(true))
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
src/scoring/scorer.ts (1)
252-258: Deduplication may collapse violations with empty analysis.Line 254 uses
|| ""as a fallback, meaning all violations with missing or emptyanalysiswill share the same deduplication key and collapse to a single violation. While the subjective schema requires theanalysisfield, defensive programming suggests not relying solely on schema enforcement.🔎 Suggested defensive fix
Option 1: Generate unique keys for empty analysis:
// Deduplicate violations const seen = new Set<string>(); + let emptyAnalysisCounter = 0; const uniqueViolations = entry.violations.filter((v) => { - const key = v.analysis?.toLowerCase().trim() || ""; + const analysisText = v.analysis?.toLowerCase().trim(); + const key = analysisText || `__empty_${emptyAnalysisCounter++}`; if (seen.has(key)) return false; seen.add(key); return true; });Option 2: Use composite key (analysis + quoted_text):
const uniqueViolations = entry.violations.filter((v) => { - const key = v.analysis?.toLowerCase().trim() || ""; + const key = [ + v.analysis?.toLowerCase().trim() || "", + v.quoted_text?.toLowerCase().trim() || "", + ].join("|"); if (seen.has(key)) return false; seen.add(key); return true; });
🧹 Nitpick comments (1)
src/evaluators/base-evaluator.ts (1)
67-86: Consider using consistent word counting across modules.Line 68 uses a simple
split(/\s+/)for word counting, while the chunking utilities usesplitIntoWordswhich filters empty strings. This could lead to minor discrepancies in word counts between the threshold check and actual chunking logic.🔎 Suggested refactor for consistency
+import { countWords } from "../chunking/utils"; + protected chunkContent(content: string): Chunk[] { - const wordCount = content.trim().split(/\s+/).length || 1; + const wordCount = countWords(content) || 1; const chunkingEnabled = this.prompt.meta.chunking !== false;
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
src/chunking/chunker.tssrc/chunking/merger.tssrc/chunking/types.tssrc/evaluators/base-evaluator.tssrc/prompts/schema.tssrc/schemas/prompt-schemas.tssrc/scoring/scorer.tstests/scoring-types.test.ts
🚧 Files skipped from review as they are similar to previous changes (1)
- src/schemas/prompt-schemas.ts
🧰 Additional context used
📓 Path-based instructions (2)
src/**/*.ts
📄 CodeRabbit inference engine (AGENTS.md)
src/**/*.ts: Use TypeScript ESM with explicit imports and narrow types
Use 2-space indentation; avoid trailing whitespace
Maintain strict TypeScript with noany; useunknown+ schema validation for external data
Use custom error types with proper inheritance; catch blocks useunknowntype
Files:
src/chunking/chunker.tssrc/scoring/scorer.tssrc/prompts/schema.tssrc/chunking/types.tssrc/chunking/merger.tssrc/evaluators/base-evaluator.ts
tests/**/*.test.ts
📄 CodeRabbit inference engine (AGENTS.md)
tests/**/*.test.ts: Write tests using Vitest framework with focus on config parsing, file discovery, schema/structured output, and locator
Use dependency injection in tests: mock providers; do not hit network in unit tests
Files:
tests/scoring-types.test.ts
🧠 Learnings (3)
📚 Learning: 2025-12-28T19:43:51.176Z
Learnt from: CR
Repo: TRocket-Labs/vectorlint PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-28T19:43:51.176Z
Learning: Applies to tests/**/*.test.ts : Write tests using Vitest framework with focus on config parsing, file discovery, schema/structured output, and locator
Applied to files:
tests/scoring-types.test.ts
📚 Learning: 2025-12-28T19:43:51.176Z
Learnt from: CR
Repo: TRocket-Labs/vectorlint PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-28T19:43:51.176Z
Learning: Applies to src/**/*.ts : Use TypeScript ESM with explicit imports and narrow types
Applied to files:
tests/scoring-types.test.ts
📚 Learning: 2025-12-28T19:43:51.176Z
Learnt from: CR
Repo: TRocket-Labs/vectorlint PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-28T19:43:51.176Z
Learning: Applies to src/**/*.ts : Maintain strict TypeScript with no `any`; use `unknown` + schema validation for external data
Applied to files:
tests/scoring-types.test.ts
🧬 Code graph analysis (3)
src/chunking/chunker.ts (3)
src/chunking/types.ts (3)
ChunkingOptions(9-11)ChunkingStrategy(13-16)Chunk(1-7)src/evaluators/base-evaluator.ts (1)
chunkContent(67-86)src/chunking/utils.ts (2)
countWords(5-7)splitIntoWords(1-3)
src/scoring/scorer.ts (2)
src/scoring/index.ts (5)
SemiObjectiveScoringOptions(2-2)SubjectiveScoringOptions(3-3)calculateSemiObjectiveScore(4-4)calculateSubjectiveScore(5-5)averageSubjectiveScores(6-6)src/prompts/schema.ts (4)
SemiObjectiveItem(145-152)SemiObjectiveResult(154-171)SubjectiveLLMResult(96-110)SubjectiveResult(123-143)
src/chunking/merger.ts (2)
src/chunking/index.ts (1)
mergeViolations(3-3)src/prompts/schema.ts (1)
SemiObjectiveItem(145-152)
🔇 Additional comments (9)
src/chunking/merger.ts (1)
3-20: LGTM! Composite key properly addresses deduplication concerns.The implementation correctly uses a composite key combining
quoted_text,description, andanalysisfields. This ensures that only violations matching across all three fields are treated as duplicates, preventing the false positive deduplication issue raised in previous reviews.src/chunking/types.ts (1)
1-16: LGTM! Clean interface design.The interface definitions are well-structured and minimal. The removal of unused
overlapFractionandpreserveSentencesoptions (mentioned in past reviews) has resulted in a cleaner, more maintainable design.tests/scoring-types.test.ts (1)
1-205: LGTM! Comprehensive test coverage.The tests properly validate the new scoring logic with:
- Weighted average calculations for subjective evaluation
- Violation-based scoring for semi-objective evaluation
- Edge cases (empty violations, perfect scores)
- Unique analysis strings in test data to avoid unintended deduplication
Based on learnings, tests use Vitest framework appropriately with proper mocking.
src/prompts/schema.ts (1)
158-158: LGTM! Clearer field naming.The change from
passed_count/total_counttoviolation_countsignificantly improves API clarity and directly addresses the semantic confusion raised in past reviews.src/chunking/chunker.ts (1)
44-103: LGTM! Sound recursive chunking logic.The hierarchical separator approach with recursive refinement is well-designed. The method properly handles:
- Base case for small-enough text
- Fallback to next separator level
- Recursive processing of oversized chunks
- Final fallback to word-based splitting
src/evaluators/base-evaluator.ts (1)
113-230: LGTM! Well-structured chunked evaluation flow.The implementation properly handles both single and multiple chunk scenarios:
- Subjective evaluation averages per-chunk scores weighted by chunk word count
- Semi-objective evaluation collects and merges violations, then scores once
- Token usage is correctly aggregated across multiple LLM calls
This aligns perfectly with the PR objectives for chunk-aware evaluation.
src/scoring/scorer.ts (3)
48-101: LGTM! Clear scoring logic with improved field naming.The function properly implements density-based scoring with the new
violation_countfield that addresses past review concerns about unclear semantics. The formula is well-documented and the severity determination logic is appropriate.
109-145: LGTM! Correct weighted scoring calculation.The subjective scoring properly:
- Normalizes 1-4 scores to 1-10 scale using the correct formula
- Applies criterion weights from prompt configuration
- Handles missing weights with sensible default of 1
161-166: LGTM! Defensive check properly added.The warning for array length mismatches addresses the past review concern and provides helpful debugging information if the arrays are mismatched due to a programming error.
- Use composite key (quoted_text + analysis) for subjective violation deduplication - Use countWords utility instead of inline split for word counting - Export countWords from chunking module
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (4)
src/evaluators/base-evaluator.ts (2)
166-169: UsecountWordsutility for consistent word counting.The inline word counting
trim().split(/\s+/).lengthdiffers from the importedcountWordsutility used inchunkContent. This inconsistency could cause subtle discrepancies sincecountWordsusessplitIntoWordswhich may handle edge cases (e.g., multiple spaces, empty strings) differently.🔎 Proposed fix
- chunkWordCounts.push( - stripLineNumbers(chunk.content).trim().split(/\s+/).length - ); + chunkWordCounts.push(countWords(stripLineNumbers(chunk.content)));
198-198: UsecountWordsutility for consistent word counting.Same issue as in
runSubjectiveEvaluation— use the importedcountWordsutility instead of inline splitting for consistency.🔎 Proposed fix
- const totalWordCount = content.trim().split(/\s+/).length || 1; + const totalWordCount = countWords(content) || 1;src/scoring/scorer.ts (2)
9-18: Consider narrowingpromptSeveritytype.The
| stringinpromptSeveritytype weakens type safety. If this accepts prompt frontmatter values, consider using a union of expected string literals (e.g.,"warning" | "error") or document why arbitrary strings are accepted.
269-270: Consider using a proper sentence separator when joining summaries/reasonings.Joining with
" "may produce awkward text if chunks' summaries don't end with punctuation. Consider using" | "or"\n"as a separator for clearer delineation.🔎 Proposed fix
- summary: entry.summaries.join(" "), - reasoning: entry.reasonings.join(" "), + summary: entry.summaries.join(" "), + reasoning: entry.reasonings.join(" "),Or filter empty strings and use a cleaner separator:
summary: entry.summaries.filter(Boolean).join(" | "), reasoning: entry.reasonings.filter(Boolean).join(" | "),
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
src/chunking/index.tssrc/evaluators/base-evaluator.tssrc/scoring/scorer.ts
🚧 Files skipped from review as they are similar to previous changes (1)
- src/chunking/index.ts
🧰 Additional context used
📓 Path-based instructions (1)
src/**/*.ts
📄 CodeRabbit inference engine (AGENTS.md)
src/**/*.ts: Use TypeScript ESM with explicit imports and narrow types
Use 2-space indentation; avoid trailing whitespace
Maintain strict TypeScript with noany; useunknown+ schema validation for external data
Use custom error types with proper inheritance; catch blocks useunknowntype
Files:
src/scoring/scorer.tssrc/evaluators/base-evaluator.ts
🧬 Code graph analysis (2)
src/scoring/scorer.ts (1)
src/prompts/schema.ts (4)
SemiObjectiveItem(145-152)SemiObjectiveResult(154-171)SubjectiveLLMResult(96-110)SubjectiveResult(123-143)
src/evaluators/base-evaluator.ts (10)
src/chunking/index.ts (4)
Chunk(1-1)countWords(4-4)RecursiveChunker(2-2)mergeViolations(3-3)src/chunking/types.ts (1)
Chunk(1-7)src/chunking/utils.ts (1)
countWords(5-7)src/chunking/chunker.ts (2)
RecursiveChunker(8-116)chunk(13-42)src/providers/token-usage.ts (1)
TokenUsage(1-4)src/prompts/schema.ts (3)
SubjectiveLLMResult(96-110)SubjectiveResult(123-143)SemiObjectiveLLMResult(112-121)src/scoring/scorer.ts (3)
calculateSubjectiveScore(109-145)averageSubjectiveScores(148-282)calculateSemiObjectiveScore(48-101)src/scoring/index.ts (3)
calculateSubjectiveScore(5-5)averageSubjectiveScores(6-6)calculateSemiObjectiveScore(4-4)src/output/line-numbering.ts (1)
stripLineNumbers(27-32)src/chunking/merger.ts (1)
mergeViolations(3-20)
🔇 Additional comments (9)
src/evaluators/base-evaluator.ts (4)
16-30: LGTM!The new imports are well-organized and the chunking constants are reasonable values (600-word threshold, 500-word max chunk size).
72-91: LGTM!The
chunkContentmethod correctly uses thecountWordsutility and provides a clean opt-out mechanism viaprompt.meta.chunking. The defensive|| 1handles edge cases well.
96-109: LGTM!Clean token aggregation with proper type narrowing via the type guard filter.
200-227: LGTM!The chunked evaluation flow correctly collects violations per chunk, merges with deduplication via
mergeViolations, and calculates the final score from the merged set. Token usage aggregation is properly handled.src/scoring/scorer.ts (5)
26-41: LGTM!Clean strictness resolution with sensible default values.
48-101: LGTM!The scoring logic is correct. Using
violation_countis clearer than the previous field names mentioned in past reviews.
109-145: LGTM!The normalization formula correctly maps the 1-4 scale to 1-10, and the weighted average calculation is properly implemented.
160-166: LGTM!Good defensive logging for array length mismatches. This addresses the past review concern about silent data loss.
251-261: LGTM!The composite key approach using both
quoted_textandanalysisaddresses the prior deduplication concern. Violations with both fields empty colliding is an acceptable edge case since such violations would provide minimal diagnostic value anyway.
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (2)
src/evaluators/base-evaluator.ts (2)
29-30: Consider making chunking parameters configurable.The hardcoded constants
CHUNKING_THRESHOLD = 600andMAX_CHUNK_SIZE = 500work for general use, but different evaluation types or content domains might benefit from different thresholds. Consider exposing these viaprompt.meta(e.g.,prompt.meta.chunkingThreshold,prompt.meta.maxChunkSize) with these values as defaults.
72-91: Consider adding error handling for chunking operations.The
RecursiveChunkercould potentially throw errors during chunking (e.g., invalid input, configuration issues). Wrapping lines 89-90 in a try-catch with a fallback to a single-chunk strategy would improve robustness.🔎 Suggested error handling
const chunker = new RecursiveChunker(); - return chunker.chunk(content, { maxChunkSize: MAX_CHUNK_SIZE }); + try { + return chunker.chunk(content, { maxChunkSize: MAX_CHUNK_SIZE }); + } catch (e: unknown) { + const err = e instanceof Error ? e : new Error(String(e)); + console.warn(`[vectorlint] Chunking failed: ${err.message}`); + // Fallback to single chunk + return [ + { + content, + startOffset: 0, + endOffset: content.length, + index: 0, + }, + ]; + }
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
src/evaluators/accuracy-evaluator.tssrc/evaluators/base-evaluator.ts
🧰 Additional context used
📓 Path-based instructions (1)
src/**/*.ts
📄 CodeRabbit inference engine (AGENTS.md)
src/**/*.ts: Use TypeScript ESM with explicit imports and narrow types
Use 2-space indentation; avoid trailing whitespace
Maintain strict TypeScript with noany; useunknown+ schema validation for external data
Use custom error types with proper inheritance; catch blocks useunknowntype
Files:
src/evaluators/accuracy-evaluator.tssrc/evaluators/base-evaluator.ts
🧠 Learnings (1)
📚 Learning: 2025-12-28T19:43:51.176Z
Learnt from: CR
Repo: TRocket-Labs/vectorlint PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-28T19:43:51.176Z
Learning: Applies to src/boundaries/**/*.ts : Use Zod schemas for boundary validation of all external data (files, CLI, env, APIs) at system boundaries
Applied to files:
src/evaluators/accuracy-evaluator.ts
🔇 Additional comments (7)
src/evaluators/accuracy-evaluator.ts (3)
13-14: LGTM! Clean integration of scoring module.The no-claims path now correctly uses
calculateSemiObjectiveScorefrom the scoring module for consistency with the base evaluator. ThecountWordsutility integration aligns with the standardization goal mentioned in the PR objectives.Also applies to: 65-77
107-115: Usage aggregation is correctly implemented.Token usage from claim extraction and evaluation phases are properly summed, and the optional nature of
usagefields is handled correctly.
242-259: Registration pattern is correct.The factory function properly guards against missing
searchProviderwith a clear error message, and the multiline formatting improves readability.src/evaluators/base-evaluator.ts (4)
16-27: Chunking integration is well-architected.The approach of prepending line numbers before chunking (lines 124, 195) ensures that line references reported by the LLM remain globally accurate across all chunks. Stripping line numbers before word counting (line 167, 196) correctly excludes line-number artifacts from the word count.
Also applies to: 72-91
93-109: Token usage aggregation is correctly implemented.The type guard on line 99 properly filters
undefinedvalues, and the reduce operation correctly sums token counts across multiple LLM calls.
113-177: Subjective evaluation chunking flow is correct.The implementation correctly:
- Handles the single-chunk optimization (lines 129-145)
- Strips line numbers before counting words for chunk weights (line 167)
- Uses
averageSubjectiveScoreswith chunk word counts to produce a weighted average- Aggregates token usage across all chunk evaluations
180-233: Semi-objective evaluation chunking flow is correct.The implementation correctly:
- Collects violations from all chunks (lines 202-211)
- Merges and deduplicates violations via
mergeViolations(line 214)- Calculates the final score once using total word count from the original content (line 196)
- Aggregates token usage across all chunk evaluations
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
src/evaluators/base-evaluator.ts (1)
145-166: Consider extracting chunk word counting logic.Line 165 strips line numbers and counts words for each chunk. This pattern could be extracted to reduce duplication and clarify intent, though the current implementation is functional.
🔎 Optional refactor
+ protected getActualWordCount(content: string): number { + return countWords(stripLineNumbers(content)); + } for (const chunk of chunks) { const { data: llmResult, usage } = await this.llmProvider.runPromptStructured<SubjectiveLLMResult>( chunk.content, this.prompt.body, schema ); usages.push(usage); const result = calculateSubjectiveScore(llmResult.criteria, { promptCriteria: this.prompt.meta.criteria, }); chunkResults.push(result); - // Strip line numbers before counting to get accurate word count - chunkWordCounts.push(countWords(stripLineNumbers(chunk.content))); + chunkWordCounts.push(this.getActualWordCount(chunk.content)); }
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
src/chunking/chunker.tssrc/chunking/types.tssrc/evaluators/base-evaluator.ts
🚧 Files skipped from review as they are similar to previous changes (1)
- src/chunking/types.ts
🧰 Additional context used
📓 Path-based instructions (1)
src/**/*.ts
📄 CodeRabbit inference engine (AGENTS.md)
src/**/*.ts: Use TypeScript ESM with explicit imports and narrow types
Use 2-space indentation; avoid trailing whitespace
Maintain strict TypeScript with noany; useunknown+ schema validation for external data
Use custom error types with proper inheritance; catch blocks useunknowntype
Files:
src/evaluators/base-evaluator.tssrc/chunking/chunker.ts
🧬 Code graph analysis (1)
src/evaluators/base-evaluator.ts (9)
src/chunking/types.ts (1)
Chunk(1-5)src/chunking/index.ts (4)
Chunk(1-1)countWords(4-4)RecursiveChunker(2-2)mergeViolations(3-3)src/chunking/utils.ts (1)
countWords(5-7)src/chunking/chunker.ts (2)
RecursiveChunker(8-98)chunk(13-24)src/providers/token-usage.ts (1)
TokenUsage(1-4)src/prompts/schema.ts (3)
SubjectiveLLMResult(96-110)SubjectiveResult(123-143)SemiObjectiveLLMResult(112-121)src/scoring/index.ts (3)
calculateSubjectiveScore(5-5)averageSubjectiveScores(6-6)calculateSemiObjectiveScore(4-4)src/output/line-numbering.ts (1)
stripLineNumbers(27-32)src/chunking/merger.ts (1)
mergeViolations(3-20)
🔇 Additional comments (4)
src/chunking/chunker.ts (2)
13-24: LGTM! Past review concern addressed.The simplified implementation only tracks
contentandindexfor chunks, avoiding the offset calculation issues flagged in the previous review. The filtering and mapping logic is correct.
26-97: Solid recursive chunking implementation.The hierarchical splitting with automatic fallback to the next separator and recursive refinement for oversized chunks is well-designed. The logic correctly handles edge cases, including parts that exceed
maxSize(refined recursively at lines 75-78).src/evaluators/base-evaluator.ts (2)
94-107: Correct usage aggregation.The type guard filtering, empty check, and reduce logic properly aggregate token usage across multiple LLM calls.
186-231: Correct semi-objective evaluation flow.The implementation properly collects violations across chunks, merges with deduplication, and calculates density-based scoring. Line 194 correctly uses the original content word count (not numbered) for accurate density calculation.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
src/evaluators/base-evaluator.ts (1)
72-89: Word counting on numbered content causes incorrect chunking decisions.This method receives
numberedContent(with line number prefixes) from both evaluation methods (lines 123, 192). Line 73 counts words on this numbered content, inflating the word count and causing:
- Premature chunking — content below 600 actual words may exceed the threshold due to line numbers
- Incorrect chunk sizes —
RecursiveChunkerbases chunk sizes on inflated word counts (line 88)For example, 500 actual words might count as 600+ with line numbers, incorrectly triggering chunking.
🔎 Proposed fix: Strip line numbers before word counting
Option 1 (Recommended): Strip line numbers in
chunkContentbefore countingAdd a utility to strip line numbers and count words on actual content:
protected chunkContent(content: string): Chunk[] { - const wordCount = countWords(content) || 1; + // Strip line numbers for accurate word counting + const actualContent = content.replace(/^\s*\d+[:.)]\s*/gm, ''); + const wordCount = countWords(actualContent) || 1; const chunkingEnabled = this.prompt.meta.evaluateAs !== "document"; if (!chunkingEnabled || wordCount <= CHUNKING_THRESHOLD) { // Chunking disabled or content is small enough - return as single chunk return [ { content, index: 0, }, ]; } const chunker = new RecursiveChunker(); - return chunker.chunk(content, { maxChunkSize: MAX_CHUNK_SIZE }); + return chunker.chunk(actualContent, { maxChunkSize: MAX_CHUNK_SIZE }); }Note: After stripping, you'll need to restore line numbers to chunk content before returning, or pass original numbered content to the chunker and only strip for counting. The latter approach preserves line numbers in chunks for LLM reporting.
Option 2: Pass original content for chunking decisions
Refactor callers to pass both original and numbered content, using original for decisions and numbered for LLM processing. This avoids repeated stripping but requires signature changes.
🧹 Nitpick comments (1)
src/schemas/prompt-schemas.ts (1)
49-50: Consider clarifying default behavior in comment.The comment indicates "chunk (default)" but the schema doesn't enforce a default value via
.default(). The default behavior is implicit in the consuming code (base-evaluator.tsline 75:evaluateAs !== "document"), which treatsundefinedas enabling chunking.For better clarity, consider either:
- Adding
.default("chunk")to the schema- Or updating the comment to:
// Determines how content is evaluated: 'document' for full document, 'chunk' for chunked processing (chunking enabled when unspecified)
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
CREATING_RULES.mdsrc/evaluators/base-evaluator.tssrc/schemas/prompt-schemas.ts
✅ Files skipped from review due to trivial changes (1)
- CREATING_RULES.md
🧰 Additional context used
📓 Path-based instructions (1)
src/**/*.ts
📄 CodeRabbit inference engine (AGENTS.md)
src/**/*.ts: Use TypeScript ESM with explicit imports and narrow types
Use 2-space indentation; avoid trailing whitespace
Maintain strict TypeScript with noany; useunknown+ schema validation for external data
Use custom error types with proper inheritance; catch blocks useunknowntype
Files:
src/evaluators/base-evaluator.tssrc/schemas/prompt-schemas.ts
🧠 Learnings (4)
📚 Learning: 2025-12-28T19:43:51.189Z
Learnt from: CR
Repo: TRocket-Labs/vectorlint PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-28T19:43:51.189Z
Learning: Applies to src/output/**/*.ts : IDs should be shown as `PromptId.CriterionId` in output
Applied to files:
src/schemas/prompt-schemas.ts
📚 Learning: 2025-12-28T19:43:51.189Z
Learnt from: CR
Repo: TRocket-Labs/vectorlint PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-28T19:43:51.189Z
Learning: Applies to src/**/*.ts : Maintain strict TypeScript with no `any`; use `unknown` + schema validation for external data
Applied to files:
src/schemas/prompt-schemas.ts
📚 Learning: 2025-12-28T19:43:51.189Z
Learnt from: CR
Repo: TRocket-Labs/vectorlint PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-28T19:43:51.189Z
Learning: Applies to src/boundaries/**/*.ts : Use Zod schemas for boundary validation of all external data (files, CLI, env, APIs) at system boundaries
Applied to files:
src/schemas/prompt-schemas.ts
📚 Learning: 2025-12-28T19:43:51.189Z
Learnt from: CR
Repo: TRocket-Labs/vectorlint PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-28T19:43:51.189Z
Learning: Applies to src/boundaries/env-parser.ts : All environment variables must be validated via Zod schemas in `src/boundaries/env-parser.ts`
Applied to files:
src/schemas/prompt-schemas.ts
🧬 Code graph analysis (1)
src/evaluators/base-evaluator.ts (7)
src/chunking/types.ts (1)
Chunk(1-5)src/chunking/index.ts (4)
Chunk(1-1)countWords(4-4)RecursiveChunker(2-2)mergeViolations(3-3)src/chunking/chunker.ts (2)
RecursiveChunker(8-98)chunk(13-24)src/providers/token-usage.ts (1)
TokenUsage(1-4)src/prompts/schema.ts (3)
SubjectiveLLMResult(96-110)SubjectiveResult(123-143)SemiObjectiveLLMResult(112-121)src/scoring/index.ts (3)
calculateSubjectiveScore(5-5)averageSubjectiveScores(6-6)calculateSemiObjectiveScore(4-4)src/chunking/merger.ts (1)
mergeViolations(3-20)
🔇 Additional comments (5)
src/schemas/prompt-schemas.ts (2)
5-13: Good addition of.strict()for schema validation.Adding
.strict()toTARGET_SPEC_SCHEMAprevents unknown properties from being accepted, improving validation rigor.
39-46: LGTM!Using
z.enum()forevaluatorandtypefields provides better type checking and inference compared to string literals. The expandedstrictnessunion allows both numeric and named presets, providing flexibility while maintaining type safety.src/evaluators/base-evaluator.ts (3)
16-30: LGTM!The new imports for chunking and scoring utilities are properly typed, and the chunking constants (600-word threshold, 500-word max chunk size) are reasonable for managing large documents.
91-107: LGTM!The
aggregateUsagemethod correctly handles optionalTokenUsagevalues with proper type guards and aggregates token counts across multiple LLM calls.
185-230: Semi-objective evaluation flow is well-structured.The evaluation correctly:
- Collects violations from all chunks
- Merges and deduplicates violations (line 211)
- Calculates the final score once using total violations and accurate word count from original content (lines 193, 214-222)
Note: Line 193 correctly uses the original
contentparameter for total word count, avoiding the numbered content issue. However, the chunking decision at line 192 is still affected by the word counting bug inchunkContent.
…uators (#39) * feat: implement content chunking and dedicated scoring logic for evaluators * refactor: replace `pre` and `post` fields with `quoted_text`, `context_before`, and `context_after` * feat: add line numbering and per-rule chunking control - Add line numbering before chunking for accurate LLM line reporting - Add token usage tracking and aggregation across chunks - Add 'chunking' frontmatter option to disable chunking per-rule * refactor: improve chunking and scoring code quality - Use composite key for violation deduplication (quoted_text + description + analysis) - Remove unused chunking options (overlapFraction, preserveSentences) - Rename total_count to violation_count, remove meaningless passed_count - Add warning for array length mismatch in averageSubjectiveScores - Fix Zod schema ordering (.optional().default(true)) * fix: correct word counting and improve test mocking * refactor: use consistent deduplication and word counting - Use composite key (quoted_text + analysis) for subjective violation deduplication - Use countWords utility instead of inline split for word counting - Export countWords from chunking module * refactor: use countWords utility consistently in evaluators * refactor: remove unused chunk offset calculation * refactor: make countWords automatically strip line number prefixes * refactor: replace chunking with evaluateAs property
Summary
This PR introduces automatic content chunking for large documents and extracts scoring logic into a dedicated module, improving evaluation accuracy for long-form content.
Changes
New Chunking Module (
src/chunking/)\n\n,\n,.,) to preserve semantic boundariesChunk,ChunkingOptions, andChunkingStrategyNew Scoring Module (
src/scoring/)calculateSemiObjectiveScore()for density-based scoringcalculateSubjectiveScore()for weighted criteria scoringaverageSubjectiveScores()for chunk aggregation with word-count weightingEvaluator Updates
Tests
scoring-types.test.tswith unique violation analysis strings to avoid deduplication issuesVersion
Bumped to
2.1.1Summary by CodeRabbit
New Features
Improvements
Breaking Changes
Tests
Documentation
✏️ Tip: You can customize this high-level summary in your review settings.