feat: improve Issue Location Accuracy with Line Numbering and Fuzzy Matching#46
feat: improve Issue Location Accuracy with Line Numbering and Fuzzy Matching#46oshorefueled merged 6 commits intomainfrom
Conversation
📝 WalkthroughWalkthroughOverhauls evidence location from pre/post context to quoted-text fuzzy matching with multiple fallbacks, adds line-numbering utilities, updates violation shapes to use Changes
Sequence Diagram(s)sequenceDiagram
participant CLI as CLI Orchestrator
participant Locator as locateQuotedText
participant Evaluator as BaseEvaluator / LLM
participant Reporter as reportIssue
CLI->>Locator: request locate(quoted_text, context_before, context_after, content)
alt Found
Locator-->>CLI: { line, column, match, confidence, strategy }
else Not found
Locator-->>CLI: null
end
CLI->>CLI: filter verified matches, dedupe by (quoted_text,line)
CLI->>Evaluator: send numberedContent + verified violations
Evaluator-->>CLI: evaluation result (scores, messages, updated violations)
CLI->>Reporter: reportIssue for each unique verified violation
Reporter-->>CLI: ACK
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: defaults Review profile: CHILL Plan: Pro 📒 Files selected for processing (4)
🚧 Files skipped from review as they are similar to previous changes (1)
🧰 Additional context used📓 Path-based instructions (1)src/**/*.ts📄 CodeRabbit inference engine (AGENTS.md)
Files:
🧠 Learnings (1)📚 Learning: 2025-12-27T17:28:16.346ZApplied to files:
🧬 Code graph analysis (1)src/cli/orchestrator.ts (2)
🔇 Additional comments (6)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/cli/orchestrator.ts (2)
460-466: Update type cast to match new schema structure.The inline type definition still references
pre?andpost?fields from the old evidence location approach. However, the function now usesquoted_text,context_before, andcontext_afterfields (as seen in thelocateQuotedTextcall at lines 236-245).🔧 Proposed fix
const violationResult = locateAndReportViolations({ violations: violations as Array<{ - pre?: string; - post?: string; + quoted_text?: string; + context_before?: string; + context_after?: string; + line?: number; analysis?: string; suggestion?: string; }>,Alternatively, define a proper type in
src/cli/types.tsand import it to avoid inline type casts.
695-705: Add missing verbose parameter.The function signature of
extractAndReportCriterion(line 317-331) includes averboseparameter, and it's used internally (line 474). However, this call site doesn't pass theverboseparameter, which means verbose logging won't work for these violations.🔧 Proposed fix
const criterionResult = extractAndReportCriterion({ exp, result, content, relFile, promptId, promptFilename: promptFile.filename, meta, outputFormat, jsonFormatter, + verbose, });
🧹 Nitpick comments (2)
tests/fuzzy-matching.test.ts (1)
24-33: Verify expected strategy for context disambiguation.Per the
locateQuotedTextfunction documentation, when context is used to disambiguate multiple exact matches, the strategy should be"context", not"exact". However, in this test case there's only one occurrence of "quick brown fox" in the text, so context disambiguation isn't needed and"exact"is correct.Consider adding a test with actual duplicate text to properly test the context disambiguation path returning
"context"strategy.src/cli/orchestrator.ts (1)
144-194: Consider simplifying or removing this legacy logic.The
extractMatchTextfunction attempts to extract quoted text from the analysis message using regex patterns, which was useful in the old pre/post approach. However, with the new quoted-text-based strategy, violations already have an explicitquoted_textfield, andlocateQuotedTextreturns the actual matched text. This additional extraction and refinement step may be redundant and could cause confusion.💡 Consider simplifying to just return the location from locateQuotedText
Since
locateQuotedTextalready returns line, column, and match with confidence scoring, you might be able to simplify or remove this function entirely. If quote refinement from the analysis message is still needed for specific cases, document why and add tests for those scenarios.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
package-lock.jsonis excluded by!**/package-lock.json
📒 Files selected for processing (9)
package.jsonsrc/cli/orchestrator.tssrc/cli/types.tssrc/evaluators/base-evaluator.tssrc/output/line-numbering.tssrc/output/location.tssrc/prompts/directive-loader.tssrc/prompts/schema.tstests/fuzzy-matching.test.ts
🧰 Additional context used
📓 Path-based instructions (2)
src/**/*.ts
📄 CodeRabbit inference engine (AGENTS.md)
src/**/*.ts: Use TypeScript ESM with explicit imports and narrow types
Use 2-space indentation; avoid trailing whitespace
Use strict TypeScript with noanytypes; useunknown+ schema validation for external data
Use custom error types with proper inheritance; catch blocks should useunknowntype
Files:
src/cli/types.tssrc/output/line-numbering.tssrc/prompts/directive-loader.tssrc/evaluators/base-evaluator.tssrc/output/location.tssrc/prompts/schema.tssrc/cli/orchestrator.ts
tests/**/*.test.ts
📄 CodeRabbit inference engine (AGENTS.md)
tests/**/*.test.ts: Use Vitest for testing framework; locate tests undertests/with*.test.tsnaming
Use dependency injection in tests: mock providers and do not hit network in unit tests
Files:
tests/fuzzy-matching.test.ts
🧠 Learnings (4)
📚 Learning: 2025-12-27T17:28:16.346Z
Learnt from: CR
Repo: TRocket-Labs/vectorlint PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-27T17:28:16.346Z
Learning: Applies to tests/**/*.test.ts : Use Vitest for testing framework; locate tests under `tests/` with `*.test.ts` naming
Applied to files:
tests/fuzzy-matching.test.ts
📚 Learning: 2025-12-27T17:28:16.346Z
Learnt from: CR
Repo: TRocket-Labs/vectorlint PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-27T17:28:16.346Z
Learning: Applies to src/providers/**/*.ts : Depend on `LLMProvider` and `SearchProvider` interfaces; keep providers thin (transport only)
Applied to files:
src/evaluators/base-evaluator.ts
📚 Learning: 2025-12-27T17:28:16.346Z
Learnt from: CR
Repo: TRocket-Labs/vectorlint PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-27T17:28:16.346Z
Learning: Separate concerns: evals define rubric; schemas enforce structure; CLI orchestrates; evaluators process; reporters format
Applied to files:
src/evaluators/base-evaluator.tssrc/prompts/schema.ts
📚 Learning: 2025-12-27T17:28:16.346Z
Learnt from: CR
Repo: TRocket-Labs/vectorlint PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-27T17:28:16.346Z
Learning: Add providers by implementing `LLMProvider` or `SearchProvider` interfaces; add evaluators via registry pattern
Applied to files:
src/evaluators/base-evaluator.ts
🧬 Code graph analysis (3)
tests/fuzzy-matching.test.ts (1)
src/output/location.ts (1)
locateQuotedText(148-345)
src/evaluators/base-evaluator.ts (3)
src/prompts/schema.ts (6)
buildSubjectiveLLMSchema(3-55)SubjectiveLLMResult(95-109)SemiObjectiveResult(152-169)buildSemiObjectiveLLMSchema(57-93)SemiObjectiveLLMResult(111-120)SemiObjectiveItem(143-150)src/output/line-numbering.ts (1)
prependLineNumbers(13-18)src/evaluators/evaluator-registry.ts (1)
registerEvaluator(65-67)
src/cli/orchestrator.ts (2)
src/output/location.ts (1)
locateQuotedText(148-345)src/errors/index.ts (1)
handleUnknownError(46-51)
🪛 GitHub Actions: Lint
src/output/location.ts
[error] 172-172: 'lineStartIdx' is assigned a value but never used @typescript-eslint/no-unused-vars
🪛 GitHub Check: ESLint
tests/fuzzy-matching.test.ts
[failure] 4-4:
Variable name originalText must match one of the following formats: UPPER_CASE
src/output/location.ts
[failure] 172-172:
'lineStartIdx' is assigned a value but never used
🔇 Additional comments (32)
src/cli/types.ts (2)
59-59: LGTM!The optional
verboseflag is appropriately added toEvaluationContextfor controlling logging behavior in the evaluation workflow.
91-97: LGTM!The violation structure is updated to use the new quoted-text-based approach with
line,quoted_text,context_before, andcontext_after. The fields are appropriately optional to handle cases where the LLM might not provide all evidence.tests/fuzzy-matching.test.ts (2)
58-73: LGTM!Fuzzy matching tests appropriately verify that imperfect quotes (missing words, reordered words) are still matched with reasonable confidence. The minimum confidence threshold of 80 aligns with the default
minConfidenceparameter.
76-90: LGTM!Edge cases are well covered: unrelated text returns null, and empty
quoted_textalso returns null as expected.src/output/line-numbering.ts (4)
13-18: LGTM!Clean implementation using split/map/join pattern. The 1-based line numbering with tab separator is clear and deterministic.
27-32: LGTM!The regex
^\d+\tcorrectly strips the line number prefix added byprependLineNumbers.
41-47: LGTM!Proper bounds checking with 1-based indexing. The defensive
|| ""on line 46 handles potential edge cases.
56-63: LGTM!Correctly accumulates character indices accounting for newline characters. The loop bounds are safe with the
i < lines.lengthcheck.src/prompts/directive-loader.ts (2)
11-35: LGTM!The updated directive is well-structured with clear instructions for the LLM:
- Explicit format requirements for
line,quoted_text, and context fields- Strong anti-hallucination rules (CRITICAL RULES 2-5)
- Requirement for step-by-step reasoning before reporting
- Clear guidance on verbatim copy-paste and verification
This aligns well with the new quoted-text-based location strategy.
37-49: LGTM!The override loading logic is unchanged; only formatting adjustments.
src/evaluators/base-evaluator.ts (4)
16-16: LGTM!Correctly imports the new line-numbering utility.
67-76: LGTM!Line numbers are prepended before sending content to the LLM, enabling deterministic line reporting. The numbered content is correctly passed to
runPromptStructuredwhile preserving the original content reference for other calculations.
125-134: LGTM!Consistent with the subjective path—line numbers are prepended before LLM evaluation for semi-objective mode as well.
152-159: LGTM!The violation mapping correctly uses the new field names (
quoted_text,context_before,context_after) with conditional spreading to handle optional values.src/prompts/schema.ts (5)
17-48: LGTM!The subjective schema is well-designed:
reasoningmoved to the start of the schema to encourage step-by-step thinking before scoring (chain-of-thought pattern)lineis optional (not inrequired) since LLM-provided line numbers are hints, not guaranteesquoted_text,context_before,context_afterare required to ensure the fuzzy matching system has evidence to work with
65-91: LGTM!The semi-objective schema mirrors the subjective structure with required quoted-text evidence fields. This ensures consistent violation data across both evaluation paths.
101-119: LGTM!Type definitions correctly reflect the updated violation structure for both
SubjectiveLLMResultandSemiObjectiveLLMResult.
133-149: LGTM!The runtime result types (
SubjectiveResult,SemiObjectiveItem) are updated to match the new violation shape with quoted-text fields.
161-168: LGTM!
SemiObjectiveResult.violationscorrectly includes the optionalcriterionNamefield for downstream reporting.src/output/location.ts (8)
1-1: LGTM!Correctly imports the necessary fuzzy matching functions from fuzzball.
3-32: LGTM!Well-defined interfaces:
QuotedTextEvidencecaptures the LLM-provided evidenceLocationWithMatchextends location with match metadata and strategy for debugging/confidence reportingFuzzyMatchis appropriately scoped as internal
52-89: LGTM!
findBestLineMatchefficiently scores each line using multiple fuzzball strategies (partial_ratio,token_sort_ratio,ratio) and takes the maximum. Skipping empty lines is a sensible optimization.
95-131: LGTM!The sliding window approach with 50%-150% size variation and step-by-5 granularity provides a good balance between accuracy and performance for multi-line quote matching.
219-278: LGTM!Phase 2 exact matching is well-implemented:
- Collects all exact matches
- Single match returns immediately with 100% confidence
- Multiple matches use context for disambiguation
- Falls back to first match if context doesn't help
280-300: LGTM!Progressive substring matching is a clever approach to handle LLM word additions/removals. Starting from
words.length - 1down to 3 finds the longest valid substring first, and confidence scales proportionally.
302-341: LGTM!Phases 4-6 provide robust fallbacks:
- Case-insensitive at 95% confidence (reasonable penalty for case mismatch)
- Line-based fuzzy matching for fast handling of typos
- Window-based fuzzy matching as the last resort for multi-line issues
351-357: LGTM!Simple and clean batch processing helper that applies
locateQuotedTextto each evidence item.package.json (1)
61-61: LGTM!The
fuzzballdependency is correctly added to runtime dependencies and is a well-established library for fuzzy string matching. It's properly imported and used insrc/output/location.tsfor fuzzy matching functionality. The caret version specifier allows compatible updates.src/cli/orchestrator.ts (4)
21-21: LGTM!The import of
locateQuotedTextcorrectly supports the new quoted-text-based location strategy.
608-616: LGTM!The
verboseparameter is correctly destructured and propagated tolocateAndReportViolations, enabling conditional logging throughout the violation reporting workflow.Also applies to: 641-641
796-796: LGTM!The
verboseflag is correctly added to the function parameters and properly propagated toroutePromptResult, ensuring consistent verbose behavior across the evaluation pipeline.Also applies to: 894-894
219-256: Solid implementation of verification and deduplication workflow.The new approach correctly:
- Attempts to locate each violation using
locateQuotedTextwith fuzzy matching- Skips unverifiable quotes with optional verbose warnings
- Tracks operational errors without blocking other violations
- Separates verification from reporting for cleaner flow
The multi-phase location strategy (exact, substring, fuzzy) with confidence thresholds is a robust improvement over the old pre/post approach.
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
src/output/location.ts (1)
275-295: Consider making the minimum word count configurable.Phase 3's progressive substring matching requires at least 3 words (line 278), which means 1-2 word quotes skip this phase entirely. While this might prevent false positives, it could also miss legitimate short-quote matches that have minor edits.
This is likely intentional, but consider documenting this threshold or making it configurable if short quotes are common in your use case.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
src/output/location.tstests/fuzzy-matching.test.ts
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/fuzzy-matching.test.ts
🧰 Additional context used
📓 Path-based instructions (1)
src/**/*.ts
📄 CodeRabbit inference engine (AGENTS.md)
src/**/*.ts: Use TypeScript ESM with explicit imports and narrow types
Use 2-space indentation; avoid trailing whitespace
Use strict TypeScript with noanytypes; useunknown+ schema validation for external data
Use custom error types with proper inheritance; catch blocks should useunknowntype
Files:
src/output/location.ts
🔇 Additional comments (5)
src/output/location.ts (5)
1-1: LGTM: Fuzzy matching imports are appropriate.The fuzzball library imports are correctly structured and all three functions are utilized in the multi-phase matching strategy.
3-32: Well-structured type definitions.The interfaces clearly define the quoted-text evidence model and location results with rich metadata (confidence, strategy). The 1-based indexing is properly documented.
34-46: Line/column computation is correct.The function properly converts absolute indices to 1-based line/column positions. The use of
charCodeAt(i) === 10correctly identifies newline characters.
148-340: Well-architected multi-phase matching strategy.The six-phase approach provides excellent fallback coverage, moving from fast exact matches to progressively more expensive fuzzy strategies. The confidence scoring and strategy labeling enable downstream consumers to make informed decisions about match quality.
The phase ordering (line hint → exact → context → substring → case-insensitive → fuzzy) is logical and well-documented.
346-352: LGTM: Batch processing helper is clean and correct.The function appropriately maps over the evidences array, and the return type correctly represents that some quotes might not be located (null values).
- Update violation type cast to use new schema fields (line, quoted_text, context_before, context_after) instead of legacy pre/post fields - Add missing verbose parameter to extractAndReportCriterion call - Fix deduplication to skip when quoted_text is empty to avoid false collisions - Log caught errors during evidence location when verbose is enabled - Add test case with duplicate text for context disambiguation - Remove legacy extractMatchText function - use locateQuotedText results directly - Remove unused ExtractMatchTextParams and LocationMatch types
- Fix index/match mismatch in findBestLineMatch and findBestWindowMatch where trimmed match text didn't align with stored index position - Add leadingWhitespace offset to ensure columns point to actual content start rather than leading whitespace - Fix nested loop bug in lineHint fuzzy matching that could overwrite longer substring matches with shorter ones using labeled break
…atching (#46) * feat: improve issue location using quoted text * feat: implement fuzzy text matching for LLM output * feat: implement line numbering for content analysis * fix: resolve eslint errors * refactor: improve violation processing and remove legacy code - Update violation type cast to use new schema fields (line, quoted_text, context_before, context_after) instead of legacy pre/post fields - Add missing verbose parameter to extractAndReportCriterion call - Fix deduplication to skip when quoted_text is empty to avoid false collisions - Log caught errors during evidence location when verbose is enabled - Add test case with duplicate text for context disambiguation - Remove legacy extractMatchText function - use locateQuotedText results directly - Remove unused ExtractMatchTextParams and LocationMatch types * fix: correct column alignment in fuzzy matching functions - Fix index/match mismatch in findBestLineMatch and findBestWindowMatch where trimmed match text didn't align with stored index position - Add leadingWhitespace offset to ensure columns point to actual content start rather than leading whitespace - Fix nested loop bug in lineHint fuzzy matching that could overwrite longer substring matches with shorter ones using labeled break
Summary
This PR significantly improves the accuracy of issue location detection in VectorLint by implementing a multi-layered approach: line number hints from the LLM, quoted text matching, and robust fuzzy matching as fallback.
Changes
Line Numbering for Content Analysis
123\ttext)linefield to violation schema so the LLM can report which line the issue appears onFuzzy Text Matching for LLM Output
fuzzballdependency for fuzzy string matchingImproved Issue Location Using Quoted Text
pre/postanchor fields with more reliablequoted_text,context_before,context_afterreasoningfield to appear first in schema (encouraging LLM to think before answering)Files Changed
quoted_text,line, and context fieldspackage.jsonfuzzballdependencyWhy This Matters
LLMs often paraphrase, truncate, or slightly modify quotes when reporting issues. This makes locating the exact issue in the original text challenging. This PR addresses this by:
Testing
Summary by CodeRabbit
New Features
Bug Fixes
Chores
Tests
✏️ Tip: You can customize this high-level summary in your review settings.