Skip to content

feat: improve Issue Location Accuracy with Line Numbering and Fuzzy Matching#46

Merged
oshorefueled merged 6 commits intomainfrom
feat/issue-location
Dec 27, 2025
Merged

feat: improve Issue Location Accuracy with Line Numbering and Fuzzy Matching#46
oshorefueled merged 6 commits intomainfrom
feat/issue-location

Conversation

@hurshore
Copy link
Collaborator

@hurshore hurshore commented Dec 27, 2025

Summary

This PR significantly improves the accuracy of issue location detection in VectorLint by implementing a multi-layered approach: line number hints from the LLM, quoted text matching, and robust fuzzy matching as fallback.

Changes

  1. Line Numbering for Content Analysis

    • Prepends line numbers to input content before sending to the LLM (format: 123\ttext)
    • Adds line field to violation schema so the LLM can report which line the issue appears on
    • Uses LLM-provided line numbers as hints for faster, more accurate location resolution
    • Falls back to fuzzy matching when line hints don't resolve
  2. Fuzzy Text Matching for LLM Output

    • Adds fuzzball dependency for fuzzy string matching
    • Implements multi-phase location strategy:
      • Phase 1: Exact matching (fastest)
      • Phase 2: Progressive substring matching (handles LLM adding/removing words)
      • Phase 3: Case-insensitive exact matching
      • Phase 4: Fuzzy line-by-line matching (handles typos)
      • Phase 5: Sliding window fuzzy matching (handles multi-line quotes)
    • Returns confidence scores and strategy used for each match
    • Adds tests for fuzzy matching
  3. Improved Issue Location Using Quoted Text

    • Replaces pre/post anchor fields with more reliable quoted_text, context_before, context_after
    • Updates LLM prompt directives to emphasize verbatim quoting:
      • Instructions to COPY-PASTE exact phrases (5-50 chars)
      • Critical rules against fabricating quotes
      • Requirement to verify quotes exist in input before reporting
    • Moves reasoning field to appear first in schema (encouraging LLM to think before answering)

Files Changed

File Description
[src/output/line-numbering.ts](src/output/line-numbering.ts) [NEW] Line numbering utilities
[src/output/location.ts](src/output/location.ts) Multi-phase fuzzy matching implementation
[src/prompts/schema.ts](src/prompts/schema.ts) Updated schema with quoted_text, line, and context fields
[src/prompts/directive-loader.ts](src/prompts/directive-loader.ts) Enhanced LLM instructions for accurate quoting
[src/evaluators/base-evaluator.ts](src/evaluators/base-evaluator.ts) Integration with line numbering
[src/cli/orchestrator.ts](src/cli/orchestrator.ts) Line number prepending before LLM calls
[src/cli/types.ts](src/cli/types.ts) Type updates for new fields
[tests/fuzzy-matching.test.ts](tests/fuzzy-matching.test.ts) [NEW] Fuzzy matching tests
package.json Added fuzzball dependency

Why This Matters

LLMs often paraphrase, truncate, or slightly modify quotes when reporting issues. This makes locating the exact issue in the original text challenging. This PR addresses this by:

  • Giving the LLM explicit line numbers to reference
  • Accepting fuzzy matches when exact matching fails
  • Providing confidence scores so downstream consumers know match quality

Testing

  • Added comprehensive tests for fuzzy matching covering exact, case-insensitive, substring, and fuzzy strategies
  • Tests verify confidence scores and match strategies are correctly reported

Summary by CodeRabbit

  • New Features

    • Multi‑strategy quote verification with fuzzy matching and standardized report fields (line, quoted_text, context_before/context_after)
    • Optional verbose logging to surface warnings and diagnostics
  • Bug Fixes

    • Line-numbered content handling and de-duplication to avoid duplicate or unverifiable reports
    • Graceful handling and reporting of unverifiable quotes
  • Chores

    • Added fuzzball dependency for similarity scoring
  • Tests

    • New tests for exact, case-insensitive, substring, fuzzy, and no-match quote location scenarios

✏️ Tip: You can customize this high-level summary in your review settings.

@hurshore hurshore requested a review from ayo6706 December 27, 2025 17:49
@coderabbitai
Copy link

coderabbitai bot commented Dec 27, 2025

📝 Walkthrough

Walkthrough

Overhauls evidence location from pre/post context to quoted-text fuzzy matching with multiple fallbacks, adds line-numbering utilities, updates violation shapes to use quoted_text and contextual fields, propagates an optional verbose flag through evaluation flows, updates prompts/schema, and adds tests and a fuzzball dependency.

Changes

Cohort / File(s) Summary
Dependency Management
package.json
Added fuzzball ^2.2.3 dependency for fuzzy string matching
Type System Updates
src/cli/types.ts, src/prompts/schema.ts
Replaced pre/post with quoted_text, context_before, context_after in violation shapes; added optional verbose?: boolean to evaluation/context types; updated LLM and evaluation result types
Evidence Location Refactor
src/output/location.ts
Replaced pre/post locating with multi-strategy quoted-text pipeline (exact → context → substring → case-insensitive → fuzzy-line → fuzzy-window); added QuotedTextEvidence, enriched LocationWithMatch (match, confidence, strategy), locateQuotedText, locateMultipleQuotes; removed legacy locate/extract functions
Line Numbering Utilities
src/output/line-numbering.ts
New helpers: prependLineNumbers, stripLineNumbers, getLineContent, getLineStartIndex for deterministic line handling
Orchestration & Reporting
src/cli/orchestrator.ts
Now uses locateQuotedText, verifies matches, deduplicates violations by quoted_text+line, reports only verified unique violations, propagates verbose flag, and logs unverifiable quotes when verbose
Evaluator Base Changes
src/evaluators/base-evaluator.ts
Prepend line numbers to content before LLM calls; mapping updated to new violation fields (quoted_text, context_before, context_after); minor signature/formatting changes
Prompt Directives
src/prompts/directive-loader.ts
DEFAULT_DIRECTIVE rewritten to require reported fields line, quoted_text, context_before, context_after, analysis, suggestion; various string quoting standardized
Prompt Schema Updates
src/prompts/schema.ts
Schemas and result types updated to the new violation shape and required fields
Tests
tests/fuzzy-matching.test.ts
New test suite for locateQuotedText covering exact, context, case-insensitive, substring, fuzzy-line/window matching and no-match cases

Sequence Diagram(s)

sequenceDiagram
  participant CLI as CLI Orchestrator
  participant Locator as locateQuotedText
  participant Evaluator as BaseEvaluator / LLM
  participant Reporter as reportIssue

  CLI->>Locator: request locate(quoted_text, context_before, context_after, content)
  alt Found
    Locator-->>CLI: { line, column, match, confidence, strategy }
  else Not found
    Locator-->>CLI: null
  end

  CLI->>CLI: filter verified matches, dedupe by (quoted_text,line)
  CLI->>Evaluator: send numberedContent + verified violations
  Evaluator-->>CLI: evaluation result (scores, messages, updated violations)

  CLI->>Reporter: reportIssue for each unique verified violation
  Reporter-->>CLI: ACK
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Poem

🐰 Fuzzy whiskers trace the quoted line,
Numbers march tidy, context in a row—
Verified quotes hop up, duplicates decline,
Reports skip the shadows, confidence aglow.
nibbles carrot, cheerful thump

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 68.42% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and accurately summarizes the main change: improving issue location accuracy through line numbering and fuzzy matching, which is the core objective of this substantial PR.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/issue-location

📜 Recent review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4a5addc and 3e946ae.

📒 Files selected for processing (4)
  • src/cli/orchestrator.ts
  • src/cli/types.ts
  • src/output/location.ts
  • tests/fuzzy-matching.test.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/fuzzy-matching.test.ts
🧰 Additional context used
📓 Path-based instructions (1)
src/**/*.ts

📄 CodeRabbit inference engine (AGENTS.md)

src/**/*.ts: Use TypeScript ESM with explicit imports and narrow types
Use 2-space indentation; avoid trailing whitespace
Use strict TypeScript with no any types; use unknown + schema validation for external data
Use custom error types with proper inheritance; catch blocks should use unknown type

Files:

  • src/cli/types.ts
  • src/cli/orchestrator.ts
  • src/output/location.ts
🧠 Learnings (1)
📚 Learning: 2025-12-27T17:28:16.346Z
Learnt from: CR
Repo: TRocket-Labs/vectorlint PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-27T17:28:16.346Z
Learning: Applies to src/**/*.ts : Use custom error types with proper inheritance; catch blocks should use `unknown` type

Applied to files:

  • src/cli/orchestrator.ts
🧬 Code graph analysis (1)
src/cli/orchestrator.ts (2)
src/output/location.ts (1)
  • locateQuotedText (150-342)
src/errors/index.ts (1)
  • handleUnknownError (46-51)
🔇 Additional comments (6)
src/cli/types.ts (1)

59-59: LGTM! Type updates align with the new quoted-text location strategy.

The addition of verbose to EvaluationContext and the new violation fields (line, quoted_text, context_before, context_after) properly support the PR's shift from pre/post anchors to fuzzy-matched quoted text with contextual evidence.

Also applies to: 78-84

src/output/location.ts (2)

135-342: Excellent multi-phase fuzzy matching implementation.

The six-phase algorithm is well-structured and documented:

  1. Line hint optimization for LLM-provided line numbers
  2. Exact matching with context-based disambiguation
  3. Progressive substring matching for partial text
  4. Case-insensitive fallback
  5. Fast fuzzy line-by-line matching
  6. Slower sliding-window matching for multi-line quotes

The phase ordering optimizes for performance (fast paths first) while maintaining robust fallbacks. Confidence scoring is appropriate for each strategy.


1-1: The fuzzball library is already configured at the latest stable version (2.2.3) with no known security vulnerabilities. No action needed.

src/cli/orchestrator.ts (3)

162-222: LGTM! Verification-first approach with proper deduplication.

The refactored logic correctly:

  • Calls locateQuotedText with the new quoted-text fields (lines 179-188)
  • Skips unverifiable quotes and logs warnings when verbose (lines 190-199)
  • Handles empty quoted_text in deduplication by creating a key only when quoted_text exists (lines 206-211), addressing the past review concern
  • Logs location errors when verbose (lines 217-219), addressing another past review concern
  • Collects only verified violations before reporting (lines 214, 224-245)

This ensures that only violations with successfully located evidence are reported to users.


157-157: Verbose flag properly propagated through evaluation workflows.

The verbose parameter is correctly threaded through the entire evaluation pipeline:

  • Added to function parameters (locateAndReportViolations, extractAndReportCriterion, routePromptResult)
  • Extracted from options in evaluateFile (line 737)
  • Consistently passed to downstream functions
  • Used to gate debug logging (e.g., lines 192-196, 217-219)

This enables conditional verbose output throughout the evaluation flow without cluttering normal output.

Also applies to: 268-268, 414-414, 548-556, 581-581, 645-645, 737-737, 835-835


399-406: Violation type correctly updated to use quoted-text fields.

The type assertion properly reflects the new violation shape with line, quoted_text, context_before, and context_after fields, replacing the legacy pre/post anchors. This aligns with the broader refactor to fuzzy-matched quoted text.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/cli/orchestrator.ts (2)

460-466: Update type cast to match new schema structure.

The inline type definition still references pre? and post? fields from the old evidence location approach. However, the function now uses quoted_text, context_before, and context_after fields (as seen in the locateQuotedText call at lines 236-245).

🔧 Proposed fix
     const violationResult = locateAndReportViolations({
       violations: violations as Array<{
-        pre?: string;
-        post?: string;
+        quoted_text?: string;
+        context_before?: string;
+        context_after?: string;
+        line?: number;
         analysis?: string;
         suggestion?: string;
       }>,

Alternatively, define a proper type in src/cli/types.ts and import it to avoid inline type casts.


695-705: Add missing verbose parameter.

The function signature of extractAndReportCriterion (line 317-331) includes a verbose parameter, and it's used internally (line 474). However, this call site doesn't pass the verbose parameter, which means verbose logging won't work for these violations.

🔧 Proposed fix
     const criterionResult = extractAndReportCriterion({
       exp,
       result,
       content,
       relFile,
       promptId,
       promptFilename: promptFile.filename,
       meta,
       outputFormat,
       jsonFormatter,
+      verbose,
     });
🧹 Nitpick comments (2)
tests/fuzzy-matching.test.ts (1)

24-33: Verify expected strategy for context disambiguation.

Per the locateQuotedText function documentation, when context is used to disambiguate multiple exact matches, the strategy should be "context", not "exact". However, in this test case there's only one occurrence of "quick brown fox" in the text, so context disambiguation isn't needed and "exact" is correct.

Consider adding a test with actual duplicate text to properly test the context disambiguation path returning "context" strategy.

src/cli/orchestrator.ts (1)

144-194: Consider simplifying or removing this legacy logic.

The extractMatchText function attempts to extract quoted text from the analysis message using regex patterns, which was useful in the old pre/post approach. However, with the new quoted-text-based strategy, violations already have an explicit quoted_text field, and locateQuotedText returns the actual matched text. This additional extraction and refinement step may be redundant and could cause confusion.

💡 Consider simplifying to just return the location from locateQuotedText

Since locateQuotedText already returns line, column, and match with confidence scoring, you might be able to simplify or remove this function entirely. If quote refinement from the analysis message is still needed for specific cases, document why and add tests for those scenarios.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between efb0218 and c3f759b.

⛔ Files ignored due to path filters (1)
  • package-lock.json is excluded by !**/package-lock.json
📒 Files selected for processing (9)
  • package.json
  • src/cli/orchestrator.ts
  • src/cli/types.ts
  • src/evaluators/base-evaluator.ts
  • src/output/line-numbering.ts
  • src/output/location.ts
  • src/prompts/directive-loader.ts
  • src/prompts/schema.ts
  • tests/fuzzy-matching.test.ts
🧰 Additional context used
📓 Path-based instructions (2)
src/**/*.ts

📄 CodeRabbit inference engine (AGENTS.md)

src/**/*.ts: Use TypeScript ESM with explicit imports and narrow types
Use 2-space indentation; avoid trailing whitespace
Use strict TypeScript with no any types; use unknown + schema validation for external data
Use custom error types with proper inheritance; catch blocks should use unknown type

Files:

  • src/cli/types.ts
  • src/output/line-numbering.ts
  • src/prompts/directive-loader.ts
  • src/evaluators/base-evaluator.ts
  • src/output/location.ts
  • src/prompts/schema.ts
  • src/cli/orchestrator.ts
tests/**/*.test.ts

📄 CodeRabbit inference engine (AGENTS.md)

tests/**/*.test.ts: Use Vitest for testing framework; locate tests under tests/ with *.test.ts naming
Use dependency injection in tests: mock providers and do not hit network in unit tests

Files:

  • tests/fuzzy-matching.test.ts
🧠 Learnings (4)
📚 Learning: 2025-12-27T17:28:16.346Z
Learnt from: CR
Repo: TRocket-Labs/vectorlint PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-27T17:28:16.346Z
Learning: Applies to tests/**/*.test.ts : Use Vitest for testing framework; locate tests under `tests/` with `*.test.ts` naming

Applied to files:

  • tests/fuzzy-matching.test.ts
📚 Learning: 2025-12-27T17:28:16.346Z
Learnt from: CR
Repo: TRocket-Labs/vectorlint PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-27T17:28:16.346Z
Learning: Applies to src/providers/**/*.ts : Depend on `LLMProvider` and `SearchProvider` interfaces; keep providers thin (transport only)

Applied to files:

  • src/evaluators/base-evaluator.ts
📚 Learning: 2025-12-27T17:28:16.346Z
Learnt from: CR
Repo: TRocket-Labs/vectorlint PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-27T17:28:16.346Z
Learning: Separate concerns: evals define rubric; schemas enforce structure; CLI orchestrates; evaluators process; reporters format

Applied to files:

  • src/evaluators/base-evaluator.ts
  • src/prompts/schema.ts
📚 Learning: 2025-12-27T17:28:16.346Z
Learnt from: CR
Repo: TRocket-Labs/vectorlint PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-27T17:28:16.346Z
Learning: Add providers by implementing `LLMProvider` or `SearchProvider` interfaces; add evaluators via registry pattern

Applied to files:

  • src/evaluators/base-evaluator.ts
🧬 Code graph analysis (3)
tests/fuzzy-matching.test.ts (1)
src/output/location.ts (1)
  • locateQuotedText (148-345)
src/evaluators/base-evaluator.ts (3)
src/prompts/schema.ts (6)
  • buildSubjectiveLLMSchema (3-55)
  • SubjectiveLLMResult (95-109)
  • SemiObjectiveResult (152-169)
  • buildSemiObjectiveLLMSchema (57-93)
  • SemiObjectiveLLMResult (111-120)
  • SemiObjectiveItem (143-150)
src/output/line-numbering.ts (1)
  • prependLineNumbers (13-18)
src/evaluators/evaluator-registry.ts (1)
  • registerEvaluator (65-67)
src/cli/orchestrator.ts (2)
src/output/location.ts (1)
  • locateQuotedText (148-345)
src/errors/index.ts (1)
  • handleUnknownError (46-51)
🪛 GitHub Actions: Lint
src/output/location.ts

[error] 172-172: 'lineStartIdx' is assigned a value but never used @typescript-eslint/no-unused-vars

🪛 GitHub Check: ESLint
tests/fuzzy-matching.test.ts

[failure] 4-4:
Variable name originalText must match one of the following formats: UPPER_CASE

src/output/location.ts

[failure] 172-172:
'lineStartIdx' is assigned a value but never used

🔇 Additional comments (32)
src/cli/types.ts (2)

59-59: LGTM!

The optional verbose flag is appropriately added to EvaluationContext for controlling logging behavior in the evaluation workflow.


91-97: LGTM!

The violation structure is updated to use the new quoted-text-based approach with line, quoted_text, context_before, and context_after. The fields are appropriately optional to handle cases where the LLM might not provide all evidence.

tests/fuzzy-matching.test.ts (2)

58-73: LGTM!

Fuzzy matching tests appropriately verify that imperfect quotes (missing words, reordered words) are still matched with reasonable confidence. The minimum confidence threshold of 80 aligns with the default minConfidence parameter.


76-90: LGTM!

Edge cases are well covered: unrelated text returns null, and empty quoted_text also returns null as expected.

src/output/line-numbering.ts (4)

13-18: LGTM!

Clean implementation using split/map/join pattern. The 1-based line numbering with tab separator is clear and deterministic.


27-32: LGTM!

The regex ^\d+\t correctly strips the line number prefix added by prependLineNumbers.


41-47: LGTM!

Proper bounds checking with 1-based indexing. The defensive || "" on line 46 handles potential edge cases.


56-63: LGTM!

Correctly accumulates character indices accounting for newline characters. The loop bounds are safe with the i < lines.length check.

src/prompts/directive-loader.ts (2)

11-35: LGTM!

The updated directive is well-structured with clear instructions for the LLM:

  • Explicit format requirements for line, quoted_text, and context fields
  • Strong anti-hallucination rules (CRITICAL RULES 2-5)
  • Requirement for step-by-step reasoning before reporting
  • Clear guidance on verbatim copy-paste and verification

This aligns well with the new quoted-text-based location strategy.


37-49: LGTM!

The override loading logic is unchanged; only formatting adjustments.

src/evaluators/base-evaluator.ts (4)

16-16: LGTM!

Correctly imports the new line-numbering utility.


67-76: LGTM!

Line numbers are prepended before sending content to the LLM, enabling deterministic line reporting. The numbered content is correctly passed to runPromptStructured while preserving the original content reference for other calculations.


125-134: LGTM!

Consistent with the subjective path—line numbers are prepended before LLM evaluation for semi-objective mode as well.


152-159: LGTM!

The violation mapping correctly uses the new field names (quoted_text, context_before, context_after) with conditional spreading to handle optional values.

src/prompts/schema.ts (5)

17-48: LGTM!

The subjective schema is well-designed:

  • reasoning moved to the start of the schema to encourage step-by-step thinking before scoring (chain-of-thought pattern)
  • line is optional (not in required) since LLM-provided line numbers are hints, not guarantees
  • quoted_text, context_before, context_after are required to ensure the fuzzy matching system has evidence to work with

65-91: LGTM!

The semi-objective schema mirrors the subjective structure with required quoted-text evidence fields. This ensures consistent violation data across both evaluation paths.


101-119: LGTM!

Type definitions correctly reflect the updated violation structure for both SubjectiveLLMResult and SemiObjectiveLLMResult.


133-149: LGTM!

The runtime result types (SubjectiveResult, SemiObjectiveItem) are updated to match the new violation shape with quoted-text fields.


161-168: LGTM!

SemiObjectiveResult.violations correctly includes the optional criterionName field for downstream reporting.

src/output/location.ts (8)

1-1: LGTM!

Correctly imports the necessary fuzzy matching functions from fuzzball.


3-32: LGTM!

Well-defined interfaces:

  • QuotedTextEvidence captures the LLM-provided evidence
  • LocationWithMatch extends location with match metadata and strategy for debugging/confidence reporting
  • FuzzyMatch is appropriately scoped as internal

52-89: LGTM!

findBestLineMatch efficiently scores each line using multiple fuzzball strategies (partial_ratio, token_sort_ratio, ratio) and takes the maximum. Skipping empty lines is a sensible optimization.


95-131: LGTM!

The sliding window approach with 50%-150% size variation and step-by-5 granularity provides a good balance between accuracy and performance for multi-line quote matching.


219-278: LGTM!

Phase 2 exact matching is well-implemented:

  • Collects all exact matches
  • Single match returns immediately with 100% confidence
  • Multiple matches use context for disambiguation
  • Falls back to first match if context doesn't help

280-300: LGTM!

Progressive substring matching is a clever approach to handle LLM word additions/removals. Starting from words.length - 1 down to 3 finds the longest valid substring first, and confidence scales proportionally.


302-341: LGTM!

Phases 4-6 provide robust fallbacks:

  • Case-insensitive at 95% confidence (reasonable penalty for case mismatch)
  • Line-based fuzzy matching for fast handling of typos
  • Window-based fuzzy matching as the last resort for multi-line issues

351-357: LGTM!

Simple and clean batch processing helper that applies locateQuotedText to each evidence item.

package.json (1)

61-61: LGTM!

The fuzzball dependency is correctly added to runtime dependencies and is a well-established library for fuzzy string matching. It's properly imported and used in src/output/location.ts for fuzzy matching functionality. The caret version specifier allows compatible updates.

src/cli/orchestrator.ts (4)

21-21: LGTM!

The import of locateQuotedText correctly supports the new quoted-text-based location strategy.


608-616: LGTM!

The verbose parameter is correctly destructured and propagated to locateAndReportViolations, enabling conditional logging throughout the violation reporting workflow.

Also applies to: 641-641


796-796: LGTM!

The verbose flag is correctly added to the function parameters and properly propagated to routePromptResult, ensuring consistent verbose behavior across the evaluation pipeline.

Also applies to: 894-894


219-256: Solid implementation of verification and deduplication workflow.

The new approach correctly:

  • Attempts to locate each violation using locateQuotedText with fuzzy matching
  • Skips unverifiable quotes with optional verbose warnings
  • Tracks operational errors without blocking other violations
  • Separates verification from reporting for cleaner flow

The multi-phase location strategy (exact, substring, fuzzy) with confidence thresholds is a robust improvement over the old pre/post approach.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
src/output/location.ts (1)

275-295: Consider making the minimum word count configurable.

Phase 3's progressive substring matching requires at least 3 words (line 278), which means 1-2 word quotes skip this phase entirely. While this might prevent false positives, it could also miss legitimate short-quote matches that have minor edits.

This is likely intentional, but consider documenting this threshold or making it configurable if short quotes are common in your use case.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c3f759b and 4a5addc.

📒 Files selected for processing (2)
  • src/output/location.ts
  • tests/fuzzy-matching.test.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/fuzzy-matching.test.ts
🧰 Additional context used
📓 Path-based instructions (1)
src/**/*.ts

📄 CodeRabbit inference engine (AGENTS.md)

src/**/*.ts: Use TypeScript ESM with explicit imports and narrow types
Use 2-space indentation; avoid trailing whitespace
Use strict TypeScript with no any types; use unknown + schema validation for external data
Use custom error types with proper inheritance; catch blocks should use unknown type

Files:

  • src/output/location.ts
🔇 Additional comments (5)
src/output/location.ts (5)

1-1: LGTM: Fuzzy matching imports are appropriate.

The fuzzball library imports are correctly structured and all three functions are utilized in the multi-phase matching strategy.


3-32: Well-structured type definitions.

The interfaces clearly define the quoted-text evidence model and location results with rich metadata (confidence, strategy). The 1-based indexing is properly documented.


34-46: Line/column computation is correct.

The function properly converts absolute indices to 1-based line/column positions. The use of charCodeAt(i) === 10 correctly identifies newline characters.


148-340: Well-architected multi-phase matching strategy.

The six-phase approach provides excellent fallback coverage, moving from fast exact matches to progressively more expensive fuzzy strategies. The confidence scoring and strategy labeling enable downstream consumers to make informed decisions about match quality.

The phase ordering (line hint → exact → context → substring → case-insensitive → fuzzy) is logical and well-documented.


346-352: LGTM: Batch processing helper is clean and correct.

The function appropriately maps over the evidences array, and the return type correctly represents that some quotes might not be located (null values).

- Update violation type cast to use new schema fields (line, quoted_text,
  context_before, context_after) instead of legacy pre/post fields
- Add missing verbose parameter to extractAndReportCriterion call
- Fix deduplication to skip when quoted_text is empty to avoid false collisions
- Log caught errors during evidence location when verbose is enabled
- Add test case with duplicate text for context disambiguation
- Remove legacy extractMatchText function - use locateQuotedText results directly
- Remove unused ExtractMatchTextParams and LocationMatch types
- Fix index/match mismatch in findBestLineMatch and findBestWindowMatch
  where trimmed match text didn't align with stored index position
- Add leadingWhitespace offset to ensure columns point to actual content
  start rather than leading whitespace
- Fix nested loop bug in lineHint fuzzy matching that could overwrite
  longer substring matches with shorter ones using labeled break
@oshorefueled oshorefueled merged commit 1287051 into main Dec 27, 2025
3 checks passed
oshorefueled pushed a commit that referenced this pull request Mar 2, 2026
…atching (#46)

* feat: improve issue location using quoted text

* feat: implement fuzzy text matching for LLM output

* feat: implement line numbering for content analysis

* fix: resolve eslint errors

* refactor: improve violation processing and remove legacy code

- Update violation type cast to use new schema fields (line, quoted_text,
  context_before, context_after) instead of legacy pre/post fields
- Add missing verbose parameter to extractAndReportCriterion call
- Fix deduplication to skip when quoted_text is empty to avoid false collisions
- Log caught errors during evidence location when verbose is enabled
- Add test case with duplicate text for context disambiguation
- Remove legacy extractMatchText function - use locateQuotedText results directly
- Remove unused ExtractMatchTextParams and LocationMatch types

* fix: correct column alignment in fuzzy matching functions

- Fix index/match mismatch in findBestLineMatch and findBestWindowMatch
  where trimmed match text didn't align with stored index position
- Add leadingWhitespace offset to ensure columns point to actual content
  start rather than leading whitespace
- Fix nested loop bug in lineHint fuzzy matching that could overwrite
  longer substring matches with shorter ones using labeled break
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants