fix: sort diff matches by similarity before picking overall suggestion#26
Merged
Merged
Conversation
KeywordSearch orders candidates by token overlap score, which can diverge from the Jaccard-based Similarity computed in Diff. A candidate that contains all of new's tokens (keyword score = 1.0) but has many additional tokens (Jaccard ≈ 0.36 → ADD) would rank first, masking a later candidate with lower keyword score but higher Jaccard (→ UPDATE or DUPLICATE). Fix: sort matches by Similarity descending after all candidates are scored, so matches[0] is always the most similar candidate and drives the overall suggestion correctly. Adds regression test TestDiff_LowerKeywordScoreUpdateNotMasked that fails on the pre-fix code and passes after.
Member
|
LGTM. This addresses the Jaccard follow-up cleanly, and the regression test covers the masking case we discussed. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Follow-up to #25 based on author review feedback.
Diff picks its overall suggestion from matches[0], but matches is ordered by KeywordSearch token overlap score — not by the final Jaccard Similarity. A candidate with high keyword score (all of new's tokens present) but low Jaccard (many additional tokens, classified as ADD) would rank first, silently masking a lower-keyword-score candidate with higher Jaccard that should have been UPDATE or DUPLICATE.
Fix: sort matches by Similarity descending after all candidates are scored, so matches[0] is always the most similar candidate and drives the overall suggestion correctly.
Adds regression test TestDiff_LowerKeywordScoreUpdateNotMasked that fails on the pre-fix code and passes after.
Changes
internal/search/diff.go — sort.Slice by Similarity descending before the overall-suggestion block
internal/search/diff_test.go — TestDiff_LowerKeywordScoreUpdateNotMasked regression test (fails before fix, passes after)
CHANGELOG.md — [Unreleased] entry