Commit 3e0dd25
committed
fix: normalize queryWords with same punctuation/split pipeline as tokenize
queryWords was built by splitting the raw lowercased query on whitespace
only, skipping the punctuation stripping and hyphen/underscore/slash
splitting that tokenize() applies. This caused dirty tokens like
pattern? or whats to form bigrams that never matched against clean
document text, silently disabling partial phrase boost for punctuated
queries. Apply the same replace/split pipeline as tokenize (minus
stemming) so bigram matching works correctly.1 parent 44aa004 commit 3e0dd25
2 files changed
Lines changed: 2 additions & 2 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
81 | 81 | | |
82 | 82 | | |
83 | 83 | | |
84 | | - | |
| 84 | + | |
85 | 85 | | |
86 | 86 | | |
87 | 87 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
102 | 102 | | |
103 | 103 | | |
104 | 104 | | |
105 | | - | |
| 105 | + | |
106 | 106 | | |
107 | 107 | | |
108 | 108 | | |
| |||
0 commit comments