fix(search): add simple dictionary for multilingual FTS (#42)#146
Conversation
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 4 minutes and 39 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Code Review
This pull request implements multilingual Full-Text Search (FTS) for observations by combining 'english' and 'simple' PostgreSQL dictionaries. It adds a migration to recreate the search_vector generated column and updates the ObservationStore search methods to query both dictionaries. Feedback was provided regarding an inconsistency in index naming between the migration and its rollback logic.
| || to_tsvector('simple', COALESCE(title, '') || ' ' || COALESCE(subtitle, '') || ' ' || COALESCE(narrative, '')) | ||
| ) STORED`, | ||
| // Recreate GIN index (DROP above removed the old idx_observations_fts). | ||
| `CREATE INDEX IF NOT EXISTS idx_observations_search_vector ON observations USING GIN (search_vector)`, |
There was a problem hiding this comment.
The index name has been changed from idx_observations_fts to idx_observations_search_vector. While this is not a functional bug, maintaining consistent naming across migrations and rollbacks is better for maintainability. The rollback function still refers to the original name idx_observations_fts (line 2238), which could lead to confusion or orphaned indexes if not handled carefully.
| `CREATE INDEX IF NOT EXISTS idx_observations_search_vector ON observations USING GIN (search_vector)`, | |
| `CREATE INDEX IF NOT EXISTS idx_observations_fts ON observations USING GIN (search_vector)`, |
The 'english' dictionary drops Cyrillic tokens and mishandles product
names like "SocratiCode". Combining it with 'simple' (no stemming, no
stopword removal) ensures all tokens are indexed verbatim, so mixed-
language queries like "socraticode сломался" correctly find memories.
- Migration 075: drops and recreates search_vector generated column
as to_tsvector('english',...) || to_tsvector('simple',...); recreates
GIN index as idx_observations_search_vector.
- SearchObservationsFTS, SearchObservationsFTSFiltered,
SearchObservationsFTSScored: all @@ predicates and ts_rank calls
now use (websearch_to_tsquery('english', ?) || websearch_to_tsquery('simple', ?));
query param passed twice to match the two placeholders.
154473b to
a8ca055
Compare
Summary
to_tsvector('english', ...)exclusively, which drops Cyrillic tokens entirely and mishandles product names like "SocratiCode" (stemming corrupts them). Mixed-language queries like "socraticode сломался" returned zero results.search_vectorasto_tsvector('english',...) || to_tsvector('simple',...)so the GIN index contains both stemmed English lexemes and verbatim tokens for all languages.(websearch_to_tsquery('english', ?) || websearch_to_tsquery('simple', ?))and rank against the combined tsquery.Changes
internal/db/gorm/migrations.go— Migration075_observations_fts_multilang:search_vectorgenerated column (PostgreSQL cannot ALTER a generated column in-place).englishandsimpledictionaries via||.idx_observations_search_vector.idx_observations_fts.internal/db/gorm/observation_store.go— three functions updated:SearchObservationsFTS—$1/$2= query×2,$3= project,$4= limitSearchObservationsFTSFiltered—$1/$2= query×2,$3= project,$4= agentID,$5= limitSearchObservationsFTSScored—$1/$2= query×2,$3= project,$4= limitTest plan
go build ./...passes (verified locally)go test ./internal/db/gorm/... -count=1passes (verified locally)SELECT to_tsvector('english','SocratiCode') || to_tsvector('simple','SocratiCode')and confirm bothsocraticode(simple) andsocraticode(english) lexemes appear"socraticode сломался"returns memories containing either word