Skip to content

fix(search): add simple dictionary for multilingual FTS (#42)#146

Merged
thebtf merged 1 commit into
mainfrom
fix/fts-multilang
Apr 12, 2026
Merged

fix(search): add simple dictionary for multilingual FTS (#42)#146
thebtf merged 1 commit into
mainfrom
fix/fts-multilang

Conversation

@thebtf
Copy link
Copy Markdown
Owner

@thebtf thebtf commented Apr 12, 2026

Summary

  • FTS used to_tsvector('english', ...) exclusively, which drops Cyrillic tokens entirely and mishandles product names like "SocratiCode" (stemming corrupts them). Mixed-language queries like "socraticode сломался" returned zero results.
  • Fix: generate search_vector as to_tsvector('english',...) || to_tsvector('simple',...) so the GIN index contains both stemmed English lexemes and verbatim tokens for all languages.
  • All three FTS query functions updated to match with (websearch_to_tsquery('english', ?) || websearch_to_tsquery('simple', ?)) and rank against the combined tsquery.

Changes

internal/db/gorm/migrations.go — Migration 075_observations_fts_multilang:

  • Drops the existing search_vector generated column (PostgreSQL cannot ALTER a generated column in-place).
  • Recreates it combining both english and simple dictionaries via ||.
  • Recreates the GIN index as idx_observations_search_vector.
  • Rollback restores the original english-only column and idx_observations_fts.

internal/db/gorm/observation_store.go — three functions updated:

  • SearchObservationsFTS$1/$2 = query×2, $3 = project, $4 = limit
  • SearchObservationsFTSFiltered$1/$2 = query×2, $3 = project, $4 = agentID, $5 = limit
  • SearchObservationsFTSScored$1/$2 = query×2, $3 = project, $4 = limit

Test plan

  • go build ./... passes (verified locally)
  • go test ./internal/db/gorm/... -count=1 passes (verified locally)
  • Deploy to staging; run SELECT to_tsvector('english','SocratiCode') || to_tsvector('simple','SocratiCode') and confirm both socraticode (simple) and socraticode (english) lexemes appear
  • Verify query "socraticode сломался" returns memories containing either word
  • Verify existing English-only queries continue to return the same results

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 12, 2026

Warning

Rate limit exceeded

@thebtf has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 4 minutes and 39 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 4 minutes and 39 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b705b7ea-21f0-4dbe-9b94-c7d5f6ba9970

📥 Commits

Reviewing files that changed from the base of the PR and between fe2f224 and a8ca055.

📒 Files selected for processing (2)
  • internal/db/gorm/migrations.go
  • internal/db/gorm/observation_store.go
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/fts-multilang

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements multilingual Full-Text Search (FTS) for observations by combining 'english' and 'simple' PostgreSQL dictionaries. It adds a migration to recreate the search_vector generated column and updates the ObservationStore search methods to query both dictionaries. Feedback was provided regarding an inconsistency in index naming between the migration and its rollback logic.

|| to_tsvector('simple', COALESCE(title, '') || ' ' || COALESCE(subtitle, '') || ' ' || COALESCE(narrative, ''))
) STORED`,
// Recreate GIN index (DROP above removed the old idx_observations_fts).
`CREATE INDEX IF NOT EXISTS idx_observations_search_vector ON observations USING GIN (search_vector)`,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The index name has been changed from idx_observations_fts to idx_observations_search_vector. While this is not a functional bug, maintaining consistent naming across migrations and rollbacks is better for maintainability. The rollback function still refers to the original name idx_observations_fts (line 2238), which could lead to confusion or orphaned indexes if not handled carefully.

Suggested change
`CREATE INDEX IF NOT EXISTS idx_observations_search_vector ON observations USING GIN (search_vector)`,
`CREATE INDEX IF NOT EXISTS idx_observations_fts ON observations USING GIN (search_vector)`,

The 'english' dictionary drops Cyrillic tokens and mishandles product
names like "SocratiCode". Combining it with 'simple' (no stemming, no
stopword removal) ensures all tokens are indexed verbatim, so mixed-
language queries like "socraticode сломался" correctly find memories.

- Migration 075: drops and recreates search_vector generated column
  as to_tsvector('english',...) || to_tsvector('simple',...); recreates
  GIN index as idx_observations_search_vector.
- SearchObservationsFTS, SearchObservationsFTSFiltered,
  SearchObservationsFTSScored: all @@ predicates and ts_rank calls
  now use (websearch_to_tsquery('english', ?) || websearch_to_tsquery('simple', ?));
  query param passed twice to match the two placeholders.
@thebtf thebtf force-pushed the fix/fts-multilang branch from 154473b to a8ca055 Compare April 12, 2026 20:25
@thebtf thebtf merged commit 5f29183 into main Apr 12, 2026
2 checks passed
@thebtf thebtf deleted the fix/fts-multilang branch April 12, 2026 20:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant