Skip to content
1 change: 1 addition & 0 deletions _partials/_since_0_1_0.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
<Tag variant="hollow">Since [pg_textsearch v0.1.0](https://github.com/timescale/pg_textsearch/releases/tag/v0.1.0)</Tag>
140 changes: 107 additions & 33 deletions use-timescale/extensions/pg-textsearch.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ products: [cloud, self_hosted]
---

import EA1125 from "versionContent/_partials/_early_access_11_25.mdx";
import SINCE010 from "versionContent/_partials/_since_0_1_0.mdx";
import IntegrationPrereqs from "versionContent/_partials/_integration-prereqs.mdx";

# Optimize full text search with BM25
Expand All @@ -27,13 +28,12 @@ matches. `pg_textsearch` implements the following:
This page shows you how to install `pg_textsearch`, configure BM25 indexes, and optimize your search capabilities using
the following best practice:

* **Memory planning**: size your `index_memory_limit` based on corpus vocabulary and document count
* **Language configuration**: choose appropriate text search configurations for your data language
* **Hybrid search**: combine with pgvector or pgvectorscale for applications requiring both semantic and keyword search
* **Query optimization**: use score thresholds to filter low-relevance results
* **Index monitoring**: regularly check index usage and memory consumption

<EA1125 /> this preview release is designed for development and staging environments. It is not recommended for use with hypertables.
<EA1125 /> this preview release is designed for development and staging environments.

## Prerequisites

Expand Down Expand Up @@ -124,42 +124,76 @@ Use efficient query patterns to leverage BM25 ranking and optimize search perfor
1. **Perform ranked searches using the distance operator**

```sql
SELECT name, description,
description <@> to_bm25query('ergonomic work', 'products_search_idx') as score
SELECT name, description, description <@> to_bm25query('ergonomic work', 'products_search_idx') as score
FROM products
ORDER BY description <@> to_bm25query('ergonomic work', 'products_search_idx')
ORDER BY score
LIMIT 3;
```

You see something like:

```sql
name | description | score
----------------------------+-----------------------------------------------------------------------------------+---------------------
Ergonomic Mouse | Wireless mouse with ergonomic design to reduce wrist strain during long work sessions | -1.8132977485656738
Mechanical Keyboard | Durable mechanical switches with RGB backlighting for gaming and productivity | 0
Standing Desk | Adjustable height desk for better posture and productivity throughout the workday | 0
```

1. **Filter results by score threshold**

```sql
SELECT name,
description <@> to_bm25query('wireless', 'products_search_idx') as score
SELECT name, description <@> to_bm25query('wireless', 'products_search_idx') as score
FROM products
WHERE description <@> to_bm25query('wireless', 'products_search_idx') < -2.0;
WHERE description <@> to_bm25query('wireless', 'products_search_idx') < -0.5;
```

You see something like:

```sql
name | score
----------------+---------------------
Ergonomic Mouse | -0.9066488742828369
```

1. **Combine with standard SQL operations**

```sql
SELECT category, name,
description <@> to_bm25query('ergonomic', 'products_search_idx') as score
SELECT category, name, description <@> to_bm25query('ergonomic', 'products_search_idx') as score
FROM products
WHERE price < 500
AND description <@> to_bm25query('ergonomic', 'products_search_idx') < -1.0
AND description <@> to_bm25query('ergonomic', 'products_search_idx') < -0.5
ORDER BY description <@> to_bm25query('ergonomic', 'products_search_idx')
LIMIT 5;
```

You see something like:

```sql
category | name | score
-------------+-----------------+---------------------
Electronics | Ergonomic Mouse | -0.9066488742828369
```

1. **Verify index usage with EXPLAIN**

```sql
EXPLAIN SELECT * FROM products
ORDER BY description <@> to_bm25query('wireless keyboard', 'products_search_idx')
ORDER BY description <@> to_bm25query('ergonomic', 'products_search_idx')
LIMIT 5;
```

You see something like:

```sql
QUERY PLAN
--------------------------------------------------------------------------------------------
Limit (cost=8.55..8.56 rows=3 width=140)
-> Sort (cost=8.55..8.56 rows=3 width=140)
Sort Key: ((description <@> 'products_search_idx:ergonomic'::bm25query))
-> Seq Scan on products (cost=0.00..8.53 rows=3 width=140)
```

</Procedure>

You have optimized your search queries for BM25 ranking.
Expand All @@ -181,10 +215,21 @@ Combine `pg_textsearch` with `pgvector` or `pgvectorscale` to build powerful hyb
id serial PRIMARY KEY,
title text,
content text,
embedding vector(1536) -- OpenAI ada-002 embedding dimension
embedding vector(3) -- Using 3 dimensions for this example; use 1536 for OpenAI ada-002
);
```

1. **Insert sample data**

```sql
INSERT INTO articles (title, content, embedding) VALUES
('Database Query Optimization', 'Learn how to optimize database query performance using indexes and query planning', '[0.1, 0.15, 0.2]'),
('Performance Tuning Guide', 'A comprehensive guide to performance tuning in distributed systems and databases', '[0.12, 0.18, 0.25]'),
('Introduction to Indexing', 'Understanding how database indexes improve query performance and data retrieval', '[0.09, 0.14, 0.19]'),
('Advanced SQL Techniques', 'Master advanced SQL techniques for complex data analysis and reporting', '[0.5, 0.6, 0.7]'),
('Data Warehousing Basics', 'Getting started with data warehousing and analytical query processing', '[0.8, 0.9, 0.85]');
```

1. **Create indexes for both search types**

```sql
Expand Down Expand Up @@ -223,7 +268,19 @@ Combine `pg_textsearch` with `pgvector` or `pgvectorscale` to build powerful hyb
LEFT JOIN keyword_search k ON a.id = k.id
WHERE v.id IS NOT NULL OR k.id IS NOT NULL
ORDER BY combined_score DESC
LIMIT 10;
LIMIT 10;
```

You see something like:

```sql
id | title | combined_score
----+----------------------------+--------------------
3 | Introduction to Indexing | 0.0325224748810153
1 | Database Query Optimization| 0.0322664584959667
2 | Performance Tuning Guide | 0.0320020481310804
5 | Data Warehousing Basics | 0.0310096153846154
4 | Advanced SQL Techniques | 0.0310096153846154
```

1. **Adjust relative weights for different search types**
Expand Down Expand Up @@ -257,6 +314,18 @@ Combine `pg_textsearch` with `pgvector` or `pgvectorscale` to build powerful hyb
LIMIT 10;
```

You see something like:

```sql
id | title | combined_score
----+----------------------------+--------------------
3 | Introduction to Indexing | 0.0163141195134849
2 | Performance Tuning Guide | 0.0160522273425499
1 | Database Query Optimization| 0.0160291438979964
4 | Advanced SQL Techniques | 0.0155528846153846
5 | Data Warehousing Basics | 0.0154567307692308
```

</Procedure>

You have implemented hybrid search combining semantic and keyword search.
Expand All @@ -267,27 +336,37 @@ Customize `pg_textsearch` behavior for your specific use case and data character

<Procedure>

1. **Configure the memory limit**
1. **Configure memory and performance settings**

To manage memory usage, you control when the in-memory index spills to disk segments. When the memtable reaches the
threshold, it automatically flushes to a segment at transaction commit.

The size of the memtable depends primarily on the number of distinct terms in your corpus. A corpus with longer
documents or more varied vocabulary requires more memory per document.
```sql
-- Set memory limit per index (default 64MB)
SET pg_textsearch.index_memory_limit = '128MB';
-- Set memtable spill threshold (default 800000 posting entries, ~8MB segments)
SET pg_textsearch.memtable_spill_threshold = 1000000;

-- Set bulk load spill threshold (default 100000 terms per transaction)
SET pg_textsearch.bulk_load_threshold = 150000;

-- Set default query limit when no LIMIT clause is present (default 1000)
SET pg_textsearch.default_limit = 5000;
```
<SINCE010 />

1. **Configure language-specific text processing**

```sql
-- French language configuration
CREATE INDEX products_fr_idx ON products_fr
USING pg_textsearch(description)
WITH (text_config='french');
You can create multiple BM25 indexes on the same column with different language configurations:

-- Simple tokenization without stemming
```sql
-- Create an additional index with simple tokenization (no stemming)
CREATE INDEX products_simple_idx ON products
USING pg_textsearch(description)
USING bm25(description)
WITH (text_config='simple');

-- Example: French language configuration for a French products table
-- CREATE INDEX products_fr_idx ON products_fr
-- USING bm25(description)
-- WITH (text_config='french');
```

1. **Tune BM25 parameters**
Expand All @@ -310,7 +389,7 @@ Customize `pg_textsearch` behavior for your specific use case and data character

- View detailed index information
```sql
SELECT bm25_debug_dump_index('products_search_idx');
SELECT bm25_dump_index('products_search_idx');
```

</Procedure>
Expand All @@ -320,12 +399,7 @@ caching and pagination to improve user experience with large result sets.

## Current limitations

This preview release focuses on core BM25 functionality. It has the following limitations:

* **Memory-only storage**: indexes are limited by `pg_textsearch.index_memory_limit` (default 64MB)
* **No phrase queries**: cannot search for exact multi-word phrases yet

These limitations will be addressed in upcoming releases with disk-based segments and expanded query capabilities.
This preview release focuses on core BM25 functionality. In this release, you cannot search for exact multi-word phrases.

[bm25-wiki]: https://en.wikipedia.org/wiki/Okapi_BM25
[connect-using-psql]: /integrations/:currentVersion:/psql/#connect-to-your-service
Expand Down