feat: Integrate pg_textsearch Extension #7

brandonzhou2002 · 2025-11-16T05:18:01Z

This PR adds support for pg_textsearch by Timescale to provide BM25-based sparse indexing and retrieval, and to enable hybrid search alongside vector similarity.

brandonzhou2002 · 2025-11-16T20:18:44Z

Hey @lilyjge, please take a look at this PR when you get a chance. It is independent of PR #6. Thanks!

cc: @lintool

lilyjge · 2025-12-24T21:48:36Z

config/__init__.py

+
+if not os.getenv("_ENV_LOADED"):
+    load_dotenv()
+    os.environ["_ENV_LOADED"] = "1"


is it necessary that this is separate from https://github.com/castorini/quackir/blob/main/quackir/_base.py#L55

I'm not quite sure as the function _load_env is never being called. What I'm thinking is in the future, we may (or may not) add more configurations to this file, which then can be shared across multiple places. For example, we only need to load env here then anywhere that needs env can just import config, instead of calling load_dotenv multiple times. But I can def reuse the _load_env function here.

lilyjge · 2025-12-24T21:49:20Z

docs/usage-pg-textsearch.md

@@ -0,0 +1,174 @@
+# Using Timescale pg_textsearch (BM25) with QuackIR


let's name the file usage-pg-textsearch.md so we don't get the ugly mix of underscores and dashes?

Right didn't notice this, thanks for catching!

lilyjge · 2025-12-24T21:50:51Z

docs/usage-pg_textsearch.md

+3. At the root of this repo, create a `.env` file with your Timescale DSN (service URL):
+
+```
+# .env


is this line necessary?

lilyjge · 2025-12-24T21:52:14Z

docs/usage-pg_textsearch.md

+indexer = PostgresIndexer(use_pg_textsearch=True)
+indexer.init_table(table_name, index_type)
+indexer.load_table(table_name, corpus_file, pretokenized=True)
+indexer.fts_index(table_name, k1=1.5, b=0.8)


let's use k1=0.9 and b=0.4 to align with anserini defaults

lilyjge · 2025-12-24T21:53:12Z

docs/usage-pg-textsearch.md

+
+with pathlib.Path("runs/run.quackir.postgres.sparse.pg_textsearch.nfcorpus.txt").open(
+    "w"
+) as out:


can we make the line spacing less awkward?

All python codes here are formatted with Black formatter. I think we'd better stick with it? The point is that each line won't get too long.

lilyjge · 2025-12-24T21:58:04Z

docs/usage-pg_textsearch.md

+which should yield:
+
+```
+ndcg_cut_10             all     0.3098


briefly compare with normal bm25 and postgres fts results? like a table

lilyjge · 2025-12-24T22:00:21Z

docs/usage-pg-textsearch.md

+```
+
+Then, evaluate the hybrid results:
+


add the actual results with comparison?

lilyjge · 2025-12-24T22:05:37Z

quackir/index/_postgres.py

+            )
+        self.conn.commit()
+
+    def vector_index(


since we call it embedding_search, let's call it embedding_index for consistency?

lilyjge · 2025-12-24T22:09:10Z

quackir/search/_postgres.py

+            else table_names[0]
+        )
+        cur = self.conn.cursor()
+        if self.use_pg_textsearch:


could we reduce the replication between the two cases a bit? the only part that changes is the keyword_search right? if that's too annoying it's also fine to leave as is

Definitely, I thought about doing it back then but haven’t gotten around to it yet. But done now.

lilyjge · 2025-12-24T22:09:43Z

quackir/utils/constants.py

@@ -0,0 +1,19 @@
+#


can we put this here https://github.com/castorini/quackir/blob/main/quackir/_base.py

also, is this necessary? or can users use whatever table names they want?

Not strictly necessary but the index name will be created based on the users' table_name. They don't need to take care of the index name as it is mainly for internal use.

Not 100% sure what _base.py is for, does it serve as a file to contain all utility functions and constants?

lilyjge · 2025-12-24T22:21:16Z

hey @brandonzhou2002 great work! I noticed that you added a vector_index method for postgres with hnsw support as well, in addition to pg_textsearch. is that part of pg_textsearch? if not, let's try to keep PRs incremental so we know what each commit for? what you have is fine since you already added it, but could you create an issue to document the hnsw usage and add CLI parameters at some point?

lintool · 2026-01-03T19:50:16Z

I think this would be a good addition since it's "core" pg.

brandonzhou2002 · 2026-01-06T20:33:47Z

docs/usage-pg-textsearch.md

@@ -0,0 +1,174 @@
+# Using Timescale pg_textsearch (BM25) with QuackIR


Right didn't notice this, thanks for catching!

brandonzhou2002 · 2026-01-06T20:36:43Z

docs/usage-pg_textsearch.md

+3. At the root of this repo, create a `.env` file with your Timescale DSN (service URL):
+
+```
+# .env


brandonzhou2002 · 2026-01-06T20:44:08Z

docs/usage-pg_textsearch.md

+## Searching with BM25
+
+- When `use_pg_textsearch=True`, QuackIR runs pg_textsearch BM25 queries via the distance operator `<@>` and `to_bm25query(...)`.
+- BM25 scores from pg_textsearch are negative; more negative means better match. We often reverse the sign for evaluation tooling that expects higher-is-better.


Was thinking it could work as a quick summary/heads-up but the content is basically same as the section "why reverse the score". I can just remove it

brandonzhou2002 · 2026-01-06T20:46:07Z

docs/usage-pg-textsearch.md

+
+with pathlib.Path("runs/run.quackir.postgres.sparse.pg_textsearch.nfcorpus.txt").open(
+    "w"
+) as out:


All python codes here are formatted with Black formatter. I think we'd better stick with it? The point is that each line won't get too long.

brandonzhou2002 · 2026-01-06T20:51:41Z

quackir/utils/constants.py

@@ -0,0 +1,19 @@
+#


Not strictly necessary but the index name will be created based on the users' table_name. They don't need to take care of the index name as it is mainly for internal use.

Not 100% sure what _base.py is for, does it serve as a file to contain all utility functions and constants?

brandonzhou2002 · 2026-01-07T01:09:40Z

config/__init__.py

+
+if not os.getenv("_ENV_LOADED"):
+    load_dotenv()
+    os.environ["_ENV_LOADED"] = "1"


I'm not quite sure as the function _load_env is never being called. What I'm thinking is in the future, we may (or may not) add more configurations to this file, which then can be shared across multiple places. For example, we only need to load env here then anywhere that needs env can just import config, instead of calling load_dotenv multiple times. But I can def reuse the _load_env function here.

brandonzhou2002 · 2026-01-07T01:14:24Z

docs/usage-pg_textsearch.md

@@ -0,0 +1,174 @@
+# Using Timescale pg_textsearch (BM25) with QuackIR
+
+This guide shows how to run BM25 keyword search in PostgreSQL/Timescale using the pg_textsearch extension, integrated into QuackIR’s Postgres back end.


brandonzhou2002 · 2026-01-07T01:33:07Z

quackir/index/_postgres.py

+            )
+        self.conn.commit()
+
+    def vector_index(


brandonzhou2002 · 2026-01-07T02:31:25Z

quackir/search/_postgres.py

+            else table_names[0]
+        )
+        cur = self.conn.cursor()
+        if self.use_pg_textsearch:


Definitely, I thought about doing it back then but haven’t gotten around to it yet. But done now.

brandonzhou2002 · 2026-01-07T02:36:16Z

quackir/search/_postgres.py

-        WITH semantic_search AS (
-            SELECT id, RANK () OVER (ORDER BY embedding <=> %(vector)s::vector) AS rank
-            FROM {dense_table}
-            LIMIT %(n)s


Also, I noticed that there is no outer ORDER BY for each search, which may produce incorrect results since we do LIMIT %(n)s and without ORDER BY, the query picks random n rows instead of top n. I fixed it here. Lemme know if I'm wrong. Thanks

brandonzhou2002 · 2026-01-07T02:54:17Z

hey @brandonzhou2002 great work! I noticed that you added a vector_index method for postgres with hnsw support as well, in addition to pg_textsearch. is that part of pg_textsearch? if not, let's try to keep PRs incremental so we know what each commit for? what you have is fine since you already added it, but could you create an issue to document the hnsw usage and add CLI parameters at some point?

Thank you for reviewing @lilyjge! All comments should have been resolved.

The vector_index (now embedding_index) is mainly used for hybrid search for performance and its use case is in the md file in this PR. Will create an issue.

brandonzhou2002 added 2 commits November 16, 2025 05:06

integrate pg_textsearch

03395bf

add create extension

9e3b906

brandonzhou2002 changed the title ~~Integrate pg_textsearch Extension~~ feat: Integrate pg_textsearch Extension Dec 9, 2025

brandonzhou2002 mentioned this pull request Dec 9, 2025

feat: Integrate RagDB #8

Open

lilyjge requested changes Dec 24, 2025

View reviewed changes

resolve comments

f9376a2

brandonzhou2002 commented Jan 7, 2026

View reviewed changes

brandonzhou2002 requested a review from lilyjge January 7, 2026 02:48

update rrf

94e39fb

		@@ -0,0 +1,174 @@
		# Using Timescale pg_textsearch (BM25) with QuackIR

		@@ -0,0 +1,174 @@
		# Using Timescale pg_textsearch (BM25) with QuackIR

		This guide shows how to run BM25 keyword search in PostgreSQL/Timescale using the pg_textsearch extension, integrated into QuackIR’s Postgres back end.

feat: Integrate pg_textsearch Extension #7

Are you sure you want to change the base?

feat: Integrate pg_textsearch Extension #7

Uh oh!

Conversation

brandonzhou2002 commented Nov 16, 2025

Uh oh!

brandonzhou2002 commented Nov 16, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lilyjge commented Dec 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lintool commented Jan 3, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

brandonzhou2002 commented Jan 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lilyjge commented Dec 24, 2025 •

edited

Loading