Skip to content

Conversation

@brandonzhou2002
Copy link
Contributor

This PR adds support for pg_textsearch by Timescale to provide BM25-based sparse indexing and retrieval, and to enable hybrid search alongside vector similarity.

@brandonzhou2002
Copy link
Contributor Author

Hey @lilyjge, please take a look at this PR when you get a chance. It is independent of PR #6. Thanks!

cc: @lintool

@brandonzhou2002 brandonzhou2002 changed the title Integrate pg_textsearch Extension feat: Integrate pg_textsearch Extension Dec 9, 2025

if not os.getenv("_ENV_LOADED"):
load_dotenv()
os.environ["_ENV_LOADED"] = "1"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite sure as the function _load_env is never being called. What I'm thinking is in the future, we may (or may not) add more configurations to this file, which then can be shared across multiple places. For example, we only need to load env here then anywhere that needs env can just import config, instead of calling load_dotenv multiple times. But I can def reuse the _load_env function here.

@@ -0,0 +1,174 @@
# Using Timescale pg_textsearch (BM25) with QuackIR
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's name the file usage-pg-textsearch.md so we don't get the ugly mix of underscores and dashes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right didn't notice this, thanks for catching!

3. At the root of this repo, create a `.env` file with your Timescale DSN (service URL):

```
# .env
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this line necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed!

indexer = PostgresIndexer(use_pg_textsearch=True)
indexer.init_table(table_name, index_type)
indexer.load_table(table_name, corpus_file, pretokenized=True)
indexer.fts_index(table_name, k1=1.5, b=0.8)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's use k1=0.9 and b=0.4 to align with anserini defaults


with pathlib.Path("runs/run.quackir.postgres.sparse.pg_textsearch.nfcorpus.txt").open(
"w"
) as out:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we make the line spacing less awkward?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All python codes here are formatted with Black formatter. I think we'd better stick with it? The point is that each line won't get too long.

which should yield:

```
ndcg_cut_10 all 0.3098
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

briefly compare with normal bm25 and postgres fts results? like a table

```

Then, evaluate the hybrid results:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add the actual results with comparison?

)
self.conn.commit()

def vector_index(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we call it embedding_search, let's call it embedding_index for consistency?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea sure!

else table_names[0]
)
cur = self.conn.cursor()
if self.use_pg_textsearch:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we reduce the replication between the two cases a bit? the only part that changes is the keyword_search right? if that's too annoying it's also fine to leave as is

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely, I thought about doing it back then but haven’t gotten around to it yet. But done now.

@@ -0,0 +1,19 @@
#
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, is this necessary? or can users use whatever table names they want?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not strictly necessary but the index name will be created based on the users' table_name. They don't need to take care of the index name as it is mainly for internal use.

Not 100% sure what _base.py is for, does it serve as a file to contain all utility functions and constants?

@lilyjge
Copy link
Member

lilyjge commented Dec 24, 2025

hey @brandonzhou2002 great work! I noticed that you added a vector_index method for postgres with hnsw support as well, in addition to pg_textsearch. is that part of pg_textsearch? if not, let's try to keep PRs incremental so we know what each commit for? what you have is fine since you already added it, but could you create an issue to document the hnsw usage and add CLI parameters at some point?

@lintool
Copy link
Member

lintool commented Jan 3, 2026

I think this would be a good addition since it's "core" pg.

@@ -0,0 +1,174 @@
# Using Timescale pg_textsearch (BM25) with QuackIR
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right didn't notice this, thanks for catching!

3. At the root of this repo, create a `.env` file with your Timescale DSN (service URL):

```
# .env
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed!

## Searching with BM25

- When `use_pg_textsearch=True`, QuackIR runs pg_textsearch BM25 queries via the distance operator `<@>` and `to_bm25query(...)`.
- BM25 scores from pg_textsearch are negative; more negative means better match. We often reverse the sign for evaluation tooling that expects higher-is-better.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was thinking it could work as a quick summary/heads-up but the content is basically same as the section "why reverse the score". I can just remove it


with pathlib.Path("runs/run.quackir.postgres.sparse.pg_textsearch.nfcorpus.txt").open(
"w"
) as out:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All python codes here are formatted with Black formatter. I think we'd better stick with it? The point is that each line won't get too long.

@@ -0,0 +1,19 @@
#
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not strictly necessary but the index name will be created based on the users' table_name. They don't need to take care of the index name as it is mainly for internal use.

Not 100% sure what _base.py is for, does it serve as a file to contain all utility functions and constants?


if not os.getenv("_ENV_LOADED"):
load_dotenv()
os.environ["_ENV_LOADED"] = "1"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite sure as the function _load_env is never being called. What I'm thinking is in the future, we may (or may not) add more configurations to this file, which then can be shared across multiple places. For example, we only need to load env here then anywhere that needs env can just import config, instead of calling load_dotenv multiple times. But I can def reuse the _load_env function here.

@@ -0,0 +1,174 @@
# Using Timescale pg_textsearch (BM25) with QuackIR

This guide shows how to run BM25 keyword search in PostgreSQL/Timescale using the pg_textsearch extension, integrated into QuackIR’s Postgres back end.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

)
self.conn.commit()

def vector_index(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea sure!

else table_names[0]
)
cur = self.conn.cursor()
if self.use_pg_textsearch:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely, I thought about doing it back then but haven’t gotten around to it yet. But done now.

WITH semantic_search AS (
SELECT id, RANK () OVER (ORDER BY embedding <=> %(vector)s::vector) AS rank
FROM {dense_table}
LIMIT %(n)s
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I noticed that there is no outer ORDER BY for each search, which may produce incorrect results since we do LIMIT %(n)s and without ORDER BY, the query picks random n rows instead of top n. I fixed it here. Lemme know if I'm wrong. Thanks

@brandonzhou2002
Copy link
Contributor Author

hey @brandonzhou2002 great work! I noticed that you added a vector_index method for postgres with hnsw support as well, in addition to pg_textsearch. is that part of pg_textsearch? if not, let's try to keep PRs incremental so we know what each commit for? what you have is fine since you already added it, but could you create an issue to document the hnsw usage and add CLI parameters at some point?

Thank you for reviewing @lilyjge! All comments should have been resolved.

The vector_index (now embedding_index) is mainly used for hybrid search for performance and its use case is in the md file in this PR. Will create an issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants