-
Notifications
You must be signed in to change notification settings - Fork 1
feat: Integrate pg_textsearch Extension #7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: Integrate pg_textsearch Extension #7
Conversation
config/__init__.py
Outdated
|
|
||
| if not os.getenv("_ENV_LOADED"): | ||
| load_dotenv() | ||
| os.environ["_ENV_LOADED"] = "1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it necessary that this is separate from https://github.com/castorini/quackir/blob/main/quackir/_base.py#L55
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not quite sure as the function _load_env is never being called. What I'm thinking is in the future, we may (or may not) add more configurations to this file, which then can be shared across multiple places. For example, we only need to load env here then anywhere that needs env can just import config, instead of calling load_dotenv multiple times. But I can def reuse the _load_env function here.
| @@ -0,0 +1,174 @@ | |||
| # Using Timescale pg_textsearch (BM25) with QuackIR | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's name the file usage-pg-textsearch.md so we don't get the ugly mix of underscores and dashes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right didn't notice this, thanks for catching!
docs/usage-pg_textsearch.md
Outdated
| 3. At the root of this repo, create a `.env` file with your Timescale DSN (service URL): | ||
|
|
||
| ``` | ||
| # .env |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this line necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed!
docs/usage-pg_textsearch.md
Outdated
| indexer = PostgresIndexer(use_pg_textsearch=True) | ||
| indexer.init_table(table_name, index_type) | ||
| indexer.load_table(table_name, corpus_file, pretokenized=True) | ||
| indexer.fts_index(table_name, k1=1.5, b=0.8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's use k1=0.9 and b=0.4 to align with anserini defaults
|
|
||
| with pathlib.Path("runs/run.quackir.postgres.sparse.pg_textsearch.nfcorpus.txt").open( | ||
| "w" | ||
| ) as out: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we make the line spacing less awkward?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All python codes here are formatted with Black formatter. I think we'd better stick with it? The point is that each line won't get too long.
docs/usage-pg_textsearch.md
Outdated
| which should yield: | ||
|
|
||
| ``` | ||
| ndcg_cut_10 all 0.3098 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
briefly compare with normal bm25 and postgres fts results? like a table
| ``` | ||
|
|
||
| Then, evaluate the hybrid results: | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add the actual results with comparison?
quackir/index/_postgres.py
Outdated
| ) | ||
| self.conn.commit() | ||
|
|
||
| def vector_index( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we call it embedding_search, let's call it embedding_index for consistency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea sure!
| else table_names[0] | ||
| ) | ||
| cur = self.conn.cursor() | ||
| if self.use_pg_textsearch: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we reduce the replication between the two cases a bit? the only part that changes is the keyword_search right? if that's too annoying it's also fine to leave as is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely, I thought about doing it back then but haven’t gotten around to it yet. But done now.
| @@ -0,0 +1,19 @@ | |||
| # | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we put this here https://github.com/castorini/quackir/blob/main/quackir/_base.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, is this necessary? or can users use whatever table names they want?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not strictly necessary but the index name will be created based on the users' table_name. They don't need to take care of the index name as it is mainly for internal use.
Not 100% sure what _base.py is for, does it serve as a file to contain all utility functions and constants?
|
hey @brandonzhou2002 great work! I noticed that you added a |
|
I think this would be a good addition since it's "core" pg. |
| @@ -0,0 +1,174 @@ | |||
| # Using Timescale pg_textsearch (BM25) with QuackIR | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right didn't notice this, thanks for catching!
docs/usage-pg_textsearch.md
Outdated
| 3. At the root of this repo, create a `.env` file with your Timescale DSN (service URL): | ||
|
|
||
| ``` | ||
| # .env |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed!
docs/usage-pg_textsearch.md
Outdated
| ## Searching with BM25 | ||
|
|
||
| - When `use_pg_textsearch=True`, QuackIR runs pg_textsearch BM25 queries via the distance operator `<@>` and `to_bm25query(...)`. | ||
| - BM25 scores from pg_textsearch are negative; more negative means better match. We often reverse the sign for evaluation tooling that expects higher-is-better. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was thinking it could work as a quick summary/heads-up but the content is basically same as the section "why reverse the score". I can just remove it
|
|
||
| with pathlib.Path("runs/run.quackir.postgres.sparse.pg_textsearch.nfcorpus.txt").open( | ||
| "w" | ||
| ) as out: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All python codes here are formatted with Black formatter. I think we'd better stick with it? The point is that each line won't get too long.
| @@ -0,0 +1,19 @@ | |||
| # | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not strictly necessary but the index name will be created based on the users' table_name. They don't need to take care of the index name as it is mainly for internal use.
Not 100% sure what _base.py is for, does it serve as a file to contain all utility functions and constants?
config/__init__.py
Outdated
|
|
||
| if not os.getenv("_ENV_LOADED"): | ||
| load_dotenv() | ||
| os.environ["_ENV_LOADED"] = "1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not quite sure as the function _load_env is never being called. What I'm thinking is in the future, we may (or may not) add more configurations to this file, which then can be shared across multiple places. For example, we only need to load env here then anywhere that needs env can just import config, instead of calling load_dotenv multiple times. But I can def reuse the _load_env function here.
docs/usage-pg_textsearch.md
Outdated
| @@ -0,0 +1,174 @@ | |||
| # Using Timescale pg_textsearch (BM25) with QuackIR | |||
|
|
|||
| This guide shows how to run BM25 keyword search in PostgreSQL/Timescale using the pg_textsearch extension, integrated into QuackIR’s Postgres back end. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
quackir/index/_postgres.py
Outdated
| ) | ||
| self.conn.commit() | ||
|
|
||
| def vector_index( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea sure!
| else table_names[0] | ||
| ) | ||
| cur = self.conn.cursor() | ||
| if self.use_pg_textsearch: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely, I thought about doing it back then but haven’t gotten around to it yet. But done now.
| WITH semantic_search AS ( | ||
| SELECT id, RANK () OVER (ORDER BY embedding <=> %(vector)s::vector) AS rank | ||
| FROM {dense_table} | ||
| LIMIT %(n)s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I noticed that there is no outer ORDER BY for each search, which may produce incorrect results since we do LIMIT %(n)s and without ORDER BY, the query picks random n rows instead of top n. I fixed it here. Lemme know if I'm wrong. Thanks
Thank you for reviewing @lilyjge! All comments should have been resolved. The |
This PR adds support for pg_textsearch by Timescale to provide BM25-based sparse indexing and retrieval, and to enable hybrid search alongside vector similarity.