Skip to content

Conversation

@lgrz
Copy link
Collaborator

@lgrz lgrz commented Aug 4, 2025

Add CLI option to customize scorer params for BM25 and beyond.
Fixes #392

Changes proposed in this pull request:

  • Add --scorer option to tools/compute_intersection.cpp
  • Update tests/test_intersection.cpp
  • Fix logic error when filtering queries by min/max number of terms

@lgrz lgrz changed the title 392 scorer [PR] Add scorer option to compute_intersection Aug 4, 2025
@lgrz lgrz changed the title [PR] Add scorer option to compute_intersection Scorer option for compute_intersection Aug 4, 2025
auto filtered_queries = ranges::views::filter(queries, [&](auto&& query) {
auto size = query.terms().size();
return size < min_query_len || size > max_query_len;
return size > min_query_len || size < max_query_len;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't it be min_query_len <= size && size <= max_query_len? 🤔

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thanks that is a more readable way to put it and that it should include the boundary min/max. Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, that plus your initial fix still used &&. Just confirming: this function should return true if size is in [min_query_len, max_query_len] range (inclusive), right? I'm double checking because it's weird it was the opposite before, seems like a very big oversight, so I'm second guessing myself on this... But yeah, it should be right now I think.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it was odd that it was the opposite before. And I second guessed myself a couple of times, but then verified by way of that the change in this PR was the only way to get the intersections printed out when running the the tool. We don't have tests in place for this tool similar to test/cli/*, but maybe we (or I) should have some there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, this wouldn't be implemented directly in the tool source file but rather in the library and then unit tested. Perhaps subtype arg::Query and have it deal with filtering query? Not sure to be honest. I also wouldn't say it's a hard requirement.

If you have a spare moment to write CLI tests, it would be great. But I won't block this for that, it's good to have this fix in.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree. While working on this PR I did try to look at moving the
min/max query length filter into include/pisa/intersect.hpp. But I didn't
get to a good solution becuase of the way for_all_subsets works for the
--combinations flag and the --max-term-count filter.

To me a good solution would result in a reasonable design that works well with
for_all_subsets and for Intersection::compute, where either of those
intersection modes may have the min/max query length filter.
Then in tool/compute_intersection there would be no need to pass a
function argument print_intersection which does more than only printing
the intersection (it does the intersection as well as print to stdout).

Also, I did not consider other ways of solving this outside of
pisa/include/intersection.hpp. Thanks, this is a good suggestion and
Args:Query that sounds like a good place to start looking (because it is
more along the lines of a TermPolicy but at the query level).

This could make sense for generating sub-queries too as it is a query
filter in the opposite direction. But then again if it is not used outside
of compute_intersection then probably best to keep that local to the
intersection implementation as it may have additional complications for
example the query ID and so forth.

I would prefer to try and resolve the min/max query filter with unit
tests rather than the CLI tests. So I won't add the CLI tests for now.

@lgrz
Copy link
Collaborator Author

lgrz commented Aug 18, 2025

Also, just to add some output for reference. This output is with the new min/max change (in this PR):

== without min, max arguments
$ ./build/debug/bin/compute_intersection --encoding block_simdbp --index 100k_block_simdbp.idx --wand 100kwand_fixed40_bm25 --tokenizer english --token-filters lowercase porter2 --terms 100kfwd.termlex --queries dl20.txt --header --scorer dph
qid     length  max_score
1108939 0       0
1112389 0       0
792752  4000    6.984559
...

== with arguments min: 0, max: 3
$ ./build/debug/bin/compute_intersection --encoding block_simdbp --index 100k_block_simdbp.idx --wand 100kwand_fixed40_bm25 --tokenizer english --token-filters lowercase porter2 --terms 100kfwd.termlex --queries dl20.txt --header --scorer dph --min-query-len 0 --max-query-len 3
qid     length  max_score
792752  4000    6.984559
1128373 2402    5.70508
1124979 1       7.8003674
...

== with arguments min: 3, max: 3
$ ./build/debug/bin/compute_intersection --encoding block_simdbp --index 100k_block_simdbp.idx --wand 100kwand_fixed40_bm25 --tokenizer english --token-filters lowercase porter2 --terms 100kfwd.termlex --queries dl20.txt --header --scorer dph --min-query-len 3 --max-query-len 3
qid     length  max_score
1124979 1       7.8003674
156493  1       13.767251
1124464 0       0
...

== with invalid arguments min: 1, max: 0
$ ./build/debug/bin/compute_intersection --encoding block_simdbp --index 100k_block_simdbp.idx --wand 100kwand_fixed40_bm25 --tokenizer english --token-filters lowercase porter2 --terms 100kfwd.termlex --queries dl20.txt --header --scorer dph --min-query-len 1 --max-query-len 0
qid     length  max_score

@lgrz
Copy link
Collaborator Author

lgrz commented Sep 26, 2025

Updated the PR and am looking for feedback on the new changes introduced.
To summarize, added arg::QueryFilter to handle the filtering of queries
for compute_intersection; opted to not inherit from arg::Query since
it is already a child of arg::Analyzer. This is why when using
QueryFilter the "user" must explicitly call query_filter_add_options
and then apply the filter using the output of app.queries.

Added tests in tools/tests/test_app.cpp, but I realized that they are
outdated and not used in builds or gh actions.

Any suggestions on whether to continue along this path using
tools/tests/test_app.cpp?

Another option might be if they are useful is to rework them under the main
test suite.

@lgrz lgrz requested a review from elshize September 26, 2025 22:03
Copy link
Member

@elshize elshize left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the app tests could be implemented as part of the regular test directory, but they need to be still separate to an extent from the rest. We probably don't want to link app to the lib tests and vice versa. It doesn't need to be done here though.

tools/app.hpp Outdated
private:
int m_min_length = 1;
int m_max_length = std::numeric_limits<int>::max();
CLI::App* m_app = nullptr;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd rather not set it to nullptr. The reason is that it may suggest that it may not have a set value, but it's not true because the only constructor sets it.

Are there any consequences of not having one variable set here? I don't think so, but I might be wrong. For consistency, we can also not set any of the values here and set the min/max in the constructor as well. WDYT?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback, sure can move these to the constructor.

The member initializer way is sort of a habbit as as code gets changed later on (for example C.48). But then again the usecase here is quite specific and arg doesn't change that often. And probably more importantly it moving to the constructor will make it consistent with existing code. (Perhaps I will also move query_filter_add_options to the constructor too; then there is no need for a member m_app.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no strong preference as to where to initialize, I just prefer not to initialize anything with a value if it's always overridden in the constructor, because I would want to avoid a situation where we accidentally remove the initialization from the constructor and we end up with explicit nullptr, and static analysis won't detect an issue there.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, thanks for the context on asan; will update the patch

@lgrz
Copy link
Collaborator Author

lgrz commented Sep 28, 2025

I think the app tests could be implemented as part of the regular test
directory, but they need to be still separate to an extent from the rest.
We probably don't want to link app to the lib tests and vice versa. It
doesn't need to be done here though.

Ok yes, thanks for the tips on how it should work if it ends up in /tests.

The reason I was asking the above is because currently there is no CI/Build
that has both PISA_BUILD_TOOLS=ON and PISA_ENABLE_TESTING=ON, which
at present is required for the tests in tools/tests to be automated
on build. And I think that possibly the tests under tools only depend on
CLI.hpp, app.cpp, app.h and do not depend on the individual tools under
tools directory. So it may be useful to lift the app tests out and into
the main test directory.

I will try leaving the app tests where they are for now and maybe it is
just a matter of having an extra build config in the CI for the app tests.
Does that sound OK?

@elshize
Copy link
Member

elshize commented Sep 28, 2025

This sounds good. I'm fine moving it to test directory as well. Whatever is easier here, and we can address it in a later PR. We don't have to fix everything here.

This moves the query length filter out of the main
`compute_intersection` binary for automated testing.

- Enable `tools/tests` in cmake build
- Fix old tests in `tools/tests/test_app.cpp`
- Add `args::QueryFilter` tests for `compute_intersection`

The `tools/tests` are not yet configured for CI builds as they require
both options `PISA_BUILD_TOOLS=ON` and `PISA_ENABLE_TESTING=ON`.
Although, they can be run locally. This will be resolved in a later PR
(see pisa-engine#620).

See pisa-engine#392
@lgrz
Copy link
Collaborator Author

lgrz commented Sep 29, 2025

This sounds good. I'm fine moving it to test directory as well. Whatever is easier here, and we can address it in a later PR. We don't have to fix everything here.

Thanks, have left the tests under tools/test for now and noted that they can be resolved at a later stage.

@elshize
Copy link
Member

elshize commented Oct 9, 2025

Sorry for the late reply, I've been traveling. Just double-checking: this is ready to merge, correct? Or did you plan any more changes?

@lgrz
Copy link
Collaborator Author

lgrz commented Oct 9, 2025

Yes, this should be good to merge now. No further changes for this one. Thanks!

@elshize elshize merged commit 95f95be into pisa-engine:main Oct 9, 2025
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Use --scorer for compute_intersection

2 participants