Skip to content

Conversation

@aruggero
Copy link
Contributor

@aruggero aruggero commented Jul 21, 2025

https://issues.apache.org/jira/browse/SOLR-16667

Description

The current definition and usage of the QUERY_DOC_FV feature cache has been modified to support both reranking and logging.

Solution

  • The cache has been defined in the SolrConfig as the filter cache, the query result cache etc..
  • Lookups and insertions in the cache have been integrated in both reranking and logging phases.

Tests

Tests have been added in the solr/modules/ltr/src/test/org/apache/solr/ltr/TestFeatureVectorCache.java file.
The tests check for the correct cache usage and response in different scenarios, considering the ltr parameters: logAll, store, efis... and their defaults.

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Solr maintainers access to contribute to my PR branch. (optional but recommended, not available for branches on forks living under an organisation)
  • I have developed this patch against the main branch.
  • I have run ./gradlew check.
  • I have added tests for my changes.
  • I have added documentation for the Reference Guide

@aruggero aruggero force-pushed the feature/SOLR-16667 branch 2 times, most recently from a7e6fe7 to ab63d05 Compare July 24, 2025 07:47
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Jul 24, 2025
Copy link
Contributor

@alessandrobenedetti alessandrobenedetti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some changes and some discussions to do, then I'll review the tests!

@aruggero aruggero force-pushed the feature/SOLR-16667 branch 2 times, most recently from 8994985 to a2ed4f4 Compare July 31, 2025 09:22
@aruggero aruggero force-pushed the feature/SOLR-16667 branch from e4ebcc0 to ca914a8 Compare August 5, 2025 09:25
@github-actions github-actions bot added dependencies Dependency upgrades tool:build labels Aug 5, 2025
@aruggero aruggero marked this pull request as ready for review August 7, 2025 07:27
@github-actions github-actions bot added configs and removed dependencies Dependency upgrades labels Aug 11, 2025
@aruggero
Copy link
Contributor Author

We have finished the review iterations, and the code is open to further suggestions and revisions. In the meantime, I am running a benchmark to report the performance of the new contribution compared to the current implementation.

@github-actions github-actions bot added the dependencies Dependency upgrades label Oct 15, 2025
Copy link
Contributor

@alessandrobenedetti alessandrobenedetti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside some minors, we are ready to merge, just waiting for main to be stable

@alessandrobenedetti alessandrobenedetti merged commit aeb9063 into apache:main Oct 21, 2025
2 of 4 checks passed
alessandrobenedetti pushed a commit that referenced this pull request Oct 21, 2025
by Anna and Alessandro

(cherry picked from commit aeb9063)
@aruggero
Copy link
Contributor Author

aruggero commented Oct 21, 2025

As mentioned, I report here the results of the benchmark that has been done:
Screenshot 2025-10-21 at 14 08 20

We compared the current main branch containing the FV_cache for LTR with our contributed cache. We have also done a test in both situations with all caches with zero size to be sure not to slow down the query execution when not using caches at all.

The first column represents the difference in milliseconds between the first query (no cache hit - miss than insert) and the second query (cache hit). The higher, the better, since it means that the cache has a great impact on reducing the query execution time.

The second column represents the average time taken to execute the first query in milliseconds (the one that does not find any entry - miss than insert).

The third column represents the average time taken to execute the second query in milliseconds (the one that finds an entry - hit).

Finally, the last columns are the cache statistics from the Solr UI.

We executed 10 pairs of queries (the first having a miss, the second a hit in the cache), each retrieving 10000 documents.
We uploaded a feature store of 200 features and use a linear model that exploits these 200 features.

All the results are positive. We obtain a slightly slower result when doing the "miss" query during pure reranking, but we still have a comparable query execution time.

A detailed blog post about this will follow on sease.io

alessandrobenedetti pushed a commit that referenced this pull request Oct 21, 2025
by Anna and Alessandro

(cherry picked from commit aeb9063)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cat:search configs dependencies Dependency upgrades documentation Improvements or additions to documentation module:ltr tests tool:build

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants