-
Notifications
You must be signed in to change notification settings - Fork 809
SOLR-7632: Tika module to replace extraction module #3361
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…tion I've implemented a new request handler, TikaServerRequestHandler, that delegates rich document parsing to an external Tika Server instance. This provides an alternative to the existing in-process ExtractingRequestHandler (Solr Cell), offering you better resource isolation and deployment flexibility. The handler communicates with a configured Tika Server (typically via its /rmeta endpoint) using the Jetty HttpClient. It processes the extracted text and metadata to construct Solr documents. Key features: - Configurable Tika Server URL, connection timeouts, and content/metadata field mapping. - Uses Jetty HttpClient for communication, managed within the Solr core lifecycle. - Comprehensive unit tests for the handler and document loader. - New documentation page in the Solr Reference Guide. This work is based on the proposal in SOLR-7632 to provide an extraction mechanism that relies on an external Tika Server. The module is named 'tika' and the handler class is 'org.apache.solr.handler.tika.TikaServerRequestHandler'. The implementation initially used Apache HttpClient 5.x and I changed it to use Jetty HttpClient based on your feedback to align better with Solr's existing HTTP client usage.
|
It got some of the project-specific stuff wrong, so I'll prompt it to fix gradle, use version catalogs, add license headers, use v12 instead of v11 of jetty-clent etc. |
|
Honestly, this looks like what I would expect... We are just trying to do a bit of redirection... Accept a binary doc, forward it to a specific endpoint, take the response from the end point, and then index that/return it to the caller.... It looks like you have much of what is needed... I've contemplated taking this task on, and it always seemed like a "big deal", and I'm almost embarrased to see how little code doing this "redirection" takes! While you could tweak your prompt to get to perfection, maybe this is enough? Get it to compile and see if |
Adds the standard Apache License header to: - solr/modules/tika/build.gradle - solr/solr-ref-guide/modules/indexing-guide/pages/indexing-with-tika-server.adoc This ensures all new files in the 'tika' module have the required license information.
Fix some deprecations Supress forbidden reflection
dsmiley
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Impressive results!
I'd much prefer to see testcontainers. I don't trust mocks in the slightest mocking some external system. Who knows if what we mock is trustworthy or not?
| // Apply request-specific timeout using connectionTimeout. | ||
| // The HttpClient instance itself is configured with an idleTimeout (derived from | ||
| // socketTimeout). | ||
| jettyRequest.timeout(this.connectionTimeout, TimeUnit.MILLISECONDS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
weird to use the connection timeout as the request timeout. If we want one universal timeout for all the things that timeout, then this.connectionTimeout should just be this.timeout.
|
I ran the existing |
There currently is NOT a testcontainer for Tika, as far as I know. What do you say @tballison? |
|
Also, I wonder if instead of a whole new module, what if we just forced this into the ExtractingRequestHandler.java, and kept it in /extraction? Or do we think it's cleaner to remove /extraction and add /tika? |
By all means. We could aim for having to "extraction backends" inside extraction handler, i.e. pluggable loader, and reuse all the logic in place for request parsing, arguments etc. And tell people in 9.x that a new experimental tika-server backend is available, just add to the request handler config |
This PR is generated entirely by the new Google Jules AI coder https://jules.google
Prompt used:
Below is the PR text generated by Jules, referencing the code branch it generated and pushed to my "janhoy" repository. I have not touched that branch.
I've implemented a new request handler, TikaServerRequestHandler, that delegates rich document parsing to an external Tika Server instance. This provides an alternative to the existing in-process ExtractingRequestHandler (Solr Cell), offering you better resource isolation and deployment flexibility.
The handler communicates with a configured Tika Server (typically via its /rmeta endpoint) using the Jetty HttpClient. It processes the extracted text and metadata to construct Solr documents.
Key features:
This work is based on the proposal in SOLR-7632 to provide an extraction mechanism that relies on an external Tika Server. The module is named 'tika' and the handler class is 'org.apache.solr.handler.tika.TikaServerRequestHandler'.
Checklist
Please review the following and check all that apply:
mainbranch../gradlew check.