SOLR-7632: Tika module to replace extraction module #3361

janhoy · 2025-05-22T10:18:14Z

This PR is generated entirely by the new Google Jules AI coder https://jules.google

Prompt used:

Read https://issues.apache.org/jira/browse/SOLR-7632 which proposes to deprecate the old "extraction" module, and replace it with an api-compatible new module that instead of parsing rich text documents in-process with Tika, will delegate to an externally running Tika-Server. More discussion can be found in https://lists.apache.org/thread/lbm6wb88gd1cfktgs6sfvw5xf73o8trd.

Do not focus on deprecating the old module yet. Just make a working PR for the new module. You can assume that the user has provisioned a TikaServer on some URL. When writing tests for the module, a good idea could be to look at the existing tests for "extraction" handler. You can choose whether you mock TikaServer API in thests or spin up a TIkaServer using TestContainers.

The PullRequest should also add reference guide documentation for the new feature.

Please don't use Apache httpClient. Use Jetty httpclient instead, or JDK httpclient. Please name the module "tika"

Below is the PR text generated by Jules, referencing the code branch it generated and pushed to my "janhoy" repository. I have not touched that branch.

I've implemented a new request handler, TikaServerRequestHandler, that delegates rich document parsing to an external Tika Server instance. This provides an alternative to the existing in-process ExtractingRequestHandler (Solr Cell), offering you better resource isolation and deployment flexibility.

The handler communicates with a configured Tika Server (typically via its /rmeta endpoint) using the Jetty HttpClient. It processes the extracted text and metadata to construct Solr documents.

Key features:

Configurable Tika Server URL, connection timeouts, and content/metadata field mapping.
Uses Jetty HttpClient for communication, managed within the Solr core lifecycle.
Comprehensive unit tests for the handler and document loader.
New documentation page in the Solr Reference Guide.

This work is based on the proposal in SOLR-7632 to provide an extraction mechanism that relies on an external Tika Server. The module is named 'tika' and the handler class is 'org.apache.solr.handler.tika.TikaServerRequestHandler'.

Checklist

Please review the following and check all that apply:

I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
I have created a Jira issue and added the issue ID to my pull request title.
I have given Solr maintainers access to contribute to my PR branch. (optional but recommended, not available for branches on forks living under an organisation)
I have developed this patch against the main branch.
I have run ./gradlew check.
I have added tests for my changes.
I have added documentation for the Reference Guide

…tion I've implemented a new request handler, TikaServerRequestHandler, that delegates rich document parsing to an external Tika Server instance. This provides an alternative to the existing in-process ExtractingRequestHandler (Solr Cell), offering you better resource isolation and deployment flexibility. The handler communicates with a configured Tika Server (typically via its /rmeta endpoint) using the Jetty HttpClient. It processes the extracted text and metadata to construct Solr documents. Key features: - Configurable Tika Server URL, connection timeouts, and content/metadata field mapping. - Uses Jetty HttpClient for communication, managed within the Solr core lifecycle. - Comprehensive unit tests for the handler and document loader. - New documentation page in the Solr Reference Guide. This work is based on the proposal in SOLR-7632 to provide an extraction mechanism that relies on an external Tika Server. The module is named 'tika' and the handler class is 'org.apache.solr.handler.tika.TikaServerRequestHandler'. The implementation initially used Apache HttpClient 5.x and I changed it to use Jetty HttpClient based on your feedback to align better with Solr's existing HTTP client usage.

janhoy · 2025-05-22T11:06:13Z

It got some of the project-specific stuff wrong, so I'll prompt it to fix gradle, use version catalogs, add license headers, use v12 instead of v11 of jetty-clent etc.

epugh · 2025-05-22T13:54:38Z

Honestly, this looks like what I would expect... We are just trying to do a bit of redirection... Accept a binary doc, forward it to a specific endpoint, take the response from the end point, and then index that/return it to the caller....

It looks like you have much of what is needed... I've contemplated taking this task on, and it always seemed like a "big deal", and I'm almost embarrased to see how little code doing this "redirection" takes!

While you could tweak your prompt to get to perfection, maybe this is enough? Get it to compile and see if test_extraction.bats passes?

Adds the standard Apache License header to: - solr/modules/tika/build.gradle - solr/solr-ref-guide/modules/indexing-guide/pages/indexing-with-tika-server.adoc This ensures all new files in the 'tika' module have the required license information.

Fix some deprecations Supress forbidden reflection

dsmiley

Impressive results!

I'd much prefer to see testcontainers. I don't trust mocks in the slightest mocking some external system. Who knows if what we mock is trustworthy or not?

dsmiley · 2025-06-05T02:59:16Z

solr/modules/tika/src/java/org/apache/solr/handler/tika/TikaServerDocumentLoader.java

+    // Apply request-specific timeout using connectionTimeout.
+    // The HttpClient instance itself is configured with an idleTimeout (derived from
+    // socketTimeout).
+    jettyRequest.timeout(this.connectionTimeout, TimeUnit.MILLISECONDS);


weird to use the connection timeout as the request timeout. If we want one universal timeout for all the things that timeout, then this.connectionTimeout should just be this.timeout.

epugh · 2025-09-18T20:37:39Z

I ran the existing test_extraction.bats just to make sure the old stuff still worked. And fixed the tests. Though they are still very mock!

epugh · 2025-09-18T20:42:48Z

Impressive results!

I'd much prefer to see testcontainers. I don't trust mocks in the slightest mocking some external system. Who knows if what we mock is trustworthy or not?

There currently is NOT a testcontainer for Tika, as far as I know. What do you say @tballison?

epugh · 2025-09-18T20:43:40Z

Also, I wonder if instead of a whole new module, what if we just forced this into the ExtractingRequestHandler.java, and kept it in /extraction? Or do we think it's cleaner to remove /extraction and add /tika?

janhoy · 2025-09-19T11:55:41Z

Also, I wonder if instead of a whole new module, what if we just forced this into the ExtractingRequestHandler.java, and kept it in /extraction? Or do we think it's cleaner to remove /extraction and add /tika?

By all means. We could aim for having to "extraction backends" inside extraction handler, i.e. pluggable loader, and reuse all the logic in place for request parsing, arguments etc. And tell people in 9.x that a new experimental tika-server backend is available, just add to the request handler config loader=org.apache.solr.extraction.loaders.TikaServerDocumentLoader or similar. And in 10.0 it becomes the default as we remove the in-process loader?

github-actions bot added documentation Improvements or additions to documentation tool:build tests labels May 22, 2025

janhoy marked this pull request as draft May 22, 2025 10:18

github-actions bot added the dependencies Dependency upgrades label May 23, 2025

janhoy added 4 commits May 28, 2025 13:47

Manual cleanup of test directory structure and gradle build file

d2a58d1

Package-info and licenses

e81cd00

Unchecked cast and wrong type

9a0066e

Unchecked cast and wrong type

1b2b01b

Fix some deprecations Supress forbidden reflection

github-actions bot added client:solrj cat:api labels May 28, 2025

dsmiley reviewed Jun 5, 2025

View reviewed changes

epugh added 3 commits September 18, 2025 15:36

Merge remote-tracking branch 'upstream/main' into pr/3361

268d5f2

Regenerate lock file to match main.

3acff38

Tweak to make mocks all work.

bb93e30

Suggestions from Intellij on code style

adb2e47

janhoy mentioned this pull request Sep 19, 2025

SOLR-7632 TikaServer as pluggable backend to existing extraction handler #3670

Merged

9 tasks

janhoy closed this Oct 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SOLR-7632: Tika module to replace extraction module #3361

SOLR-7632: Tika module to replace extraction module #3361

Uh oh!

janhoy commented May 22, 2025

Uh oh!

janhoy commented May 22, 2025

Uh oh!

epugh commented May 22, 2025

Uh oh!

dsmiley left a comment

Uh oh!

dsmiley Jun 5, 2025

Uh oh!

epugh commented Sep 18, 2025

Uh oh!

epugh commented Sep 18, 2025

Uh oh!

epugh commented Sep 18, 2025

Uh oh!

janhoy commented Sep 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

SOLR-7632: Tika module to replace extraction module #3361

SOLR-7632: Tika module to replace extraction module #3361

Uh oh!

Conversation

janhoy commented May 22, 2025

Checklist

Uh oh!

janhoy commented May 22, 2025

Uh oh!

epugh commented May 22, 2025

Uh oh!

dsmiley left a comment

Choose a reason for hiding this comment

Uh oh!

dsmiley Jun 5, 2025

Choose a reason for hiding this comment

Uh oh!

epugh commented Sep 18, 2025

Uh oh!

epugh commented Sep 18, 2025

Uh oh!

epugh commented Sep 18, 2025

Uh oh!

janhoy commented Sep 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants