SOLR-7632 TikaServer as pluggable backend to existing extraction handler #3670

janhoy · 2025-09-19T15:14:00Z

https://issues.apache.org/jira/browse/SOLR-7632

This work builds on the one in #3361 but instead of making a new module, we add it as a capability to the existing extraction handler through specifying extraction.backend=tikaserver.

This first required refactoring extraction handler to detach it from the Tika-v1 API. There is a new interface ExtractionBackend that takes generic ExtractionRequest object in and returns an ExtractionResult bean, and a new LocalTikaExtractionBackend implementation that encapsulates all Tikav1 api handling. This implementation can be deprecated, and in Solr 10, the tikaserver one can be made default.

All existing tests pass, and most of the existing extraction tests now also pass when running the tikaserver backend (running in TestContainers). Unfortunately docker is not available in Crave, so a new GH workflow is made to run only the extraction tests.

TODO's:

Lots of code from GenAI, which needs review and rewrite / simplification.
There may be debug print and TODOs left here and there
The metadata back-compat map is AI generated and by no means complete or even correct 🤣
Lack of JavaDoc everywhere
Harden exception handling, retrying, timeout values etc
Should probably use Jetty HTTP client instead of JDK one.
Optimize throughput with async methods? The request thread is blocked on Tika response
Explicit HTTPS / self-signed cert support?
Complete the RefGuide docs

…ika API Refactor some tests to LocalTikaExtractionBackendTest

epugh · 2025-09-19T17:18:56Z

Exciting!

janhoy · 2025-09-20T01:16:13Z

Status:

Parses docs using TikaServer
Can switch between xml (html) and text format of the content field
Randomized the choice of backend for the main test class
ExtractOnly not fully implemented for tikaserver, some tests fail

TBD:

The whole xpath / SAX parsing of XML response is missing
We use JDK HTTP client, could perhaps use Jetty client. See other POC for example, including making timeouts configurable
Must make sure that tikaserver.url is only configurable on requesthandler config in solrconfig, not as a request parameter (security)
RefGuide docs, especially how to start TikaServer etc
Remove the DummyExtractionBackend

Anyone, please feel free to hack away on this if it looks exciting, committing directly to the PR branch.

Question: Would it bring value to isolate the refactoring in one PR and then another one to add the tikaserver impl?

Cleanup TestContainer Refactor ExtractionMetadata Add returnType to ExtractionRequest Remove static initializers

epugh · 2025-09-21T00:18:41Z

Any luck with security manager?? I had many difficulties

janhoy · 2025-09-22T14:26:11Z

Yea, that’s annoying. Perhaps we could disable JSM for this test or for tests in the entire module?

iamsanjay · 2025-09-22T14:27:19Z

I had the similar experience as I was upgrading kafka. And then I stopped.

Java Security Manager and Testcontainers do not play nicely together. We prefer Testcontainers, so disable JSM

epugh · 2025-09-22T14:48:38Z

When I first saw DummyExtractionBackend my first thought was that it should be in the test class hierarchy. However, would there be value in keeping it? If you wanted to test your set up in Solr (and not worry about the Tika side), could it be useful for that? "I send a doc and I get something back"....

Add common metadata Adjust some tests with dc:title instead of title Support passwords in TikaServer backend

solr/modules/extraction/src/test-files/extraction/solr/collection1/conf/solrconfig.xml

janhoy · 2025-10-14T11:54:21Z

Did a series of final improvements, based on review feedback from Claude Code, added a new test, hardened input parameter validation etc.

Planning to merge this to main Wednesday morning CET, hopefully incorporating feedback from @epugh taking it for a spin.

If any of you have a review pending, let me know so I can incorporate that as well.

…urces

...ules/extraction/src/java/org/apache/solr/handler/extraction/TikaServerExtractionBackend.java

janhoy · 2025-10-14T13:01:46Z

Any luck with security manager?? I had many difficulties

Skipping JSM for extraction tests was the easy way out, and given that JSM is going away as soon as you upgrade to JRE 24+, I'm not too worried about this shortcut.

# Conflicts: # solr/test-framework/src/java/org/apache/solr/SolrIgnoredThreadsFilter.java

janhoy

To make an even cleaner separation, let's not import any org.apache.tika classes in the base classes that will survive the removal of local backend.

...modules/extraction/src/java/org/apache/solr/handler/extraction/ExtractingDocumentLoader.java

...dules/extraction/src/java/org/apache/solr/handler/extraction/RegexRulesPasswordProvider.java

.../extraction/src/test/org/apache/solr/handler/extraction/TikaServerExtractionBackendTest.java

...ules/extraction/src/java/org/apache/solr/handler/extraction/TikaServerExtractionBackend.java

epugh · 2025-10-14T19:42:00Z

To make an even cleaner separation, let's not import any org.apache.tika classes in the base classes that will survive the removal of local backend.

Very astute comment! I am excited about adding tika-pipes as a replacement for local in the future...

janhoy · 2025-10-15T00:07:54Z

...ules/extraction/src/java/org/apache/solr/handler/extraction/fromtika/BodyContentHandler.java

+import org.apache.tika.sax.XHTMLContentHandler;
+import org.apache.tika.sax.xpath.Matcher;
+import org.apache.tika.sax.xpath.MatchingContentHandler;
+import org.apache.tika.sax.xpath.XPathParser;


Here are more imports depending on Tika. These classes can also be copied into our project, but it starts to get deep, with a total of 18 java source files if we want to bring it all in.

Seems a bit much for ONE usage in TikaServerExtractionBackend, so I'll leave it in there and defer to later to find / write a replacement for the BodyContentHandler.

…ler (#3670) Co-authored-by: Eric Pugh <epugh@opensourceconnections.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> (cherry picked from commit cca45c7)

…ler (apache#3670) Co-authored-by: Eric Pugh <epugh@opensourceconnections.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> (cherry picked from commit cca45c7) Signed-off-by: Jan Høydahl <jan.git@cominvent.com>

janhoy added 6 commits September 19, 2025 15:15

Introduce ExtractionBackend interface

26bde10

Move some tika tests to new test file

57d8d4e

ExtractingRequestHandler and ExtractingDocumentLoader not depend on T…

dc151c5

…ika API Refactor some tests to LocalTikaExtractionBackendTest

Use a factory to create the backend to keep it DRY

5a19251

Add TikaServerExtractionBackend

35fef11

Change testing to use TestContainers

196dcdc

janhoy marked this pull request as draft September 19, 2025 15:14

github-actions bot added dependencies Dependency upgrades module:extraction tool:build tests labels Sep 19, 2025

janhoy requested a review from epugh September 19, 2025 15:14

Draft docs

11ea400

github-actions bot added the documentation Improvements or additions to documentation label Sep 19, 2025

Use json response from Tika

a3794ce

Cleanup TestContainer Refactor ExtractionMetadata Add returnType to ExtractionRequest Remove static initializers

janhoy force-pushed the refactor-extraction-handler branch from cc3d43f to a3794ce Compare September 20, 2025 01:24

malliaridis mentioned this pull request Sep 20, 2025

SOLR-17888: Upgrade Apache Tika to 3.2.3 #3674

Closed

10 tasks

Allow testcontainers to read config

cf97169

This comment was marked as outdated.

Sign in to view

epugh added 3 commits September 22, 2025 10:34

Disable JSM

87cb45c

Java Security Manager and Testcontainers do not play nicely together. We prefer Testcontainers, so disable JSM

IntelliJ prompted me.. and I couldn't resist.

7ebed82

lint

f25631d

Split test in two sub classes

5aa381f

Add common metadata Adjust some tests with dc:title instead of title Support passwords in TikaServer backend

epugh reviewed Sep 23, 2025

View reviewed changes

solr/modules/extraction/src/test-files/extraction/solr/collection1/conf/solrconfig.xml Show resolved Hide resolved

janhoy added 3 commits October 14, 2025 13:22

Use builder pattern in ExtractionRequest

b2452a3

New test testXPathWithTikaServer()

975ecba

Supress deprecated warnings for now

2443997

janhoy added 2 commits October 14, 2025 13:57

Suppress deprecation warnings in LocalTikaExtractionBackendTest

473be7e

Use Solr's RefCounted class for keeping track of users of shared reso…

60d5e2d

…urces

janhoy commented Oct 14, 2025

View reviewed changes

...ules/extraction/src/java/org/apache/solr/handler/extraction/TikaServerExtractionBackend.java Outdated Show resolved Hide resolved

Merge branch 'refs/heads/main' into refactor-extraction-handler-clone

62bff71

# Conflicts: # solr/test-framework/src/java/org/apache/solr/SolrIgnoredThreadsFilter.java

janhoy commented Oct 14, 2025

View reviewed changes

Core extraction classes should not import any Tika classes

23b2359

janhoy added 4 commits October 14, 2025 21:49

Remove jna 5.12 sha file

075c363

Merge branch 'main' into refactor-extraction-handler-clone

acaa6a7

Make TikaServerExtractionBackendTest.java not import tika classes

a8d30c0

Remove yet a tika import, from RegexRulesPasswordProvider

9d5a17a

janhoy commented Oct 15, 2025

View reviewed changes

janhoy added 5 commits October 15, 2025 02:14

Add doc comments to the copied TIKA files

059b80a

Merge branch 'refs/heads/main' into refactor-extraction-handler-clone

ff452fd

Libs update

ec013f0

Revert BoduContentHandler changes

2235647

Merge branch 'main' into refactor-extraction-handler

99d95ab

janhoy requested a review from anshumg October 15, 2025 21:25

janhoy mentioned this pull request Oct 16, 2025

Rolling upgrade test (BATS, Docker) #3706

Open

Merge branch 'main' into refactor-extraction-handler

eacc932

janhoy merged commit cca45c7 into apache:main Oct 16, 2025
5 checks passed

janhoy deleted the refactor-extraction-handler branch October 16, 2025 14:10

SOLR-7632 TikaServer as pluggable backend to existing extraction handler #3670

SOLR-7632 TikaServer as pluggable backend to existing extraction handler #3670

Uh oh!

Conversation

janhoy commented Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

epugh commented Sep 19, 2025

Uh oh!

janhoy commented Sep 20, 2025

Uh oh!

epugh commented Sep 21, 2025

Uh oh!

This comment was marked as outdated.

janhoy commented Sep 22, 2025

Uh oh!

iamsanjay commented Sep 22, 2025

Uh oh!

epugh commented Sep 22, 2025

Uh oh!

Uh oh!

janhoy commented Oct 14, 2025

Uh oh!

Uh oh!

janhoy commented Oct 14, 2025

Uh oh!

janhoy left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

epugh commented Oct 14, 2025

Uh oh!

janhoy Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

janhoy commented Sep 19, 2025 •

edited

Loading