Skip to content

Conversation

@janhoy
Copy link
Contributor

@janhoy janhoy commented Sep 19, 2025

https://issues.apache.org/jira/browse/SOLR-7632

This work builds on the one in #3361 but instead of making a new module, we add it as a capability to the existing extraction handler through specifying extraction.backend=tikaserver.

This first required refactoring extraction handler to detach it from the Tika-v1 API. There is a new interface ExtractionBackend that takes generic ExtractionRequest object in and returns an ExtractionResult bean, and a new LocalTikaExtractionBackend implementation that encapsulates all Tikav1 api handling. This implementation can be deprecated, and in Solr 10, the tikaserver one can be made default.

All existing tests pass, and most of the existing extraction tests now also pass when running the tikaserver backend (running in TestContainers). Unfortunately docker is not available in Crave, so a new GH workflow is made to run only the extraction tests.

TODO's:

  • Lots of code from GenAI, which needs review and rewrite / simplification.
  • There may be debug print and TODOs left here and there
  • The metadata back-compat map is AI generated and by no means complete or even correct 🤣
  • Lack of JavaDoc everywhere
  • Harden exception handling, retrying, timeout values etc
  • Should probably use Jetty HTTP client instead of JDK one.
  • Optimize throughput with async methods? The request thread is blocked on Tika response
  • Explicit HTTPS / self-signed cert support?
  • Complete the RefGuide docs

@janhoy janhoy marked this pull request as draft September 19, 2025 15:14
@janhoy janhoy requested a review from epugh September 19, 2025 15:14
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Sep 19, 2025
@epugh
Copy link
Contributor

epugh commented Sep 19, 2025

Exciting!

@janhoy
Copy link
Contributor Author

janhoy commented Sep 20, 2025

Status:

  • Parses docs using TikaServer
  • Can switch between xml (html) and text format of the content field
  • Randomized the choice of backend for the main test class
  • ExtractOnly not fully implemented for tikaserver, some tests fail

TBD:

  • The whole xpath / SAX parsing of XML response is missing
  • We use JDK HTTP client, could perhaps use Jetty client. See other POC for example, including making timeouts configurable
  • Must make sure that tikaserver.url is only configurable on requesthandler config in solrconfig, not as a request parameter (security)
  • RefGuide docs, especially how to start TikaServer etc
  • Remove the DummyExtractionBackend

Anyone, please feel free to hack away on this if it looks exciting, committing directly to the PR branch.

Question: Would it bring value to isolate the refactoring in one PR and then another one to add the tikaserver impl?

Cleanup TestContainer
Refactor ExtractionMetadata
Add returnType to ExtractionRequest
Remove static initializers
@janhoy janhoy force-pushed the refactor-extraction-handler branch from cc3d43f to a3794ce Compare September 20, 2025 01:24
@epugh
Copy link
Contributor

epugh commented Sep 21, 2025

Any luck with security manager?? I had many difficulties

@epugh

This comment was marked as outdated.

@janhoy
Copy link
Contributor Author

janhoy commented Sep 22, 2025

Yea, that’s annoying. Perhaps we could disable JSM for this test or for tests in the entire module?

@iamsanjay
Copy link
Contributor

I had the similar experience as I was upgrading kafka. And then I stopped.

Java Security Manager and Testcontainers do not play nicely together.  We prefer Testcontainers, so disable JSM
@epugh
Copy link
Contributor

epugh commented Sep 22, 2025

When I first saw DummyExtractionBackend my first thought was that it should be in the test class hierarchy. However, would there be value in keeping it? If you wanted to test your set up in Solr (and not worry about the Tika side), could it be useful for that? "I send a doc and I get something back"....

Add common metadata
Adjust some tests with dc:title instead of title
Support passwords in TikaServer backend
@janhoy
Copy link
Contributor Author

janhoy commented Oct 14, 2025

Did a series of final improvements, based on review feedback from Claude Code, added a new test, hardened input parameter validation etc.

Planning to merge this to main Wednesday morning CET, hopefully incorporating feedback from @epugh taking it for a spin.

If any of you have a review pending, let me know so I can incorporate that as well.

@janhoy
Copy link
Contributor Author

janhoy commented Oct 14, 2025

Any luck with security manager?? I had many difficulties

Skipping JSM for extraction tests was the easy way out, and given that JSM is going away as soon as you upgrade to JRE 24+, I'm not too worried about this shortcut.

# Conflicts:
#	solr/test-framework/src/java/org/apache/solr/SolrIgnoredThreadsFilter.java
Copy link
Contributor Author

@janhoy janhoy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make an even cleaner separation, let's not import any org.apache.tika classes in the base classes that will survive the removal of local backend.

@epugh
Copy link
Contributor

epugh commented Oct 14, 2025

To make an even cleaner separation, let's not import any org.apache.tika classes in the base classes that will survive the removal of local backend.

Very astute comment! I am excited about adding tika-pipes as a replacement for local in the future...

import org.apache.tika.sax.XHTMLContentHandler;
import org.apache.tika.sax.xpath.Matcher;
import org.apache.tika.sax.xpath.MatchingContentHandler;
import org.apache.tika.sax.xpath.XPathParser;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are more imports depending on Tika. These classes can also be copied into our project, but it starts to get deep, with a total of 18 java source files if we want to bring it all in.

Skjermbilde 2025-10-15 kl  01 59 33

Seems a bit much for ONE usage in TikaServerExtractionBackend, so I'll leave it in there and defer to later to find / write a replacement for the BodyContentHandler.

@janhoy janhoy merged commit cca45c7 into apache:main Oct 16, 2025
5 checks passed
@janhoy janhoy deleted the refactor-extraction-handler branch October 16, 2025 14:10
janhoy added a commit that referenced this pull request Oct 16, 2025
…ler (#3670)

Co-authored-by: Eric Pugh <epugh@opensourceconnections.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
(cherry picked from commit cca45c7)
janhoy added a commit that referenced this pull request Oct 16, 2025
…ler (#3670)

Co-authored-by: Eric Pugh <epugh@opensourceconnections.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
(cherry picked from commit cca45c7)
janhoy added a commit to janhoy/solr that referenced this pull request Oct 16, 2025
…ler (apache#3670)

Co-authored-by: Eric Pugh <epugh@opensourceconnections.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

(cherry picked from commit cca45c7)
Signed-off-by: Jan Høydahl <jan.git@cominvent.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants