-
Notifications
You must be signed in to change notification settings - Fork 809
SOLR-7632 TikaServer as pluggable backend to existing extraction handler #3670
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ika API Refactor some tests to LocalTikaExtractionBackendTest
|
Exciting! |
|
Status:
TBD:
Anyone, please feel free to hack away on this if it looks exciting, committing directly to the PR branch. Question: Would it bring value to isolate the refactoring in one PR and then another one to add the tikaserver impl? |
Cleanup TestContainer Refactor ExtractionMetadata Add returnType to ExtractionRequest Remove static initializers
cc3d43f to
a3794ce
Compare
|
Any luck with security manager?? I had many difficulties |
This comment was marked as outdated.
This comment was marked as outdated.
|
Yea, that’s annoying. Perhaps we could disable JSM for this test or for tests in the entire module? |
|
I had the similar experience as I was upgrading kafka. And then I stopped. |
Java Security Manager and Testcontainers do not play nicely together. We prefer Testcontainers, so disable JSM
|
When I first saw |
Add common metadata Adjust some tests with dc:title instead of title Support passwords in TikaServer backend
solr/modules/extraction/src/test-files/extraction/solr/collection1/conf/solrconfig.xml
Show resolved
Hide resolved
|
Did a series of final improvements, based on review feedback from Claude Code, added a new test, hardened input parameter validation etc. Planning to merge this to main Wednesday morning CET, hopefully incorporating feedback from @epugh taking it for a spin. If any of you have a review pending, let me know so I can incorporate that as well. |
...ules/extraction/src/java/org/apache/solr/handler/extraction/TikaServerExtractionBackend.java
Outdated
Show resolved
Hide resolved
Skipping JSM for extraction tests was the easy way out, and given that JSM is going away as soon as you upgrade to JRE 24+, I'm not too worried about this shortcut. |
# Conflicts: # solr/test-framework/src/java/org/apache/solr/SolrIgnoredThreadsFilter.java
janhoy
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make an even cleaner separation, let's not import any org.apache.tika classes in the base classes that will survive the removal of local backend.
...modules/extraction/src/java/org/apache/solr/handler/extraction/ExtractingDocumentLoader.java
Outdated
Show resolved
Hide resolved
...dules/extraction/src/java/org/apache/solr/handler/extraction/RegexRulesPasswordProvider.java
Outdated
Show resolved
Hide resolved
.../extraction/src/test/org/apache/solr/handler/extraction/TikaServerExtractionBackendTest.java
Outdated
Show resolved
Hide resolved
...ules/extraction/src/java/org/apache/solr/handler/extraction/TikaServerExtractionBackend.java
Outdated
Show resolved
Hide resolved
Very astute comment! I am excited about adding |
| import org.apache.tika.sax.XHTMLContentHandler; | ||
| import org.apache.tika.sax.xpath.Matcher; | ||
| import org.apache.tika.sax.xpath.MatchingContentHandler; | ||
| import org.apache.tika.sax.xpath.XPathParser; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are more imports depending on Tika. These classes can also be copied into our project, but it starts to get deep, with a total of 18 java source files if we want to bring it all in.
Seems a bit much for ONE usage in TikaServerExtractionBackend, so I'll leave it in there and defer to later to find / write a replacement for the BodyContentHandler.
…ler (apache#3670) Co-authored-by: Eric Pugh <epugh@opensourceconnections.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> (cherry picked from commit cca45c7) Signed-off-by: Jan Høydahl <jan.git@cominvent.com>
https://issues.apache.org/jira/browse/SOLR-7632
This work builds on the one in #3361 but instead of making a new module, we add it as a capability to the existing extraction handler through specifying
extraction.backend=tikaserver.This first required refactoring extraction handler to detach it from the Tika-v1 API. There is a new interface
ExtractionBackendthat takes genericExtractionRequestobject in and returns anExtractionResultbean, and a newLocalTikaExtractionBackendimplementation that encapsulates all Tikav1 api handling. This implementation can be deprecated, and in Solr 10, thetikaserverone can be made default.All existing tests pass, and most of the existing extraction tests now also pass when running the
tikaserverbackend (running in TestContainers). Unfortunately docker is not available in Crave, so a new GH workflow is made to run only the extraction tests.TODO's: