-
Notifications
You must be signed in to change notification settings - Fork 809
SOLR-17888: Upgrade Apache Tika to 3.2.3 #3674
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
epugh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love the upgrade for 9.... however have you seen the work for yen to use a separate Tika server instead? Move tika "out of process".
The main motivation for SOLR-7632 is to free Solr from all those bloated dependencies. And frankly it is more flexible for users to be able to install, configure and upgrade Tika separately from Solr. It is also safer, less resource hungry Solr, less crash prone, fewer vulnerabilities, easier to scale extraction compute separately from search etc. So my hope is actually to remove the old embedded Tika from 10.0. Which would leave the option of upgrading in 9.x, which breaks back-compat? The Tika-server approach will probably also have some compat breaks though. |
|
@janhoy I agree with and support the motivation of SOLR-7632.
It may be more flexible, but we may not have a good enough adoption rate here. Looking at Zookeeper in Cloud mode, I fear that this may be problematic. I believe (from what I read) that we do not provide an easy and quick enough Solr deployment option that comes with all the "external" systems shipped, especially for production.
I agree with all these points 👍
I believe no matter what action we take, it will in all cases have breaking changes, which is fine for Solr 10, problematic for Solr 9 (but maybe reasonable and acceptable for security reasons?). |
|
Hi, @janhoy: see issue policeman-tools/forbidden-apis#273 and PR policeman-tools/forbidden-apis#274. |
|
@epugh had an idea to make an extraction backend for tika-pipes. It would take a path to some |
|
Update: From 10.0 only tika-core remains used and only in one location. Also it is upgraded to latest. On 9x the old fun with is still there. You may want to er-target this on branch_9x. Tika langid will only need tila-core, not sure what tika-pipes needs. There is a base test class for extraction module that can test that existing unit tests work. |
I tackled that in the I'd not be opposed to announce that a "necessary" breaking change will happen in, say 9.11, due to security risks, and then prepare users for the change. I kept the mapping option hidden, un-documented, since I don't want us to have to support it. But one could offer a user-supplied map |
I think this is reasonable. Upgrading 9x to using Tika 2 or 3 is a huge effort, and the payoff I don't think is there. We have a better path forward with the new pluggable backends, and that is a better route forward. Anyone using Tika needs to anticipate upgrading their codebase anyway for Solr 10. I think documenting these either or both of the alternative approaches is fine. I suspect the vast majority of users of Tika will either NOT upgrade, or jump to Solr 10 directly, which is IMO what they should do! Just the fact that we are moving from Tika 1 to Tika 3 means usrs will want to revalidate everythign anyway, so they won't be able to easily move Solr 9 versions anyway, because we all know that Tika 3 is going to handle documents slightly differently than Tika 1 did, and users will need to test/validate/understand that. |
|
I agree. I think this PR is not relevant anymore and Jan's implementation makes much more sense to follow. Even for Solr 9 it would introduce changes likely unwanted in a minor release update. What makes more sense to me is to provide a transition implementation to make the upgrade to Solr 10's implementation easier, but I don't have the time to prepare something like that. Therefore, I am closing this PR for now as "won't fix". |
|
Thanks @malliaridis .. Looking at this PR was helpful to me in understanding the potential paths forward, so this work was valuable to inform the final decision! Definitely not wasted effort. |
|
Can this PR be re-purposed to upgrading tika to v3 on branch_9x? If not we’ll likely remove local tika from 9.x as well… |
I think my previous statement above still holds. I wouldn't want anyone to go from 9.10 to 9.11 and all of a sudden they get jumped from Tika 1 to Tika 3... I just don't see that big jump being drop in/backwards compatible. They would need to revalidate everything. So either A) stay on 9.10. B) Move to 10 and the modern version. Or C) Get them to add/sponsor/do the tika-pipes version. But for B and C, you would still consider that a major project, not just a small bump to Solr... |
|
Sure. But now that we may end up with removing local tika that is a bigger back company break than upgrading with some metadata incompatibility. So I guess my comment above still holds:
But I don’t know how close this PR is to adapt for 9x, to be a viable alternative to full removal. |
https://issues.apache.org/jira/browse/SOLR-17888
Description
With Apache Tika being strongly outdated, we have several CVEs reported in the extraction and langid modules.
Solution
This PR upgrade Apache Tika to 3.2.3 and some depencies that were included as transitive dependencies with Tika (log4j and commons-io).
Please note that forbidden-api is currently missing the commons-io 2.20.0 signatures and therefore a bypass is added to this PR. Therefore two additional tasks were added (see pending changes below).
The PR introduces breaking changes (therefore backporting should probably be avoided). Apache Tika 2 and 3 standardized the metadata fields, which affect the returned fields. You can see some of the fields that are affected in the changed tests. More can be found in the migration guide of Apache Tika.
Tests
Tests were only updated to work with new Tika version.
Checklist
Please review the following and check all that apply:
mainbranch../gradlew check.Pending Changes