Skip to content

Conversation

@malliaridis
Copy link
Contributor

https://issues.apache.org/jira/browse/SOLR-17888

Description

With Apache Tika being strongly outdated, we have several CVEs reported in the extraction and langid modules.

Solution

This PR upgrade Apache Tika to 3.2.3 and some depencies that were included as transitive dependencies with Tika (log4j and commons-io).

Please note that forbidden-api is currently missing the commons-io 2.20.0 signatures and therefore a bypass is added to this PR. Therefore two additional tasks were added (see pending changes below).

The PR introduces breaking changes (therefore backporting should probably be avoided). Apache Tika 2 and 3 standardized the metadata fields, which affect the returned fields. You can see some of the fields that are affected in the changed tests. More can be found in the migration guide of Apache Tika.

Tests

Tests were only updated to work with new Tika version.

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Solr maintainers access to contribute to my PR branch. (optional but recommended, not available for branches on forks living under an organisation)
  • I have developed this patch against the main branch.
  • I have run ./gradlew check.
  • I have added tests for my changes.
  • I have added documentation for the Reference Guide

Pending Changes

  • Update transitive dependencies log4j and commons-io before merging this PR
  •  Update forbidden-apis before merging
  • Remove suppression rules for forbidden-apis from gradle script

Copy link
Contributor

@epugh epugh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love the upgrade for 9.... however have you seen the work for yen to use a separate Tika server instead? Move tika "out of process".

@malliaridis
Copy link
Contributor Author

"have you seen the work for yen to use a separate Tika server instead?"

@epugh not sure what you mean. Are you talking about Jan's PR #3670?

@janhoy
Copy link
Contributor

janhoy commented Sep 20, 2025

"have you seen the work for yen to use a separate Tika server instead?"

@epugh not sure what you mean. Are you talking about Jan's PR #3670?

The main motivation for SOLR-7632 is to free Solr from all those bloated dependencies. And frankly it is more flexible for users to be able to install, configure and upgrade Tika separately from Solr. It is also safer, less resource hungry Solr, less crash prone, fewer vulnerabilities, easier to scale extraction compute separately from search etc. So my hope is actually to remove the old embedded Tika from 10.0. Which would leave the option of upgrading in 9.x, which breaks back-compat? The Tika-server approach will probably also have some compat breaks though.

@malliaridis
Copy link
Contributor Author

@janhoy I agree with and support the motivation of SOLR-7632.

And frankly it is more flexible for users to be able to install, configure and upgrade Tika separately from Solr

It may be more flexible, but we may not have a good enough adoption rate here. Looking at Zookeeper in Cloud mode, I fear that this may be problematic. I believe (from what I read) that we do not provide an easy and quick enough Solr deployment option that comes with all the "external" systems shipped, especially for production.

It is also safer, less resource hungry Solr, less crash prone, fewer vulnerabilities, easier to scale extraction compute separately from search etc.

I agree with all these points 👍

The Tika-server approach will probably also have some compat breaks though.

I believe no matter what action we take, it will in all cases have breaking changes, which is fine for Solr 10, problematic for Solr 9 (but maybe reasonable and acceptable for security reasons?).

@uschindler
Copy link
Contributor

Hi,
the next verison of forbiddenapis 3.10 will have support for commons-io v2.20.0. But it will also have support for newer versions automatically by downgrading signature versions to latest available one. This will prevent build failure slike this, but it will also log a warning when forbiddenapis is executed.

@janhoy: see issue policeman-tools/forbidden-apis#273 and PR policeman-tools/forbidden-apis#274.

@janhoy
Copy link
Contributor

janhoy commented Oct 6, 2025

@epugh had an idea to make an extraction backend for tika-pipes. It would take a path to some tika-uber.jar and be able to spawn new Java processes for each extraction job, and not pollute Solr's class-path. That could be an option for supporting Tika 3 in Solr, also Solr 9. The extra Jar file does not need to be shipped with solr but must be provided by user? It could be an alternative approach here, for those who want a simpler single-node way of running Tika?

@janhoy
Copy link
Contributor

janhoy commented Oct 23, 2025

Update: From 10.0 only tika-core remains used and only in one location. Also it is upgraded to latest. On 9x the old fun with is still there.

You may want to er-target this on branch_9x. Tika langid will only need tila-core, not sure what tika-pipes needs. There is a base test class for extraction module that can test that existing unit tests work.

@janhoy
Copy link
Contributor

janhoy commented Nov 3, 2025

The PR introduces breaking changes (therefore backporting should probably be avoided). Apache Tika 2 and 3 standardized the metadata fields, which affect the returned fields.

I tackled that in the tikaserver backend by adding a Metadata mapper that, if enabled, will map from e.g. dc.author to Author to please what users might have come to expect in Tika1.x. If you intend to pursue some upgrade in the 9.x line, re-using that class could perhaps make the upgrade somewhat more compatible. But if it is compatible enough to warrant this breaking change in 9.x I don't know.

I'd not be opposed to announce that a "necessary" breaking change will happen in, say 9.11, due to security risks, and then prepare users for the change. I kept the mapping option hidden, un-documented, since I don't want us to have to support it. But one could offer a user-supplied map {"from": "to", "from2", "to2"} where she could tailor this. Or, perhaps that would not be needed since we already have the fmap feature able to map fields, e.g. fmap.dc.author=Author.

@epugh
Copy link
Contributor

epugh commented Nov 3, 2025

The PR introduces breaking changes (therefore backporting should probably be avoided). Apache Tika 2 and 3 standardized the metadata fields, which affect the returned fields.

I tackled that in the tikaserver backend by adding a Metadata mapper that, if enabled, will map from e.g. dc.author to Author to please what users might have come to expect in Tika1.x. If you intend to pursue some upgrade in the 9.x line, re-using that class could perhaps make the upgrade somewhat more compatible. But if it is compatible enough to warrant this breaking change in 9.x I don't know.

I'd not be opposed to announce that a "necessary" breaking change will happen in, say 9.11, due to security risks, and then prepare users for the change. I kept the mapping option hidden, un-documented, since I don't want us to have to support it. But one could offer a user-supplied map {"from": "to", "from2", "to2"} where she could tailor this. Or, perhaps that would not be needed since we already have the fmap feature able to map fields, e.g. fmap.dc.author=Author.

I think this is reasonable. Upgrading 9x to using Tika 2 or 3 is a huge effort, and the payoff I don't think is there. We have a better path forward with the new pluggable backends, and that is a better route forward.

Anyone using Tika needs to anticipate upgrading their codebase anyway for Solr 10.

I think documenting these either or both of the alternative approaches is fine. I suspect the vast majority of users of Tika will either NOT upgrade, or jump to Solr 10 directly, which is IMO what they should do! Just the fact that we are moving from Tika 1 to Tika 3 means usrs will want to revalidate everythign anyway, so they won't be able to easily move Solr 9 versions anyway, because we all know that Tika 3 is going to handle documents slightly differently than Tika 1 did, and users will need to test/validate/understand that.

@malliaridis
Copy link
Contributor Author

I agree. I think this PR is not relevant anymore and Jan's implementation makes much more sense to follow. Even for Solr 9 it would introduce changes likely unwanted in a minor release update. What makes more sense to me is to provide a transition implementation to make the upgrade to Solr 10's implementation easier, but I don't have the time to prepare something like that.

Therefore, I am closing this PR for now as "won't fix".

@malliaridis malliaridis closed this Nov 3, 2025
@epugh
Copy link
Contributor

epugh commented Nov 3, 2025

Thanks @malliaridis .. Looking at this PR was helpful to me in understanding the potential paths forward, so this work was valuable to inform the final decision! Definitely not wasted effort.

@janhoy
Copy link
Contributor

janhoy commented Dec 12, 2025

Can this PR be re-purposed to upgrading tika to v3 on branch_9x? If not we’ll likely remove local tika from 9.x as well…

@epugh
Copy link
Contributor

epugh commented Dec 12, 2025

Can this PR be re-purposed to upgrading tika to v3 on branch_9x? If not we’ll likely remove local tika from 9.x as well…

I think my previous statement above still holds. I wouldn't want anyone to go from 9.10 to 9.11 and all of a sudden they get jumped from Tika 1 to Tika 3... I just don't see that big jump being drop in/backwards compatible. They would need to revalidate everything. So either A) stay on 9.10. B) Move to 10 and the modern version. Or C) Get them to add/sponsor/do the tika-pipes version. But for B and C, you would still consider that a major project, not just a small bump to Solr...

@janhoy
Copy link
Contributor

janhoy commented Dec 12, 2025

Sure. But now that we may end up with removing local tika that is a bigger back company break than upgrading with some metadata incompatibility. So I guess my comment above still holds:

I'd not be opposed to announce that a "necessary" breaking change will happen in, say 9.11, due to security risks, and then prepare users for the change.

But I don’t know how close this PR is to adapt for 9x, to be a viable alternative to full removal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants