-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-29175][SQL] Make additional remote maven repository in IsolatedClientLoader configurable #25849
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
core/src/main/scala/org/apache/spark/internal/config/package.scala
Outdated
Show resolved
Hide resolved
|
Test build #110989 has finished for PR 25849 at commit
|
|
What other repo can you connect to for these dependencies? are they already in Maven Central too? |
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
Outdated
Show resolved
Hide resolved
| .doc("The default central repository used for downloading Hive jars " + | ||
| "in IsolatedClientLoader.") | ||
| .stringConf | ||
| .createWithDefault("https://repo1.maven.org/maven2") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not need to provide a default value here. Without any value, it will use maven central, I think.
In the Hive tests, we can set the conf to some default repo in TestHiveContext? See this link https://www.deps.co/guides/public-maven-repositories/ We can use google mirror for avoiding we are blocked by maven central.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If so, then shouldn't this be called something like "additional repo" here and below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If provided, I think it will try the provided mirror first and then central?
In general, it is weird to hardcode the default mirror.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes you are right; see SparkSubmitUtils.buildIvySettings. It's an additional "remote repo", which is used in addition to central. That's why I'm saying this should not be called "central repo", as it won't (and shouldn't) override resolving against Maven Central -- almost nothing would work without that. This looks good if we can clarify the semantics in the naming and description. The default should be "None" here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right, I also test locally without setting any additional remote repo, it will pass.
Change the default value and set the config to google mirror for hive tests in 49ea1cd.
|
Test build #110996 has finished for PR 25849 at commit
|
| SparkSubmitUtils.resolveMavenCoordinates( | ||
| hiveArtifacts.mkString(","), | ||
| SparkSubmitUtils.buildIvySettings( | ||
| Some("http://www.datanucleus.org/downloads/maven2"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. So, with this PR, the side-effect benefit is the removal of the flakiness by default, @xuanyuanking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xuanyuanking . If then, could you make a separate JIRA and PR for this line change with the following description?
The repository currently used is "http://www.datanucleus.org/downloads/maven2", which is no longer maintained. This will sometimes cause downloading failure and make hive test cases flaky. End users can also set this config to the central repository they want to access.
Then, we can backport your new PR to branch-2.4, too. After that, we can proceed this PR on top of that. That will be very helpful for our LTS branch branch-2.4.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, as the discussion above, the flakiness is caused by when the Jenkins blocked by maven central repo and the additional datanucleus remote repo still not work.
I updated this PR to set google mirror as an additional remote repo for hive tests in 49ea1cd.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. What I meant was another PR like the following two lines (excluding all the other stuff in this PR) for master and branch-2.4.
- Some("http://www.datanucleus.org/downloads/maven2"),
+ Some("https://maven-central.storage-download.googleapis.com/repos/central/data/"),In short, we had better split new configuration and removing datanucleus into different PRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, not this line. TestHive.scala will be a better place. But, we should not have private[spark] val ADDITIONAL_REMOTE_REPOSITORIES =... in that PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation, split these two works done.
Removing datanucleus by the one-line change maybe the most straightforward way, done in #25915.
| .doc("A comma-delimited string config of the optional additional remote maven mirror " + | ||
| "repositories, this can be used for downloading Hive jars in IsolatedClientLoader.") | ||
| .stringConf | ||
| .createWithDefault("") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're not going to set the default option, can we use createOptional instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is kind of an array option, it's ok to use empty string as default, like DISABLED_V2_STREAMING_WRITERS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright it's at least consistent with other instances. But I think we should strictly use createOptional if that option is optional at least to make it easier to read and consistent.
|
Test build #111046 has finished for PR 25849 at commit
|
|
Test build #4881 has finished for PR 25849 at commit
|
49ea1cd to
130def0
Compare
|
Test build #111285 has finished for PR 25849 at commit
|
| // Add additional remote maven mirror repo here for avoiding the jenkins is blocked | ||
| // by maven central. | ||
| .set(SQLConf.ADDITIONAL_REMOTE_REPOSITORIES.key, | ||
| "https://maven-central.storage-download.googleapis.com/repos/central/data/"))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we make this as the default value of the new config?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, done in 7cc0607.
130def0 to
7cc0607
Compare
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
Outdated
Show resolved
Hide resolved
| private[spark] val ADDITIONAL_REMOTE_REPOSITORIES = | ||
| ConfigBuilder("spark.sql.additionalRemoteRepositories") | ||
| .doc("A comma-delimited string config of the optional additional remote maven mirror " + | ||
| "repositories, this can be used for downloading Hive jars in IsolatedClientLoader " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be used -> this is only used?
I'm slightly worrying about that this might be confused with --repositories.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed in 5bd630c.
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala
Outdated
Show resolved
Hide resolved
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala
Show resolved
Hide resolved
|
Test build #111324 has finished for PR 25849 at commit
|
|
Thank you for updating, @xuanyuanking . |
|
Test build #111347 has finished for PR 25849 at commit
|
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
Outdated
Show resolved
Hide resolved
|
Test build #111384 has finished for PR 25849 at commit
|
|
retest this please. |
|
Test build #111419 has finished for PR 25849 at commit
|
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Merged to master.
Thank you all!
…en.additionalRemoteRepositories ### What changes were proposed in this pull request? Rename the config added in #25849 to `spark.sql.maven.additionalRemoteRepositories`. ### Why are the changes needed? Follow the advice in [SPARK-29175](https://issues.apache.org/jira/browse/SPARK-29175?focusedCommentId=17021586&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17021586), the new name is more clear. ### Does this PR introduce any user-facing change? Yes, the config name changed. ### How was this patch tested? Existing test. Closes #27339 from xuanyuanking/SPARK-29175. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
What changes were proposed in this pull request?
Added a new config "spark.sql.additionalRemoteRepositories", a comma-delimited string config of the optional additional remote maven mirror.
Why are the changes needed?
We need to connect the Maven repositories in IsolatedClientLoader for downloading Hive jars,
end-users can set this config if the default maven central repo is unreachable.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Existing UT.