[SPARK-29175][SQL] Make additional remote maven repository in IsolatedClientLoader configurable #25849

xuanyuanking · 2019-09-19T08:27:03Z

What changes were proposed in this pull request?

Added a new config "spark.sql.additionalRemoteRepositories", a comma-delimited string config of the optional additional remote maven mirror.

Why are the changes needed?

We need to connect the Maven repositories in IsolatedClientLoader for downloading Hive jars,
end-users can set this config if the default maven central repo is unreachable.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing UT.

xuanyuanking · 2019-09-19T08:27:28Z

cc @cloud-fan @gatorsmile

core/src/main/scala/org/apache/spark/internal/config/package.scala

SparkQA · 2019-09-19T11:03:28Z

Test build #110989 has finished for PR 25849 at commit 8eac037.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2019-09-19T12:04:35Z

What other repo can you connect to for these dependencies? are they already in Maven Central too?

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

gatorsmile · 2019-09-19T15:51:49Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+      .doc("The default central repository used for downloading Hive jars " +
+        "in IsolatedClientLoader.")
+      .stringConf
+      .createWithDefault("https://repo1.maven.org/maven2")


We do not need to provide a default value here. Without any value, it will use maven central, I think.

In the Hive tests, we can set the conf to some default repo in TestHiveContext? See this link https://www.deps.co/guides/public-maven-repositories/ We can use google mirror for avoiding we are blocked by maven central.

If so, then shouldn't this be called something like "additional repo" here and below?

If provided, I think it will try the provided mirror first and then central?

In general, it is weird to hardcode the default mirror.

Yes you are right; see SparkSubmitUtils.buildIvySettings. It's an additional "remote repo", which is used in addition to central. That's why I'm saying this should not be called "central repo", as it won't (and shouldn't) override resolving against Maven Central -- almost nothing would work without that. This looks good if we can clarify the semantics in the naming and description. The default should be "None" here

That's right, I also test locally without setting any additional remote repo, it will pass.
Change the default value and set the config to google mirror for hive tests in 49ea1cd.

SparkQA · 2019-09-19T16:35:39Z

Test build #110996 has finished for PR 25849 at commit 96b3fe9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2019-09-20T04:38:15Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala

      SparkSubmitUtils.resolveMavenCoordinates(
        hiveArtifacts.mkString(","),
        SparkSubmitUtils.buildIvySettings(
-          Some("http://www.datanucleus.org/downloads/maven2"),


Interesting. So, with this PR, the side-effect benefit is the removal of the flakiness by default, @xuanyuanking?

@xuanyuanking . If then, could you make a separate JIRA and PR for this line change with the following description?

The repository currently used is "http://www.datanucleus.org/downloads/maven2", which is no longer maintained. This will sometimes cause downloading failure and make hive test cases flaky. End users can also set this config to the central repository they want to access.

Then, we can backport your new PR to branch-2.4, too. After that, we can proceed this PR on top of that. That will be very helpful for our LTS branch branch-2.4.

Yes, as the discussion above, the flakiness is caused by when the Jenkins blocked by maven central repo and the additional datanucleus remote repo still not work.
I updated this PR to set google mirror as an additional remote repo for hive tests in 49ea1cd.

No. What I meant was another PR like the following two lines (excluding all the other stuff in this PR) for master and branch-2.4.

- Some("http://www.datanucleus.org/downloads/maven2"), + Some("https://maven-central.storage-download.googleapis.com/repos/central/data/"),

In short, we had better split new configuration and removing datanucleus into different PRs.

Maybe, not this line. TestHive.scala will be a better place. But, we should not have private[spark] val ADDITIONAL_REMOTE_REPOSITORIES =... in that PR.

Thanks for the explanation, split these two works done.
Removing datanucleus by the one-line change maybe the most straightforward way, done in #25915.

HyukjinKwon · 2019-09-20T05:45:42Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+      .doc("A comma-delimited string config of the optional additional remote maven mirror " +
+        "repositories, this can be used for downloading Hive jars in IsolatedClientLoader.")
+      .stringConf
+      .createWithDefault("")


We're not going to set the default option, can we use createOptional instead?

This is kind of an array option, it's ok to use empty string as default, like DISABLED_V2_STREAMING_WRITERS.

Alright it's at least consistent with other instances. But I think we should strictly use createOptional if that option is optional at least to make it easier to read and consistent.

SparkQA · 2019-09-20T07:05:01Z

Test build #111046 has finished for PR 25849 at commit 49ea1cd.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-09-23T15:55:11Z

Test build #4881 has finished for PR 25849 at commit 49ea1cd.

This patch fails PySpark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2019-09-24T16:30:04Z

Test build #111285 has finished for PR 25849 at commit 130def0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-09-24T16:50:35Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/test/TestHive.scala

+        // Add additional remote maven mirror repo here for avoiding the jenkins is blocked
+        // by maven central.
+        .set(SQLConf.ADDITIONAL_REMOTE_REPOSITORIES.key,
+          "https://maven-central.storage-download.googleapis.com/repos/central/data/")))


shall we make this as the default value of the new config?

Yep, done in 7cc0607.

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

dongjoon-hyun · 2019-09-25T04:21:27Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+  private[spark] val ADDITIONAL_REMOTE_REPOSITORIES =
+    ConfigBuilder("spark.sql.additionalRemoteRepositories")
+      .doc("A comma-delimited string config of the optional additional remote maven mirror " +
+        "repositories, this can be used for downloading Hive jars in IsolatedClientLoader " +


this can be used -> this is only used?

I'm slightly worrying about that this might be confused with --repositories.

Changed in 5bd630c.

sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala

SparkQA · 2019-09-25T07:05:01Z

Test build #111324 has finished for PR 25849 at commit 7cc0607.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2019-09-25T12:14:08Z

Thank you for updating, @xuanyuanking .

SparkQA · 2019-09-25T16:07:37Z

Test build #111347 has finished for PR 25849 at commit 5bd630c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

SparkQA · 2019-09-26T07:05:02Z

Test build #111384 has finished for PR 25849 at commit 6e6b87c.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

xuanyuanking · 2019-09-26T09:52:54Z

retest this please.

SparkQA · 2019-09-26T14:08:31Z

Test build #111419 has finished for PR 25849 at commit 6e6b87c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun

+1, LGTM. Merged to master.
Thank you all!

…en.additionalRemoteRepositories ### What changes were proposed in this pull request? Rename the config added in #25849 to `spark.sql.maven.additionalRemoteRepositories`. ### Why are the changes needed? Follow the advice in [SPARK-29175](https://issues.apache.org/jira/browse/SPARK-29175?focusedCommentId=17021586&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17021586), the new name is more clear. ### Does this PR introduce any user-facing change? Yes, the config name changed. ### How was this patch tested? Existing test. Closes #27339 from xuanyuanking/SPARK-29175. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

wangyum reviewed Sep 19, 2019

View reviewed changes

core/src/main/scala/org/apache/spark/internal/config/package.scala Outdated Show resolved Hide resolved

gatorsmile reviewed Sep 19, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Outdated Show resolved Hide resolved

gatorsmile reviewed Sep 19, 2019

View reviewed changes

dongjoon-hyun added the SQL label Sep 20, 2019

dongjoon-hyun reviewed Sep 20, 2019

View reviewed changes

xuanyuanking changed the title ~~[SPARK-29175][SQL] Make maven central repository in IsolatedClientLoader configurable~~ [SPARK-29175][SQL] Make additional remote maven repository in IsolatedClientLoader configurable Sep 20, 2019

HyukjinKwon reviewed Sep 20, 2019

View reviewed changes

xuanyuanking force-pushed the SPARK-29175 branch from 49ea1cd to 130def0 Compare September 24, 2019 12:50

cloud-fan reviewed Sep 24, 2019

View reviewed changes

xuanyuanking added 5 commits September 25, 2019 11:22

Add config DEFAULT_CENTRAL_REPOSITORY

e5d189c

move config to SQLConf

47f8402

rename and set the config in hive tests

a1f99cc

fix comment

a1502c5

Address comments

7cc0607

xuanyuanking force-pushed the SPARK-29175 branch from 130def0 to 7cc0607 Compare September 25, 2019 03:31

dongjoon-hyun reviewed Sep 25, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Sep 25, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Sep 25, 2019

View reviewed changes

sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Sep 25, 2019

View reviewed changes

sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala Show resolved Hide resolved

address comment

5bd630c

srowen reviewed Sep 25, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Outdated Show resolved Hide resolved

fix doc

6e6b87c

srowen approved these changes Sep 26, 2019

View reviewed changes

dongjoon-hyun approved these changes Sep 27, 2019

View reviewed changes

dongjoon-hyun closed this in ada3ad3 Sep 27, 2019

xuanyuanking deleted the SPARK-29175 branch September 27, 2019 05:00

xuanyuanking mentioned this pull request Jan 23, 2020

[SPARK-29175][SQL][Follow-up] Rename the config name to spark.sql.maven.additionalRemoteRepositories #27339

Closed

[SPARK-29175][SQL] Make additional remote maven repository in IsolatedClientLoader configurable #25849

[SPARK-29175][SQL] Make additional remote maven repository in IsolatedClientLoader configurable #25849

Uh oh!

Conversation

xuanyuanking commented Sep 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

xuanyuanking commented Sep 19, 2019

Uh oh!

Uh oh!

SparkQA commented Sep 19, 2019

Uh oh!

srowen commented Sep 19, 2019

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xuanyuanking Sep 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 19, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Sep 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Sep 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 20, 2019

Uh oh!

SparkQA commented Sep 23, 2019

Uh oh!

SparkQA commented Sep 24, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

SparkQA commented Sep 25, 2019

Uh oh!

dongjoon-hyun commented Sep 25, 2019

Uh oh!

SparkQA commented Sep 25, 2019

Uh oh!

Uh oh!

xuanyuanking commented Sep 19, 2019 •

edited

Loading

xuanyuanking Sep 20, 2019 •

edited

Loading

dongjoon-hyun Sep 20, 2019 •

edited

Loading

dongjoon-hyun Sep 20, 2019 •

edited

Loading