Skip to content

Conversation

@xuanyuanking
Copy link
Member

@xuanyuanking xuanyuanking commented Sep 19, 2019

What changes were proposed in this pull request?

Added a new config "spark.sql.additionalRemoteRepositories", a comma-delimited string config of the optional additional remote maven mirror.

Why are the changes needed?

We need to connect the Maven repositories in IsolatedClientLoader for downloading Hive jars,
end-users can set this config if the default maven central repo is unreachable.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing UT.

@xuanyuanking
Copy link
Member Author

cc @cloud-fan @gatorsmile

@SparkQA
Copy link

SparkQA commented Sep 19, 2019

Test build #110989 has finished for PR 25849 at commit 8eac037.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Sep 19, 2019

What other repo can you connect to for these dependencies? are they already in Maven Central too?

.doc("The default central repository used for downloading Hive jars " +
"in IsolatedClientLoader.")
.stringConf
.createWithDefault("https://repo1.maven.org/maven2")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not need to provide a default value here. Without any value, it will use maven central, I think.

In the Hive tests, we can set the conf to some default repo in TestHiveContext? See this link https://www.deps.co/guides/public-maven-repositories/ We can use google mirror for avoiding we are blocked by maven central.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If so, then shouldn't this be called something like "additional repo" here and below?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If provided, I think it will try the provided mirror first and then central?

In general, it is weird to hardcode the default mirror.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you are right; see SparkSubmitUtils.buildIvySettings. It's an additional "remote repo", which is used in addition to central. That's why I'm saying this should not be called "central repo", as it won't (and shouldn't) override resolving against Maven Central -- almost nothing would work without that. This looks good if we can clarify the semantics in the naming and description. The default should be "None" here

Copy link
Member Author

@xuanyuanking xuanyuanking Sep 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right, I also test locally without setting any additional remote repo, it will pass.
Change the default value and set the config to google mirror for hive tests in 49ea1cd.

@SparkQA
Copy link

SparkQA commented Sep 19, 2019

Test build #110996 has finished for PR 25849 at commit 96b3fe9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkSubmitUtils.resolveMavenCoordinates(
hiveArtifacts.mkString(","),
SparkSubmitUtils.buildIvySettings(
Some("http://www.datanucleus.org/downloads/maven2"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. So, with this PR, the side-effect benefit is the removal of the flakiness by default, @xuanyuanking?

Copy link
Member

@dongjoon-hyun dongjoon-hyun Sep 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xuanyuanking . If then, could you make a separate JIRA and PR for this line change with the following description?

The repository currently used is "http://www.datanucleus.org/downloads/maven2", which is no longer maintained. This will sometimes cause downloading failure and make hive test cases flaky. End users can also set this config to the central repository they want to access.

Then, we can backport your new PR to branch-2.4, too. After that, we can proceed this PR on top of that. That will be very helpful for our LTS branch branch-2.4.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, as the discussion above, the flakiness is caused by when the Jenkins blocked by maven central repo and the additional datanucleus remote repo still not work.
I updated this PR to set google mirror as an additional remote repo for hive tests in 49ea1cd.

Copy link
Member

@dongjoon-hyun dongjoon-hyun Sep 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. What I meant was another PR like the following two lines (excluding all the other stuff in this PR) for master and branch-2.4.

- Some("http://www.datanucleus.org/downloads/maven2"),
+ Some("https://maven-central.storage-download.googleapis.com/repos/central/data/"),

In short, we had better split new configuration and removing datanucleus into different PRs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, not this line. TestHive.scala will be a better place. But, we should not have private[spark] val ADDITIONAL_REMOTE_REPOSITORIES =... in that PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation, split these two works done.
Removing datanucleus by the one-line change maybe the most straightforward way, done in #25915.

@xuanyuanking xuanyuanking changed the title [SPARK-29175][SQL] Make maven central repository in IsolatedClientLoader configurable [SPARK-29175][SQL] Make additional remote maven repository in IsolatedClientLoader configurable Sep 20, 2019
.doc("A comma-delimited string config of the optional additional remote maven mirror " +
"repositories, this can be used for downloading Hive jars in IsolatedClientLoader.")
.stringConf
.createWithDefault("")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're not going to set the default option, can we use createOptional instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kind of an array option, it's ok to use empty string as default, like DISABLED_V2_STREAMING_WRITERS.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright it's at least consistent with other instances. But I think we should strictly use createOptional if that option is optional at least to make it easier to read and consistent.

@SparkQA
Copy link

SparkQA commented Sep 20, 2019

Test build #111046 has finished for PR 25849 at commit 49ea1cd.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 23, 2019

Test build #4881 has finished for PR 25849 at commit 49ea1cd.

  • This patch fails PySpark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 24, 2019

Test build #111285 has finished for PR 25849 at commit 130def0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

// Add additional remote maven mirror repo here for avoiding the jenkins is blocked
// by maven central.
.set(SQLConf.ADDITIONAL_REMOTE_REPOSITORIES.key,
"https://maven-central.storage-download.googleapis.com/repos/central/data/")))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we make this as the default value of the new config?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, done in 7cc0607.

private[spark] val ADDITIONAL_REMOTE_REPOSITORIES =
ConfigBuilder("spark.sql.additionalRemoteRepositories")
.doc("A comma-delimited string config of the optional additional remote maven mirror " +
"repositories, this can be used for downloading Hive jars in IsolatedClientLoader " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be used -> this is only used?

I'm slightly worrying about that this might be confused with --repositories.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed in 5bd630c.

@SparkQA
Copy link

SparkQA commented Sep 25, 2019

Test build #111324 has finished for PR 25849 at commit 7cc0607.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Thank you for updating, @xuanyuanking .

@SparkQA
Copy link

SparkQA commented Sep 25, 2019

Test build #111347 has finished for PR 25849 at commit 5bd630c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 26, 2019

Test build #111384 has finished for PR 25849 at commit 6e6b87c.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@xuanyuanking
Copy link
Member Author

retest this please.

@SparkQA
Copy link

SparkQA commented Sep 26, 2019

Test build #111419 has finished for PR 25849 at commit 6e6b87c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Merged to master.
Thank you all!

@xuanyuanking xuanyuanking deleted the SPARK-29175 branch September 27, 2019 05:00
dongjoon-hyun pushed a commit that referenced this pull request Jan 23, 2020
…en.additionalRemoteRepositories

### What changes were proposed in this pull request?
Rename the config added in #25849 to `spark.sql.maven.additionalRemoteRepositories`.

### Why are the changes needed?
Follow the advice in [SPARK-29175](https://issues.apache.org/jira/browse/SPARK-29175?focusedCommentId=17021586&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17021586), the new name is more clear.

### Does this PR introduce any user-facing change?
Yes, the config name changed.

### How was this patch tested?
Existing test.

Closes #27339 from xuanyuanking/SPARK-29175.

Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants