[SPARK-16759][CORE] Add a configuration property to pass caller contexts of upstream applications into Spark #15563

weiqingy · 2016-10-20T06:37:26Z

What changes were proposed in this pull request?

Many applications take Spark as a computing engine and run on it. This PR adds a configuration property spark.log.callerContext that can be used by Spark's upstream applications (e.g. Oozie) to set up their caller contexts into Spark. In the end, Spark will combine its own caller context with the caller contexts of its upstream applications, and write them into Yarn RM log and HDFS audit log.

The audit log has a config to truncate the caller contexts passed in (default 128). The caller contexts will be sent over rpc, so it should be concise. The call context written into HDFS log and Yarn log consists of two parts: the information A specified by Spark itself and the value B of spark.log.callerContext property. Currently A typically takes 64 to 74 characters, so B can have up to 50 characters (mentioned in the doc running-on-yarn.md)

How was this patch tested?

Manual tests. I have run some Spark applications with spark.log.callerContext configuration in Yarn client/cluster mode, and verified that the caller contexts were written into Yarn RM log and HDFS audit log correctly.

The ways to configure spark.log.callerContext property:

In spark-defaults.conf:

spark.log.callerContext  infoSpecifiedByUpstreamApp

In app's source code:

val spark = SparkSession
      .builder
      .appName("SparkKMeans")
      .config("spark.log.callerContext", "infoSpecifiedByUpstreamApp")
      .getOrCreate()

When running on Spark Yarn cluster mode, the driver is unable to pass 'spark.log.callerContext' to Yarn client and AM since Yarn client and AM have already started before the driver performs .config("spark.log.callerContext", "infoSpecifiedByUpstreamApp").

The following example shows the command line used to submit a SparkKMeans application and the corresponding records in Yarn RM log and HDFS audit log.

Command:

./bin/spark-submit --verbose --executor-cores 3 --num-executors 1 --master yarn --deploy-mode client --class org.apache.spark.examples.SparkKMeans examples/target/original-spark-examples_2.11-2.1.0-SNAPSHOT.jar hdfs://localhost:9000/lr_big.txt 2 5

Yarn RM log:

HDFS audit log:

rxin · 2016-10-20T06:52:24Z

Can't this just be a normal sparkconf? Why do we need to specialize the spark-submit arguments for this?

jerryshao · 2016-10-20T07:52:15Z

core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala

spark.hadoop prefix will be treated as Hadoop configuration and set to Configuration, probably be better to change to another name.

Thanks for this info. I have changed the property name to "spark.upstreamApp.callerContext".

jerryshao · 2016-10-20T07:56:28Z

core/src/main/scala/org/apache/spark/util/Utils.scala

We could simplify with getOrElse here and below.

upstreamCallerContextStr needs to combine some characters/words with upstreamCallerContext.get, so getOrElse is not used here.

upstreamCallerContext.map("_" + _).getOrElse("") ?

Yes, I have updated the PR to use this.

jerryshao · 2016-10-20T07:58:47Z

core/src/main/scala/org/apache/spark/scheduler/Task.scala

It would be better to define "spark.hadoop.callerContext" in internal/config and use that instead.

weiqingy · 2016-10-27T17:13:28Z

@rxin Thanks for reviewing this. I have updated the PR that does not make this to a spark-submit argument.

weiqingy · 2016-11-01T17:57:11Z

Hadoop verifies caller contexts passed in here and isContextValid(). It limits the length.

weiqingy · 2016-11-01T17:59:04Z

Hi, @tgravescs Could you please review this PR? Thanks.

tgravescs · 2016-11-07T15:18:30Z

core/src/main/scala/org/apache/spark/util/Utils.scala

I'd prefer to see new parameters added as the last param to the class.

I agree with you that new parameters should be added as last parameters so that no client code would be broken. It’s just in this case, there are three callers in total and they are all under my control. The new optional parameter would be used much more frequently than other optional parameters, so I think I shall change the parameter orders before the class is used more widely. If I put the new parameter as the last one, users (including the existing three) would have to pass may “None”s as parameters.

It is possible to use named parameters, but I agree with the point of spark private scope + frequency of use here to be mitigating factor.

tgravescs · 2016-11-07T15:47:04Z

core/src/main/scala/org/apache/spark/util/Utils.scala

can we log a warning here saying we truncated and what the string was truncated from and to.

tgravescs · 2016-11-07T16:00:16Z

docs/running-on-yarn.md

doesn't match above actual config spark.upstreamApp.callerContext. how about: spark.log.callerContext

If I'm running spark in standalonde mode with master/worker and reading from hdfs, the caller context would still work on the hdfs side, right? So this isn't just a spark on yarn config and should move to general configuration section but mention applies to yarn/hdfs.

Yes. I have changed the config to spark.log.callerContext and moved the documentation to spark/docs/configuration.md.

mridulm · 2016-11-07T21:32:24Z

core/src/main/scala/org/apache/spark/util/Utils.scala

upstreamCallerContext.map("_" + _).getOrElse("") ?

mridulm · 2016-11-07T21:34:36Z

core/src/main/scala/org/apache/spark/util/Utils.scala

Should prepareContext be done once as part of initialization of context or should it need to be done for each invocation of setCurrentContext ?

prepareContext needs to be done for each invocation of setCurrentContext.

why? we set the actual context when we create it and there is no way to change it. set is just actually calling the hadoop routine. I don't know that it matters to much right now since we always create and set but its better to do.

The reason prepareContext is called in each setCallerContext is that currently every client only need to set the caller context once, and each CallerContext object only call setCallerContext once. Yes, I agree with your point, that will make the implementation better and benefit future invocations.

mridulm · 2016-11-07T21:34:50Z

core/src/main/scala/org/apache/spark/util/Utils.scala

mridulm · 2016-11-07T21:35:55Z

core/src/main/scala/org/apache/spark/util/Utils.scala

We can inline upstreamCallerContextStr and remove the variable ?

Yes. I have updated the PR to inline the variables.

…ntexts of upstream applications into Spark

…ontext'

tgravescs · 2016-11-08T14:03:57Z

Jenkins, test this please

SparkQA · 2016-11-08T14:09:18Z

Test build #68340 has finished for PR 15563 at commit fc40cf3.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2016-11-08T14:17:23Z

core/src/main/scala/org/apache/spark/util/Utils.scala

-
-   val context = "SPARK_" + from + appIdStr + appAttemptIdStr +
-     jobIdStr + stageIdStr + stageAttemptIdStr + taskIdStr + taskAttemptNumberStr
+   val context = "SPARK_" +


make this private

tgravescs · 2016-11-08T14:18:16Z

please look at jenkins failures

mridulm · 2016-11-09T09:02:07Z

Jenkins, test this please

tgravescs · 2016-11-09T14:19:54Z

Jenkins, test this please

weiqingy · 2016-11-09T18:33:52Z

@mridulm @tgravescs It seems Jenkins doesn't work?

tdas · 2016-11-10T02:33:48Z

jenkins, ok to test

SparkQA · 2016-11-10T05:01:07Z

Test build #68437 has finished for PR 15563 at commit e525284.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2016-11-11T15:29:33Z

+1. @mridulm you have any further comments?

mridulm · 2016-11-12T02:15:58Z

Looks good to me

mridulm · 2016-11-12T02:34:39Z

Merging into master

weiqingy · 2016-11-12T02:50:50Z

Thanks a lot for the review, @tgravescs @mridulm

rxin · 2016-11-12T06:25:20Z

core/src/main/scala/org/apache/spark/util/Utils.scala

-
-   val context = "SPARK_" + from + appIdStr + appAttemptIdStr +
-     jobIdStr + stageIdStr + stageAttemptIdStr + taskIdStr + taskAttemptNumberStr
+  from: String,


this is off?

What do you mean? Could you please elaborate? Thanks.

rxin · 2016-11-12T06:25:52Z

core/src/main/scala/org/apache/spark/internal/config/package.scala

    .booleanConf
    .createWithDefault(false)

+  private[spark] val APP_CALLER_CONTEXT = ConfigBuilder("spark.log.callerContext")


shouldn't this option be called spark.yarn.log.callerContext?

This is not just for Yarn, if running spark apps in standalonde mode with master and workers and reading/writing from/to hdfs, the caller context would still work on the hdfs side. (PS. we also can not use spark.hadoop prefix that will be treated as Hadoop configuration and set to Configuration.)

Thanks for the comment. Would it be a problem if we use spark.hadoop.log.callerContext? I know it gets passed into Configuration, but why would that be a problem? Is it overriding some common configuration in Hadoop?

It shouldn't hurt anything as I don't think there is a hadoop config called that but if there was it would conflict. In general I think that is a bad idea for us to purposely do this, there is no reason to put it into the hadoop configuration also. I saw this as something that could apply to other things (then yarn/hdfs) if they supplied the api. For instance is aws s3 or any other filesystem provided the similar api you could use same config. But I don't know if it does now.

What is your concern with the current name?

My concern is that this is a very hadoop specific thing at the moment, and it is unclear other environments will support it, so it'd make more sense to have hadoop in the name. A lot of Spark users don't run Hadoop.

Personally I don't like using the spark.hadoop. for this if we know its already being used and these configs will always be added to the hadoop config when not needed. I guess perhaps we messed up naming that to put things into hadoop configuration.

I would say we should change spark.hadoop being applied to hadoop conf (to like spark.hadoopConf.) but even though I don't see it documented anywhere I think it would be to painful to change as I know people are using it.

what about spark.log.hadoop.callerContext? Although perhaps we should set a policy for this in general.

Hi, @rxin If you think spark.log.hadoop.callerContext is ok, I can submit a follow-up PR to rename spark.log.callerContext.

…xts of upstream applications into Spark ## What changes were proposed in this pull request? Many applications take Spark as a computing engine and run on it. This PR adds a configuration property `spark.log.callerContext` that can be used by Spark's upstream applications (e.g. Oozie) to set up their caller contexts into Spark. In the end, Spark will combine its own caller context with the caller contexts of its upstream applications, and write them into Yarn RM log and HDFS audit log. The audit log has a config to truncate the caller contexts passed in (default 128). The caller contexts will be sent over rpc, so it should be concise. The call context written into HDFS log and Yarn log consists of two parts: the information `A` specified by Spark itself and the value `B` of `spark.log.callerContext` property. Currently `A` typically takes 64 to 74 characters, so `B` can have up to 50 characters (mentioned in the doc `running-on-yarn.md`) ## How was this patch tested? Manual tests. I have run some Spark applications with `spark.log.callerContext` configuration in Yarn client/cluster mode, and verified that the caller contexts were written into Yarn RM log and HDFS audit log correctly. The ways to configure `spark.log.callerContext` property: - In spark-defaults.conf: ``` spark.log.callerContext infoSpecifiedByUpstreamApp ``` - In app's source code: ``` val spark = SparkSession .builder .appName("SparkKMeans") .config("spark.log.callerContext", "infoSpecifiedByUpstreamApp") .getOrCreate() ``` When running on Spark Yarn cluster mode, the driver is unable to pass 'spark.log.callerContext' to Yarn client and AM since Yarn client and AM have already started before the driver performs `.config("spark.log.callerContext", "infoSpecifiedByUpstreamApp")`. The following example shows the command line used to submit a SparkKMeans application and the corresponding records in Yarn RM log and HDFS audit log. Command: ``` ./bin/spark-submit --verbose --executor-cores 3 --num-executors 1 --master yarn --deploy-mode client --class org.apache.spark.examples.SparkKMeans examples/target/original-spark-examples_2.11-2.1.0-SNAPSHOT.jar hdfs://localhost:9000/lr_big.txt 2 5 ``` Yarn RM log: <img width="1440" alt="screen shot 2016-10-19 at 9 12 03 pm" src="https://cloud.githubusercontent.com/assets/8546874/19547050/7d2f278c-9649-11e6-9df8-8d5ff12609f0.png"> HDFS audit log: <img width="1400" alt="screen shot 2016-10-19 at 10 18 14 pm" src="https://cloud.githubusercontent.com/assets/8546874/19547102/096060ae-964a-11e6-981a-cb28efd5a058.png"> Author: Weiqing Yang <yangweiqing001@gmail.com> Closes apache#15563 from weiqingy/SPARK-16759.

weiqingy changed the title ~~[SPARK-16759][CORE] Add a configuration property to pass in caller contexts of upstream applications into Spark~~ [SPARK-16759][CORE] Add a configuration property to pass in caller contexts of upstream applications to Spark Oct 20, 2016

weiqingy changed the title ~~[SPARK-16759][CORE] Add a configuration property to pass in caller contexts of upstream applications to Spark~~ [SPARK-16759][CORE] Add a configuration property to pass caller contexts of upstream applications into Spark Oct 20, 2016

jerryshao reviewed Oct 20, 2016

View reviewed changes

weiqingy force-pushed the SPARK-16759 branch 2 times, most recently from 2cdbe4f to 45fccf8 Compare November 1, 2016 17:47

tgravescs reviewed Nov 7, 2016

View reviewed changes

mridulm reviewed Nov 7, 2016

View reviewed changes

weiqingy added 4 commits November 7, 2016 21:20

[SPARK-16759][CORE] Add a configuration property to pass in caller co…

ca11bab

…ntexts of upstream applications into Spark

Define the caller context property in 'internal/config'

ec2145e

Add a function to check the length of caller contexts

b0719b4

refactor code and change the configuration name to 'spark.log.callerC…

996dffc

…ontext'

weiqingy force-pushed the SPARK-16759 branch from 45fccf8 to 996dffc Compare November 8, 2016 06:07

Fix the conflicts

fc40cf3

tgravescs reviewed Nov 8, 2016

View reviewed changes

Make 'prepareContext' called only once in each CallerContext object

e525284

asfgit closed this in 3af8945 Nov 12, 2016

rxin reviewed Nov 12, 2016

View reviewed changes

weiqingy mentioned this pull request Nov 15, 2016

[YARN][DOC] Remove non-Yarn specific configurations from running-on-yarn.md #15869

Closed

[SPARK-16759][CORE] Add a configuration property to pass caller contexts of upstream applications into Spark #15563

[SPARK-16759][CORE] Add a configuration property to pass caller contexts of upstream applications into Spark #15563

Uh oh!

Conversation

weiqingy commented Oct 20, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

rxin commented Oct 20, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

weiqingy Oct 27, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

weiqingy Oct 27, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

weiqingy commented Oct 27, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

weiqingy commented Nov 1, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

weiqingy commented Nov 1, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mridulm Nov 8, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tgravescs commented Nov 8, 2016

Uh oh!

SparkQA commented Nov 8, 2016

Uh oh!

Choose a reason for hiding this comment

weiqingy commented Oct 20, 2016 •

edited

Loading

weiqingy Oct 27, 2016 •

edited

Loading

weiqingy Oct 27, 2016 •

edited

Loading

weiqingy commented Oct 27, 2016 •

edited

Loading

weiqingy commented Nov 1, 2016 •

edited

Loading

mridulm Nov 8, 2016 •

edited

Loading

weiqingy Nov 12, 2016 •

edited

Loading