Skip to content

Conversation

@weiqingy
Copy link
Contributor

@weiqingy weiqingy commented Oct 20, 2016

What changes were proposed in this pull request?

Many applications take Spark as a computing engine and run on it. This PR adds a configuration property spark.log.callerContext that can be used by Spark's upstream applications (e.g. Oozie) to set up their caller contexts into Spark. In the end, Spark will combine its own caller context with the caller contexts of its upstream applications, and write them into Yarn RM log and HDFS audit log.

The audit log has a config to truncate the caller contexts passed in (default 128). The caller contexts will be sent over rpc, so it should be concise. The call context written into HDFS log and Yarn log consists of two parts: the information A specified by Spark itself and the value B of spark.log.callerContext property. Currently A typically takes 64 to 74 characters, so B can have up to 50 characters (mentioned in the doc running-on-yarn.md)

How was this patch tested?

Manual tests. I have run some Spark applications with spark.log.callerContext configuration in Yarn client/cluster mode, and verified that the caller contexts were written into Yarn RM log and HDFS audit log correctly.

The ways to configure spark.log.callerContext property:

  • In spark-defaults.conf:
spark.log.callerContext  infoSpecifiedByUpstreamApp
  • In app's source code:
val spark = SparkSession
      .builder
      .appName("SparkKMeans")
      .config("spark.log.callerContext", "infoSpecifiedByUpstreamApp")
      .getOrCreate()

When running on Spark Yarn cluster mode, the driver is unable to pass 'spark.log.callerContext' to Yarn client and AM since Yarn client and AM have already started before the driver performs .config("spark.log.callerContext", "infoSpecifiedByUpstreamApp").

The following example shows the command line used to submit a SparkKMeans application and the corresponding records in Yarn RM log and HDFS audit log.

Command:

./bin/spark-submit --verbose --executor-cores 3 --num-executors 1 --master yarn --deploy-mode client --class org.apache.spark.examples.SparkKMeans examples/target/original-spark-examples_2.11-2.1.0-SNAPSHOT.jar hdfs://localhost:9000/lr_big.txt 2 5

Yarn RM log:

screen shot 2016-10-19 at 9 12 03 pm

HDFS audit log:

screen shot 2016-10-19 at 10 18 14 pm

@weiqingy weiqingy changed the title [SPARK-16759][CORE] Add a configuration property to pass in caller contexts of upstream applications into Spark [SPARK-16759][CORE] Add a configuration property to pass in caller contexts of upstream applications to Spark Oct 20, 2016
@weiqingy weiqingy changed the title [SPARK-16759][CORE] Add a configuration property to pass in caller contexts of upstream applications to Spark [SPARK-16759][CORE] Add a configuration property to pass caller contexts of upstream applications into Spark Oct 20, 2016
@rxin
Copy link
Contributor

rxin commented Oct 20, 2016

Can't this just be a normal sparkconf? Why do we need to specialize the spark-submit arguments for this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spark.hadoop prefix will be treated as Hadoop configuration and set to Configuration, probably be better to change to another name.

Copy link
Contributor Author

@weiqingy weiqingy Oct 27, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this info. I have changed the property name to "spark.upstreamApp.callerContext".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could simplify with getOrElse here and below.

Copy link
Contributor Author

@weiqingy weiqingy Oct 27, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

upstreamCallerContextStr needs to combine some characters/words with upstreamCallerContext.get, so getOrElse is not used here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

upstreamCallerContext.map("_" + _).getOrElse("") ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I have updated the PR to use this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to define "spark.hadoop.callerContext" in internal/config and use that instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Done.

@weiqingy
Copy link
Contributor Author

weiqingy commented Oct 27, 2016

@rxin Thanks for reviewing this. I have updated the PR that does not make this to a spark-submit argument.

@weiqingy weiqingy force-pushed the SPARK-16759 branch 2 times, most recently from 2cdbe4f to 45fccf8 Compare November 1, 2016 17:47
@weiqingy
Copy link
Contributor Author

weiqingy commented Nov 1, 2016

Hadoop verifies caller contexts passed in here and isContextValid(). It limits the length.

@weiqingy
Copy link
Contributor Author

weiqingy commented Nov 1, 2016

Hi, @tgravescs Could you please review this PR? Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to see new parameters added as the last param to the class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you that new parameters should be added as last parameters so that no client code would be broken. It’s just in this case, there are three callers in total and they are all under my control. The new optional parameter would be used much more frequently than other optional parameters, so I think I shall change the parameter orders before the class is used more widely. If I put the new parameter as the last one, users (including the existing three) would have to pass may “None”s as parameters.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Copy link
Contributor

@mridulm mridulm Nov 8, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible to use named parameters, but I agree with the point of spark private scope + frequency of use here to be mitigating factor.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we log a warning here saying we truncated and what the string was truncated from and to.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't match above actual config spark.upstreamApp.callerContext. how about: spark.log.callerContext

If I'm running spark in standalonde mode with master/worker and reading from hdfs, the caller context would still work on the hdfs side, right? So this isn't just a spark on yarn config and should move to general configuration section but mention applies to yarn/hdfs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I have changed the config to spark.log.callerContext and moved the documentation to spark/docs/configuration.md.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

upstreamCallerContext.map("_" + _).getOrElse("") ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should prepareContext be done once as part of initialization of context or should it need to be done for each invocation of setCurrentContext ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prepareContext needs to be done for each invocation of setCurrentContext.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why? we set the actual context when we create it and there is no way to change it. set is just actually calling the hadoop routine. I don't know that it matters to much right now since we always create and set but its better to do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason prepareContext is called in each setCallerContext is that currently every client only need to set the caller context once, and each CallerContext object only call setCallerContext once. Yes, I agree with your point, that will make the implementation better and benefit future invocations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can inline upstreamCallerContextStr and remove the variable ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I have updated the PR to inline the variables.

@tgravescs
Copy link
Contributor

Jenkins, test this please

@SparkQA
Copy link

SparkQA commented Nov 8, 2016

Test build #68340 has finished for PR 15563 at commit fc40cf3.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


val context = "SPARK_" + from + appIdStr + appAttemptIdStr +
jobIdStr + stageIdStr + stageAttemptIdStr + taskIdStr + taskAttemptNumberStr
val context = "SPARK_" +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this private

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@tgravescs
Copy link
Contributor

please look at jenkins failures

@mridulm
Copy link
Contributor

mridulm commented Nov 9, 2016

Jenkins, test this please

1 similar comment
@tgravescs
Copy link
Contributor

Jenkins, test this please

@weiqingy
Copy link
Contributor Author

weiqingy commented Nov 9, 2016

@mridulm @tgravescs It seems Jenkins doesn't work?

@tdas
Copy link
Contributor

tdas commented Nov 10, 2016

jenkins, ok to test

@SparkQA
Copy link

SparkQA commented Nov 10, 2016

Test build #68437 has finished for PR 15563 at commit e525284.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tgravescs
Copy link
Contributor

+1. @mridulm you have any further comments?

@mridulm
Copy link
Contributor

mridulm commented Nov 12, 2016

Looks good to me

@mridulm
Copy link
Contributor

mridulm commented Nov 12, 2016

Merging into master

@asfgit asfgit closed this in 3af8945 Nov 12, 2016
@weiqingy
Copy link
Contributor Author

Thanks a lot for the review, @tgravescs @mridulm


val context = "SPARK_" + from + appIdStr + appAttemptIdStr +
jobIdStr + stageIdStr + stageAttemptIdStr + taskIdStr + taskAttemptNumberStr
from: String,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is off?

Copy link
Contributor Author

@weiqingy weiqingy Nov 12, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean? Could you please elaborate? Thanks.

.booleanConf
.createWithDefault(false)

private[spark] val APP_CALLER_CONTEXT = ConfigBuilder("spark.log.callerContext")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this option be called spark.yarn.log.callerContext?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not just for Yarn, if running spark apps in standalonde mode with master and workers and reading/writing from/to hdfs, the caller context would still work on the hdfs side. (PS. we also can not use spark.hadoop prefix that will be treated as Hadoop configuration and set to Configuration.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment. Would it be a problem if we use spark.hadoop.log.callerContext? I know it gets passed into Configuration, but why would that be a problem? Is it overriding some common configuration in Hadoop?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't hurt anything as I don't think there is a hadoop config called that but if there was it would conflict. In general I think that is a bad idea for us to purposely do this, there is no reason to put it into the hadoop configuration also. I saw this as something that could apply to other things (then yarn/hdfs) if they supplied the api. For instance is aws s3 or any other filesystem provided the similar api you could use same config. But I don't know if it does now.

What is your concern with the current name?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern is that this is a very hadoop specific thing at the moment, and it is unclear other environments will support it, so it'd make more sense to have hadoop in the name. A lot of Spark users don't run Hadoop.

Copy link
Contributor

@tgravescs tgravescs Nov 14, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I don't like using the spark.hadoop. for this if we know its already being used and these configs will always be added to the hadoop config when not needed. I guess perhaps we messed up naming that to put things into hadoop configuration.

I would say we should change spark.hadoop being applied to hadoop conf (to like spark.hadoopConf.) but even though I don't see it documented anywhere I think it would be to painful to change as I know people are using it.

what about spark.log.hadoop.callerContext? Although perhaps we should set a policy for this in general.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @rxin If you think spark.log.hadoop.callerContext is ok, I can submit a follow-up PR to rename spark.log.callerContext.

uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
…xts of upstream applications into Spark

## What changes were proposed in this pull request?

Many applications take Spark as a computing engine and run on it. This PR adds a configuration property `spark.log.callerContext` that can be used by Spark's upstream applications (e.g. Oozie) to set up their caller contexts into Spark. In the end, Spark will combine its own caller context with the caller contexts of its upstream applications, and write them into Yarn RM log and HDFS audit log.

The audit log has a config to truncate the caller contexts passed in (default 128). The caller contexts will be sent over rpc, so it should be concise. The call context written into HDFS log and Yarn log consists of two parts: the information `A` specified by Spark itself and the value `B` of `spark.log.callerContext` property.  Currently `A` typically takes 64 to 74 characters,  so `B` can have up to 50 characters (mentioned in the doc `running-on-yarn.md`)
## How was this patch tested?

Manual tests. I have run some Spark applications with `spark.log.callerContext` configuration in Yarn client/cluster mode, and verified that the caller contexts were written into Yarn RM log and HDFS audit log correctly.

The ways to configure `spark.log.callerContext` property:
- In spark-defaults.conf:

```
spark.log.callerContext  infoSpecifiedByUpstreamApp
```
- In app's source code:

```
val spark = SparkSession
      .builder
      .appName("SparkKMeans")
      .config("spark.log.callerContext", "infoSpecifiedByUpstreamApp")
      .getOrCreate()
```

When running on Spark Yarn cluster mode, the driver is unable to pass 'spark.log.callerContext' to Yarn client and AM since Yarn client and AM have already started before the driver performs `.config("spark.log.callerContext", "infoSpecifiedByUpstreamApp")`.

The following  example shows the command line used to submit a SparkKMeans application and the corresponding records in Yarn RM log and HDFS audit log.

Command:

```
./bin/spark-submit --verbose --executor-cores 3 --num-executors 1 --master yarn --deploy-mode client --class org.apache.spark.examples.SparkKMeans examples/target/original-spark-examples_2.11-2.1.0-SNAPSHOT.jar hdfs://localhost:9000/lr_big.txt 2 5
```

Yarn RM log:

<img width="1440" alt="screen shot 2016-10-19 at 9 12 03 pm" src="https://cloud.githubusercontent.com/assets/8546874/19547050/7d2f278c-9649-11e6-9df8-8d5ff12609f0.png">

HDFS audit log:

<img width="1400" alt="screen shot 2016-10-19 at 10 18 14 pm" src="https://cloud.githubusercontent.com/assets/8546874/19547102/096060ae-964a-11e6-981a-cb28efd5a058.png">

Author: Weiqing Yang <yangweiqing001@gmail.com>

Closes apache#15563 from weiqingy/SPARK-16759.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants