-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-17802] Improved caller context logging. #15377
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-17802] Improved caller context logging. #15377
Conversation
|
cc @weiqingy (who worked on SPARK-16757.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just Boolean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. I have simplied the logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was also superfluous, just end the method with the right return value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just put false at the end of the method like before. Do you really want to log the exception at info level? it looks like a problem when it's not. Nits: Builder -> builder, hadoop -> Hadoop. Don't put a space before the colons, and the new callerContextSupported flag should be private.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
489295d to
112ddcb
Compare
|
Test build #66440 has finished for PR 15377 at commit
|
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, in the case of another non-fatal error, would you not also consider the context unsupported? it seems more like a warning than error because you can complete without it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think you perhaps should not log the exception but just log the message (so that the caller sees what class was not found, like logInfo(s"... or later: ${e.getMessage}")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
Test build #66443 has finished for PR 15377 at commit
|
|
Test build #66447 has finished for PR 15377 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I mean: set callerContextSupported = false here too right? or else you're still going to fail over and over and log it if something besides ClassNotFoundException happens. Which is probably a pretty fatal error actually, but hey.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Emm, I prefer not to suppress the non-ClassNotFound errors, because they are real errors (either on spark side or on hadoop side), compared to the ClassNotFound error, which is rather a conditional feature based on the hadoop environment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's some different deployment situations here.
- you are running on Hadoop 2.8, want caller context. A failure here is something to mention.
- you are running on Hadoop <= 2.7, don't want context or care about it. Here another stack trace is going to be a distraction; if it's not a support call then it gets added to the list of "error messages you learn to ignore". (this is my current state, BTW)
- you want caller context, but are running on an incompatible version of Hadoop. Again, here, logging the CNFE makes sense.
Question is: do you need anything if the caller context is disabled? As I don't see you do. And there's a Hadoop config option hadoop.caller.context.enabled (default false), which controls that.
What about looking for the config option, if it is set going through the introspection work, reporting problems with stack traces. And if unset: don't even bother with the introspection?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the tip about hadoop.caller.context.enabled and the suggestion, i'll do that.
|
Test build #66449 has finished for PR 15377 at commit
|
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, it's a corner case anyway.
|
Thanks for proposing this PR. LGTM. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srowen I updated the code to make callerContextSupported a static variable because the usage pattern of CallerContext is new CallerContext(...).setCurrentContext()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, could it even be private[util]?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, done.
|
Test build #66507 has finished for PR 15377 at commit
|
|
Test build #66506 has finished for PR 15377 at commit
|
|
Test build #66556 has finished for PR 15377 at commit
|
|
Test build #66715 has finished for PR 15377 at commit
|
|
Test build #66719 has finished for PR 15377 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the usage of this flag? I don't see any other place use it, all just setters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's used below, and you have commented on it :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, code will never trap into this branch, since you intialize it as true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, when there is a ClassNotFound exception, it would be set to false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The usage of CallerContext.callerContextSupported is quite weird to me, looks like you're implementing a singleton factory like patten to avoid re-calling this method if class not found.
Will this introduce threading issue for the state of CallerContext.callerContextSupported?
Also do we have side affect to call setCurrentContext multiple times in one JVM?
I would be in favor of avoiding such usage of global flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, it won't matter if it's called multiple times. It just means some extra work is done but the result is the same. The flag is at least local to the class here. I don't know of a better way to record this state because it is global, and properly so, as the things it depends on can't vary within one JVM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, so the main purpose to use this flag is to avoid re-executing the codes below. So AFAIK unless we can guarantee the sequentially calling of this method, otherwise callerContextSupported may introduce concurrent issue, especially in Task running code. Simultaneously started running tasks will possibly get the flag as true and executing below code, which makes this flag useless.
Also in AM/Client, it will only be called once, so this flag is not used at those JVMs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but re-executing the code an extra doesn't do anything bad except log an additional message. We're trying to prevent it from executing a bunch of times. The flag can only go from true to false. It keeps executing the same deterministic path until something stops it. Unless you mean this really never executes more than once anyway, I don't think there's a problem right here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you cannot fully preventing re-executing this code, and to the worst all threads will executing the same logics again, so is it necessary enough to add such flag? For me I think it is some kind of undeterministic that will confuse the user (some tasks printed the log while others not).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point is to avoid a bunch of exception stack traces in the log over and over. I don't think thread-safety is an issue here. If it prints twice instead of once, no big deal. Otherwise this error/warning prints on every task execution. That's not great.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use SparkHadoopUtils#conf.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use Utils#classForName.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: indent is not correct, use 2 ws.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
884de64 to
c42019b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, done.
|
Another thing, do you verify it locally? Since there's no unit test to cover it. |
|
Test build #66886 has finished for PR 15377 at commit
|
|
Test build #66885 has finished for PR 15377 at commit
|
@jerryshao Yeah, I did test it locally to ensure the error is only logged once. |
|
Test build #66891 has finished for PR 15377 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two values are not used at all, I think you could remove them.
|
LGTM just some minor things. |
|
@weiqingy Emm, then we would also add the logic of checking "hadoop.caller.context.enabled" in the test code, which makes the test code simply duplicates the code path of On the other hand I do agree we should test the code path, but it seems not easy to do it. @srowen @jerryshao What do you think? |
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the issue with the tests, yeah, and it's a fair point. My personal opinion is that the test here is sufficient because one would have to reimplement the code in question to test it, and, we'll soon drop support for versions that don't support this anyway, and, the failure that this wouldn't catch is not bad
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this needs to be lazy (?) and doesn't need to be broken out into methods. It can just be one val initialized directly upfront in one expression.
|
Test build #67201 has finished for PR 15377 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The above code can be simplified with Try { xxxx }.isSuccess, please check scala.util.Try. Then these two methods can be merged into one expression.
|
Test build #67239 has finished for PR 15377 at commit
|
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rest seems OK to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's find a different way to wrap this; it's non-standard and hard to read the condition here. Maybe break out conf as a local val to make it simpler.
I see why you're using Try, though I wonder if we are suppressing potentially important errors this way. For example, ClassNotFoundException obviously means 'not supported' and that's normal. A fatal error should probably just propagate. But a NonFatal error could be logged with a warning? because that would be unexpected. It would still mean the caller context isn't supported rather than cause a failure.
|
Test build #67368 has finished for PR 15377 at commit
|
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rest is looking good. Let's keep this simple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't need to be a separate member; it can be a local val.
Still, why have more than one member here? Initializing callerContextSupported in one block of code is sufficient and simpler.
|
Test build #67394 has finished for PR 15377 at commit
|
c93bed3 to
b57f009
Compare
| private[util] object CallerContext extends Logging { | ||
| val callerContextSupported: Boolean = { | ||
| val conf = SparkHadoopUtil.get.conf | ||
| conf.getBoolean("hadoop.caller.context.enabled", false) && { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although this construct looks funny to me, a more conventional structure would just be a little longer. I'm OK with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I removed the local conf variable.
|
Test build #67445 has finished for PR 15377 at commit
|
|
Test build #67447 has finished for PR 15377 at commit
|
|
Test build #3372 has finished for PR 15377 at commit
|
|
@srowen Could we get this merged since the tests are now green? Not sure why it failed previously, it just turned green without me doing anything. |
|
Merged to master |
## What changes were proposed in this pull request? [SPARK-16757](https://issues.apache.org/jira/browse/SPARK-16757) sets the hadoop `CallerContext` when calling hadoop/hdfs apis to make spark applications more diagnosable in hadoop/hdfs logs. However, the `org.apache.hadoop.ipc.CallerContext` class is only added since [hadoop 2.8](https://issues.apache.org/jira/browse/HDFS-9184), which is not officially releaed yet. So each time `utils.CallerContext.setCurrentContext()` is called (e.g [when a task is created](https://github.com/apache/spark/blob/b678e46/core/src/main/scala/org/apache/spark/scheduler/Task.scala#L95-L96)), a "java.lang.ClassNotFoundException: org.apache.hadoop.ipc.CallerContext" error is logged, which pollutes the spark logs when there are lots of tasks. This patch improves this behaviour by only logging the `ClassNotFoundException` once. ## How was this patch tested? Existing tests. Author: Shuai Lin <linshuai2012@gmail.com> Closes apache#15377 from lins05/spark-17802-improve-callercontext-logging.
## What changes were proposed in this pull request? [SPARK-16757](https://issues.apache.org/jira/browse/SPARK-16757) sets the hadoop `CallerContext` when calling hadoop/hdfs apis to make spark applications more diagnosable in hadoop/hdfs logs. However, the `org.apache.hadoop.ipc.CallerContext` class is only added since [hadoop 2.8](https://issues.apache.org/jira/browse/HDFS-9184), which is not officially releaed yet. So each time `utils.CallerContext.setCurrentContext()` is called (e.g [when a task is created](https://github.com/apache/spark/blob/b678e46/core/src/main/scala/org/apache/spark/scheduler/Task.scala#L95-L96)), a "java.lang.ClassNotFoundException: org.apache.hadoop.ipc.CallerContext" error is logged, which pollutes the spark logs when there are lots of tasks. This patch improves this behaviour by only logging the `ClassNotFoundException` once. ## How was this patch tested? Existing tests. Author: Shuai Lin <linshuai2012@gmail.com> Closes apache#15377 from lins05/spark-17802-improve-callercontext-logging.
…op.caller.context.enabled` ### What changes were proposed in this pull request? This PR aims to fix `CallerContext` test by enabling `hadoop.caller.context.enabled` during tests. ### Why are the changes needed? This test case was disabled since Apache Spark 2.1.0 because `CallerContext` was supported since Apache Hadoop 2.8+. - #15377 We need to recover `CallerContext` test coverage. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #49893 from dongjoon-hyun/SPARK-51164. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…op.caller.context.enabled` ### What changes were proposed in this pull request? This PR aims to fix `CallerContext` test by enabling `hadoop.caller.context.enabled` during tests. ### Why are the changes needed? This test case was disabled since Apache Spark 2.1.0 because `CallerContext` was supported since Apache Hadoop 2.8+. - #15377 We need to recover `CallerContext` test coverage. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #49893 from dongjoon-hyun/SPARK-51164. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit d07b560) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…op.caller.context.enabled` ### What changes were proposed in this pull request? This PR aims to fix `CallerContext` test by enabling `hadoop.caller.context.enabled` during tests. ### Why are the changes needed? This test case was disabled since Apache Spark 2.1.0 because `CallerContext` was supported since Apache Hadoop 2.8+. - #15377 We need to recover `CallerContext` test coverage. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #49893 from dongjoon-hyun/SPARK-51164. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit d07b560) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…op.caller.context.enabled` ### What changes were proposed in this pull request? This PR aims to fix `CallerContext` test by enabling `hadoop.caller.context.enabled` during tests. ### Why are the changes needed? This test case was disabled since Apache Spark 2.1.0 because `CallerContext` was supported since Apache Hadoop 2.8+. - apache#15377 We need to recover `CallerContext` test coverage. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#49893 from dongjoon-hyun/SPARK-51164. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 9edaf80) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
What changes were proposed in this pull request?
SPARK-16757 sets the hadoop
CallerContextwhen calling hadoop/hdfs apis to make spark applications more diagnosable in hadoop/hdfs logs. However, theorg.apache.hadoop.ipc.CallerContextclass is only added since hadoop 2.8, which is not officially releaed yet. So each timeutils.CallerContext.setCurrentContext()is called (e.g when a task is created), a "java.lang.ClassNotFoundException: org.apache.hadoop.ipc.CallerContext"error is logged, which pollutes the spark logs when there are lots of tasks.
This patch improves this behaviour by only logging the
ClassNotFoundExceptiononce.How was this patch tested?
Existing tests.