-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-51095][CORE][SQL] Include caller context for hdfs audit logs for calls from driver #49814
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ac288d1 to
7ba89a3
Compare
|
@attilapiros Could you take a look at this PR. There are some test failures apparently due to some scala style failures. But I couldn't quite figure out what they are and also the modules that seem to be failing don't seem to have anything to do with this PR |
|
This must be: |
|
The '__' in the PR description suggest the example was created with an older version of the PR, right? I.e. the |
cnauroth
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 (non-binding) overall. I agree with other comments about removing the debug log line. Thank you, @sririshindra ! This will be useful!
Yes, it was from an older version. I reran the test with the latest version and updated the PR description with the new results. |
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala
Outdated
Show resolved
Hide resolved
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala
Outdated
Show resolved
Hide resolved
pan3793
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that due to recent changes in #49893 and #49898 , it should be possible to write a reliable unit test for this. Otherwise, I am +1 (non-binding).
Thank you, @sririshindra .
5a85e9a to
2949bc2
Compare
|
@attilapiros , @dongjoon-hyun, @sunchao Can you please review this PR when you get a chance. Thanks. |
attilapiros
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
What changes were proposed in this pull request?
Add the caller context for calls from DRIVER to HDFS.
Why are the changes needed?
HDFS audit logs include the ability to add a "caller context". Spark already leverages this to set the yarn application id, job id, task id, etc. but only on executors. The caller context is left empty on the spark driver. With introductions of Iceberg we have seen multiple scenarios in which files in HDFS are accessed from the driver. But since the caller context is left empty our ability to forensically analyse any issues has diminished. This PR includes sets caller context from the driver as well.
Does this PR introduce any user-facing change?
Yes, hdfs audit logs will now have caller context for calls from driver.
How was this patch tested?
This patch was tested manually. After this change the hdfs audit logs now contain caller context from the driver.
Was this patch authored or co-authored using generative AI tooling?
No