Skip to content

Conversation

@pan3793
Copy link
Member

@pan3793 pan3793 commented Oct 23, 2022

What changes were proposed in this pull request?

Provide a flexible way on K8s for Driver and Executor by using env vars to configure external log service links(pattern) and attributes, on both live Spark UI and SHS.

The full design doc is https://docs.google.com/document/d/1MfB39LD4B4Rp7MDRxZbMKMbdNSe6V6mBmMQ-gkCnM-0/edit?usp=sharing

  1. Expose general attributes on K8s, for both Driver and Executor, which can be referred in log URLs pattern and will be persisted into event log. My proposed generic attributes are
  • APP_ID
  • KUBENETES_POD_NAME
  • KUBENETES_NAMESPACE
  1. Allow using env vars to add custom log URLs and attributes, for both Driver and Executor.
  • Driver log URL: env vars w/ prefix SPARK_DRIVER_LOG_URL_
  • Driver attribute: env vars w/ prefix SPARK_DRIVER_ATTRIBUTE_
  • Executor log URL: env vars w/ prefix SPARK_LOG_URL_
  • Executor attribute: env vars w/ prefix SPARK_EXECUTOR_ATTRIBUTE_
  1. Always do log URLs replacement for Driver before sending SparkListenerApplicationStart into the LiveListenerBus, so that the Driver could have the log URL replacement ability on live UI, as Executor does.

  2. Always do log URLs replacement for Executor,

  • if spark.ui.custom.executor.log.url is provided, as-is;
  • otherwise, use the value of log URL as pattern in case that user-provided log URL refers to the attributes.

Why are the changes needed?

Currently, there is no out-of-box log solution for Spark on K8s.

For Spark on Yarn case, Spark provides stdout/stderr log links on Spark UI for the Driver and each Executor which redirects to the Yarn log pages, but for the resource manager which does not provide the out-of-box log services, like K8s, Spark has no log links on Spark UI.

Does this PR introduce any user-facing change?

Yes, users who deploy Spark on K8s could add custom log links in the Spark UI by configurations.

How was this patch tested?

Local mode

build/sbt clean package
export SPARK_DRIVER_ATTRIBUTE_KIBANA_SVC=https://kibana-svc/spark
export SPARK_DRIVER_ATTRIBUTE_S3_ARCHIVE_SVC=https://log-archive
export SPARK_DRIVER_ATTRIBUTE_CORE_DUMP_SVC=https://core_dump
export SPARK_DRIVER_ATTRIBUTE_KUBENETES_POD_NAME=spark-pod-123
export SPARK_DRIVER_LOG_URL_kibana={{KIBANA_SVC}}/{{KUBENETES_POD_NAME}}
export SPARK_DRIVER_LOG_URL_archive={{S3_ARCHIVE_SVC}}/{{KUBENETES_POD_NAME}}
export SPARK_DRIVER_LOG_URL_core_dump={{CORE_DUMP_SVC}}/{{KUBENETES_POD_NAME}}
SPARK_PREPEND_CLASSES=true bin/spark-shell

image

K8s mode

An online Spark on K8s application using Kibana as the external log service w/ the following configurations.

spark.kubernetes.driverEnv.SPARK_DRIVER_LOG_URL_kibana      http://logsearch.xxxxxxxxxxxxxxxxxxxxxxxxxxx-log-es-online/app/discover#/?_a=(index:'{{ES_INDEX}}',query:(language:kuery,query:'podName:%22{{KUBERNETES_POD_NAME}}%22%20'))&_g=(time:(from:now-1d,to:now))
spark.kubernetes.driverEnv.SPARK_DRIVER_ATTRIBUTE_ES_INDEX  667e2c80-46c5-xxxxxxxxxxxxxxxxxxx
spark.executorEnv.SPARK_LOG_URL_kibana                      http://logsearch.xxxxxxxxxxxxxxxxxxxxxxxxxxx-log-es-online/app/discover#/?_a=(index:'{{ES_INDEX}}',query:(language:kuery,query:'podName:%22{{KUBERNETES_POD_NAME}}%22%20'))&_g=(time:(from:now-1d,to:now))
spark.executorEnv.SPARK_EXECUTOR_ATTRIBUTE_ES_INDEX         667e2c80-46c5-xxxxxxxxxxxxxxxxxxx

image

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@mridulm
Copy link
Contributor

mridulm commented Oct 24, 2022

+CC @tgravescs Since you have more context on this from yarn pov than I do

@pan3793
Copy link
Member Author

pan3793 commented Oct 25, 2022

@tgravescs this one is the real solution based on the feedback of #38205, would you please take a look? And should I extend the solution to Yarn or not?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like logically duplicated in both getDriverLogUrls and getDriverAttributes except the variable names. Could you try to refactor more?

Copy link
Member Author

@pan3793 pan3793 Dec 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are few differences, e.g. toLowerCase and toUpperCase, this follows CoarseGrainedExecutorBackend#extractLogUrls and CoarseGrainedExecutorBackend#extractAttributes

Copy link
Member

@dongjoon-hyun dongjoon-hyun Dec 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of creating a new class KubernetesCoarseGrainedExecutorBackend, can we do the following simply?

         new CoarseGrainedExecutorBackend(rpcEnv, arguments.driverUrl, execId,
         arguments.bindAddress, arguments.hostname, arguments.cores,
-        env, arguments.resourcesFileOpt, resourceProfile)
+        env, arguments.resourcesFileOpt, resourceProfile) {
+          override def getDriverAttributes: Option[Map[String, String]] = Some(
+            super.getDriverAttributes.getOrElse(Map.empty) ++ Map(
+              "APP_ID" -> System.getenv(ENV_APPLICATION_ID),
+              "KUBERNETES_NAMESPACE" -> conf.get(KUBERNETES_NAMESPACE),
+              "KUBERNETES_POD_NAME" -> System.getenv(ENV_DRIVER_POD_NAME)))
+        }

Then, we can remove the following file from this PR.

  • resource-managers/kubernetes/core/src/main/scala/org/apache/spark/executor/KubernetesCoarseGrainedExecutorBackend.scala

Copy link
Member

@dongjoon-hyun dongjoon-hyun Dec 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had better put ths line should be before ENV_DRIVER_BIND_ADDRESS.

BTW, Could you spin off KUBERNETES_POD_NAME and SPARK_DRIVER_POD_NAME addition to a new PR, @pan3793 ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, opened #39160 for this.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Dec 17, 2022

Hi, @pan3793 . I'm interested in this PR for Apache Spark 3.4.0. Although we have one month until Feature Freeze, the holiday seasons will block us heavily. To make a progress for this topic, I believe we need to split this PR in order to discuss and test more. I proposed you the followings first. Please let me know if you can revise.

  • Spin off pod env variable contributions (SPARK_DRIVER_POD_NAME).
  • Reduce the code change (e.g. Avoiding a whole new class like KubernetesCoarseGrainedExecutorBackend)

@pan3793
Copy link
Member Author

pan3793 commented Dec 21, 2022

Thanks @dongjoon-hyun, updated the code based on your comments, and sorry for late reply, because my team is suffering from COVID-19 😢

@pan3793
Copy link
Member Author

pan3793 commented Mar 24, 2023

Update PR state.

Currently, the PR is stuck at "Is it good to let 3rd-party log service use POD NAME to access Driver log?"

In #39160 (review), @dongjoon-hyun left a concern

While reviewing this design again in detail, I have a concern. Currently, Apache Spark uses K8s Service entity via DriverServiceFeatureStep to access Spark driver pod in K8s environment. The proposed design is a kind of exception. Do you think you can revise this log service design to use Driver Service instead?

Yes, it's an exception, but I think it may be the right direction. Because:

  1. My vision is exposing both driver and executor in an unified way to the log service, and aggregate logs by Pod is much straightforward, just like Yarn does, by container. So my first candidate is Pod Name, the second one is Pod IP.
  2. I found that apple/batch-processing-gateway uses Pod Name to fetch the log [1]
  3. GoogleCloudPlatform/spark-on-k8s-operator also uses Pod Name to fetch driver and executor log [2]
  4. The design does not force to use POD_NAME nor SVC_NAME as criteria to access driver/executor logs, it totally depends on how the external log service aggregates logs

Would like to hear more thoughts from the community, cc @holdenk @mridulm @yaooqinn @Yikun @attilapiros @jzhuge @LuciferYang @cxzl25, thanks!

[1] https://github.com/apple/batch-processing-gateway/blob/main/src/main/java/com/apple/spark/rest/ApplicationGetLogRest.java#L327
[2] https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/sparkctl/cmd/log.go#L89

@github-actions
Copy link

github-actions bot commented Jul 3, 2023

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants