Create Kubernetes peon lifecycle task log persist timeout by capistrant · Pull Request #18444 · apache/druid

capistrant · 2025-08-27T18:54:35Z

Description

Prevents saveLogs() from hanging indefinitely when there are fabric8 issues. Read fabric8io/kubernetes-client#7163 for some extra context on how LogWatch processing can be indefinitely blocking situation. Situations like this are suspects in issues around overlord graceful shutdowns being able to complete successfully. The upstream kubernetes-client fix associated with the linked issue is preferable to my implementation, but with that being unreleased, it is not clear when we can integrate the fix into Druid.

Release note

Introduce task log saving timeouts for the kubernetes-overlord-extensions mm-less ingestion framework. Persisting task logs will no longer have the potential to block indefinitely. Instead there is a time limit for persisting logs, that if breached, results in giving up on the persist. The default timeout is 5 minutes, but can be configured by overriding druid.indexer.runner.logSaveTimeout with a valid Duration (e.g PT60S)

Key changed/added classes in this PR

KubernetesPeonLifecycle.java

This PR has:

been self-reviewed.
added documentation for new or modified features or behaviors.
a release note entry in the PR description.
added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
added or updated version, license, or notice information in licenses.yaml
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
added integration tests.
been tested in a test Druid cluster.

…ging

cryptoe · 2025-08-28T04:39:06Z


  protected void saveLogs()
+  {
+    ExecutorService executor = Executors.newSingleThreadExecutor();


Does it make sense to expose the timeouts here
https://github.com/fabric8io/kubernetes-client?tab=readme-ov-file#configuring-the-client
rather than doing a timeout via an exec service ?

Does it make sense to expose the timeouts here

https://github.com/fabric8io/kubernetes-client?tab=readme-ov-file#configuring-the-client

rather than doing a timeout via an exec service ?

fabric8io/kubernetes-client#7163

From talking with @kgyrtkirk i was under the impression that the reason this log read can hang is because of the above bug. I don't think any existing timeouts can fix that. But we could also expose all the timeouts that fabric8 supplies as part of this PR?

I see thanks for clarifying.

this seems to create a new executor for every call?
that looks a bit odd

I don't know how many additional patches will be needed to make this fabric8+vertx stuff work correctly.
Would it be hard to add a knob which could restore the okhttp usage?
if we are about to handle the timeouts ourselves - is fabric8 the right choice of library?
I was taking a quick glance at https://github.com/kubernetes-client/java and it seemed more straightforward...

@kgyrtkirk you raise good points. The only reason I'm proposing this is because of the fabric8 bug not being available in any release. I'd like to remove it ASAP once fabric8 does a release or patch release with this fix. maybe we need to take a step back and consider making the client pluggable allowing for operators to migrate between clients. vertx has solved a lot of problems for some use cases but caused other problems.

this seems to create a new executor for every call? that looks a bit odd

new executor per task when this saveLogs is called once when task completes

cryptoe

Minor comments. LGTM otherwise.

cryptoe · 2025-09-03T05:08:23Z


  protected void saveLogs()
+  {
+    ExecutorService executor = Executors.newSingleThreadExecutor();


Lets name the thread.

cryptoe · 2025-09-03T05:29:18Z

+      log.warn("saveLogs() timed out after %d ms for task [%s]", logSaveTimeoutMs, taskId.getOriginalTaskId());
+    }
+    catch (Exception e) {
+      log.error(e, "saveLogs() failed for task [%s]", taskId.getOriginalTaskId());


Since tasks logs would not visible this has to be phrased in a manner that the cluster admin can understand.
Unable to save logs from the for the task[s] on location[details]. This does not have any impact on the work done by of the task. If this continues to happen, check Kubernetes server logs for potential errors .

…ss of if it completes or not

…ound logwatch getting closed

…ache#18444)" This reverts commit 90be682.

capistrant added 2 commits August 27, 2025 13:33

Wrap logs save for K8s peon in executor so we can prevent it from han…

c064ea7

…ging

add documentation

235c45f

github-actions Bot added Area - Documentation Kubernetes labels Aug 27, 2025

cryptoe reviewed Aug 28, 2025

View reviewed changes

capistrant changed the title ~~Creat Kubernetes peon lifecycle task log persist timeout~~ Create Kubernetes peon lifecycle task log persist timeout Sep 2, 2025

capistrant requested a review from kgyrtkirk September 2, 2025 22:43

cryptoe approved these changes Sep 3, 2025

View reviewed changes

capistrant added 2 commits September 3, 2025 09:15

Improve logging and ensure logWatch is closed after saveLogs regardle…

8887915

…ss of if it completes or not

fixup other k8s peon lifecycle tests to account for the new guards ar…

daaf81f

…ound logwatch getting closed

capistrant merged commit 90be682 into apache:master Sep 3, 2025
63 checks passed

This was referenced Sep 26, 2025

Make KubernetesWorkItem.shutdown idempotent #18576

Merged

Remove LogWatch initialization during k8s task shutdown #18579

Closed

capistrant added a commit to capistrant/incubator-druid that referenced this pull request Sep 30, 2025

Revert "Create Kubernetes peon lifecycle task log persist timeout (ap…

131f9a6

…ache#18444)" This reverts commit 90be682.

capistrant mentioned this pull request Sep 30, 2025

Expand use of Druid side timeouts for fabric8 kubernetesclient timeouts #18587

Merged

10 tasks

cecemei added this to the 35.0.0 milestone Oct 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create Kubernetes peon lifecycle task log persist timeout#18444

Create Kubernetes peon lifecycle task log persist timeout#18444
capistrant merged 4 commits intoapache:masterfrom
capistrant:KubernetesPeonLifecycle-saveLogsTimeout

capistrant commented Aug 27, 2025 •

edited

Loading

Uh oh!

cryptoe Aug 28, 2025

Uh oh!

capistrant Aug 28, 2025

Uh oh!

cryptoe Sep 3, 2025

Uh oh!

kgyrtkirk Sep 3, 2025

Uh oh!

capistrant Sep 3, 2025

Uh oh!

capistrant Sep 3, 2025 •

edited

Loading

Uh oh!

cryptoe left a comment

Uh oh!

cryptoe Sep 3, 2025

Uh oh!

cryptoe Sep 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

capistrant commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Release note

Key changed/added classes in this PR

Uh oh!

cryptoe Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

capistrant Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

cryptoe Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

kgyrtkirk Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

capistrant Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

capistrant Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cryptoe left a comment

Choose a reason for hiding this comment

Uh oh!

cryptoe Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

cryptoe Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

capistrant commented Aug 27, 2025 •

edited

Loading

capistrant Sep 3, 2025 •

edited

Loading