Skip to content

Create Kubernetes peon lifecycle task log persist timeout#18444

Merged
capistrant merged 4 commits intoapache:masterfrom
capistrant:KubernetesPeonLifecycle-saveLogsTimeout
Sep 3, 2025
Merged

Create Kubernetes peon lifecycle task log persist timeout#18444
capistrant merged 4 commits intoapache:masterfrom
capistrant:KubernetesPeonLifecycle-saveLogsTimeout

Conversation

@capistrant
Copy link
Copy Markdown
Contributor

@capistrant capistrant commented Aug 27, 2025

Description

Prevents saveLogs() from hanging indefinitely when there are fabric8 issues. Read fabric8io/kubernetes-client#7163 for some extra context on how LogWatch processing can be indefinitely blocking situation. Situations like this are suspects in issues around overlord graceful shutdowns being able to complete successfully. The upstream kubernetes-client fix associated with the linked issue is preferable to my implementation, but with that being unreleased, it is not clear when we can integrate the fix into Druid.

Release note

Introduce task log saving timeouts for the kubernetes-overlord-extensions mm-less ingestion framework. Persisting task logs will no longer have the potential to block indefinitely. Instead there is a time limit for persisting logs, that if breached, results in giving up on the persist. The default timeout is 5 minutes, but can be configured by overriding druid.indexer.runner.logSaveTimeout with a valid Duration (e.g PT60S)


Key changed/added classes in this PR
  • KubernetesPeonLifecycle.java

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.


protected void saveLogs()
{
ExecutorService executor = Executors.newSingleThreadExecutor();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to expose the timeouts here
https://github.com/fabric8io/kubernetes-client?tab=readme-ov-file#configuring-the-client
rather than doing a timeout via an exec service ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to expose the timeouts here

https://github.com/fabric8io/kubernetes-client?tab=readme-ov-file#configuring-the-client

rather than doing a timeout via an exec service ?

fabric8io/kubernetes-client#7163

From talking with @kgyrtkirk i was under the impression that the reason this log read can hang is because of the above bug. I don't think any existing timeouts can fix that. But we could also expose all the timeouts that fabric8 supplies as part of this PR?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see thanks for clarifying.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems to create a new executor for every call?
that looks a bit odd

I don't know how many additional patches will be needed to make this fabric8+vertx stuff work correctly.
Would it be hard to add a knob which could restore the okhttp usage?
if we are about to handle the timeouts ourselves - is fabric8 the right choice of library?
I was taking a quick glance at https://github.com/kubernetes-client/java and it seemed more straightforward...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kgyrtkirk you raise good points. The only reason I'm proposing this is because of the fabric8 bug not being available in any release. I'd like to remove it ASAP once fabric8 does a release or patch release with this fix. maybe we need to take a step back and consider making the client pluggable allowing for operators to migrate between clients. vertx has solved a lot of problems for some use cases but caused other problems.

Copy link
Copy Markdown
Contributor Author

@capistrant capistrant Sep 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems to create a new executor for every call? that looks a bit odd

new executor per task when this saveLogs is called once when task completes

@capistrant capistrant changed the title Creat Kubernetes peon lifecycle task log persist timeout Create Kubernetes peon lifecycle task log persist timeout Sep 2, 2025
@capistrant capistrant requested a review from kgyrtkirk September 2, 2025 22:43
Copy link
Copy Markdown
Contributor

@cryptoe cryptoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments. LGTM otherwise.


protected void saveLogs()
{
ExecutorService executor = Executors.newSingleThreadExecutor();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets name the thread.

log.warn("saveLogs() timed out after %d ms for task [%s]", logSaveTimeoutMs, taskId.getOriginalTaskId());
}
catch (Exception e) {
log.error(e, "saveLogs() failed for task [%s]", taskId.getOriginalTaskId());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since tasks logs would not visible this has to be phrased in a manner that the cluster admin can understand.
Unable to save logs from the for the task[s] on location[details]. This does not have any impact on the work done by of the task. If this continues to happen, check Kubernetes server logs for potential errors .

@capistrant capistrant merged commit 90be682 into apache:master Sep 3, 2025
63 checks passed
capistrant added a commit to capistrant/incubator-druid that referenced this pull request Sep 30, 2025
@cecemei cecemei added this to the 35.0.0 milestone Oct 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants