Skip to content

Make KubernetesWorkItem.shutdown idempotent#18576

Merged
kfaraz merged 4 commits intoapache:masterfrom
kfaraz:k8s_work_item_idem_shutdown
Sep 26, 2025
Merged

Make KubernetesWorkItem.shutdown idempotent#18576
kfaraz merged 4 commits intoapache:masterfrom
kfaraz:k8s_work_item_idem_shutdown

Conversation

@kfaraz
Copy link
Copy Markdown
Contributor

@kfaraz kfaraz commented Sep 26, 2025

Description

This patch tries to address the following condition due to a bug in fabric8 client.

The callback executor in TaskQueue can get stuck in the following scenario.

  • TaskQueue tries to shutdown a task via notifyStatus() and queues up a TaskRunner.shutdown() on the callback executor.
  • TaskQueue then removes the entry for this task from its in-memory data-structures
  • TaskRunner.shutdown() remains stuck due to the fabric8 bug
  • TaskRunner still has the task in its memory
  • When the TaskQueue periodically manages itself via startPendingTasksOnRunner(),
    it realizes that the TaskRunner has this unknown task and tries to shut it down again.

Fix

Subsequent calls to KubernetesWorkItem.shutdown() should not block and return immediately.


This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

Copy link
Copy Markdown
Contributor

@capistrant capistrant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a logical solution to me. I think this might be a time to take a second and consider the log watch in the shutdown and if it is even necessary. what do you think?

this.kubernetesPeonLifecycle.shutdown();
if (isShutdown.compareAndSet(false, true)) {
synchronized (this) {
this.kubernetesPeonLifecycle.startWatchingLogs();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we even need to be calling startWatchingLogs() here? Kind of off topic for the PR. But part of the reason shutdown is slow and can even hang, is because of the LogWatch being unhealthy. #18444 was implemented to put a time limit on log persist for saving off task logs. In that code path we are actually writing the logs out somewhere. I guess I don't see how the call here is useful at all if we are trying to just shut the task down?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree, the only apparent effect of calling startWatchingLogs() is to initialize KubernetesPeonLifecycle.logWatch field which would be initialized anyway via join() -> finally -> saveLogs() -> doSaveLogs().

We can explore that in a separate PR.

@capistrant
Copy link
Copy Markdown
Contributor

#18579 opened this for proposing dropping the LogWatch init in shutdown

@kfaraz kfaraz merged commit 1b220b2 into apache:master Sep 26, 2025
90 of 92 checks passed
@kfaraz kfaraz deleted the k8s_work_item_idem_shutdown branch September 26, 2025 16:49
@cecemei cecemei added this to the 35.0.0 milestone Oct 21, 2025
RonShub pushed a commit to singular-labs/druid that referenced this pull request Nov 30, 2025
Subsequent calls to KubernetesWorkItem.shutdown() should not block and return immediately.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants