Fix race condition in KubernetesTaskRunner between shutdown and getKnownTasks by georgew5656 · Pull Request #14030 · apache/druid

georgew5656 · 2023-04-05T01:42:18Z

Description

Discovered a race condition in the shutdown code while doing some additional testing around f60f377

If a task is shutdown, has its k8s job deleted and then subsequently removed from the tasks map on https://github.com/apache/druid/blob/master/extensions-contrib/kubernetes-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/KubernetesTaskRunner.java#L218,
it is possible for the pod to still be running in a terminating state for a little while after. this is because k8s deletes the controller (the job) and then lets the pod clean itself up.

If getKnownTasks (https://github.com/apache/druid/blob/master/extensions-contrib/kubernetes-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/KubernetesTaskRunner.java#L338) is called by the TaskQueue main loop between when the tasks table is cleaned up and the peon pod finishes deleting, another run() will be called by getKnownTasks.

Normally this sees the existing future in the tasks table and returns it, but since the task id will already have been deleted from the tasks table, the task runner will actually start another job with the same name as the original one that was deleted.

This job will quickly be deleted again (because getKnownTasks will return it, the taskQueue will see the task that the task runner returned is not in TaskStorage and submit another shutdown request to delete the job), but this is a poor user experience.

Additionally, there may be DuplicateKeyError's thrown because there are multiple pods with the same job name running, this can cause the ingestion to be briefly nonresponsive.

It is also possible to get these DuplicateKeyErrors if a pod (not a job) is manually deleted in K8s. Normally, since we set backoffLimit to 0, if a pod fails, the job doesn't try to retry creating the pod and just fails out. But it seems like k8s treats manual termination differently from a pod failing, so when a pod is manually terminated, the job actually starts another pod.

IMO it makes more sense for getKnownTasks to try to rehydrate state from Kubernetes via the list of jobs, rather than the list of Pods. This solves both of the above problems because we can be sure the K8s job has been deleted before deleting the task future from the tasks map, and there will never be duplicate jobs with the same name.

Release note

Bugfixes to the kubernetes overlord extension

Key changed/added classes in this PR

The change here to have toTask take in the Job object rather than the Pod object and then use this change to have getKnownTasks and getRunningTasks list jobs rather than pods when trying to read the k8s state.

I explored some other options for fixing this issue, such as excluding terminating pods from getKNownTasks and getRunningTasks or having the main run loop wait for pods to delete if the job has been deleted, but this seemed like the cleanest solution.

This PR has:

[ X been self-reviewed.
- using the concurrency checklist (Remove this item if the PR doesn't have any relation to concurrency.)
added documentation for new or modified features or behaviors.
a release note entry in the PR description.
added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
added or updated version, license, or notice information in licenses.yaml
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
added integration tests.
been tested in a test Druid cluster.
I have tested more thoroughly with the PodTemplateTaskAdapter and done some smoke tests with the MultiContainerTaskAdapter.

…a/org/apache/druid/k8s/overlord/common/DruidKubernetesPeonClient.java Co-authored-by: Abhishek Agarwal <1477457+abhishekagarwal87@users.noreply.github.com>

nlippis · 2023-04-05T02:08:25Z

    List<TaskRunnerWorkItem> result = new ArrayList<>();
-    for (Pod existingTask : client.listPeonPods(Sets.newHashSet(PeonPhase.RUNNING))) {
+    for (Job existingTask : client.listAllPeonJobs().stream()
+        .filter(job -> job.getStatus() != null && job.getStatus().getActive() != null && job.getStatus().getActive() > 0).collect(Collectors.toSet())


Instead of defining logic for whether the job is active, succeeded or failed in various places. Can you create a class with a method for each state. I.E

JobStatus.isActive(Job job) JobStatus.isSucceeded(Job job) JobStatus.isFailed(Job job)

i added a class to do this, I think it may change slightly once #14028 (review) is merged so I'll have to publish a new version once thats merged

nlippis · 2023-04-05T02:10:41Z

  {
    List<TaskRunnerWorkItem> result = new ArrayList<>();
-    for (Pod existingTask : client.listPeonPods()) {
+    for (Job existingTask : client.listAllPeonJobs()) {


The listPeonPods() and listPeonPods(Set<PeonPhase> phases) methods will be unused if you make this change right? If so please remove them.

this is done

nlippis

Looks good, let's merge after #14028 is in.

nlippis · 2023-04-05T17:34:04Z

Looks good, let's merge after #14028 is in.

#14028 has been merged, please pull in those changes.

georgew5656 · 2023-04-05T19:49:44Z

Looks good, let's merge after #14028 is in.

just resolved merge conflicts, gonna do some more smoke testing with the new fabric8 clients once I can figure out how to get the changes from #14028 deployed to my environment but this should be good to go

-    client.pods().inNamespace("test").create(pod);
-    PodList podList = client.pods().inNamespace("test").list();
-    assertEquals(1, podList.getItems().size());
+    client.batch().v1().jobs().inNamespace("test").create(jobFromSpec);


georgew5656 · 2023-04-08T03:28:44Z

Looks good, let's merge after #14028 is in.

just resolved merge conflicts, gonna do some more smoke testing with the new fabric8 clients once I can figure out how to get the changes from #14028 deployed to my environment but this should be good to go

@nlippis i got this deployed into my testing environment and it looks good to go with the new fabric8 client, other than the missing dependency from #14052

…ownTasks (apache#14030) * Fix issues with null pointers on jobResponse * fix unit tests * Update extensions-contrib/kubernetes-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/common/DruidKubernetesPeonClient.java Co-authored-by: Abhishek Agarwal <1477457+abhishekagarwal87@users.noreply.github.com> * nullable * fix error message * Use jobs for known tasks instead of pods * Remove log lines * remove log lines * PR change requests * revert wait change --------- Co-authored-by: Abhishek Agarwal <1477457+abhishekagarwal87@users.noreply.github.com>

…ownTasks (#14030) (#14057) * Fix issues with null pointers on jobResponse * fix unit tests * Update extensions-contrib/kubernetes-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/common/DruidKubernetesPeonClient.java * nullable * fix error message * Use jobs for known tasks instead of pods * Remove log lines * remove log lines * PR change requests * revert wait change --------- Co-authored-by: George Shiqi Wu <george.wu@imply.io> Co-authored-by: Abhishek Agarwal <1477457+abhishekagarwal87@users.noreply.github.com>

George Wu and others added 9 commits March 31, 2023 16:36

Fix issues with null pointers on jobResponse

aa80588

fix unit tests

437d998

Update extensions-contrib/kubernetes-overlord-extensions/src/main/jav…

51315d3

…a/org/apache/druid/k8s/overlord/common/DruidKubernetesPeonClient.java Co-authored-by: Abhishek Agarwal <1477457+abhishekagarwal87@users.noreply.github.com>

nullable

1b249fe

fix error message

359bbee

Use jobs for known tasks instead of pods

2e0a940

merge conflicts

afa3c08

Remove log lines

9f857ce

remove log lines

c3c554c

nlippis suggested changes Apr 5, 2023

View reviewed changes

George Wu added 2 commits April 5, 2023 11:11

PR change requests

d52fab9

revert wait change

3eea4c5

nlippis approved these changes Apr 5, 2023

View reviewed changes

fix merge conflicts

b0a4956

clintropolis added Bug Kubernetes labels Apr 5, 2023

georgew5656 requested a review from abhishekagarwal87 April 5, 2023 19:48

github-advanced-security AI found potential problems Apr 6, 2023

View reviewed changes

georgew5656 mentioned this pull request Apr 8, 2023

Add snakeyaml dependency back in #14052

Closed

8 tasks

clintropolis merged commit 00d777d into apache:master Apr 10, 2023

clintropolis added this to the 26.0 milestone Apr 10, 2023

clintropolis mentioned this pull request Apr 10, 2023

[Backport] Fix race condition in KubernetesTaskRunner between shutdown and getKnownTasks #14057

Merged

georgew5656 mentioned this pull request Apr 25, 2023

KubernetesTaskRunner gets out of sync with TaskQueue, stops running tasks (race condition) #13841

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix race condition in KubernetesTaskRunner between shutdown and getKnownTasks#14030

Fix race condition in KubernetesTaskRunner between shutdown and getKnownTasks#14030
clintropolis merged 12 commits intoapache:masterfrom
georgew5656:fixRaceCondition

georgew5656 commented Apr 5, 2023 •

edited

Loading

Uh oh!

nlippis Apr 5, 2023

Uh oh!

abhishekagarwal87 Apr 5, 2023

Uh oh!

georgew5656 Apr 5, 2023

Uh oh!

nlippis Apr 5, 2023

Uh oh!

georgew5656 Apr 5, 2023

Uh oh!

nlippis left a comment

Uh oh!

nlippis commented Apr 5, 2023

Uh oh!

georgew5656 commented Apr 5, 2023 •

edited

Loading

Uh oh!

Check notice

georgew5656 commented Apr 8, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

georgew5656 commented Apr 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Release note

Key changed/added classes in this PR

Uh oh!

nlippis Apr 5, 2023

Choose a reason for hiding this comment

Uh oh!

abhishekagarwal87 Apr 5, 2023

Choose a reason for hiding this comment

Uh oh!

georgew5656 Apr 5, 2023

Choose a reason for hiding this comment

Uh oh!

nlippis Apr 5, 2023

Choose a reason for hiding this comment

Uh oh!

georgew5656 Apr 5, 2023

Choose a reason for hiding this comment

Uh oh!

nlippis left a comment

Choose a reason for hiding this comment

Uh oh!

nlippis commented Apr 5, 2023

Uh oh!

georgew5656 commented Apr 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Check notice

georgew5656 commented Apr 8, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

georgew5656 commented Apr 5, 2023 •

edited

Loading

georgew5656 commented Apr 5, 2023 •

edited

Loading