Fix race condition in KubernetesTaskRunner between shutdown and getKnownTasks#14030
Fix race condition in KubernetesTaskRunner between shutdown and getKnownTasks#14030clintropolis merged 12 commits intoapache:masterfrom
Conversation
…a/org/apache/druid/k8s/overlord/common/DruidKubernetesPeonClient.java Co-authored-by: Abhishek Agarwal <1477457+abhishekagarwal87@users.noreply.github.com>
| List<TaskRunnerWorkItem> result = new ArrayList<>(); | ||
| for (Pod existingTask : client.listPeonPods(Sets.newHashSet(PeonPhase.RUNNING))) { | ||
| for (Job existingTask : client.listAllPeonJobs().stream() | ||
| .filter(job -> job.getStatus() != null && job.getStatus().getActive() != null && job.getStatus().getActive() > 0).collect(Collectors.toSet()) |
There was a problem hiding this comment.
Instead of defining logic for whether the job is active, succeeded or failed in various places. Can you create a class with a method for each state. I.E
JobStatus.isActive(Job job)
JobStatus.isSucceeded(Job job)
JobStatus.isFailed(Job job)
There was a problem hiding this comment.
i added a class to do this, I think it may change slightly once #14028 (review) is merged so I'll have to publish a new version once thats merged
| { | ||
| List<TaskRunnerWorkItem> result = new ArrayList<>(); | ||
| for (Pod existingTask : client.listPeonPods()) { | ||
| for (Job existingTask : client.listAllPeonJobs()) { |
There was a problem hiding this comment.
The listPeonPods() and listPeonPods(Set<PeonPhase> phases) methods will be unused if you make this change right? If so please remove them.
| client.pods().inNamespace("test").create(pod); | ||
| PodList podList = client.pods().inNamespace("test").list(); | ||
| assertEquals(1, podList.getItems().size()); | ||
| client.batch().v1().jobs().inNamespace("test").create(jobFromSpec); |
Check notice
Code scanning / CodeQL
Deprecated method or constructor invocation
@nlippis i got this deployed into my testing environment and it looks good to go with the new fabric8 client, other than the missing dependency from #14052 |
…ownTasks (apache#14030) * Fix issues with null pointers on jobResponse * fix unit tests * Update extensions-contrib/kubernetes-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/common/DruidKubernetesPeonClient.java Co-authored-by: Abhishek Agarwal <1477457+abhishekagarwal87@users.noreply.github.com> * nullable * fix error message * Use jobs for known tasks instead of pods * Remove log lines * remove log lines * PR change requests * revert wait change --------- Co-authored-by: Abhishek Agarwal <1477457+abhishekagarwal87@users.noreply.github.com>
…ownTasks (#14030) (#14057) * Fix issues with null pointers on jobResponse * fix unit tests * Update extensions-contrib/kubernetes-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/common/DruidKubernetesPeonClient.java * nullable * fix error message * Use jobs for known tasks instead of pods * Remove log lines * remove log lines * PR change requests * revert wait change --------- Co-authored-by: George Shiqi Wu <george.wu@imply.io> Co-authored-by: Abhishek Agarwal <1477457+abhishekagarwal87@users.noreply.github.com>
Description
Discovered a race condition in the shutdown code while doing some additional testing around f60f377
If a task is shutdown, has its k8s job deleted and then subsequently removed from the tasks map on https://github.com/apache/druid/blob/master/extensions-contrib/kubernetes-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/KubernetesTaskRunner.java#L218,
it is possible for the pod to still be running in a terminating state for a little while after. this is because k8s deletes the controller (the job) and then lets the pod clean itself up.
If getKnownTasks (https://github.com/apache/druid/blob/master/extensions-contrib/kubernetes-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/KubernetesTaskRunner.java#L338) is called by the TaskQueue main loop between when the tasks table is cleaned up and the peon pod finishes deleting, another run() will be called by getKnownTasks.
Normally this sees the existing future in the tasks table and returns it, but since the task id will already have been deleted from the tasks table, the task runner will actually start another job with the same name as the original one that was deleted.
This job will quickly be deleted again (because getKnownTasks will return it, the taskQueue will see the task that the task runner returned is not in TaskStorage and submit another shutdown request to delete the job), but this is a poor user experience.
Additionally, there may be DuplicateKeyError's thrown because there are multiple pods with the same job name running, this can cause the ingestion to be briefly nonresponsive.
It is also possible to get these DuplicateKeyErrors if a pod (not a job) is manually deleted in K8s. Normally, since we set backoffLimit to 0, if a pod fails, the job doesn't try to retry creating the pod and just fails out. But it seems like k8s treats manual termination differently from a pod failing, so when a pod is manually terminated, the job actually starts another pod.
IMO it makes more sense for getKnownTasks to try to rehydrate state from Kubernetes via the list of jobs, rather than the list of Pods. This solves both of the above problems because we can be sure the K8s job has been deleted before deleting the task future from the tasks map, and there will never be duplicate jobs with the same name.
Release note
Bugfixes to the kubernetes overlord extension
Key changed/added classes in this PR
I explored some other options for fixing this issue, such as excluding terminating pods from getKNownTasks and getRunningTasks or having the main run loop wait for pods to delete if the job has been deleted, but this seemed like the cleanest solution.
This PR has:
I have tested more thoroughly with the PodTemplateTaskAdapter and done some smoke tests with the MultiContainerTaskAdapter.