Fix Overlord leader election when task lock re-acquisition fails#13172
Conversation
|
@abhishekagarwal87 thank you for the review! |
…uisition_leaderElectionFailure
…uisition_leaderElectionFailure
…uisition_leaderElectionFailure
|
@kfaraz thank you for the review! |
kfaraz
left a comment
There was a problem hiding this comment.
Thanks for the quick fix, @AmatyaAvadhanula !
I have added some more minor feedback. I would also request you to test this out thoroughly (in case you haven't already) on a Druid cluster as this changes task queue startup, which affects all task executions.
| running.clear(); | ||
| activeTasks.clear(); | ||
| activeTasks.addAll(storedActiveTasks); | ||
| // Set of task groups in which at least one task failed to re-acquire a lock |
| task.getId(), | ||
| task.getGroupId() | ||
| ); | ||
| continue; |
There was a problem hiding this comment.
Nit: probably not needed as we are already at the end of the loop.
| for (Task task : taskStorage.getActiveTasks()) { | ||
| if (failedToReacquireLockTaskGroups.contains(task.getGroupId())) { | ||
| tasksToFail.add(task); | ||
| activeTasks.remove(task.getId()); |
There was a problem hiding this comment.
Nit: Style: You could choose to remove all of them in one go, thus retaining the sense of atomic update to activeTasks.
| } | ||
|
|
||
| Set<Task> tasksToFail = new HashSet<>(); | ||
| for (Task task : taskStorage.getActiveTasks()) { |
There was a problem hiding this comment.
We don't want to make another call to the storage. Use an iterator over the activeTasks or storedActiveTasks. Either should be fine.
| * groupId, dataSource, and priority. | ||
| */ | ||
| private TaskLockPosse verifyAndCreateOrFindLockPosse(Task task, TaskLock taskLock) | ||
| @VisibleForTesting |
There was a problem hiding this comment.
Nit: is there a way to avoid this and still be able to test it? (without too much hassle)
| // Clean up needs to happen after tasks have been synced from storage | ||
| Set<Task> tasksToFail = taskLockbox.syncFromStorage().getTasksToFail(); | ||
| for (Task task : tasksToFail) { | ||
| shutdown(task.getId(), "Failed to reacquire lock."); |
There was a problem hiding this comment.
| shutdown(task.getId(), "Failed to reacquire lock."); | |
| shutdown(task.getId(), "Shutting down forcefully as failed to reacquire lock after becoming leader."); |
It's a little verbose but paints a clearer picture of what happened.
There was a problem hiding this comment.
Thanks for the suggestion. Shouldn't it be "while becoming leader"?
| // should return number of tasks which are not in running state | ||
| response = overlordResource.getCompleteTasks(null, req); | ||
| Assert.assertEquals(2, (((List) response.getEntity()).size())); | ||
| Assert.assertEquals(4, (((List) response.getEntity()).size())); |
There was a problem hiding this comment.
Why did we need to change an existing test?
I would advise adding a separate test for verifying the behaviour of such tasks where we failed to reacquire locks.
| @Test | ||
| public void testFailedToReacquireTaskLock() throws Exception | ||
| { | ||
| final Task badTask0 = NoopTask.withGroupId("BadTask"); |
There was a problem hiding this comment.
Probably a better name explaining why it's bad or good?
There was a problem hiding this comment.
I added a TestLockbox class which returns null for tasks whose group contains "BadTask". I'm not sure it's the best approach to test these changes.
I'm hoping you could take a look and suggest a better approach.
There was a problem hiding this comment.
The testing approach itself is fine but the names can be made a little more self-explanatory,
like "TaskWithFailingLockAcquisition" or something. Name the Task instances also accordingly.
| testLockbox.add(badTask1); | ||
| testLockbox.add(goodTask0); | ||
|
|
||
| testLockbox.tryLock(badTask0, new TimeChunkLockRequest(TaskLockType.EXCLUSIVE, |
There was a problem hiding this comment.
Nit: Style: Put each arg on a separate line (when necessary) for consistency with the rest of Druid code.
| } | ||
| ); | ||
| requestManagement(); | ||
| // Remove any unacquired locks from storage |
There was a problem hiding this comment.
Thanks for adding the comment!
I am a little unclear on why there would be unacquired locks left behind. I would imagine that shutdown() would take care of this. If not, please rephrase the comments to clarify that point.
There was a problem hiding this comment.
shutdown() only clears tasklocks for which a TaskLockPosse is present. The error here is that a lock could not be reacquired and a TaskLockPosse doesn't exist for the conflicting task, which is why these "unacquired" entries must be removed in this manner
There was a problem hiding this comment.
Okay, please include this info in the comments.
|
@kfaraz, the changes have been tested on a druid cluster by introducing a bad entry in the metadata store to simulate this error and I monitored the changes for a while, and have ticked the corresponding box in the checklist. Thank you for the additional feedback. I'll be sure to test the changes on a cluster after addressing these comments as well. |
…uisition_leaderElectionFailure
…che#13172) Overlord leader election can sometimes fail due to task lock re-acquisition issues. This commit solves the issue by failing such tasks and clearing all their locks.
Fixes Overlord leader election when task lock re-acquisition fails
Description
#11653 describes an issue where Overlord leader election fails due lock re-acquisition issues
This PR aims to solve the issue by failing such tasks and clearing all their locks when re-acquisition doesn't succeed, so that the Overlord leader election is not blocked
Key changed/added classes in this PR
TaskLockboxTaskQueueThis PR has: