Fix Overlord leader election when task lock re-acquisition fails by AmatyaAvadhanula · Pull Request #13172 · apache/druid

AmatyaAvadhanula · 2022-10-03T11:52:18Z

Fixes Overlord leader election when task lock re-acquisition fails

Description

#11653 describes an issue where Overlord leader election fails due lock re-acquisition issues

This PR aims to solve the issue by failing such tasks and clearing all their locks when re-acquisition doesn't succeed, so that the Overlord leader election is not blocked

Key changed/added classes in this PR

TaskLockbox
TaskQueue

This PR has:

been self-reviewed.
added documentation for new or modified features or behaviors.
added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
added or updated version, license, or notice information in licenses.yaml
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
added integration tests.
been tested in a test Druid cluster.

AmatyaAvadhanula · 2022-10-03T14:19:33Z

@abhishekagarwal87 thank you for the review!
I think there were a few other gaps such as cleaning up the task's existing TaskLockPosse and shutting it down if it's running, which I've fixed.

…uisition_leaderElectionFailure

kfaraz

Added some comments.

AmatyaAvadhanula · 2022-10-14T06:13:27Z

@kfaraz thank you for the review!

kfaraz

Thanks for the quick fix, @AmatyaAvadhanula !
I have added some more minor feedback. I would also request you to test this out thoroughly (in case you haven't already) on a Druid cluster as this changes task queue startup, which affects all task executions.

kfaraz · 2022-10-15T03:52:40Z

      running.clear();
      activeTasks.clear();
      activeTasks.addAll(storedActiveTasks);
+      // Set of task groups in which at least one task failed to re-acquire a lock


Thanks for the comments!

kfaraz · 2022-10-15T03:53:54Z

+              task.getId(),
+              task.getGroupId()
          );
+          continue;


Nit: probably not needed as we are already at the end of the loop.

kfaraz · 2022-10-15T03:55:42Z

+      for (Task task : taskStorage.getActiveTasks()) {
+        if (failedToReacquireLockTaskGroups.contains(task.getGroupId())) {
+          tasksToFail.add(task);
+          activeTasks.remove(task.getId());


Nit: Style: You could choose to remove all of them in one go, thus retaining the sense of atomic update to activeTasks.

kfaraz · 2022-10-15T03:56:46Z

+      }
+
+      Set<Task> tasksToFail = new HashSet<>();
+      for (Task task : taskStorage.getActiveTasks()) {


We don't want to make another call to the storage. Use an iterator over the activeTasks or storedActiveTasks. Either should be fine.

kfaraz · 2022-10-15T03:57:20Z

   * groupId, dataSource, and priority.
   */
-  private TaskLockPosse verifyAndCreateOrFindLockPosse(Task task, TaskLock taskLock)
+  @VisibleForTesting


Nit: is there a way to avoid this and still be able to test it? (without too much hassle)

kfaraz · 2022-10-15T03:59:40Z

+      // Clean up needs to happen after tasks have been synced from storage
+      Set<Task> tasksToFail = taskLockbox.syncFromStorage().getTasksToFail();
+      for (Task task : tasksToFail) {
+        shutdown(task.getId(), "Failed to reacquire lock.");


Suggested change

shutdown(task.getId(), "Failed to reacquire lock.");

shutdown(task.getId(), "Shutting down forcefully as failed to reacquire lock after becoming leader.");

It's a little verbose but paints a clearer picture of what happened.

Thanks for the suggestion. Shouldn't it be "while becoming leader"?

Sure, that works too.

kfaraz · 2022-10-15T04:02:17Z

    // should return number of tasks which are not in running state
    response = overlordResource.getCompleteTasks(null, req);
-    Assert.assertEquals(2, (((List) response.getEntity()).size()));
+    Assert.assertEquals(4, (((List) response.getEntity()).size()));


Why did we need to change an existing test?
I would advise adding a separate test for verifying the behaviour of such tasks where we failed to reacquire locks.

kfaraz · 2022-10-15T04:03:00Z

+  @Test
+  public void testFailedToReacquireTaskLock() throws Exception
+  {
+    final Task badTask0 = NoopTask.withGroupId("BadTask");


Probably a better name explaining why it's bad or good?

I added a TestLockbox class which returns null for tasks whose group contains "BadTask". I'm not sure it's the best approach to test these changes.
I'm hoping you could take a look and suggest a better approach.

The testing approach itself is fine but the names can be made a little more self-explanatory,
like "TaskWithFailingLockAcquisition" or something. Name the Task instances also accordingly.

kfaraz · 2022-10-15T04:03:54Z

+    testLockbox.add(badTask1);
+    testLockbox.add(goodTask0);
+
+    testLockbox.tryLock(badTask0, new TimeChunkLockRequest(TaskLockType.EXCLUSIVE,


Nit: Style: Put each arg on a separate line (when necessary) for consistency with the rest of Druid code.

kfaraz · 2022-10-15T04:05:37Z

          }
      );
      requestManagement();
+      // Remove any unacquired locks from storage


Thanks for adding the comment!
I am a little unclear on why there would be unacquired locks left behind. I would imagine that shutdown() would take care of this. If not, please rephrase the comments to clarify that point.

shutdown() only clears tasklocks for which a TaskLockPosse is present. The error here is that a lock could not be reacquired and a TaskLockPosse doesn't exist for the conflicting task, which is why these "unacquired" entries must be removed in this manner

Okay, please include this info in the comments.

AmatyaAvadhanula · 2022-10-15T04:13:38Z

@kfaraz, the changes have been tested on a druid cluster by introducing a bad entry in the metadata store to simulate this error and I monitored the changes for a while, and have ticked the corresponding box in the checklist.

Thank you for the additional feedback. I'll be sure to test the changes on a cluster after addressing these comments as well.

…uisition_leaderElectionFailure

kfaraz

+1 after CI passes.

…che#13172) Overlord leader election can sometimes fail due to task lock re-acquisition issues. This commit solves the issue by failing such tasks and clearing all their locks.

Fix Overlord leader election when task lock re-acquisition fails

9d309c4

abhishekagarwal87 reviewed Oct 3, 2022

View reviewed changes

Comment thread indexing-service/src/main/java/org/apache/druid/indexing/overlord/TaskLockbox.java Outdated

abhishekagarwal87 added the Area - Ingestion label Oct 3, 2022

Add SyncResult and process it better

ba876b7

AmatyaAvadhanula requested a review from abhishekagarwal87 October 3, 2022 14:19

abhishekagarwal87 reviewed Oct 3, 2022

View reviewed changes

AmatyaAvadhanula added 3 commits October 5, 2022 14:12

Merge remote-tracking branch 'upstream/master' into feature-lockReacq…

704d92d

…uisition_leaderElectionFailure

test changes

6665d21

Merge remote-tracking branch 'upstream/master' into feature-lockReacq…

4c0a6c0

…uisition_leaderElectionFailure

AmatyaAvadhanula marked this pull request as draft October 8, 2022 09:56

AmatyaAvadhanula added 3 commits October 12, 2022 16:02

Process SyncResult in TaskQueue

e9b4c51

Merge remote-tracking branch 'upstream/master' into feature-lockReacq…

c853856

…uisition_leaderElectionFailure

Unused import

d17114a

kfaraz requested changes Oct 14, 2022

View reviewed changes

Refactoring and clean-up

212a786

AmatyaAvadhanula marked this pull request as ready for review October 14, 2022 06:08

Revert changes to prevent a bad state

87bff8a

AmatyaAvadhanula requested a review from kfaraz October 14, 2022 13:07

kfaraz requested changes Oct 15, 2022

View reviewed changes

AmatyaAvadhanula added 2 commits October 17, 2022 10:32

Refactoring and comments

7f6918b

Merge remote-tracking branch 'upstream/master' into feature-lockReacq…

63af27e

…uisition_leaderElectionFailure

AmatyaAvadhanula requested a review from kfaraz October 17, 2022 05:34

kfaraz approved these changes Oct 17, 2022

View reviewed changes

kfaraz merged commit b88e1c2 into apache:master Oct 17, 2022

kfaraz mentioned this pull request Oct 17, 2022

Overlord cannot elect a leader because of exception when reacquiring locks. #11653

Closed

AmatyaAvadhanula mentioned this pull request Oct 18, 2022

Fix Overlord leader election when task lock re-acquisition fails (#13… #13236

Merged

abhishekagarwal87 added this to the 24.0.1 milestone Oct 18, 2022

abhishekagarwal87 added Backport and removed Backport labels Oct 18, 2022

kfaraz mentioned this pull request Nov 7, 2022

[Draft] 24.0.1 Release Notes #13320

Closed

This was referenced Dec 18, 2022

[Draft] 25.0.0 Release Notes #13592

Closed

Add SegmentAllocationQueue to batch allocation actions #13369

Merged

cryptoe mentioned this pull request Apr 6, 2023

Fixing overlord unable to become a leader when syncing the lock from metadata store #14038

Merged

2 tasks

	shutdown(task.getId(), "Failed to reacquire lock.");
	shutdown(task.getId(), "Shutting down forcefully as failed to reacquire lock after becoming leader.");

Conversation

AmatyaAvadhanula commented Oct 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Key changed/added classes in this PR

Uh oh!

Uh oh!

AmatyaAvadhanula commented Oct 3, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kfaraz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AmatyaAvadhanula commented Oct 14, 2022

Uh oh!

kfaraz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AmatyaAvadhanula commented Oct 15, 2022

Uh oh!

kfaraz left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

AmatyaAvadhanula commented Oct 3, 2022 •

edited

Loading