Skip to content

Fix Overlord leader election when task lock re-acquisition fails#13172

Merged
kfaraz merged 12 commits intoapache:masterfrom
AmatyaAvadhanula:feature-lockReacquisition_leaderElectionFailure
Oct 17, 2022
Merged

Fix Overlord leader election when task lock re-acquisition fails#13172
kfaraz merged 12 commits intoapache:masterfrom
AmatyaAvadhanula:feature-lockReacquisition_leaderElectionFailure

Conversation

@AmatyaAvadhanula
Copy link
Copy Markdown
Contributor

@AmatyaAvadhanula AmatyaAvadhanula commented Oct 3, 2022

Fixes Overlord leader election when task lock re-acquisition fails

Description

#11653 describes an issue where Overlord leader election fails due lock re-acquisition issues

This PR aims to solve the issue by failing such tasks and clearing all their locks when re-acquisition doesn't succeed, so that the Overlord leader election is not blocked


Key changed/added classes in this PR
  • TaskLockbox
  • TaskQueue

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@AmatyaAvadhanula
Copy link
Copy Markdown
Contributor Author

@abhishekagarwal87 thank you for the review!
I think there were a few other gaps such as cleaning up the task's existing TaskLockPosse and shutting it down if it's running, which I've fixed.

Comment thread indexing-service/src/main/java/org/apache/druid/indexing/overlord/TaskMaster.java Outdated
@AmatyaAvadhanula AmatyaAvadhanula marked this pull request as draft October 8, 2022 09:56
Copy link
Copy Markdown
Contributor

@kfaraz kfaraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments.

Comment thread indexing-service/src/main/java/org/apache/druid/indexing/overlord/TaskQueue.java Outdated
Comment thread indexing-service/src/main/java/org/apache/druid/indexing/overlord/TaskQueue.java Outdated
Comment thread indexing-service/src/main/java/org/apache/druid/indexing/overlord/TaskMaster.java Outdated
Comment thread indexing-service/src/main/java/org/apache/druid/indexing/overlord/SyncResult.java Outdated
@AmatyaAvadhanula AmatyaAvadhanula marked this pull request as ready for review October 14, 2022 06:08
@AmatyaAvadhanula
Copy link
Copy Markdown
Contributor Author

@kfaraz thank you for the review!

Copy link
Copy Markdown
Contributor

@kfaraz kfaraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick fix, @AmatyaAvadhanula !
I have added some more minor feedback. I would also request you to test this out thoroughly (in case you haven't already) on a Druid cluster as this changes task queue startup, which affects all task executions.

running.clear();
activeTasks.clear();
activeTasks.addAll(storedActiveTasks);
// Set of task groups in which at least one task failed to re-acquire a lock
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments!

task.getId(),
task.getGroupId()
);
continue;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: probably not needed as we are already at the end of the loop.

for (Task task : taskStorage.getActiveTasks()) {
if (failedToReacquireLockTaskGroups.contains(task.getGroupId())) {
tasksToFail.add(task);
activeTasks.remove(task.getId());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Style: You could choose to remove all of them in one go, thus retaining the sense of atomic update to activeTasks.

}

Set<Task> tasksToFail = new HashSet<>();
for (Task task : taskStorage.getActiveTasks()) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want to make another call to the storage. Use an iterator over the activeTasks or storedActiveTasks. Either should be fine.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

* groupId, dataSource, and priority.
*/
private TaskLockPosse verifyAndCreateOrFindLockPosse(Task task, TaskLock taskLock)
@VisibleForTesting
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: is there a way to avoid this and still be able to test it? (without too much hassle)

// Clean up needs to happen after tasks have been synced from storage
Set<Task> tasksToFail = taskLockbox.syncFromStorage().getTasksToFail();
for (Task task : tasksToFail) {
shutdown(task.getId(), "Failed to reacquire lock.");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
shutdown(task.getId(), "Failed to reacquire lock.");
shutdown(task.getId(), "Shutting down forcefully as failed to reacquire lock after becoming leader.");

It's a little verbose but paints a clearer picture of what happened.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. Shouldn't it be "while becoming leader"?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, that works too.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

// should return number of tasks which are not in running state
response = overlordResource.getCompleteTasks(null, req);
Assert.assertEquals(2, (((List) response.getEntity()).size()));
Assert.assertEquals(4, (((List) response.getEntity()).size()));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did we need to change an existing test?
I would advise adding a separate test for verifying the behaviour of such tasks where we failed to reacquire locks.

@Test
public void testFailedToReacquireTaskLock() throws Exception
{
final Task badTask0 = NoopTask.withGroupId("BadTask");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably a better name explaining why it's bad or good?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a TestLockbox class which returns null for tasks whose group contains "BadTask". I'm not sure it's the best approach to test these changes.
I'm hoping you could take a look and suggest a better approach.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The testing approach itself is fine but the names can be made a little more self-explanatory,
like "TaskWithFailingLockAcquisition" or something. Name the Task instances also accordingly.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

testLockbox.add(badTask1);
testLockbox.add(goodTask0);

testLockbox.tryLock(badTask0, new TimeChunkLockRequest(TaskLockType.EXCLUSIVE,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Style: Put each arg on a separate line (when necessary) for consistency with the rest of Druid code.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}
);
requestManagement();
// Remove any unacquired locks from storage
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the comment!
I am a little unclear on why there would be unacquired locks left behind. I would imagine that shutdown() would take care of this. If not, please rephrase the comments to clarify that point.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shutdown() only clears tasklocks for which a TaskLockPosse is present. The error here is that a lock could not be reacquired and a TaskLockPosse doesn't exist for the conflicting task, which is why these "unacquired" entries must be removed in this manner

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, please include this info in the comments.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Included

@AmatyaAvadhanula
Copy link
Copy Markdown
Contributor Author

@kfaraz, the changes have been tested on a druid cluster by introducing a bad entry in the metadata store to simulate this error and I monitored the changes for a while, and have ticked the corresponding box in the checklist.

Thank you for the additional feedback. I'll be sure to test the changes on a cluster after addressing these comments as well.

Copy link
Copy Markdown
Contributor

@kfaraz kfaraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 after CI passes.

@kfaraz kfaraz merged commit b88e1c2 into apache:master Oct 17, 2022
AmatyaAvadhanula added a commit to AmatyaAvadhanula/druid that referenced this pull request Oct 18, 2022
…che#13172)

Overlord leader election can sometimes fail due to task lock re-acquisition issues.
This commit solves the issue by failing such tasks and clearing all their locks.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants