Skip to content

Fixing overlord unable to become a leader when syncing the lock from metadata store#14038

Merged
cryptoe merged 1 commit intoapache:masterfrom
cryptoe:overlord_fix_while_taking_locks
Apr 10, 2023
Merged

Fixing overlord unable to become a leader when syncing the lock from metadata store#14038
cryptoe merged 1 commit intoapache:masterfrom
cryptoe:overlord_fix_while_taking_locks

Conversation

@cryptoe
Copy link
Copy Markdown
Contributor

@cryptoe cryptoe commented Apr 6, 2023

Overlord fails to become the leaders if its encounters an exception while getting the taskLocks from metadatastore.

I saw this problem when one of the clusters was running a MSQ job before this patch: #13282 which changes task priority.
During upgrade, the overlord failed to start due to:

2023-03-28T15:59:53,571 ERROR [LeaderSelector[/druid/overlord/_OVERLORD]] org.apache.druid.curator.discovery.CuratorDruidLeaderSelector - listener becomeLeader() failed. Unable to become leader: {class=org.apache.druid.curator.discovery.CuratorDruidLeaderSelector, exceptionType=class java.lang.RuntimeException, exceptionMessage=java.lang.reflect.InvocationTargetException}
java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
	at org.apache.druid.indexing.overlord.TaskMaster$1.becomeLeader(TaskMaster.java:179) 
	at org.apache.druid.curator.discovery.CuratorDruidLeaderSelector$1.isLeader(CuratorDruidLeaderSelector.java:98) 
	at org.apache.curator.framework.listen.MappingListenerManager.lambda$forEach$0(MappingListenerManager.java:92) 
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
	at java.lang.Thread.run(Thread.java:829) ~[?:?]
Caused by: java.lang.reflect.InvocationTargetException
	at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?]
	at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:?]
	at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
	at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]
	at org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler.start(Lifecycle.java:446) 
	at org.apache.druid.java.util.common.lifecycle.Lifecycle.start(Lifecycle.java:341) 
	at org.apache.druid.indexing.overlord.TaskMaster$1.becomeLeader(TaskMaster.java:176) 
Caused by: java.lang.IllegalArgumentException: lock priority[0] is different from task priority[50]
	at com.google.common.base.Preconditions.checkArgument(Preconditions.java:148) ~[guava-16.0.1.jar:?]
	at org.apache.druid.indexing.overlord.TaskLockbox.verifyAndCreateOrFindLockPosse(TaskLockbox.java:260) 
	at org.apache.druid.indexing.overlord.TaskLockbox.syncFromStorage(TaskLockbox.java:169)
	at org.apache.druid.indexing.overlord.TaskQueue.start(TaskQueue.java:179)
	at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?]
	at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:?]
	at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
	at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]
	at org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler.start(Lifecycle.java:446) 
	at org.apache.druid.java.util.common.lifecycle.Lifecycle.start(Lifecycle.java:341) 
	at org.apache.druid.indexing.overlord.TaskMaster$1.becomeLeader(TaskMaster.java:176) 
	... 5 more
	

Went ahead and fixed the db sync such that any task groups whose task locks are unable to be synced with the db are first identified.
Post that all the tasks in that task group are killed by the overlord piggy backing on the logic introduced in this patch : #13172

Key changed/added classes in this PR
  • TaskLockBox

This PR has:

  • been self-reviewed.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.

@cryptoe cryptoe requested a review from AmatyaAvadhanula April 6, 2023 16:51
@cryptoe cryptoe added this to the 26.0 milestone Apr 6, 2023
@cryptoe cryptoe added Bug Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 and removed Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 labels Apr 6, 2023
@cryptoe cryptoe merged commit 8712098 into apache:master Apr 10, 2023
cryptoe added a commit to cryptoe/druid that referenced this pull request Apr 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants