Fix IllegalArgumentException in TaskLockBox.syncFromStorage() when updating from 0.12.x to 0.12.2#6086
Conversation
| // The priority of task and taskLock can be different when updating cluster if the task doesn't have the | ||
| // priority in taskContext while taskLock has. In this case, we simply ignores the task priority and uses the | ||
| // taskLock priority. | ||
| if (task.getContextValue(Tasks.PRIORITY_KEY) == null && task.getDefaultPriority() == taskLock.getPriority()) { |
There was a problem hiding this comment.
I don't understand this line. If we have a task lock saved from an earlier version of Druid with missing priority (so it becomes 0 when read from the DB) and a task running with no priority specified, then:
task.getContextValue(Tasks.PRIORITY_KEY)is nulltask.getDefaultPriority()is some default priority like 75taskLock.getPriority()is 0
So this expression is overall false, and we will go to the else block and throw an exception. It seems like the problem mentioned in the PR description is not fixed? But if the unit tests pass, how can that be?
There was a problem hiding this comment.
I looked into it a bit more and I think the issue is that Task.java has:
default int getPriority()
{
return getContextValue(Tasks.PRIORITY_KEY, Tasks.DEFAULT_TASK_PRIORITY);
}
And NoopTask does not override this. Since Tasks.DEFAULT_TASK_PRIORITY is 0, NoopTask has 0 priority, and so the tests are too easy to pass.
I think Task should change to:
default int getPriority()
{
return getContextValue(Tasks.PRIORITY_KEY, getDefaultPriority());
}
That way, the behavior of getPriority() is consistent with getDefaultPriority(). It seems like the only reason that priorities work at all today is that the OverlordResource injects them with this code:
// Set default priority if needed
final Integer priority = task.getContextValue(Tasks.PRIORITY_KEY);
if (priority == null) {
task.addToContext(Tasks.PRIORITY_KEY, task.getDefaultPriority());
}
I think this won't be necessary if Task.getPriority() is fixed.
There was a problem hiding this comment.
@gianm here is the thing.
Before 0.12, both task and taskLock don't have priority. And before #6050, the task works as you described above while the taskLock returns 0 if priority doesn't exist. So, when taskLockBox.syncFromStorage() is called, they work like below.
| priority container | value |
|---|---|
| task | no priority |
| taskLock | no priority |
| task.getPriority() | default priority depending on task type |
| taskLock.getPriority() | 0 |
This causes the priority mismatch between task and taskLock, so #6050 was to fix this issue. In that PR, I've changed to inject the default priority to submitted tasks if they don't have priority. So, now task.getPriority() returns the priority what it has or 0 which is the same with how taskLock.getPriority() works.
However, after this patch, taskLockBox.syncFromStorage() can introduce the below situation.
| priority container | value |
|---|---|
| task | no priority |
| taskLock | default priority depending on task type |
| task.getPriority() | 0 |
| taskLock.getPriority() | default priority depending on task type |
This can happen because the priority of taskLock is initialized with task.getPriority() which returns the default priority if priority doesn't exist in taskContext before #6050. So, if the task was created before #6050, the priority of task and its taskLock would be different.
|
LGTM after CI |
This bug was introduced in #6050. When updating the druid cluster from 0.12.x to 0.12.2, taskLocks in metastore should have the proper priority while tasks might not have it in taskContext. If priority isn't in taskContext, they return the default priority before #6050. But after #6050, tasks return 0 priority if priority doesn't exist in taskContext which causes this bug.
In this patch, I changed the sanity check to pass if task doesn't have the priority and taskLock priority is same with task's default priority. Also added unit tests.