Fix Kafka Indexing task pause forever if no events in taskDuration (#5656)#5899
Fix Kafka Indexing task pause forever if no events in taskDuration (#5656)#5899gianm merged 3 commits intoapache:masterfrom
Conversation
* Fix Nullpointer Exception in overlord if taskGroups does not contain the groupId * If the endOffset is same as startOffset, still let the task resume instead of returning endOffsets early which causes the tasks to pause forever and ultimately fail on timeout
| @VisibleForTesting | ||
| String generateSequenceName(int groupId) | ||
| { | ||
| if (taskGroups.get(groupId) == null) { |
There was a problem hiding this comment.
It looks that this should never happen. Would you elaborate more on when this can happen?
There was a problem hiding this comment.
This happens if no events were passed to kafka tasks, this is the issue from #5666. So checkTaskDuration() removes the groupId taskGroups.remove(groupId); but later checkPendingCompletionTasks() tries to get the groupId in sequenceTaskGroup.remove(generateSequenceName(groupId));
There was a problem hiding this comment.
@jihoonson I tried again, removing the null check, and I cannot reproduce the NPE now with the task resume fix in, but I think, this null check couldn't hurt.
There was a problem hiding this comment.
@surekhasaharan thanks. Null check is always good, but I'm not sure about returning null in this method. If groupId is never expected to be null, we should throw an exception. Otherwise, this method can return null, but all callers should check the returned sequenceName is null or not. What do you think?
There was a problem hiding this comment.
I agree with @jihoonson -- let's fix this in a different patch, since it's really a different bug from the main one we're fixing. The main one being:
If the endOffset is same as startOffset, still let the task resume instead of returning
endOffsets early which causes the tasks to pause forever and ultimately fail on timeout
I raised a new issue for this NPE: #5900
There was a problem hiding this comment.
@jihoonson Agree that all the callers will need to handle null in this case. Will address the possible NPE in separate issue raised by @gianm . Removing the null check here.
*Remove the null check and do not return null from generateSequenceName
|
@surekhasaharan The test failure seems legit: |
|
All tests in |
|
@surekhasaharan Ah okay cool - I restarted it and it passed. |
…pache#5656) (apache#5899) * Fix Kafka Indexing task pause forever (apache#5656) * Fix Nullpointer Exception in overlord if taskGroups does not contain the groupId * If the endOffset is same as startOffset, still let the task resume instead of returning endOffsets early which causes the tasks to pause forever and ultimately fail on timeout * Address PR comment *Remove the null check and do not return null from generateSequenceName
…pache#5656) (apache#5899) * Fix Kafka Indexing task pause forever (apache#5656) * Fix Nullpointer Exception in overlord if taskGroups does not contain the groupId * If the endOffset is same as startOffset, still let the task resume instead of returning endOffsets early which causes the tasks to pause forever and ultimately fail on timeout * Address PR comment *Remove the null check and do not return null from generateSequenceName
…5656) (#5899) (#5971) * Fix Kafka Indexing task pause forever (#5656) * Fix Nullpointer Exception in overlord if taskGroups does not contain the groupId * If the endOffset is same as startOffset, still let the task resume instead of returning endOffsets early which causes the tasks to pause forever and ultimately fail on timeout * Address PR comment *Remove the null check and do not return null from generateSequenceName
If the endOffset is same as startOffset, still let the task resume instead of returning
endOffsets early which causes the tasks to pause forever and ultimately fail on timeout