Skip to content

Fix Kafka Indexing task pause forever if no events in taskDuration (#5656)#5899

Merged
gianm merged 3 commits intoapache:masterfrom
surekhasaharan:issue-5656
Jun 26, 2018
Merged

Fix Kafka Indexing task pause forever if no events in taskDuration (#5656)#5899
gianm merged 3 commits intoapache:masterfrom
surekhasaharan:issue-5656

Conversation

@surekhasaharan
Copy link
Copy Markdown

@surekhasaharan surekhasaharan commented Jun 22, 2018

If the endOffset is same as startOffset, still let the task resume instead of returning
endOffsets early which causes the tasks to pause forever and ultimately fail on timeout

* Fix Nullpointer Exception in overlord if taskGroups does not contain the groupId
* If the endOffset is same as startOffset, still let the task resume instead of returning
   endOffsets early which causes the tasks to pause forever and ultimately fail on timeout
@VisibleForTesting
String generateSequenceName(int groupId)
{
if (taskGroups.get(groupId) == null) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks that this should never happen. Would you elaborate more on when this can happen?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This happens if no events were passed to kafka tasks, this is the issue from #5666. So checkTaskDuration() removes the groupId taskGroups.remove(groupId); but later checkPendingCompletionTasks() tries to get the groupId in sequenceTaskGroup.remove(generateSequenceName(groupId));

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jihoonson I tried again, removing the null check, and I cannot reproduce the NPE now with the task resume fix in, but I think, this null check couldn't hurt.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@surekhasaharan thanks. Null check is always good, but I'm not sure about returning null in this method. If groupId is never expected to be null, we should throw an exception. Otherwise, this method can return null, but all callers should check the returned sequenceName is null or not. What do you think?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @jihoonson -- let's fix this in a different patch, since it's really a different bug from the main one we're fixing. The main one being:

If the endOffset is same as startOffset, still let the task resume instead of returning
endOffsets early which causes the tasks to pause forever and ultimately fail on timeout

I raised a new issue for this NPE: #5900

Copy link
Copy Markdown
Author

@surekhasaharan surekhasaharan Jun 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jihoonson Agree that all the callers will need to handle null in this case. Will address the possible NPE in separate issue raised by @gianm . Removing the null check here.

Surekha Saharan added 2 commits June 22, 2018 22:54
*Remove the null check and do not return null from generateSequenceName
@gianm
Copy link
Copy Markdown
Contributor

gianm commented Jun 25, 2018

@surekhasaharan The test failure seems legit:

testIncrementalHandOff[isIncrementalHandoffSupported = true](io.druid.indexing.kafka.KafkaIndexTaskTest)  Time elapsed: 2.147 sec  <<< FAILURE!
java.lang.AssertionError: expected:<SUCCESS> but was:<FAILED>
	at org.junit.Assert.fail(Assert.java:88)
	at org.junit.Assert.failNotEquals(Assert.java:743)
	at org.junit.Assert.assertEquals(Assert.java:118)
	at org.junit.Assert.assertEquals(Assert.java:144)
	at io.druid.indexing.kafka.KafkaIndexTaskTest.testIncrementalHandOff(KafkaIndexTaskTest.java:517)

@surekhasaharan
Copy link
Copy Markdown
Author

All tests in KafkaIndexTaskTest seem to pass locally.

@gianm
Copy link
Copy Markdown
Contributor

gianm commented Jun 26, 2018

@surekhasaharan Ah okay cool - I restarted it and it passed.

@gianm gianm merged commit 0f42929 into apache:master Jun 26, 2018
@surekhasaharan surekhasaharan deleted the issue-5656 branch June 27, 2018 16:56
@jihoonson jihoonson added this to the 0.12.2 milestone Jul 5, 2018
jihoonson pushed a commit to implydata/druid-public that referenced this pull request Jul 5, 2018
…pache#5656) (apache#5899)

* Fix Kafka Indexing task pause forever (apache#5656)

* Fix Nullpointer Exception in overlord if taskGroups does not contain the groupId
* If the endOffset is same as startOffset, still let the task resume instead of returning
   endOffsets early which causes the tasks to pause forever and ultimately fail on timeout

* Address PR comment

*Remove the null check and do not return null from generateSequenceName
jihoonson pushed a commit to jihoonson/druid that referenced this pull request Jul 7, 2018
…pache#5656) (apache#5899)

* Fix Kafka Indexing task pause forever (apache#5656)

* Fix Nullpointer Exception in overlord if taskGroups does not contain the groupId
* If the endOffset is same as startOffset, still let the task resume instead of returning
   endOffsets early which causes the tasks to pause forever and ultimately fail on timeout

* Address PR comment

*Remove the null check and do not return null from generateSequenceName
gianm pushed a commit that referenced this pull request Jul 9, 2018
…5656) (#5899) (#5971)

* Fix Kafka Indexing task pause forever (#5656)

* Fix Nullpointer Exception in overlord if taskGroups does not contain the groupId
* If the endOffset is same as startOffset, still let the task resume instead of returning
   endOffsets early which causes the tasks to pause forever and ultimately fail on timeout

* Address PR comment

*Remove the null check and do not return null from generateSequenceName
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants