Skip to content

Fix timeout in KafkaSupervisorTest.testCheckpointForInactiveTaskGroup#6207

Merged
fjy merged 4 commits intoapache:masterfrom
jihoonson:fix-timeout-kafka-supervisor-test
Aug 27, 2018
Merged

Fix timeout in KafkaSupervisorTest.testCheckpointForInactiveTaskGroup#6207
fjy merged 4 commits intoapache:masterfrom
jihoonson:fix-timeout-kafka-supervisor-test

Conversation

@jihoonson
Copy link
Copy Markdown
Contributor

Let's see what is the error message.


Assert.assertNull(serviceEmitter.getStackTrace());
Assert.assertNull(serviceEmitter.getExceptionMessage());
Assert.assertNull(serviceEmitter.getStackTrace(), serviceEmitter.getStackTrace());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a fix or just temp debugging arrangement to see what is in the stacktrace so that later a fix can be done ?

test times out because serviceEmitter.getStackTrace() is non-null , so this assertion is gonna fail in those cases.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this doesn't fix the root cause, but fixes the infinite loop and print the stack trace so that we can fix it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR title misled me :)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, do you want this PR to be merged or you're trying builds on this PR to see things whenever this test fails ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the title looks somewhat confusing :(. Do you have a better idea?

I thought it would be great if the CI for this PR shows something interesting, but unfortunately it succeeded. I can restart it until I find something, or check other PRs if they fail with some stack traces. I don't have a strong opinion here, but I would say this PR gives us a small benefit by making CI jobs fail faster.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe, "Add debug information to expose transient failure cause in KafkaSupervisorTest.testCheckpointForInactiveTaskGroup", if this PR is merged without actually fixing.

I'm approving it so feel free to merge whenever you want after trying out the builds here. I triggered one.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. It catches the exception. I'll check it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exception was

  KafkaSupervisorTest.testCheckpointForInactiveTaskGroup:2118 java.lang.AssertionError: 
  Unexpected method call TaskRunner.getRunningTasks():
	at org.easymock.internal.MockInvocationHandler.invoke(MockInvocationHandler.java:44)
	at org.easymock.internal.ObjectMethodsFilter.invoke(ObjectMethodsFilter.java:94)
	at com.sun.proxy.$Proxy43.getRunningTasks(Unknown Source)
	at io.druid.indexing.kafka.supervisor.KafkaSupervisor$1.getTaskLocation(KafkaSupervisor.java:296)
	at io.druid.indexing.kafka.supervisor.KafkaSupervisor.checkpointTaskGroup(KafkaSupervisor.java:1503)
	at io.druid.indexing.kafka.supervisor.KafkaSupervisor.checkTaskDuration(KafkaSupervisor.java:1433)
	at io.druid.indexing.kafka.supervisor.KafkaSupervisor.runInternal(KafkaSupervisor.java:879)
	at io.druid.indexing.kafka.supervisor.KafkaSupervisor$RunNotice.handle(KafkaSupervisor.java:595)
	at io.druid.indexing.kafka.supervisor.KafkaSupervisor$2.run(KafkaSupervisor.java:369)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

So, I added expect(taskRunner.getRunningTasks()).andReturn(workItems).anyTimes();. Hopefully this fixes this issue.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leventov thanks. I've also noticed that before. The problem is that the supervisor periodically executes runNotice and this can potentially call every method of the supervisor which in turn requires to mock everything using EasyMock.

I think it's better to refactor the whole KafkaSupervisorTest to be based on a sort of more controllable mockup environment by creating taskRunner, taskMaster, taskStorage, and taskClient for test purpose instead of making them using EasyMock.

Raised #6296.

@fjy fjy merged commit 64d33ee into apache:master Aug 27, 2018
@dclim dclim added this to the 0.13.0 milestone Oct 9, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants