Skip to content

Conversation

@anmolnar
Copy link
Contributor

@lvfangmin @hanm @eolivelli

You were working on #1130 . Please check this fix, I'm not sure if it's appropriate.
When running the unit test tearDown() thread gets an interrupt which will cause the entire test runner to shut down. This affects both Maven and Ant builds. I have no idea why we can't see it in Maven jobs.

@eolivelli
Copy link
Contributor

If we call System.exit Maven forked process will exit and maybe we are running fewer tests than expected.
I have already seen this kind of problems in other projects.
I can try to port some 'System.exit' shield

Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix I have on other projects for System.exit is to add a SecurityManager that prevents calls like System.exit, this way System.exit ends in a SecurityException.

But the best way is to not call System.exit in tests

LOG.warn("CommitProcessor does not shutdown gracefully after "
+ "waiting for {} ms, exit to avoid potential "
+ "inconsistency issue", workerShutdownTimeoutMS);
System.exit(ExitCode.SHUTDOWN_UNGRACEFULLY.getValue());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Catch (InterruptedException) won't help for System.exit.

Maybe the best idea is to have a system property to disable the real 'System.exit' and add such property in tests (Maven surefire plugin config)

@anmolnar
Copy link
Contributor Author

anmolnar commented Nov 14, 2019

@eolivelli Adding some more context, because I think you misunderstood my fix.

The problem with the original code is that while the thread is waiting for CommitProcessor thread to finish in this.join(millis), it will be interrupted by somebody else (JUnit runner?).
The code catches the interrupted exception and passes on to test if CommitProcessor is still running with this.isAlive().
There's some chance here that CP is still running and System.exit() will be called shutting down the entire process.

My fix changes the original behavior to avoid further processing by quiting the entire shutdown method if calling thread has been interrupted. This significantly changes the original implementation, hence we need to look very carefully.

@eolivelli
Copy link
Contributor

@anmolnar now I understand better your fix.
If is possible that ANT is trying to interrupt application threads, this is not done in Maven.

To me it is not very clear if we can skip calling the next processor. This is a behavior change.

We can simply also revert the original fix and re-open the JIRA, this way the master branch will be clear and we will have time to fix the original patch.
@lvfangmin how does this plan sound to you ?

@anmolnar
Copy link
Contributor Author

@eolivelli Sounds good to me.

It's not just Ant. I cannot run tests with Maven on my local machine either. That's why I don't understand how could be the Maven build green: https://builds.apache.org/view/S-Z/view/ZooKeeper/job/zookeeper-master-maven/521/org.apache.zookeeper$zookeeper/testReport/

Btw I cannot see QuorumRequestPipelineTest was run in that particular build.

@anmolnar
Copy link
Contributor Author

Look at this. The test didn't succeed since build no. 514 and the job has not reported it as an error:
https://builds.apache.org/view/S-Z/view/ZooKeeper/job/zookeeper-master-maven/514/org.apache.zookeeper$zookeeper/consoleText

The error is the same:

Crashed tests:
org.apache.zookeeper.server.quorum.QuorumRequestPipelineTest
ExecutionException The forked VM terminated without properly saying goodbye. VM crash or System.exit called?

@anmolnar
Copy link
Contributor Author

@eolivelli
Copy link
Contributor

yep, it is a known and annoying problem in Surefire, if the forked VM crashes there is no report of the problem, only in the logs.

btw it is better not to call System.exit in tests.
I wonder if a good approach is to have a global system utility to call System.exit.
In tests we can override the behaviour with something that logs an error of fails the test.

I see the same problem using Kafka Server in "unit" tests of downstream applications, embedded Kafka Server breaks the JVM and this is really annoying

@anmolnar
Copy link
Contributor Author

@eolivelli Makes sense. If it's a limitation of Surefure plugin, I think the best would be to revert the patch and avoid using System.exit anywhere in our codebase.

The workaround that you suggested could also be acceptable for me. I'm closing this one and pushing the revert. @lvfangmin fyi.

@anmolnar anmolnar closed this Nov 15, 2019
@anmolnar anmolnar deleted the ZOOKEEPER-3598 branch November 15, 2019 10:00
@lvfangmin
Copy link
Contributor

If the CommitProcessor thread.join is interrupted, and the thread is still alive, we cannot call nextProcessor.shutdown, otherwise it will affect the correctness. System.exit is the safe way to avoid the potential correctness issue here.

@anmolnar can you help me understand why the tearDown interrupted the CommitProcessor shutdown()? I checked the unit test, and didn't see we'll interrupted the code, is it from junit? Maybe we should change the unit test if it's possible instead of reverting the code which will actually affect the prod.

@anmolnar
Copy link
Contributor Author

@lvfangmin Good question, I have no idea.
I didn't find the place where the actual interrupt has been called, but seen the debugger going into the Interrupted catch branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants