Skip to content

KAFKA-14244: Add guard against accidental calls to halt JVM during testing#12666

Closed
C0urante wants to merge 5 commits intoapache:trunkfrom
C0urante:kafka-14244
Closed

KAFKA-14244: Add guard against accidental calls to halt JVM during testing#12666
C0urante wants to merge 5 commits intoapache:trunkfrom
C0urante:kafka-14244

Conversation

@C0urante
Copy link
Copy Markdown
Contributor

@C0urante C0urante commented Sep 20, 2022

Jira

Uses an automatically-registered JUnit 5 extension that installs a SecurityManager which intercepts attempts to terminate the JVM that appear to have come from test cases, and then reports those attempts by causing tests to fail. If the test that attempted to terminate the JVM is running at the time of the attempt, that test will fail. If not, a random test will fail at a later point (in order to fail the CI build and surface the issue), with a message reporting that a prior test appears to have leaked a thread which has attempted to terminate the JVM after that test completed.

The InheritableThreadLocal class is used to track which tests attempt to terminate the JVM, including cases where the attempt occurs in a separate thread, and even when the attempt has occurred after the test has already completed.

Although the SecurityManager API was deprecated and slated for removal in JEP-411, there is no existing alternative for intercepting attempts to terminate the JVM (see https://bugs.openjdk.org/browse/JDK-8199704). Once an alternative becomes available, we can use it instead of the SecurityManager API.

This is heavily inspired by the system-rules library, with some modifications to support parallel testing.

It's worth noting that there is a potentially much-simpler fix where we modify the Exit wrapper classes to use InheritableThreadLocal instances to track custom procedures. However, this does not automatically prevent the JVM from being accidentally terminated during testing (tests still have to remember to use the wrapper class), and it does not provide a way to fail the build when unexpected attempts to terminate the JVM take place, potentially swallowing dangerous failures during testing and CI.

Uses an automatically-registered JUnit 5 extension that installs a custom default exit procedure which intercepts attempts to terminate the JVM that appear to have come from test cases, and then reports those attempts by causing tests to fail. If the test that attempted to terminate the JVM is running at the time of the attempt, that test will fail. If not, a random test will fail at a later point (in order to fail the CI build and surface the issue), with a message reporting that a prior test appears to have leaked a thread which has attempted to terminate the JVM after that test completed.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

@C0urante C0urante added the tests Test fixes (including flaky tests) label Sep 20, 2022
Comment thread build.gradle
includeTags "integration"
includeTags "org.apache.kafka.test.IntegrationTest"
// Both engines are needed to run JUnit 4 tests alongside JUnit 5 tests.
// Both engines are needed to run JUnit 4 tests alongside JUnit 5 tests.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated whitespace fix; this line was not aligned with others in my IDE.

Comment thread build.gradle
excludeTags "integration"
excludeTags "org.apache.kafka.test.IntegrationTest"
// Both engines are needed to run JUnit 4 tests alongside JUnit 5 tests.
// Both engines are needed to run JUnit 4 tests alongside JUnit 5 tests.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated whitespace fix; this line was not aligned with others in my IDE.

@showuon
Copy link
Copy Markdown
Member

showuon commented Sep 21, 2022

@C0urante , interesting solution. But I'm wondering how you test it and prove this can really catch Exit and report error?

@divijvaidya
Copy link
Copy Markdown
Member

This is a good idea and definitely an improvement over what already exists. But I think that the downside of taking a dependency on something that is deprecated as soon as JDK 18 is quite significant. Perhaps, we can solve the underlying problem in a different way!

Let me explain my point of view on this.

Some of our tests spawn new threads. The expectation for the happy case path is that these threads are terminated by the test gracefully at the end of test execution. In case when these threads are not terminated, they may cause erroneous/undeterministic behaviour such as calling System.Exit().

Instead of focusing on Exit() (which is not the underlying cause, it's a side effect), could we try to formulate a solution which ensures that the test would fail if threads created by that test are not terminated at the end of the execution?

I don't have a proposal towards a solution but there are ideas that we could explore such as ensuring all threads created by a test have a property in their thread local (this could be set in @before and removed in @after).

Thoughts?

@C0urante
Copy link
Copy Markdown
Contributor Author

@showuon Thanks for taking a look. I did some local testing by injecting calls to System::exit into existing unit tests, both on the actual test thread and on separate threads with a little sleep time added to allow the test case to complete before attempting to terminate the JVM. If this PR seems promising I can take a stab at adding some testing logic for the extension/security manager; LMKWYT.

@C0urante
Copy link
Copy Markdown
Contributor Author

@divijvaidya Thank you for your thoughts. I don't quite agree that attempting to terminate the JVM is a symptom rather than the root cause, since there are cases where these attempts occur without threads being leaked (one such case was touched on in #9698). In addition, the main issue here is that these attempts tend to crop up only during CI builds and not locally, in which case the cost of allowing the JVM to be terminated is higher as debugging is harder and the build attempt is rendered useless.

I'm also not very worried about the deprecation of the SecurityManager API since it's unlikely that the API will be fully removed without adding in an alternative way to intercept calls to terminate the JVM; this use case is called out repeatedly in JEP-411 as a valuable one. I've updated the PR to be safe with Java 18 with some small conditional logic in the Gradle build file. This should lay some decent groundwork for the future if we want to apply different means of intercepting calls to terminate the JVM depending on which Java version we're building with.

That said, if you have the time to develop and implement an alternative proposal, I'd be happy to review a POC. I'm not certain we really can track thread creation and termination during tests without a serious refactoring of the code base, but if there's a lightweight way to do this, it would probably be worth considering either instead of or in addition to hardening our testing logic against JVM termination.

@divijvaidya
Copy link
Copy Markdown
Member

@C0urante Thank you for explaining. I stand corrected that thread leaks are the only cause of System.Exit. Your proposed approach sounds good to me. For the thread leak scenario, we can perhaps start using TestUtils.verifyNoUnexpectedThreads in modules other than core to as an additional guard.

@C0urante
Copy link
Copy Markdown
Contributor Author

C0urante commented Sep 23, 2022

@showuon I've added some tests for the custom extension that should demonstrate its effectiveness.

I should note that I also experimented with an approach that leveraged JUnit 5's test class ordering API to create a test that runs after all others have completed and checks for leaked threads that have tried to terminate the JVM, instead of performing that check after every test and failing potentially-unrelated tests in order to surface the problem. Unfortunately, I couldn't find a way to make this work with parallel testing, since the test would only be run on a single executor, and would often finish before other tests on other executors.

@ijuma
Copy link
Copy Markdown
Member

ijuma commented Oct 6, 2022

@C0urante You mentioned in your PR that we could potentially adjust the Exit classes, but it would not solve the problem as well. Is the main issue that we have System.exit/halt calls that are not going via the Exit classes or something else?

@C0urante
Copy link
Copy Markdown
Contributor Author

C0urante commented Oct 10, 2022

@ijuma It's partially that we sometimes have direct calls to System::exit (although it's rare that these would make it past review), and partially that we wouldn't have a way to automatically fail tests that make unexpected calls to Exit::exit or Exit::halt.

I think if we could find a way to address the latter point, it'd be worth it to just modify the existing Exit classes instead of installing the custom extension here. But I haven't found a way to do that yet.

@ijuma
Copy link
Copy Markdown
Member

ijuma commented Oct 14, 2022

I think there's no way around the junit extension, but I was wondering if we could use that with the Exit classes versus the security manager approach. Then we don't have to depend on the deprecated security manager. But if that's too difficult, then we can go with the security manager approach and adjust it for newer JDK versions when/if a new api is introduced to replace it.

@C0urante
Copy link
Copy Markdown
Contributor Author

@ijuma Sorry for the delay.

The tradeoff involved is whether we want to account for direct calls to System::exit or not. If we're confident enough in our review process to catch those before they get merged, or at least to quickly identify them after they've been merged and started causing issues in CI, then it should be fine to remove the security manager and move everything into the Exit class. I'll give that a shot and see if I can push a new draft by the end of the week.

@C0urante
Copy link
Copy Markdown
Contributor Author

C0urante commented Nov 1, 2022

I've updated the PR to not use a custom SecurityManager, and drop support for catching direct calls to System::exit.

@divijvaidya @ijuma If you have time, let me know what you think; hoping this is readable to fresh eyes, but want to make sure.

@C0urante
Copy link
Copy Markdown
Contributor Author

C0urante commented Jun 6, 2023

Closing due to lack of interest; we can revisit this if necessary.

@C0urante C0urante closed this Jun 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

tests Test fixes (including flaky tests)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants