KAFKA-9875: Make integration tests more resilient#8578
KAFKA-9875: Make integration tests more resilient#8578vvcephei merged 1 commit intoapache:trunkfrom vvcephei:kafka-9875-integration-test-resilience
Conversation
vvcephei
left a comment
There was a problem hiding this comment.
Hey @guozhangwang , do you mind taking a look at this, since it overlaps with some of your recent contributions?
There was a problem hiding this comment.
I've standardized all the usages to be just "app", followed by the generated name, since the generated name contains the same information that we previously hand-wrote into the prefix or suffix. All we really need to do is ensure that the app id won't collide with a group name that we might use in a verification consumer, for example. For that reason, I've never used the generated name "plain", but always scoped it to the usage (app id, group id, input topic, etc.).
It's not super important to apply these ideas universally, but I felt it would make it easier to write more tests like it in the future if I just made a full pass on all the tests to make them all look the same.
There was a problem hiding this comment.
This is really the fix for KAFKA-9875. The other change just hopefully reduces the probability that ignoring the exceptions could cause subsequent failures (e.g., if the topics don't get deleted before the next test, at least the next one will have different topic names).
I've verified that all usages of this method are ok to ignore potential exceptions. Namely, as long as the test logic itself doesn't want to ensure that any topics got deleted, and as long as this method is the last line in the method, then it should be fine just to ignore failures here.
I also considered just deleting the method, but if it does succeed, then it leaves less garbage around for subsequent tests, so it feels better to at least attempt a cleanup.
There was a problem hiding this comment.
req: Actually deleting topics after test is critical for some tests: I've encountered some cases where the same topics are reused mistakenly across different test cases within the single class. But I feel that it is better to put the topic deletion in the @after function while leaving cleanUp() as part of the test function itself.
There was a problem hiding this comment.
I share your concern, but I'm not sure about the conclusion.
Yes, if there is state (such as a topic) that leaks from one test to the next, it can certainly cause difficult-to-debug failures. However, there are multiple things we can do to prevent/mitigate it:
- delete state after tests (not to leave any garbage behind)
- delete state before the tests (to ensure a clean slate for the test)
- choose unique names for all resources of each test (this is where the other part of this PR comes in)
Any one of these should be sufficient to prevent state from leaking in between tests, and most of these tests do all three. In other words, we have 3x redundancy guarding against such test pollution. If you look at all three of these measures, the clean up after tests is actually the most optional, since tests can't tolerate failures in the clean up before (because it also creates necessary topics), and choosing unique topic names per test is bulletproof and easy to fix (once we know what the problem is).
Whether the cleanup is part of the test or in the @After method, the outcome is the same, if the method throws an exception, the test will fail. The downside of After is that it requires you to store the topic names in mutable class-level fields, which actually makes it more awkward to choose unique names per test.
There was a problem hiding this comment.
Sounds good. I think this is convincing :)
There was a problem hiding this comment.
just fixing the formatting.
guozhangwang
left a comment
There was a problem hiding this comment.
Nice improvement.
I made a pass; except the last req comment I do not feel strong about other nits, so feel free to ignore.
There was a problem hiding this comment.
req: Actually deleting topics after test is critical for some tests: I've encountered some cases where the same topics are reused mistakenly across different test cases within the single class. But I feel that it is better to put the topic deletion in the @after function while leaving cleanUp() as part of the test function itself.
There was a problem hiding this comment.
Is it safer to encode the appID as part of the dir path to avoid collision?
There was a problem hiding this comment.
I wavered on this point, but each time you call tempDirectory, it should give you a completely new directory:
* Create a temporary relative directory in the default temporary-file directory with the given prefix.
*
* @param prefix The prefix of the temporary directory, if null using "kafka-" as default prefix
So the prefix seems to be nice only for documenting which test a directory is for, not for enforcing any kind of test/directory uniqueness. I felt like it added more noise than value, so I just dropped all the prefixes.
There was a problem hiding this comment.
Yeah I was thinking about debugging purposes as well. If it is more noise the usefulness, I'm fine with dropping it.
There was a problem hiding this comment.
nit: we can put kafkaStreams in a try block.
There was a problem hiding this comment.
Unfortunately, we generally can't use try-with-resources for these tests, since that makes the kafkaStreams reference out of scope for the finally block. We'd have to allow a reference to kafkaStreams to escape the try {} block to reference it either in finally or in an After method, which is just as messy as it currently is, if not more.
There was a problem hiding this comment.
nit: ditto here, we can put driver in the try block. And ditto elsewhere.
|
Thanks for the review, @guozhangwang . I've addressed (or responded to) your comments. |
|
LGTM! Please feel free to merge after rebasing. |
|
Thanks, @guozhangwang ! |
…/master` * apache-github/trunk: (45 commits) MINOR: Fix broken JMX link in docs by adding missing starting double quote (apache#8587) KAFKA-9652: Fix throttle metric in RequestChannel and request log due to KIP-219 (apache#8567) KAFKA-9922: Update demo instructions in examples README (apache#8559) KAFKA-9830: Implement AutoCloseable in ErrorReporter and subclasses (apache#8442) KAFKA-9875: Make integration tests more resilient (apache#8578) KAFKA-9932: Don't load configs from ZK when the log has already been loaded (apache#8582) KAFKA-9925: decorate pseudo-topics with app id (apache#8574) KAFKA-9832: fix attempt to commit non-running tasks (apache#8580) KAFKA-9127: don't create StreamThreads for global-only topology (apache#8540) MINOR: add support for kafka 2.4 and 2.5 to downgrade test KAFKA-9176: Retry on getting local stores from KafkaStreams (apache#8568) KAFKA-9823: Follow-up, check state for handling commit error response (apache#8548) KAFKA-6145: KIP-441: Add TaskAssignor class config (apache#8541) MINOR: Fix partition numbering from 0 to P-1 instead of P in docs (apache#8572) KAFKA-9921: disable caching on stores configured to retain duplicates (apache#8564) Minor: remove redundant check in auto preferred leader election (apache#8566) MINOR: Update the link to the Raft paper in docs (apache#8560) MINOR: Fix typos in config properties in MM2 test (apache#8561) MINOR: Improve producer test BufferPoolTest#testCloseNotifyWaiters. (apache#7982) MINOR: document how to escape json parameters to ducktape tests (apache#8546) ...
There was a minor conflict in gradle.properties because the default Scala version changed upstream to Scala 2.13. I kept the upstream change. Related to this, I have updated Jenkinsfile to compile and validate with Scala 2.12 in a separate stage so that we ensure we maintain compatibility. Unlike Apache Kafka, we only run the tests with the default Scala version, which is now 2.13. * apache-github/trunk: (45 commits) MINOR: Fix broken JMX link in docs by adding missing starting double quote (apache#8587) KAFKA-9652: Fix throttle metric in RequestChannel and request log due to KIP-219 (apache#8567) KAFKA-9922: Update demo instructions in examples README (apache#8559) KAFKA-9830: Implement AutoCloseable in ErrorReporter and subclasses (apache#8442) KAFKA-9875: Make integration tests more resilient (apache#8578) KAFKA-9932: Don't load configs from ZK when the log has already been loaded (apache#8582) KAFKA-9925: decorate pseudo-topics with app id (apache#8574) KAFKA-9832: fix attempt to commit non-running tasks (apache#8580) KAFKA-9127: don't create StreamThreads for global-only topology (apache#8540) MINOR: add support for kafka 2.4 and 2.5 to downgrade test KAFKA-9176: Retry on getting local stores from KafkaStreams (apache#8568) KAFKA-9823: Follow-up, check state for handling commit error response (apache#8548) KAFKA-6145: KIP-441: Add TaskAssignor class config (apache#8541) MINOR: Fix partition numbering from 0 to P-1 instead of P in docs (apache#8572) KAFKA-9921: disable caching on stores configured to retain duplicates (apache#8564) Minor: remove redundant check in auto preferred leader election (apache#8566) MINOR: Update the link to the Raft paper in docs (apache#8560) MINOR: Fix typos in config properties in MM2 test (apache#8561) MINOR: Improve producer test BufferPoolTest#testCloseNotifyWaiters. (apache#7982) MINOR: document how to escape json parameters to ducktape tests (apache#8546) ...
The ticket is for a flaky test that failed to clean up topics after the test, which
isn't strictly necessary for test success.
(after verifying it's always the last invocation inside a finally block,
so it won't break any test semantics)
a new test utility to generate safe, unique, descriptive names.
Committer Checklist (excluded from commit message)