KAFKA-9875: Make integration tests more resilient by vvcephei · Pull Request #8578 · apache/kafka

vvcephei · 2020-04-28T22:46:53Z

The ticket is for a flaky test that failed to clean up topics after the test, which
isn't strictly necessary for test success.

alter the "clean up after test" method to never throw an exception
(after verifying it's always the last invocation inside a finally block,
so it won't break any test semantics)
consolidated the naming of all integration tests' app ids, topics, etc., by introducing
a new test utility to generate safe, unique, descriptive names.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

vvcephei

Hey @guozhangwang , do you mind taking a look at this, since it overlaps with some of your recent contributions?

vvcephei · 2020-04-28T22:48:29Z

I've standardized all the usages to be just "app", followed by the generated name, since the generated name contains the same information that we previously hand-wrote into the prefix or suffix. All we really need to do is ensure that the app id won't collide with a group name that we might use in a verification consumer, for example. For that reason, I've never used the generated name "plain", but always scoped it to the usage (app id, group id, input topic, etc.).

It's not super important to apply these ideas universally, but I felt it would make it easier to write more tests like it in the future if I just made a full pass on all the tests to make them all look the same.

vvcephei · 2020-04-28T22:56:13Z

This is really the fix for KAFKA-9875. The other change just hopefully reduces the probability that ignoring the exceptions could cause subsequent failures (e.g., if the topics don't get deleted before the next test, at least the next one will have different topic names).

I've verified that all usages of this method are ok to ignore potential exceptions. Namely, as long as the test logic itself doesn't want to ensure that any topics got deleted, and as long as this method is the last line in the method, then it should be fine just to ignore failures here.

I also considered just deleting the method, but if it does succeed, then it leaves less garbage around for subsequent tests, so it feels better to at least attempt a cleanup.

req: Actually deleting topics after test is critical for some tests: I've encountered some cases where the same topics are reused mistakenly across different test cases within the single class. But I feel that it is better to put the topic deletion in the @after function while leaving cleanUp() as part of the test function itself.

I share your concern, but I'm not sure about the conclusion.

Yes, if there is state (such as a topic) that leaks from one test to the next, it can certainly cause difficult-to-debug failures. However, there are multiple things we can do to prevent/mitigate it:

delete state after tests (not to leave any garbage behind)

delete state before the tests (to ensure a clean slate for the test)

choose unique names for all resources of each test (this is where the other part of this PR comes in)

Any one of these should be sufficient to prevent state from leaking in between tests, and most of these tests do all three. In other words, we have 3x redundancy guarding against such test pollution. If you look at all three of these measures, the clean up after tests is actually the most optional, since tests can't tolerate failures in the clean up before (because it also creates necessary topics), and choosing unique topic names per test is bulletproof and easy to fix (once we know what the problem is).

Whether the cleanup is part of the test or in the @After method, the outcome is the same, if the method throws an exception, the test will fail. The downside of After is that it requires you to store the topic names in mutable class-level fields, which actually makes it more awkward to choose unique names per test.

Sounds good. I think this is convincing :)

vvcephei · 2020-04-28T22:56:33Z

just fixing the formatting.

guozhangwang

Nice improvement.

I made a pass; except the last req comment I do not feel strong about other nits, so feel free to ignore.

guozhangwang · 2020-04-29T01:05:00Z

req: Actually deleting topics after test is critical for some tests: I've encountered some cases where the same topics are reused mistakenly across different test cases within the single class. But I feel that it is better to put the topic deletion in the @after function while leaving cleanUp() as part of the test function itself.

guozhangwang · 2020-04-29T01:11:29Z

Is it safer to encode the appID as part of the dir path to avoid collision?

I wavered on this point, but each time you call tempDirectory, it should give you a completely new directory:

* Create a temporary relative directory in the default temporary-file directory with the given prefix. * * @param prefix The prefix of the temporary directory, if null using "kafka-" as default prefix

So the prefix seems to be nice only for documenting which test a directory is for, not for enforcing any kind of test/directory uniqueness. I felt like it added more noise than value, so I just dropped all the prefixes.

Yeah I was thinking about debugging purposes as well. If it is more noise the usefulness, I'm fine with dropping it.

guozhangwang · 2020-04-29T01:12:31Z

nit: we can put kafkaStreams in a try block.

Unfortunately, we generally can't use try-with-resources for these tests, since that makes the kafkaStreams reference out of scope for the finally block. We'd have to allow a reference to kafkaStreams to escape the try {} block to reference it either in finally or in an After method, which is just as messy as it currently is, if not more.

guozhangwang · 2020-04-29T01:13:37Z

nit: ditto here, we can put driver in the try block. And ditto elsewhere.

vvcephei · 2020-04-29T03:29:19Z

Thanks for the review, @guozhangwang . I've addressed (or responded to) your comments.

guozhangwang · 2020-04-29T19:29:28Z

LGTM! Please feel free to merge after rebasing.

vvcephei · 2020-04-29T19:49:04Z

Thanks, @guozhangwang !

…/master` * apache-github/trunk: (45 commits) MINOR: Fix broken JMX link in docs by adding missing starting double quote (apache#8587) KAFKA-9652: Fix throttle metric in RequestChannel and request log due to KIP-219 (apache#8567) KAFKA-9922: Update demo instructions in examples README (apache#8559) KAFKA-9830: Implement AutoCloseable in ErrorReporter and subclasses (apache#8442) KAFKA-9875: Make integration tests more resilient (apache#8578) KAFKA-9932: Don't load configs from ZK when the log has already been loaded (apache#8582) KAFKA-9925: decorate pseudo-topics with app id (apache#8574) KAFKA-9832: fix attempt to commit non-running tasks (apache#8580) KAFKA-9127: don't create StreamThreads for global-only topology (apache#8540) MINOR: add support for kafka 2.4 and 2.5 to downgrade test KAFKA-9176: Retry on getting local stores from KafkaStreams (apache#8568) KAFKA-9823: Follow-up, check state for handling commit error response (apache#8548) KAFKA-6145: KIP-441: Add TaskAssignor class config (apache#8541) MINOR: Fix partition numbering from 0 to P-1 instead of P in docs (apache#8572) KAFKA-9921: disable caching on stores configured to retain duplicates (apache#8564) Minor: remove redundant check in auto preferred leader election (apache#8566) MINOR: Update the link to the Raft paper in docs (apache#8560) MINOR: Fix typos in config properties in MM2 test (apache#8561) MINOR: Improve producer test BufferPoolTest#testCloseNotifyWaiters. (apache#7982) MINOR: document how to escape json parameters to ducktape tests (apache#8546) ...

There was a minor conflict in gradle.properties because the default Scala version changed upstream to Scala 2.13. I kept the upstream change. Related to this, I have updated Jenkinsfile to compile and validate with Scala 2.12 in a separate stage so that we ensure we maintain compatibility. Unlike Apache Kafka, we only run the tests with the default Scala version, which is now 2.13. * apache-github/trunk: (45 commits) MINOR: Fix broken JMX link in docs by adding missing starting double quote (apache#8587) KAFKA-9652: Fix throttle metric in RequestChannel and request log due to KIP-219 (apache#8567) KAFKA-9922: Update demo instructions in examples README (apache#8559) KAFKA-9830: Implement AutoCloseable in ErrorReporter and subclasses (apache#8442) KAFKA-9875: Make integration tests more resilient (apache#8578) KAFKA-9932: Don't load configs from ZK when the log has already been loaded (apache#8582) KAFKA-9925: decorate pseudo-topics with app id (apache#8574) KAFKA-9832: fix attempt to commit non-running tasks (apache#8580) KAFKA-9127: don't create StreamThreads for global-only topology (apache#8540) MINOR: add support for kafka 2.4 and 2.5 to downgrade test KAFKA-9176: Retry on getting local stores from KafkaStreams (apache#8568) KAFKA-9823: Follow-up, check state for handling commit error response (apache#8548) KAFKA-6145: KIP-441: Add TaskAssignor class config (apache#8541) MINOR: Fix partition numbering from 0 to P-1 instead of P in docs (apache#8572) KAFKA-9921: disable caching on stores configured to retain duplicates (apache#8564) Minor: remove redundant check in auto preferred leader election (apache#8566) MINOR: Update the link to the Raft paper in docs (apache#8560) MINOR: Fix typos in config properties in MM2 test (apache#8561) MINOR: Improve producer test BufferPoolTest#testCloseNotifyWaiters. (apache#7982) MINOR: document how to escape json parameters to ducktape tests (apache#8546) ...

vvcephei commented Apr 28, 2020

View reviewed changes

guozhangwang reviewed Apr 29, 2020

View reviewed changes

KAFKA-9875: Make integration tests more resilient

1884686

vvcephei merged commit dc4d439 into apache:trunk Apr 29, 2020

vvcephei deleted the kafka-9875-integration-test-resilience branch April 29, 2020 22:11

Conversation

vvcephei commented Apr 28, 2020

Committer Checklist (excluded from commit message)

Uh oh!

vvcephei left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guozhangwang left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vvcephei commented Apr 29, 2020

Uh oh!

guozhangwang commented Apr 29, 2020

Uh oh!

vvcephei commented Apr 29, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants