Skip to content

KAFKA-9875: Make integration tests more resilient#8578

Merged
vvcephei merged 1 commit intoapache:trunkfrom
vvcephei:kafka-9875-integration-test-resilience
Apr 29, 2020
Merged

KAFKA-9875: Make integration tests more resilient#8578
vvcephei merged 1 commit intoapache:trunkfrom
vvcephei:kafka-9875-integration-test-resilience

Conversation

@vvcephei
Copy link
Copy Markdown
Contributor

The ticket is for a flaky test that failed to clean up topics after the test, which
isn't strictly necessary for test success.

  • alter the "clean up after test" method to never throw an exception
    (after verifying it's always the last invocation inside a finally block,
    so it won't break any test semantics)
  • consolidated the naming of all integration tests' app ids, topics, etc., by introducing
    a new test utility to generate safe, unique, descriptive names.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

Copy link
Copy Markdown
Contributor Author

@vvcephei vvcephei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @guozhangwang , do you mind taking a look at this, since it overlaps with some of your recent contributions?

Comment on lines 105 to 106
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've standardized all the usages to be just "app", followed by the generated name, since the generated name contains the same information that we previously hand-wrote into the prefix or suffix. All we really need to do is ensure that the app id won't collide with a group name that we might use in a verification consumer, for example. For that reason, I've never used the generated name "plain", but always scoped it to the usage (app id, group id, input topic, etc.).

It's not super important to apply these ideas universally, but I felt it would make it easier to write more tests like it in the future if I just made a full pass on all the tests to make them all look the same.

Comment on lines 149 to 171
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really the fix for KAFKA-9875. The other change just hopefully reduces the probability that ignoring the exceptions could cause subsequent failures (e.g., if the topics don't get deleted before the next test, at least the next one will have different topic names).

I've verified that all usages of this method are ok to ignore potential exceptions. Namely, as long as the test logic itself doesn't want to ensure that any topics got deleted, and as long as this method is the last line in the method, then it should be fine just to ignore failures here.

I also considered just deleting the method, but if it does succeed, then it leaves less garbage around for subsequent tests, so it feels better to at least attempt a cleanup.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

req: Actually deleting topics after test is critical for some tests: I've encountered some cases where the same topics are reused mistakenly across different test cases within the single class. But I feel that it is better to put the topic deletion in the @after function while leaving cleanUp() as part of the test function itself.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I share your concern, but I'm not sure about the conclusion.

Yes, if there is state (such as a topic) that leaks from one test to the next, it can certainly cause difficult-to-debug failures. However, there are multiple things we can do to prevent/mitigate it:

  • delete state after tests (not to leave any garbage behind)
  • delete state before the tests (to ensure a clean slate for the test)
  • choose unique names for all resources of each test (this is where the other part of this PR comes in)

Any one of these should be sufficient to prevent state from leaking in between tests, and most of these tests do all three. In other words, we have 3x redundancy guarding against such test pollution. If you look at all three of these measures, the clean up after tests is actually the most optional, since tests can't tolerate failures in the clean up before (because it also creates necessary topics), and choosing unique topic names per test is bulletproof and easy to fix (once we know what the problem is).

Whether the cleanup is part of the test or in the @After method, the outcome is the same, if the method throws an exception, the test will fail. The downside of After is that it requires you to store the topic names in mutable class-level fields, which actually makes it more awkward to choose unique names per test.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I think this is convincing :)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whew! :)

Comment on lines 109 to 111
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just fixing the formatting.

Copy link
Copy Markdown
Contributor

@guozhangwang guozhangwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice improvement.

I made a pass; except the last req comment I do not feel strong about other nits, so feel free to ignore.

Comment on lines 149 to 171
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

req: Actually deleting topics after test is critical for some tests: I've encountered some cases where the same topics are reused mistakenly across different test cases within the single class. But I feel that it is better to put the topic deletion in the @after function while leaving cleanUp() as part of the test function itself.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it safer to encode the appID as part of the dir path to avoid collision?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wavered on this point, but each time you call tempDirectory, it should give you a completely new directory:

     * Create a temporary relative directory in the default temporary-file directory with the given prefix.
     *
     * @param prefix The prefix of the temporary directory, if null using "kafka-" as default prefix

So the prefix seems to be nice only for documenting which test a directory is for, not for enforcing any kind of test/directory uniqueness. I felt like it added more noise than value, so I just dropped all the prefixes.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I was thinking about debugging purposes as well. If it is more noise the usefulness, I'm fine with dropping it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we can put kafkaStreams in a try block.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, we generally can't use try-with-resources for these tests, since that makes the kafkaStreams reference out of scope for the finally block. We'd have to allow a reference to kafkaStreams to escape the try {} block to reference it either in finally or in an After method, which is just as messy as it currently is, if not more.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: ditto here, we can put driver in the try block. And ditto elsewhere.

@vvcephei
Copy link
Copy Markdown
Contributor Author

Thanks for the review, @guozhangwang . I've addressed (or responded to) your comments.

@guozhangwang
Copy link
Copy Markdown
Contributor

LGTM! Please feel free to merge after rebasing.

@vvcephei
Copy link
Copy Markdown
Contributor Author

Thanks, @guozhangwang !

@vvcephei vvcephei merged commit dc4d439 into apache:trunk Apr 29, 2020
@vvcephei vvcephei deleted the kafka-9875-integration-test-resilience branch April 29, 2020 22:11
ijuma added a commit to confluentinc/kafka that referenced this pull request Apr 30, 2020
…/master`

* apache-github/trunk: (45 commits)
  MINOR: Fix broken JMX link in docs by adding missing starting double quote (apache#8587)
  KAFKA-9652: Fix throttle metric in RequestChannel and request log due to KIP-219 (apache#8567)
  KAFKA-9922: Update demo instructions in examples README (apache#8559)
  KAFKA-9830: Implement AutoCloseable in ErrorReporter and subclasses (apache#8442)
  KAFKA-9875: Make integration tests more resilient (apache#8578)
  KAFKA-9932: Don't load configs from ZK when the log has already been loaded (apache#8582)
  KAFKA-9925: decorate pseudo-topics with app id (apache#8574)
  KAFKA-9832: fix attempt to commit non-running tasks (apache#8580)
  KAFKA-9127: don't create StreamThreads for global-only topology (apache#8540)
  MINOR: add support for kafka 2.4 and 2.5 to downgrade test
  KAFKA-9176: Retry on getting local stores from KafkaStreams (apache#8568)
  KAFKA-9823: Follow-up, check state for handling commit error response (apache#8548)
  KAFKA-6145: KIP-441: Add TaskAssignor class config (apache#8541)
  MINOR: Fix partition numbering from 0 to P-1 instead of P in docs (apache#8572)
  KAFKA-9921: disable caching on stores configured to retain duplicates (apache#8564)
  Minor: remove redundant check in auto preferred leader election (apache#8566)
  MINOR: Update the link to the Raft paper in docs (apache#8560)
  MINOR: Fix typos in config properties in MM2 test (apache#8561)
  MINOR: Improve producer test BufferPoolTest#testCloseNotifyWaiters. (apache#7982)
  MINOR: document how to escape json parameters to ducktape tests (apache#8546)
  ...
ijuma added a commit to confluentinc/kafka that referenced this pull request Apr 30, 2020
There was a minor conflict in gradle.properties because the default
Scala version changed upstream to Scala 2.13. I kept the upstream
change.

Related to this, I have updated Jenkinsfile to compile and validate
with Scala 2.12 in a separate stage so that we ensure we maintain
compatibility. Unlike Apache Kafka, we only run the tests with the
default Scala version, which is now 2.13.

* apache-github/trunk: (45 commits)
MINOR: Fix broken JMX link in docs by adding missing starting double
quote (apache#8587)
KAFKA-9652: Fix throttle metric in RequestChannel and request log due
to KIP-219 (apache#8567)
  KAFKA-9922: Update demo instructions in examples README (apache#8559)
KAFKA-9830: Implement AutoCloseable in ErrorReporter and subclasses
(apache#8442)
  KAFKA-9875: Make integration tests more resilient (apache#8578)
KAFKA-9932: Don't load configs from ZK when the log has already been
loaded (apache#8582)
  KAFKA-9925: decorate pseudo-topics with app id (apache#8574)
  KAFKA-9832: fix attempt to commit non-running tasks (apache#8580)
KAFKA-9127: don't create StreamThreads for global-only topology
(apache#8540)
  MINOR: add support for kafka 2.4 and 2.5 to downgrade test
  KAFKA-9176: Retry on getting local stores from KafkaStreams (apache#8568)
KAFKA-9823: Follow-up, check state for handling commit error response
(apache#8548)
  KAFKA-6145: KIP-441: Add TaskAssignor class config (apache#8541)
MINOR: Fix partition numbering from 0 to P-1 instead of P in docs
(apache#8572)
KAFKA-9921: disable caching on stores configured to retain duplicates
(apache#8564)
Minor: remove redundant check in auto preferred leader election
(apache#8566)
  MINOR: Update the link to the Raft paper in docs (apache#8560)
  MINOR: Fix typos in config properties in MM2 test (apache#8561)
MINOR: Improve producer test BufferPoolTest#testCloseNotifyWaiters.
(apache#7982)
MINOR: document how to escape json parameters to ducktape tests
(apache#8546)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants