Skip to content

MINOR: improve resilience of Streams test producers#6028

Merged
guozhangwang merged 1 commit intoapache:trunkfrom
vvcephei:fix-streams-broker-system-test
Jan 4, 2019
Merged

MINOR: improve resilience of Streams test producers#6028
guozhangwang merged 1 commit intoapache:trunkfrom
vvcephei:fix-streams-broker-system-test

Conversation

@vvcephei
Copy link
Copy Markdown
Contributor

Some Streams system tests have failed during the setup phase
due to the producer having retries disabled and getting some
transient error from the broker.

This patch adds a retries parameter to the VerifiableProducer
(default unchanged), and sets retries to 10 for Streams tests.

It also sets acks equal to the number of brokers for Streams tests.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

Copy link
Copy Markdown
Contributor Author

@vvcephei vvcephei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@guozhangwang @bbejeck , do you mind taking a look at this when you get the chance?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The retries parameter is ignored if EOS is enabled.

Also, along with passing the retries config along to the producer, we set the delivery timeout high enough to accomodate the desired retry value.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another measure to improve resilience: make sure that messages produced during test setup are acknowledged by all brokers so we can be sure that during broker bounce operations we'll never fail a test due to expectedly unreplicated data.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleanup: TODO comments are not an effective project planning mechanism. If there's something that needs to be done, we should create a Jira. But really, it just sounds like a slightly different alternative implementation of the same test, which we may or may not want to do. Thus, I'm just deleting it.

Copy link
Copy Markdown
Member

@bbejeck bbejeck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this @vvcephei looks good to me. Just two minor comments

  1. Can we rerun the system test, but set the DUCTAPE_ARGS to something like --repeat 25?
  2. Since the verfiiableProducer is used across all projects, we should also kick off an all project system test run just to be sure.

@mjsax mjsax added the streams label Dec 13, 2018
@vvcephei
Copy link
Copy Markdown
Contributor Author

The "flaky test" run from above failed during setup while downloading Java from Oracle:

    worker6: Connecting to download.oracle.com (download.oracle.com)|23.44.163.29|:443... connected.
    worker6: HTTP request sent, awaiting response... 200 OK
    worker6: Length: 191753373 (183M) [application/x-gzip]
    worker6: Saving to: ‘jdk-8u191-linux-x64.tar.gz’
    worker6: 
    worker6:      0K ........ ........ ........ ........ ........ ........  1% 38.2M 5s
    worker6:   3072K ........ ........ ........ ........ ........ ........  3% 82.9M 3s
    worker6:   6144K ........ ........ ........ ........ ........ ........  4% 75.6M 3s
    worker6:   9216K ........ ........ ........ ........ ........ ........  6% 46.5M 3s
    worker6:  12288K ........ ........ ........ ........ ........ ........  8% 56.7M 3s
    worker6:  15360K ........ ........ ........ ........ ........ ........  9% 86.2M 3s
    worker6:  18432K ........ ........ ........ ........ ........ ........ 11% 74.5M 3s
    worker6:  21504K ........ ........ ........ ........ ........ ........ 13% 66.2M 3s
    worker6:  24576K ........ ........ ........ ........ ........ ........ 14% 81.0M 2s
    worker6:  27648K ........ ........ ........ ........ ........ ........ 16% 88.0M 2s
    worker6:  30720K ........ ........ ........ ........ ........ ........ 18% 78.8M 2s
    worker6:  33792K ........ ........ ........ ........ ........ ........ 19% 88.5M 2s
    worker6:  36864K ........ ........ ........ ........ ........ ........ 21% 69.8M 2s
    worker6:  39936K ........ ........ ........ ........ ........ ........ 22% 79.8M 2s
    worker6:  43008K ........ ........ ........ ........ ........ ........ 24% 42.6M 2s
    worker6:  46080K ........ ........ ........ ........ ........ ........ 26% 66.3M 2s
    worker6:  49152K ........ ........ ........ ........ ........ ........ 27% 52.8M 2s
    worker6:  52224K ........ ........ ........ ........ ........ ........ 29% 68.4M 2s
    worker6:  55296K ........ ........ ........ ........ ........ ........ 31% 76.4M 2s
    worker6:  58368K ........ ........ ........ ........ ........ ........ 32% 82.7M 2s
    worker6:  61440K ........ ........ ........ ........ ........ ........ 34% 33.5M 2s
    worker6:  64512K                                                       34%  265M=1.0s
    worker6: 
    worker6: 2018-12-12 22:33:22 (63.2 MB/s) - Read error at byte 66082158/191753373 (Connection reset by peer). download failed
    worker6: Oracle JDK 8 is NOT installed.
    worker6: dpkg: error processing package oracle-java8-installer (--configure):
    worker6:  subprocess installed post-installation script returned error exit status 1
    worker6: Errors were encountered while processing:
    worker6:  oracle-java8-installer
    worker6: E: Sub-process /usr/bin/dpkg returned an error code (1)
    worker6: + echo 'ERROR: JDK install failed'
    worker6: ERROR: JDK install failed
    worker6: + exit 1
==> worker6: An error occurred. The error will be shown after all tasks complete.

re-running as https://jenkins.confluent.io/job/system-test-kafka-branch-builder/2133

@vvcephei
Copy link
Copy Markdown
Contributor Author

I'll analyze the system test failures and report whether I think they are related.

@vvcephei
Copy link
Copy Markdown
Contributor Author

The kafka.cluster.PartitionTest.testDelayedFetchAfterAppendRecords faliure in https://builds.apache.org/job/kafka-pr-jdk11-scala2.12/1191/ must be unrelated.

Retest this, please.

@vvcephei
Copy link
Copy Markdown
Contributor Author

While analyzing the failed system tests, I found that my setting in BaseStreamsTest of acks==num_brokers breaks the (only, apparently) test that sets num_brokers > 1: StreamsStandbyTask.

This is because the VerifiableProducer class only allows the values 1, 0, and -1 (with -1 meaning "all"). this suits me fine, as I initially wanted to set it to "all". I'm updating the setting to -1 and re-running the streams system tests.

After analysis, I do not believe the client or core system test failures above are caused by the changes to VerifiableProducer, so I'll no repeat all tests, just streams tests x5.

@guozhangwang
Copy link
Copy Markdown
Contributor

@vvcephei I saw some of the system tests failed, are they relevant?

@vvcephei
Copy link
Copy Markdown
Contributor Author

@guozhangwang Thanks for the reminder... I haven't looked yet. I'll let you know tomorrow.

@mjsax
Copy link
Copy Markdown
Member

mjsax commented Dec 20, 2018

@vvcephei PR shows conflicts. Can you rebase. Thx.

@vvcephei
Copy link
Copy Markdown
Contributor Author

vvcephei commented Jan 3, 2019

@guozhangwang
Copy link
Copy Markdown
Contributor

All three of the x1 runs had 1 failure, but they all failed on different tests. I'll look more into it later.

Thanks @vvcephei

@vvcephei
Copy link
Copy Markdown
Contributor Author

vvcephei commented Jan 3, 2019

The x5 runs all timed out at the configured 800-minute mark. (this means instead of 5 runs, we got 2.9 runs)

Apparently, when this happens, we don't get unique test results for the three runs, because the output for all three jobs lists the same url:

FWIW, in that run, all the tests passed (except the last, which was aborted after 16 seconds when the build timed out).

Also, I have analyzed the x1 runs above, and all three failures were unrelated. Two were Kafka startup timing out, and the third was "Never saw processing of AGGREGATED" in Never saw processing of AGGREGATED line 111 (timeout watching the processor logs).

@vvcephei
Copy link
Copy Markdown
Contributor Author

vvcephei commented Jan 4, 2019

Ok, since the last runs took longer than the configured timeout, I updated the timeout policy to timeout only if the tests stop logging and kicked off 3 more streams x5 runs.

They failed 1, 2, and 1 time respectively, as noted below:

For one thing, none of these seem to be related to the change I made. I think if anything, it would result in an overcount of processed messages, not a failure to reach some state.

There is one test, test_upgrade_optimized_topology that failed one out of 5 times in each build, which might be a problem worth investigating...

WDYT?

@bbejeck
Copy link
Copy Markdown
Member

bbejeck commented Jan 4, 2019

There is one test, test_upgrade_optimized_topology that failed one out of 5 times in each build, which might be a problem worth investigating...

The test_upgrade_optimized_topology was upgraded by #6063 with 25 consecutive runs on branch builder, but it another look over wouldn't hurt.

If those are the only failures then I would agree that merging this PR should be safe.

@ijuma
Copy link
Copy Markdown
Member

ijuma commented Jan 4, 2019

Why are we setting retries? The point of KIP-91 was to move away from that approach.

@guozhangwang
Copy link
Copy Markdown
Contributor

@ijuma during startup phases there are other reasons than the ones KIP-91 tries to fix, e.g. not-enough-replica, without retries producer would drop the messages and cause test to fail.

@guozhangwang guozhangwang merged commit ef9204d into apache:trunk Jan 4, 2019
@vvcephei vvcephei deleted the fix-streams-broker-system-test branch January 4, 2019 22:04
pengxiaolong pushed a commit to pengxiaolong/kafka that referenced this pull request Jun 14, 2019
Some Streams system tests have failed during the setup phase
due to the producer having retries disabled and getting some
transient error from the broker.

This patch adds a retries parameter to the VerifiableProducer
(default unchanged), and sets retries to 10 for Streams tests.

It also sets acks equal to the number of brokers for Streams tests.

Reviewers: Matthias J. Sax <matthias@confluent.io>, Bill Bejeck <bill@confluent.io>, Guozhang Wang <guozhang@confluent.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants