MINOR: improve resilience of Streams test producers#6028
MINOR: improve resilience of Streams test producers#6028guozhangwang merged 1 commit intoapache:trunkfrom vvcephei:fix-streams-broker-system-test
Conversation
vvcephei
left a comment
There was a problem hiding this comment.
@guozhangwang @bbejeck , do you mind taking a look at this when you get the chance?
There was a problem hiding this comment.
The retries parameter is ignored if EOS is enabled.
Also, along with passing the retries config along to the producer, we set the delivery timeout high enough to accomodate the desired retry value.
There was a problem hiding this comment.
Another measure to improve resilience: make sure that messages produced during test setup are acknowledged by all brokers so we can be sure that during broker bounce operations we'll never fail a test due to expectedly unreplicated data.
There was a problem hiding this comment.
Cleanup: TODO comments are not an effective project planning mechanism. If there's something that needs to be done, we should create a Jira. But really, it just sounds like a slightly different alternative implementation of the same test, which we may or may not want to do. Thus, I'm just deleting it.
bbejeck
left a comment
There was a problem hiding this comment.
Thanks for this @vvcephei looks good to me. Just two minor comments
- Can we rerun the system test, but set the
DUCTAPE_ARGSto something like--repeat 25? - Since the
verfiiableProduceris used across all projects, we should also kick off an all project system test run just to be sure.
|
@bbejeck Thanks for that suggestion.
|
|
The "flaky test" run from above failed during setup while downloading Java from Oracle: re-running as https://jenkins.confluent.io/job/system-test-kafka-branch-builder/2133 |
|
I'll analyze the system test failures and report whether I think they are related. |
|
The Retest this, please. |
|
While analyzing the failed system tests, I found that my setting in BaseStreamsTest of This is because the VerifiableProducer class only allows the values 1, 0, and -1 (with -1 meaning "all"). this suits me fine, as I initially wanted to set it to "all". I'm updating the setting to -1 and re-running the streams system tests. After analysis, I do not believe the client or core system test failures above are caused by the changes to VerifiableProducer, so I'll no repeat all tests, just streams tests x5. |
|
Running streams system tests x5 x4 (total of 20 runs): |
|
@vvcephei I saw some of the system tests failed, are they relevant? |
|
@guozhangwang Thanks for the reminder... I haven't looked yet. I'll let you know tomorrow. |
|
@vvcephei PR shows conflicts. Can you rebase. Thx. |
|
Ok, this time, I did three single all-streams runs and three x5 runs, so I could get some results faster:
x5 runs are still pending. All three of the x1 runs had 1 failure, but they all failed on different tests. I'll look more into it later. |
Thanks @vvcephei |
|
The x5 runs all timed out at the configured 800-minute mark. (this means instead of 5 runs, we got 2.9 runs) Apparently, when this happens, we don't get unique test results for the three runs, because the output for all three jobs lists the same url: FWIW, in that run, all the tests passed (except the last, which was aborted after 16 seconds when the build timed out). Also, I have analyzed the x1 runs above, and all three failures were unrelated. Two were Kafka startup timing out, and the third was "Never saw processing of AGGREGATED" in |
|
Ok, since the last runs took longer than the configured timeout, I updated the timeout policy to timeout only if the tests stop logging and kicked off 3 more streams x5 runs. They failed 1, 2, and 1 time respectively, as noted below:
For one thing, none of these seem to be related to the change I made. I think if anything, it would result in an overcount of processed messages, not a failure to reach some state. There is one test, WDYT? |
The If those are the only failures then I would agree that merging this PR should be safe. |
|
Why are we setting retries? The point of KIP-91 was to move away from that approach. |
|
@ijuma during startup phases there are other reasons than the ones KIP-91 tries to fix, e.g. not-enough-replica, without retries producer would drop the messages and cause test to fail. |
Some Streams system tests have failed during the setup phase due to the producer having retries disabled and getting some transient error from the broker. This patch adds a retries parameter to the VerifiableProducer (default unchanged), and sets retries to 10 for Streams tests. It also sets acks equal to the number of brokers for Streams tests. Reviewers: Matthias J. Sax <matthias@confluent.io>, Bill Bejeck <bill@confluent.io>, Guozhang Wang <guozhang@confluent.io>
Some Streams system tests have failed during the setup phase
due to the producer having retries disabled and getting some
transient error from the broker.
This patch adds a
retriesparameter to the VerifiableProducer(default unchanged), and sets
retriesto 10 for Streams tests.It also sets
acksequal to the number of brokers for Streams tests.Committer Checklist (excluded from commit message)