KAFKA-7940: Address flakiness of CustomQuotaCallbackTest#testCustomQuotaCallback#6330
KAFKA-7940: Address flakiness of CustomQuotaCallbackTest#testCustomQuotaCallback#6330rajinisivaram merged 3 commits intoapache:trunkfrom stanislavkozlovski:KAFKA-7940-test-custom-quota-callback
Conversation
|
@stanislavkozlovski Thanks for the PR. I think it would be better to just remove that failing assertion or just assert that it is less than 100 (or some large enough value) just as a sanity check. |
|
@rajinisivaram I added a value that is still somewhat reasonable - |
rajinisivaram
left a comment
There was a problem hiding this comment.
@stanislavkozlovski Thanks for the PR, LGTM. Will merge after builds complete.
|
This PR seems to address https://issues.apache.org/jira/browse/KAFKA-7940 -- PR title should not have been MINOR, but use the ticket number. If it fixed the issues, please link the PR to the ticket and update (ie, close the ticket). This should be cherry-picked to |
|
Sorry about that, I had it in my fork's branch name but forgot to update the PR title |
|
Merged to 2.2 as well. |
|
Thanks @stanislavkozlovski and @rajinisivaram! |
* AK/trunk: (36 commits) KAFKA-7962: Avoid NPE for StickyAssignor (apache#6308) Address flakiness of CustomQuotaCallbackTest#testCustomQuotaCallback (apache#6330) KAFKA-7918: Inline generic parameters Pt. II: RocksDB Bytes Store and Memory LRU Caches (apache#6327) MINOR: fix parameter naming (apache#6316) KAFKA-7956 In ShutdownableThread, immediately complete the shutdown if the thread has not been started (apache#6218) MINOR: Refactor replica log dir fetching for improved logging (apache#6313) [TRIVIAL] Remove unused StreamsGraphNode#repartitionRequired (apache#6227) MINOR: Increase produce timeout to 120 seconds (apache#6326) KAFKA-7918: Inline generic parameters Pt. I: in-memory key-value store (apache#6293) MINOR: Fix line break issue in upgrade notes (apache#6320) KAFKA-7972: Use automatic RPC generation in SaslHandshake MINOR: Enable capture of full stack trace in StreamTask#process (apache#6310) KAFKA-7938: Fix test flakiness in DeleteConsumerGroupsTest (apache#6312) KAFKA-7937: Fix Flaky Test ResetConsumerGroupOffsetTest.testResetOffsetsNotExistingGroup (apache#6311) MINOR: Update docs to say 2.2 (apache#6315) KAFKA-7672 : force write checkpoint during StreamTask #suspend (apache#6115) KAFKA-7961; Ignore assignment for un-subscribed partitions (apache#6304) KAFKA-7672: Restoring tasks need to be closed upon task suspension (apache#6113) KAFKA-7864; validate partitions are 0-based (apache#6246) KAFKA-7492 : Updated javadocs for aggregate and reduce methods returning null behavior. (apache#6285) ...
This test has been seen to fail with:
All of which are in three test jobs where the following tests are seen to fail:
https://jenkins.confluent.io/job/apache-kafka-test/job/2.2/28/
https://jenkins.confluent.io/job/apache-kafka-test/job/2.2/15
https://jenkins.confluent.io/job/apache-kafka-test/job/2.2/14/
Reproduce Attempts
I tried to reproduce this flaky test locally by running the
testCustomQuotaCallbackover and over again.First off, I had the issue where
GroupedUserQuotaCallbackwould accumulate mutations and always fail. I thought it would be better if we reset the variables on everytearDown()call inCustomQuotaCallbackTest.scalaAfter changing that, I ran the test over and over again. I commented out the lines after line 107 (
assertTrue(s"Too many quotaLimit calls $quotaLimitCalls", quotaLimitCalls(ClientQuotaType.REQUEST).get <= serverCount)) since any changes afterwards are ruled out to have an impact.A lot of red herrings encountered while debugging. I think I managed to pinpoint the cause of this to a race condition in ZooKeeper dynamic config change notifications.
This is basically the trailing result of the
user.configureAndWaitForQuotacall. That test code verifies that the quotas are set on the metrics (kafka/core/src/test/scala/integration/kafka/api/BaseQuotaTest.scala
Line 297 in 2627a1b
The problem is that there is a race condition in between the code that updates said metrics - it subsequently calls
ClientQuotaManager#updateQuotaMetricConfigs()(kafka/core/src/main/scala/kafka/server/ClientQuotaManager.scala
Line 449 in 2627a1b
kafka/core/src/main/scala/kafka/server/ClientQuotaManager.scala
Line 496 in 2627a1b
quotaCallback.quotaLimit, incrementing thequotaLimitCallsvariable we check in the test.I found
UserConfigHandler#processConfigChanges()to be the culprit.Solution
We basically need a way to guarantee that
UserConfigHandler#processConfigChanges()has exited. If the test code'suser.configureAndWaitForQuota()method had a way to do that it would be perfect. I, unfortunately, could not come up with a good way to address this.I was thinking that we could correlate the expected
quotaLimitCalls. I see they had three different kinds of values across tests (with Thread.sleep enabled):Part of those are populated from the test code's
user.configureAndWaitForQuota(2 requestquota()calls for each 1 produce, 1 fetchquota()call.I can't exactly tell the correlation though.
I brainstormed for a bit but could not figure out a smarter way to guarantee that this test passes unless we add a
Thread.sleep(). We could potentially add awaitUntilTrue(quotaLimitCalls(ClientQuotaType.REQUEST) >= 18)and then reset but that seems even more hacky and error-prone.Otherwise, we could simply not check
quotaLimitCalls(ClientQuotaType.REQUEST).get <= serverCountor allow for some leeway there.