KAFKA-14345: Fix flakiness with more accurate bound in (Dynamic)ConnectionQuotaTest#12806
KAFKA-14345: Fix flakiness with more accurate bound in (Dynamic)ConnectionQuotaTest#12806gharris1727 wants to merge 12 commits intoapache:trunkfrom
Conversation
…ionQuotaTest Signed-off-by: Greg Harris <greg.harris@aiven.io>
Signed-off-by: Greg Harris <greg.harris@aiven.io>
Signed-off-by: Greg Harris <greg.harris@aiven.io>
Signed-off-by: Greg Harris <greg.harris@aiven.io>
Signed-off-by: Greg Harris <greg.harris@aiven.io>
|
Hey @gharris1727 |
Signed-off-by: Greg Harris <greg.harris@aiven.io>
|
@divijvaidya Thanks for pointing me to that PR with a similar flakey test. I tried applying this bound to the situation in that test and it didn't hold, and managed to deterministically replicate the behavior in my unit test with a parameterized test. Here's the results, just for discussion's sake: I was testing the minInterval=0 case when I was developing this bound, but it appears that the theoretical bound doesn't apply in some cases. At this time I'm not sure yet whether the bound is wrong or the rate limiting is wrong. |
Signed-off-by: Greg Harris <greg.harris@aiven.io>
|
My worst-case bounds test had a typo that made the results invalid. I've fixed that and this is the current behavior: I also experimented with the fix from #12045 and it caused all of the test cases to pass. |
Signed-off-by: Greg Harris <greg.harris@aiven.io>
|
Okay, I've just loosened the bound to compensate for the variable-length windows. This means that the flakey tests should now be resolved, while not changing the behavior of the window algorithm. |
Signed-off-by: Greg Harris <greg.harris@aiven.io>
Signed-off-by: Greg Harris <greg.harris@aiven.io>
|
This PR is being marked as stale since it has not had any activity in 90 days. If you would like to keep this PR alive, please ask a committer for review. If the PR has merge conflicts, please update it with the latest from trunk (or appropriate release branch) If this PR is no longer valid or desired, please feel free to close it. If no activity occurrs in the next 30 days, it will be automatically closed. |
|
@divijvaidya Are you interested in reviewing this? This is a KIP-less alternative to solve the flakiness in this test. |
|
Hi @gharris1727 I am afraid that fixing only the tests hides the real problem/bug with the quotas implementation. If we merge this PR, yes the tests will not be flaky but aren't we then accepting the current behaviour of quotas implementation as expected behaviour i.e. the deviation could be +- epsilon and epsilon could exceed the thresholds (max.connection.creation.rate) set for a windowing period. Having said that, the answer might be that the current behaviour is accepted behaviour. In which case, I would be comfortable with this change if it is accompanied by a change in docs explaining the current expectations from the windowing algorithm so that the users at least know what their expectation from the quota implementation should be. Alternatively, the fix i.e. #12045 doesn't require a KIP (it doesn't change any public interfaces). Please feel free to pick up that PR (or duplicate it). I won't have time any time soon to work on it. I will be happy to provide a review. What do you think? |
|
Hey @divijvaidya .
It is not possible to eliminate the deviation visible to the outside observer, and these tests will always need to include an error term. Here's what the error bounds will be for the default window configuration and various observation times with the variable-width and fixed-width windowing algorithms: This PR implements the variable-width limits because the variable-width algorithm is currently on trunk, not because it is more accepted or correct. I'm personally ambivalent about which algorithm should be used. This PR replaces the un-motivated and incorrect hardcoded constants with computed bounds, and that is beneficial with or without the fixed-width algorithm.
If you don't have time or interest to work that PR and/or KIP, then I think a test-only fix is appropriate until someone is interested in picking up the PR. The flaky tests have already made us aware of the odd behavior of the variable-width algorithm, so more ongoing failures aren't helpful.
I don't think that what this PR addresses needs to be explained in the documentation.
|
|
Hi @divijvaidya As this is still causing flaky failures ~2% of the time, I'm still interested in getting this fix merged. Thanks! |
|
This PR is being marked as stale since it has not had any activity in 90 days. If you If you are having difficulty finding a reviewer, please reach out on the [mailing list](https://kafka.apache.org/contact). If this PR is no longer valid or desired, please feel free to close it. If no activity occurs in the next 30 days, it will be automatically closed. |
|
This PR has been closed since it has not had any activity in 120 days. If you feel like this |
Signed-off-by: Greg Harris greg.harris@aiven.io
The existing hard-coded error bound in the test is inaccurate, such that certain patterns
of request arrival cause this test to exceed the bound slightly. This has appeared in CI as
a flakey test run which just barely exceeds the listed bound.
Instead of a hard-coded error bound which can become inaccurate if the test's parameters change,
compute the worst-case error bound based on the knowledge of the underlying windowing algorithm.
Full derivation and explanation of the worst-case bound is provided in a comment in the util function.
The bound is also unit tested in a mocked-time environment which simulates the behavior of the full
SocketServer's throttle servo loop, simulating a connection flood with various delays.
Also fix some IDE warnings, incorrect/misleading comments, and an AdminClient leak.
Committer Checklist (excluded from commit message)