KAFKA-14345: Fix flakiness with more accurate bound in (Dynamic)ConnectionQuotaTest#12806

Closed

gharris1727 wants to merge 12 commits intoapache:trunkfrom

gharris1727:kafka-14345-flakey-quotas

Contributor

gharris1727 commented Nov 1, 2022 •

edited

Loading

Signed-off-by: Greg Harris greg.harris@aiven.io

The existing hard-coded error bound in the test is inaccurate, such that certain patterns
of request arrival cause this test to exceed the bound slightly. This has appeared in CI as
a flakey test run which just barely exceeds the listed bound.

Instead of a hard-coded error bound which can become inaccurate if the test's parameters change,
compute the worst-case error bound based on the knowledge of the underlying windowing algorithm.
Full derivation and explanation of the worst-case bound is provided in a comment in the util function.
The bound is also unit tested in a mocked-time environment which simulates the behavior of the full
SocketServer's throttle servo loop, simulating a connection flood with various delays.

Also fix some IDE warnings, incorrect/misleading comments, and an AdminClient leak.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)


          KAFKA-14345: Fix flakiness with more accurate bound in DynamicConnect…

7589e41

…ionQuotaTest

Signed-off-by: Greg Harris <greg.harris@aiven.io>

Contributor Author

gharris1727 commented Nov 1, 2022

cc @apovzner @splett2 for awareness of this bound change.

gharris1727 added 4 commits

October 31, 2022 20:39


          fixup: typo window instead of samples, extra semicolon

2d4b5f6

Signed-off-by: Greg Harris <greg.harris@aiven.io>


          fixup: interval diagram improvement

87fae6a

Signed-off-by: Greg Harris <greg.harris@aiven.io>


          fixup: number typo

d89ee5c

Signed-off-by: Greg Harris <greg.harris@aiven.io>


          fixup: typo in example diagram, interval names

dfe10c9

Signed-off-by: Greg Harris <greg.harris@aiven.io>

Member

divijvaidya commented Nov 1, 2022

Hey @gharris1727
Since you are familiar with the windowing algorithm for throttling, please look at #12045 too. You will find similarities over there with this PR. I will also review this PR tomorrow.


          Parameterize error bound unit test by n & minInterval in addition to k

10bd7b5

Signed-off-by: Greg Harris <greg.harris@aiven.io>

Contributor Author

gharris1727 commented Nov 1, 2022 •

edited

Loading

@divijvaidya Thanks for pointing me to that PR with a similar flakey test. I tried applying this bound to the situation in that test and it didn't hold, and managed to deterministically replicate the behavior in my unit test with a parameterized test. Here's the results, just for discussion's sake:

[1]  numSamples=1,  minInterval=0
[2]  numSamples=1,  minInterval=1
[3]  numSamples=1,  minInterval=2
[4]  numSamples=1,  minInterval=10
[5]  numSamples=1,  minInterval=20
[6]  numSamples=1,  minInterval=30
[7]  numSamples=1,  minInterval=39
[8]  numSamples=1,  minInterval=40
[9]  numSamples=2,  minInterval=0  PASS
[10] numSamples=2,  minInterval=1
[11] numSamples=2,  minInterval=2
[12] numSamples=2,  minInterval=10
[13] numSamples=2,  minInterval=20
[14] numSamples=2,  minInterval=30
[15] numSamples=2,  minInterval=39
[16] numSamples=2,  minInterval=40 PASS
[17] numSamples=10, minInterval=0  PASS
[18] numSamples=10, minInterval=1  PASS
[19] numSamples=10, minInterval=2
[20] numSamples=10, minInterval=10
[21] numSamples=10, minInterval=20
[22] numSamples=10, minInterval=30
[23] numSamples=10, minInterval=39
[24] numSamples=10, minInterval=40 PASS

I was testing the minInterval=0 case when I was developing this bound, but it appears that the theoretical bound doesn't apply in some cases. At this time I'm not sure yet whether the bound is wrong or the rate limiting is wrong.


          Fix worst case test to properly sleep, allow first-request throttling.

98cfde8

Signed-off-by: Greg Harris <greg.harris@aiven.io>

Contributor Author

gharris1727 commented Nov 4, 2022

My worst-case bounds test had a typo that made the results invalid. I've fixed that and this is the current behavior:

PASS numSamples = 1
PASS numSamples = 10
PASS numSamples = 0,1
PASS numSamples = 20,30,40
FAIL numSamples = 2 && 2 <= minInterval <= 19

I also experimented with the fix from #12045 and it caused all of the test cases to pass.


          fixup: Loosen bound to compensate for variable-length windows

f755efc

Signed-off-by: Greg Harris <greg.harris@aiven.io>

Contributor Author

gharris1727 commented Nov 4, 2022

Okay, I've just loosened the bound to compensate for the variable-length windows. This means that the flakey tests should now be resolved, while not changing the behavior of the window algorithm.
I believe that this is ready for review @divijvaidya @apovzner @splett2. Thanks!


          fixup: use new error bound in ConnectionQuotasTest

b2fdd3b

Signed-off-by: Greg Harris <greg.harris@aiven.io>

gharris1727 changed the title ~~KAFKA-14345: Fix flakiness with more accurate bound in DynamicConnectionQuotaTest~~ KAFKA-14345: Fix flakiness with more accurate bound in (Dynamic)ConnectionQuotaTest


          fixup: typos

de2eebc

Signed-off-by: Greg Harris <greg.harris@aiven.io>

jlprat added the tests label

github-actions Bot commented Jul 15, 2023

This PR is being marked as stale since it has not had any activity in 90 days. If you would like to keep this PR alive, please ask a committer for review. If the PR has merge conflicts, please update it with the latest from trunk (or appropriate release branch)

If this PR is no longer valid or desired, please feel free to close it. If no activity occurrs in the next 30 days, it will be automatically closed.

github-actions Bot added the stale label


          Merge remote-tracking branch 'upstream/trunk' into kafka-14345-flakey…

147cc2d

…-quotas

Contributor Author

gharris1727 commented Aug 1, 2023

@divijvaidya Are you interested in reviewing this? This is a KIP-less alternative to solve the flakiness in this test.

github-actions Bot removed the stale label

Member

divijvaidya commented Aug 7, 2023

Hi @gharris1727

I am afraid that fixing only the tests hides the real problem/bug with the quotas implementation. If we merge this PR, yes the tests will not be flaky but aren't we then accepting the current behaviour of quotas implementation as expected behaviour i.e. the deviation could be +- epsilon and epsilon could exceed the thresholds (max.connection.creation.rate) set for a windowing period.

Having said that, the answer might be that the current behaviour is accepted behaviour. In which case, I would be comfortable with this change if it is accompanied by a change in docs explaining the current expectations from the windowing algorithm so that the users at least know what their expectation from the quota implementation should be.

Alternatively, the fix i.e. #12045 doesn't require a KIP (it doesn't change any public interfaces). Please feel free to pick up that PR (or duplicate it). I won't have time any time soon to work on it. I will be happy to provide a review.

What do you think?

Contributor Author

gharris1727 commented Aug 7, 2023

Hey @divijvaidya .

aren't we then accepting the current behaviour of quotas implementation as expected behaviour i.e. the deviation could be +- epsilon and epsilon could exceed the thresholds (max.connection.creation.rate) set for a windowing period.

It is not possible to eliminate the deviation visible to the outside observer, and these tests will always need to include an error term. Here's what the error bounds will be for the default window configuration and various observation times with the variable-width and fixed-width windowing algorithms:

millis  variable-width          fixed-width
11001	2.4444444444444446	2.2
20001	1.736842105263158	2.2
22001	1.736842105263158	1.5714285714285714
30001	1.5172413793103448	1.5714285714285714
33001	1.5172413793103448	1.375
40001	1.4102564102564104	1.375
44001	1.4102564102564104	1.2790697674418605
50001	1.346938775510204	1.2790697674418605
55001	1.346938775510204	1.2222222222222223
60001	1.305084745762712	1.2222222222222223
66001	1.305084745762712	1.1846153846153846
70001	1.2753623188405796	1.1846153846153846
77001	1.2753623188405796	1.1578947368421053
80001	1.2531645569620253	1.1578947368421053
88001	1.2531645569620253	1.1379310344827587
90001	1.2359550561797752	1.1379310344827587
99001	1.2359550561797752	1.1224489795918366

This PR implements the variable-width limits because the variable-width algorithm is currently on trunk, not because it is more accepted or correct. I'm personally ambivalent about which algorithm should be used. This PR replaces the un-motivated and incorrect hardcoded constants with computed bounds, and that is beneficial with or without the fixed-width algorithm.

Alternatively, the fix i.e. #12045 doesn't require a KIP (it doesn't change any public interfaces). Please feel free to pick up that PR (or duplicate it). I won't have time any time soon to work on it. I will be happy to provide a review.

If you don't have time or interest to work that PR and/or KIP, then I think a test-only fix is appropriate until someone is interested in picking up the PR. The flaky tests have already made us aware of the odd behavior of the variable-width algorithm, so more ongoing failures aren't helpful.

Having said that, the answer might be that the current behaviour is accepted behaviour. In which case, I would be comfortable with this change if it is accompanied by a change in docs explaining the current expectations from the windowing algorithm so that the users at least know what their expectation from the quota implementation should be.

I don't think that what this PR addresses needs to be explained in the documentation.

The existing quotas documentation is very general, and focuses on the strategic usage of quotas. This extremely specific error bound would be out of place among the other documentation.
How users choose their rate limit depends on so many more important factors than this error term (such as network environment, hardware, etc) that I think to mention the error factor in specific would mislead users to think it is more important than it is.
Users will already be compensating for this error term when setting their limits. If someone observes that the effective rate limit (either for bursts or steady-state flows) is too high and affects their cluster health, they will lower their configured rate limit to compensate. They may be blaming the wrong thing (their hardware vs the rate limit algorithm) but the effect is still the same.


          Merge remote-tracking branch 'upstream/trunk' into kafka-14345-flakey…

ef430b5

…-quotas

Contributor Author

gharris1727 commented Jan 12, 2024

Hi @divijvaidya As this is still causing flaky failures ~2% of the time, I'm still interested in getting this fix merged. Thanks!

github-actions Bot commented Dec 27, 2024

This PR is being marked as stale since it has not had any activity in 90 days. If you
would like to keep this PR alive, please leave a comment asking for a review. If the PR has
merge conflicts, update it with the latest from the base branch.

If you are having difficulty finding a reviewer, please reach out on the [mailing list](https://kafka.apache.org/contact).

If this PR is no longer valid or desired, please feel free to close it. If no activity occurs in the next 30 days, it will be automatically closed.

github-actions Bot added the stale label

github-actions Bot commented Jan 27, 2025

This PR has been closed since it has not had any activity in 120 days. If you feel like this
was a mistake, or you would like to continue working on it, please feel free to re-open the
PR and ask for a review.

github-actions Bot added the closed-stale label

github-actions Bot closed this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

closed-stale stale tests