HOTFIX: fix failed case RemoteLogManagerTest#testStopPartitionsWithDeletion#18474
Conversation
…letion Signed-off-by: PoAn Yang <payang@apache.org>
34ae90a to
12d9936
Compare
TaiJuWu
left a comment
There was a problem hiding this comment.
This can reproduce on my local. Thanks for the fix!
Yes, you're correct, but I used the following script to identify the commit that caused this test to fail, and it points to this specific commit if I am wrong please correct me. But the weird thing is trunk is green after this PR got merged ... |
|
Ok, that's weird if it's failing consistently. Also, I have another PR I was working on 30 minutes ago and it didn't show this failure either. I'll take a closer look when I'm near a computer but that won't be for a few hours. |
I total agree this is weird thing but still hope you can help to check since you are the author 🙇 |
|
The failed case |
|
@TaiJuWu I don't think your approach to identifying the problematic commit is valid because this test is clearly flaky and not failing every time. You can see the following trunk run where it passed: https://github.com/apache/kafka/actions/runs/12708697404/job/35427267406 I ran it locally multiple times and it also passed. So, why do we think this is a recent problem versus a test that has been flaky? |
|
Ah, interesting, I can reproduce this if I run only |
ijuma
left a comment
There was a problem hiding this comment.
Thanks for the fix. It makes sense that we should return a new iterator each invocation instead of always the same iterator. It looks like this test bug has existed for a while.
A few things that are unclear:
- Why does it fail consistently when invoked in isolation (locally) but not when the class test suite is executed (locally or remotely)?
- Why did it start failing more recently - the relevant commit didn't change this test in a meaningful way.
I'll go ahead and merge in the meantime, however.
|
It must be related to the race condition mentioned in the PR description, the case where it fails must be due to the race condition winner changing - i.e. this was always possible, but somehow is more likely under certain scenarios.
|
…18474) The test has become flakier recently and it's easy to reproduce by running the single test (vs running the the class test suite). The root cause is that following functions call `RemoteLogMetadataManager#listRemoteLogSegments`. It returns iterator. If one of function goes through iterator first, another can't get expected result. I changed `thenReturn` to `thenAnswer` to avoid the issue. The race is between: * RLMExpirationTask#cleanupExpiredRemoteLogSegments * RemoteLogManager#deleteRemoteLogPartition Reviewers: Ismael Juma <ismael@juma.me.uk> Signed-off-by: PoAn Yang <payang@apache.org>
|
@ijuma Thanks for your explanation. I also think this is flaky at the moment. |
…emove-metadata-version-methods-for-versions-older-than-3.0 * apache-github/trunk: KAFKA-18340: Change Dockerfile to use log4j2 yaml instead log4j properties (apache#18378) MINOR: fix flaky RemoteLogManagerTest#testStopPartitionsWithDeletion (apache#18474) KAFKA-18311: Enforcing copartitioned topics (4/N) (apache#18397) KAFKA-18308; Update CoordinatorSerde (apache#18455) KAFKA-18440: Convert AuthorizationException to fatal error in AdminClient (apache#18435) KAFKA-17671: Create better documentation for transactions (apache#17454) KAFKA-18304; Introduce json converter generator (apache#18458) MINOR: Clean up classic group tests (apache#18473) KAFKA-18399 Remove ZooKeeper from KafkaApis (2/N): CONTROLLED_SHUTDOWN and ENVELOPE (apache#18422) MINOR: improve StreamThread periodic processing log (apache#18430)
|
Thanks for bringing it to my attention. |
…pache#18474) The test has become flakier recently and it's easy to reproduce by running the single test (vs running the the class test suite). The root cause is that following functions call `RemoteLogMetadataManager#listRemoteLogSegments`. It returns iterator. If one of function goes through iterator first, another can't get expected result. I changed `thenReturn` to `thenAnswer` to avoid the issue. The race is between: * RLMExpirationTask#cleanupExpiredRemoteLogSegments * RemoteLogManager#deleteRemoteLogPartition Reviewers: Ismael Juma <ismael@juma.me.uk> Signed-off-by: PoAn Yang <payang@apache.org>
…pache#18474) The test has become flakier recently and it's easy to reproduce by running the single test (vs running the the class test suite). The root cause is that following functions call `RemoteLogMetadataManager#listRemoteLogSegments`. It returns iterator. If one of function goes through iterator first, another can't get expected result. I changed `thenReturn` to `thenAnswer` to avoid the issue. The race is between: * RLMExpirationTask#cleanupExpiredRemoteLogSegments * RemoteLogManager#deleteRemoteLogPartition Reviewers: Ismael Juma <ismael@juma.me.uk> Signed-off-by: PoAn Yang <payang@apache.org>

Not sure which commit makes the case fail. The CI result of my PR fails with
RemoteLogManagerTest#testStopPartitionsWithDeletion. The trunk branch can also reproduce this.The root cause is that following functions call
RemoteLogMetadataManager#listRemoteLogSegments. It returns iterator. If one of function go through iterator first, another can't get expected result. I changethenReturntothenAnswerto avoid the issue.Committer Checklist (excluded from commit message)