Skip to content

KAFKA-12384: stabilize ListOffsetsRequestTest#testResponseIncludesLeaderEpoch#10389

Merged
chia7712 merged 8 commits intoapache:trunkfrom
chia7712:MINOR-10389
Apr 7, 2021
Merged

KAFKA-12384: stabilize ListOffsetsRequestTest#testResponseIncludesLeaderEpoch#10389
chia7712 merged 8 commits intoapache:trunkfrom
chia7712:MINOR-10389

Conversation

@chia7712
Copy link
Copy Markdown
Member

@chia7712 chia7712 commented Mar 23, 2021

issue: https://issues.apache.org/jira/browse/KAFKA-12384
The root cause is that we don't wait new leader to sync hw with follower so sending request to get offset could encounter OFFSET_NOT_AVAILABLE error.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

@chia7712 chia7712 requested a review from dajac March 23, 2021 17:20
Copy link
Copy Markdown
Member

@dengziming dengziming left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nice, should we trigger the jenkins build multiple times to verify that the flaky test is fixed.

Comment thread core/src/test/scala/unit/kafka/server/ListOffsetsRequestTest.scala Outdated
Comment thread core/src/test/scala/unit/kafka/server/ListOffsetsRequestTest.scala Outdated
@chia7712
Copy link
Copy Markdown
Member Author

@dengziming thanks for your review!

should we trigger the jenkins build multiple times to verify that the flaky test is fixed.

sure. I also loop ListOffsetsRequestTest 300 times on my local. all pass

@dajac
Copy link
Copy Markdown
Member

dajac commented Mar 24, 2021

@chia7712 Have you been able to reproduce the flaky test locally? I tried few times and I never could...

@chia7712
Copy link
Copy Markdown
Member Author

chia7712 commented Mar 24, 2021

Have you been able to reproduce the flaky test locally? I tried few times and I never could...

yep. It requires a "slow" machine to make "slow" sync between leader and followers. I open 8 containers to loop that test on my local and it can reproduce the error effectively.

截圖 2021-03-24 下午3 41 16

@dajac
Copy link
Copy Markdown
Member

dajac commented Mar 24, 2021

@chia7712 awesome!

@chia7712
Copy link
Copy Markdown
Member Author

merge trunk to trigger QA again.

@chia7712
Copy link
Copy Markdown
Member Author

unrelated failure. will merge trunk to trigger QA again.

val partitionToLeader = TestUtils.createTopic(zkClient, topic, numPartitions = 1, replicationFactor = 3, servers)
val topicConfig = new Properties
// make sure we won't lose data when force-removing leader
topicConfig.setProperty(TopicConfig.MIN_IN_SYNC_REPLICAS_CONFIG, "2")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this required? Even with min isr 1, we still require all in sync replicas to ack for acks=all. How many brokers do we have in the test?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, acks=all means “all in-sync replicas” (rather than “all replicas”) have to receive record. In other words, produce request can be completed after one replica has received the record if mini ISR is one. When we shutdown the server with mini replica=1, the successful record may be NOT synced with other broker (there are 3 brokers totally). In short , the data could get lost after we shutdown current leader.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but if you have 3 brokers and you shutdown one of them, you should still have 2 brokers in the ISR. Are you saying that we are shrinking the ISR to 1 in this test? That would be unexpected.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you saying that we are shrinking the ISR to 1 in this test? That would be unexpected.

No, I did not observe such accident. Maybe I misunderstood the log before and my assumption is not happen (loop it 1000 times). will revert this change.

@chia7712
Copy link
Copy Markdown
Member Author

@ijuma any suggestions? This test is still flaky :(

Copy link
Copy Markdown
Member

@showuon showuon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chia7712 , thanks for the fix. LGTM. Please also update the status in KAFKA-12384. Thank you.

@chia7712
Copy link
Copy Markdown
Member Author

Please also update the status in KAFKA-12384. Thank you.

oh, sorry that I did not notice the existent ticket. will update the jira :)

@chia7712
Copy link
Copy Markdown
Member Author

chia7712 commented Apr 6, 2021

unrelated error. merge trunk to trigger QA again

@chia7712 chia7712 changed the title MINOR: stabilize ListOffsetsRequestTest#testResponseIncludesLeaderEpoch KAFKA-12384: stabilize ListOffsetsRequestTest#testResponseIncludesLeaderEpoch Apr 6, 2021
@ijuma
Copy link
Copy Markdown
Member

ijuma commented Apr 6, 2021

@chia7712 Is the test still flaky with the latest changes?

@chia7712
Copy link
Copy Markdown
Member Author

chia7712 commented Apr 6, 2021

Is the test still flaky with the latest changes?

yep. I loop this patch 100 times and all pass

Copy link
Copy Markdown
Member

@ijuma ijuma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@chia7712 chia7712 merged commit 174f0f9 into apache:trunk Apr 7, 2021
ijuma added a commit to ijuma/kafka that referenced this pull request Apr 7, 2021
* apache-github/trunk:
  KAFKA-10769 Remove JoinGroupRequest#containsValidPattern as it is dup… (apache#9851)
  KAFKA-12384: stabilize ListOffsetsRequestTest#testResponseIncludesLeaderEpoch (apache#10389)
  KAFKA-5146: remove Connect dependency from Streams module (apache#10131)
@chia7712 chia7712 deleted the MINOR-10389 branch March 25, 2024 15:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants