Skip to content

KAFKA-8460: reduce the record size and increase the delay time#9775

Closed
showuon wants to merge 3 commits intoapache:trunkfrom
showuon:KAFKA-8460
Closed

KAFKA-8460: reduce the record size and increase the delay time#9775
showuon wants to merge 3 commits intoapache:trunkfrom
showuon:KAFKA-8460

Conversation

@showuon
Copy link
Copy Markdown
Member

@showuon showuon commented Dec 22, 2020

Looking into this flaky test, the error messages are:

Timed out before consuming expected 1350 records. The number consumed was 1230.

https://ci-builds.apache.org/job/Kafka/job/kafka-trunk-jdk8/303/testReport/kafka.api/PlaintextConsumerTest/testLowMaxFetchSizeForRequestAndPartition/

Timed out before consuming expected 1350 records. The number consumed was 1200.

https://ci-builds.apache.org/job/Kafka/job/kafka-trunk-jdk8/305/testReport/kafka.api/PlaintextConsumerTest/testLowMaxFetchSizeForRequestAndPartition/

Timed out before consuming expected 1350 records. The number consumed was 1215.

https://ci-builds.apache.org/job/Kafka/job/kafka-trunk-jdk8/305/testReport/junit/kafka.api/PlaintextConsumerTest/testLowMaxFetchSizeForRequestAndPartition/

We can see, the number consumes are not fixed number and close to 1350. After checking the test, I found the test is expected to be slow because it tests we can consume all partitions if fetch.max.bytes and max.partition.fetch.bytes are low. So I think the test has no bug, just need more time.

What I did are:

  1. reduce the record size for each partition (from 15 -> 10), it should speed up the test, but also be able to test the original scenario
  2. increase the timeout value (from 60 sec -> 90 sec)

Hope this can makes the test more reliable!

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

@showuon
Copy link
Copy Markdown
Member Author

showuon commented Dec 22, 2020

@chia7712 , could you help review this small PR? Thanks.

Copy link
Copy Markdown
Member

@chia7712 chia7712 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@showuon Thanks for your patch. Please take a look at following comments. Thanks!

// we produce 10 records for each topic partition. There are 3 topics, and 30 partitions each topic,
// so total producerRecords size should be 10 * 3 * 30 = 900
val producerRecords = partitions.flatMap(sendRecords(producer, numRecords = 10, _))
val consumerRecords = consumeRecords(consumer, producerRecords.size, waitTimeMs = 90 * 1000)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, 90 seconds is too long to be a test case. Reducing the produce size can't resolve this issue?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree! I increase to 90 secs just in case. I think reduce the record size is good enough.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert back to 60 secs now.

maxPollRecords: Int = Int.MaxValue): ArrayBuffer[ConsumerRecord[K, V]] = {
maxPollRecords: Int = Int.MaxValue,
waitTimeMs: Int = 60000): ArrayBuffer[ConsumerRecord[K, V]] = {
val records = new ArrayBuffer[ConsumerRecord[K, V]]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add the initial size? It collects all return records so the default size is too small to this case.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch! Updated. Thanks.

@showuon
Copy link
Copy Markdown
Member Author

showuon commented Dec 22, 2020

@chia7712 , I was too naive, I saw there are only 7xx records consumed in recent build:

Timed out before consuming expected 1350 records. The number consumed was 720.

https://ci-builds.apache.org/job/Kafka/job/kafka-trunk-jdk15/353/testReport/junit/kafka.api/PlaintextConsumerTest/testLowMaxFetchSizeForRequestAndPartition/
https://ci-builds.apache.org/job/Kafka/job/kafka-trunk-jdk11/333/testReport/junit/kafka.api/PlaintextConsumerTest/testLowMaxFetchSizeForRequestAndPartition/

I don't know how slow the system will be. So, I reduce to 5 records each partition, total will be 450 records. FYI.

@showuon
Copy link
Copy Markdown
Member Author

showuon commented Dec 24, 2020

@chia7712 , I found the test failed in my PR tests:

Timed out before consuming expected 450 records. The number consumed was 325

It only consumed 325 records within 60 seconds!! So slow! Do you think I should reduce the records lower?

@showuon
Copy link
Copy Markdown
Member Author

showuon commented Dec 30, 2020

Monitoring recent test failed: I think reduce the records to 450 should be good enough. How do you think? @chia7712

org.scalatest.exceptions.TestFailedException: Timed out before consuming expected 1350 records. The number consumed was 1275.
org.scalatest.exceptions.TestFailedException: Timed out before consuming expected 1350 records. The number consumed was 1005.

https://ci-builds.apache.org/job/Kafka/job/kafka-trunk-jdk11/342/testReport/junit/kafka.api/PlaintextConsumerTest/testLowMaxFetchSizeForRequestAndPartition/
https://ci-builds.apache.org/job/Kafka/job/kafka-trunk-jdk11/342/testReport/junit/kafka.api/PlaintextConsumerTest/testLowMaxFetchSizeForRequestAndPartition/

@chia7712
Copy link
Copy Markdown
Member

@showuon Is it a potential bug which can slowdown the consumer in this test case? Or this bug is caused by busy Jenkins?

@showuon showuon closed this Jan 13, 2021
@showuon showuon deleted the KAFKA-8460 branch January 13, 2021 09:54
@showuon
Copy link
Copy Markdown
Member Author

showuon commented Jan 14, 2021

@chia7712 , after further investigation, I found the tests have some issues, and cause this flaky test, not related to the record size. I opened another PR: #9877 to address it. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants