KAFKA-8460: produce records with current timestamp by showuon · Pull Request #9877 · apache/kafka

showuon · 2021-01-13T10:03:50Z

Originally, we make sure consumer awaitAssignment, and then produce records. We total send 30 records to 3 topics, and each topic has 30 partitions, so it takes some time to process it. If it exceeds 6 seconds, it'll make the consumer left due to the max.poll.interval.ms is set to 6 secs. And then it will delete the logs due to the record timestamp we set is 0, which will be greater than retentionMs setting (by currentTime - 0). After log deletion, we'll increment the start offset. So, later when the consumer is back, we'll first listOffset, and got the new and unexpected start offset. That's why we sometimes cannot receive expected amount of records.

There are many ways to fix this issue, ex: increment the max.poll.interval.ms, send the records with the current timestamp... I fixed it by making producing records before awaitAssignment, and then, I make sure we consume records right after awaitAssignemnt. This will make this test much more reliable.

The numRecords changed to partitionCount because the partitionCount is the original numRecords. And I tried to reduce the records number months ago, and still found it failed. So that's why I deep dive this flaky test. :) So, I just revert the previous change.

PS. this flaky test investigation takes me a lot of time, and finally found the root cause! Yeah~

[UPDATED] after discussion with reviewer, we decided to produce records with current timestamp to avoid breaking original test purpose.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

showuon · 2021-01-13T10:06:17Z

@hachikuji , could you help review this PR? Thanks.

ijuma · 2021-01-15T13:49:54Z

Not sure what the original purpose of the test was, but this changes what it's testing, right? Before producing was happening after subscription and now before.

ijuma · 2021-01-15T13:52:07Z

Can we just bump max.poll.interval.ms? It doesn't seem like there's any value in having that be low in this test.

ijuma · 2021-01-15T13:52:38Z

Also, thanks so much for diagnosing this! Very awesome. :)

…and record timestamp

showuon · 2021-01-18T06:40:15Z

@ijuma , thanks for the comments. I agree with you that it might changes what it was testing. In this commit(a43e72a), I updated:

revert what I did
increase max.poll.interval.ms to avoid unnecessary consumer rejoin
send records with current timestamp, so the log cleaner won't clean the records if the record sending takes too much of time.

I found the (3) is the main reason the test will fail because of log cleaner will delete the logs too old, and we'll get nothing to consume. So we need to set the records to the current timestamp ( or set to log.cleanup.policy to compact only, of course, I prefer to set timestamp :) ). I still increase the max.poll.interval.ms because it'll avoid the consumer rejoin to improve the test performance.

Thank you.

ijuma · 2021-01-18T07:19:44Z

+
    val consumerRecords = consumeRecords(consumer, producerRecords.size)

+    // the expected records should be 30 records * 3 topics * 30 partitions = 2700


What is the purpose of this and the other comment?

Let me remove it.

ijuma · 2021-01-18T07:20:08Z

    val producer = createProducer()
-    val producerRecords = partitions.flatMap(sendRecords(producer, numRecords = 15, _))
+    // it'll send 30 records to each topic partition (3 topics and 30 partitions each topic)
+    val producerRecords = partitions.flatMap(sendRecords(producer, numRecords = partitionCount, _, needCurrentTime = true))


Why did we change the number of records to match the partition count?

Actually, the partitionCount is the original numRecords. And I tried to reduce the records number months ago, and still found it failed. So that's why I deep dive this flaky test. :) So, I just revert the previous change.

Sounds good. Worth mentioning it in the PR description.

Thanks for reminding. Updated in the PR description.

ijuma · 2021-01-18T07:22:20Z


  protected def sendRecords(producer: KafkaProducer[Array[Byte], Array[Byte]], numRecords: Int,
-                            tp: TopicPartition): Seq[ProducerRecord[Array[Byte], Array[Byte]]] = {
+                            tp: TopicPartition, needCurrentTime: Boolean): Seq[ProducerRecord[Array[Byte], Array[Byte]]] = {


Do we need this boolean? Could we always set the current time? It seems like using the existing behavior may cause problems to many other tests too.

Of course, I'm just afraid of some tests relying on the original "fake time". Let me test it.

Sounds good. Yes, if some tests fail after this change, we can use the previous behavior in those tests (via a default argument instead of method overload). And the new behavior should be the default.

OK, now, the default sendRecord(producer, numRecords, tp) will send with current timestamp. Some tests will verity the timestamp after consuming, so we should keep as the old behavior.

ijuma · 2021-01-18T07:43:17Z

    this.consumerConfig.setProperty(ConsumerConfig.FETCH_MAX_BYTES_CONFIG, "500")
    this.consumerConfig.setProperty(ConsumerConfig.MAX_PARTITION_FETCH_BYTES_CONFIG, "100")
+
+    // set the 'max.poll.interval.ms' larger (default is 6 seconds), to avoid the consumer got kicked out during sending records


This comment can be more concise. Something like:

Avoid a rebalance while the records are being sent (the default is 6 seconds)

showuon · 2021-01-18T08:09:11Z

Let's see if it'll break other tests.

showuon · 2021-01-18T11:52:05Z

test jdk 11/15 passed. jdk8 failed test is addressed in #9888

ijuma · 2021-01-19T09:51:27Z

Thanks. I tweaked things a bit in the last commit here:

1ecdd58

If you think it looks good, please add it to your PR and please merge trunk.

ijuma · 2021-01-19T19:01:39Z

@showuon This test seems to be failing a lot, so it would be good to integrate the proposed changes (if you agree) so that we can merge it.

showuon · 2021-01-20T00:03:49Z

@ijuma , sorry I was busy yesterday. I'll work on it today. Thanks

ijuma

LGTM, thanks. Will merge if the tests pass.

showuon · 2021-01-20T03:25:10Z

@ijuma , Thanks for helping refactor the codes. I've fix the conflict and make sure every tests extends BaseConsumerTest all passed. Let's wait the jenkins test results. Thank you.

chia7712

@showuon Nice improvement! +1

showuon · 2021-01-20T07:20:42Z

failed tests:

    Build / JDK 8 / org.apache.kafka.streams.state.internals.metrics.RocksDBMetricsTest.shouldAddCurSizeAllMemTablesMetric
    Build / JDK 8 / org.apache.kafka.streams.state.internals.metrics.RocksDBMetricsTest.shouldAddNumLiveVersionMetric
    Build / JDK 8 / org.apache.kafka.streams.state.internals.metrics.RocksDBMetricsTest.shouldAddEstimateNumKeysMetric
    Build / JDK 8 / org.apache.kafka.streams.state.internals.metrics.RocksDBMetricsTest.shouldAddCurSizeAllMemTablesMetric
    Build / JDK 8 / org.apache.kafka.streams.state.internals.metrics.RocksDBMetricsTest.shouldAddNumLiveVersionMetric
    Build / JDK 8 / org.apache.kafka.streams.state.internals.metrics.RocksDBMetricsTest.shouldAddEstimateNumKeysMetric
    Build / JDK 15 / org.apache.kafka.clients.consumer.KafkaConsumerTest.testCloseWithTimeUnit()
    Build / JDK 15 / org.apache.kafka.clients.consumer.internals.FetcherTest.testEarlierOffsetResetArrivesLate()
    Build / JDK 15 / org.apache.kafka.streams.state.internals.metrics.RocksDBMetricsTest.shouldAddNumLiveVersionMetric
    Build / JDK 15 / org.apache.kafka.streams.state.internals.metrics.RocksDBMetricsTest.shouldAddEstimateNumKeysMetric
    Build / JDK 15 / org.apache.kafka.streams.state.internals.metrics.RocksDBMetricsTest.shouldAddCurSizeAllMemTablesMetric
    Build / JDK 15 / org.apache.kafka.streams.state.internals.metrics.RocksDBMetricsTest.shouldAddCurSizeAllMemTablesMetric
    Build / JDK 15 / org.apache.kafka.streams.state.internals.metrics.RocksDBMetricsTest.shouldAddNumLiveVersionMetric
    Build / JDK 15 / org.apache.kafka.streams.state.internals.metrics.RocksDBMetricsTest.shouldAddEstimateNumKeysMetric
    Build / JDK 11 / org.apache.kafka.streams.state.internals.metrics.RocksDBMetricsTest.shouldAddCurSizeAllMemTablesMetric
    Build / JDK 11 / org.apache.kafka.streams.state.internals.metrics.RocksDBMetricsTest.shouldAddNumLiveVersionMetric
    Build / JDK 11 / org.apache.kafka.streams.state.internals.metrics.RocksDBMetricsTest.shouldAddEstimateNumKeysMetric
    Build / JDK 11 / org.apache.kafka.streams.state.internals.metrics.RocksDBMetricsTest.shouldAddCurSizeAllMemTablesMetric
    Build / JDK 11 / org.apache.kafka.streams.state.internals.metrics.RocksDBMetricsTest.shouldAddNumLiveVersionMetric
    Build / JDK 11 / org.apache.kafka.streams.state.internals.metrics.RocksDBMetricsTest.shouldAddEstimateNumKeysMetric

Mostly they are failed in RocksDBMetricsTest, which is addressed in #9935 . The other tests are not related to my change. Thanks.

KAFKA-8460: consume records right after awaitAssignment

4c128a4

showuon mentioned this pull request Jan 14, 2021

KAFKA-8460: reduce the record size and increase the delay time #9775

Closed

3 tasks

showuon force-pushed the KAFKA-8460 branch 3 times, most recently from a717112 to fb751ac Compare January 18, 2021 06:35

KAFKA-8460: address reviewer's comment to increase max.poll.interval …

a43e72a

…and record timestamp

showuon force-pushed the KAFKA-8460 branch from fb751ac to a43e72a Compare January 18, 2021 06:36

ijuma reviewed Jan 18, 2021

View reviewed changes

KAFKA-8460: default to send records with current timestamp

bbbace9

showuon changed the title ~~KAFKA-8460: consume records right after awaitAssignment~~ KAFKA-8460: produce records with current timestamp Jan 18, 2021

chia7712 mentioned this pull request Jan 18, 2021

KAFKA-7341 Migrate core module to JUnit 5 #9855

Merged

3 tasks

showuon added 2 commits January 20, 2021 10:56

KAFKA-8460: use default argument instead of method overloading

3d6892a

Merge branch 'trunk' of https://github.com/apache/kafka into KAFKA-8460

40f8807

ijuma approved these changes Jan 20, 2021

View reviewed changes

chia7712 approved these changes Jan 20, 2021

View reviewed changes

chia7712 merged commit 918c332 into apache:trunk Jan 20, 2021


		val consumerRecords = consumeRecords(consumer, producerRecords.size)

		// the expected records should be 30 records * 3 topics * 30 partitions = 2700

Conversation

showuon commented Jan 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Committer Checklist (excluded from commit message)

Uh oh!

showuon commented Jan 13, 2021

Uh oh!

ijuma commented Jan 15, 2021

Uh oh!

ijuma commented Jan 15, 2021

Uh oh!

ijuma commented Jan 15, 2021

Uh oh!

showuon commented Jan 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ijuma Jan 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

showuon commented Jan 18, 2021

Uh oh!

showuon commented Jan 18, 2021

Uh oh!

ijuma commented Jan 19, 2021

Uh oh!

ijuma commented Jan 19, 2021

Uh oh!

showuon commented Jan 20, 2021

Uh oh!

ijuma left a comment

Choose a reason for hiding this comment

Uh oh!

showuon commented Jan 20, 2021

Uh oh!

chia7712 left a comment

Choose a reason for hiding this comment

Uh oh!

showuon commented Jan 20, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

showuon commented Jan 13, 2021 •

edited

Loading

showuon commented Jan 18, 2021 •

edited

Loading

ijuma Jan 18, 2021 •

edited

Loading