Skip to content

KAFKA-8460: produce records with current timestamp#9877

Merged
chia7712 merged 5 commits intoapache:trunkfrom
showuon:KAFKA-8460
Jan 20, 2021
Merged

KAFKA-8460: produce records with current timestamp#9877
chia7712 merged 5 commits intoapache:trunkfrom
showuon:KAFKA-8460

Conversation

@showuon
Copy link
Copy Markdown
Member

@showuon showuon commented Jan 13, 2021

Originally, we make sure consumer awaitAssignment, and then produce records. We total send 30 records to 3 topics, and each topic has 30 partitions, so it takes some time to process it. If it exceeds 6 seconds, it'll make the consumer left due to the max.poll.interval.ms is set to 6 secs. And then it will delete the logs due to the record timestamp we set is 0, which will be greater than retentionMs setting (by currentTime - 0). After log deletion, we'll increment the start offset. So, later when the consumer is back, we'll first listOffset, and got the new and unexpected start offset. That's why we sometimes cannot receive expected amount of records.

There are many ways to fix this issue, ex: increment the max.poll.interval.ms, send the records with the current timestamp... I fixed it by making producing records before awaitAssignment, and then, I make sure we consume records right after awaitAssignemnt. This will make this test much more reliable.

The numRecords changed to partitionCount because the partitionCount is the original numRecords. And I tried to reduce the records number months ago, and still found it failed. So that's why I deep dive this flaky test. :) So, I just revert the previous change.

PS. this flaky test investigation takes me a lot of time, and finally found the root cause! Yeah~

[UPDATED] after discussion with reviewer, we decided to produce records with current timestamp to avoid breaking original test purpose.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

@showuon
Copy link
Copy Markdown
Member Author

showuon commented Jan 13, 2021

@hachikuji , could you help review this PR? Thanks.

@ijuma
Copy link
Copy Markdown
Member

ijuma commented Jan 15, 2021

Not sure what the original purpose of the test was, but this changes what it's testing, right? Before producing was happening after subscription and now before.

@ijuma
Copy link
Copy Markdown
Member

ijuma commented Jan 15, 2021

Can we just bump max.poll.interval.ms? It doesn't seem like there's any value in having that be low in this test.

@ijuma
Copy link
Copy Markdown
Member

ijuma commented Jan 15, 2021

Also, thanks so much for diagnosing this! Very awesome. :)

@showuon showuon force-pushed the KAFKA-8460 branch 3 times, most recently from a717112 to fb751ac Compare January 18, 2021 06:35
@showuon
Copy link
Copy Markdown
Member Author

showuon commented Jan 18, 2021

@ijuma , thanks for the comments. I agree with you that it might changes what it was testing. In this commit(a43e72a), I updated:

  1. revert what I did
  2. increase max.poll.interval.ms to avoid unnecessary consumer rejoin
  3. send records with current timestamp, so the log cleaner won't clean the records if the record sending takes too much of time.

I found the (3) is the main reason the test will fail because of log cleaner will delete the logs too old, and we'll get nothing to consume. So we need to set the records to the current timestamp ( or set to log.cleanup.policy to compact only, of course, I prefer to set timestamp :) ). I still increase the max.poll.interval.ms because it'll avoid the consumer rejoin to improve the test performance.

Thank you.


val consumerRecords = consumeRecords(consumer, producerRecords.size)

// the expected records should be 30 records * 3 topics * 30 partitions = 2700
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of this and the other comment?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me remove it.

val producer = createProducer()
val producerRecords = partitions.flatMap(sendRecords(producer, numRecords = 15, _))
// it'll send 30 records to each topic partition (3 topics and 30 partitions each topic)
val producerRecords = partitions.flatMap(sendRecords(producer, numRecords = partitionCount, _, needCurrentTime = true))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did we change the number of records to match the partition count?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the partitionCount is the original numRecords. And I tried to reduce the records number months ago, and still found it failed. So that's why I deep dive this flaky test. :) So, I just revert the previous change.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Worth mentioning it in the PR description.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reminding. Updated in the PR description.


protected def sendRecords(producer: KafkaProducer[Array[Byte], Array[Byte]], numRecords: Int,
tp: TopicPartition): Seq[ProducerRecord[Array[Byte], Array[Byte]]] = {
tp: TopicPartition, needCurrentTime: Boolean): Seq[ProducerRecord[Array[Byte], Array[Byte]]] = {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this boolean? Could we always set the current time? It seems like using the existing behavior may cause problems to many other tests too.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course, I'm just afraid of some tests relying on the original "fake time". Let me test it.

Copy link
Copy Markdown
Member

@ijuma ijuma Jan 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Yes, if some tests fail after this change, we can use the previous behavior in those tests (via a default argument instead of method overload). And the new behavior should be the default.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, now, the default sendRecord(producer, numRecords, tp) will send with current timestamp. Some tests will verity the timestamp after consuming, so we should keep as the old behavior.

this.consumerConfig.setProperty(ConsumerConfig.FETCH_MAX_BYTES_CONFIG, "500")
this.consumerConfig.setProperty(ConsumerConfig.MAX_PARTITION_FETCH_BYTES_CONFIG, "100")

// set the 'max.poll.interval.ms' larger (default is 6 seconds), to avoid the consumer got kicked out during sending records
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment can be more concise. Something like:

Avoid a rebalance while the records are being sent (the default is 6 seconds)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

@showuon
Copy link
Copy Markdown
Member Author

showuon commented Jan 18, 2021

Let's see if it'll break other tests.

@showuon showuon changed the title KAFKA-8460: consume records right after awaitAssignment KAFKA-8460: produce records with current timestamp Jan 18, 2021
@showuon
Copy link
Copy Markdown
Member Author

showuon commented Jan 18, 2021

test jdk 11/15 passed. jdk8 failed test is addressed in #9888

@ijuma
Copy link
Copy Markdown
Member

ijuma commented Jan 19, 2021

Thanks. I tweaked things a bit in the last commit here:

1ecdd58

If you think it looks good, please add it to your PR and please merge trunk.

@ijuma
Copy link
Copy Markdown
Member

ijuma commented Jan 19, 2021

@showuon This test seems to be failing a lot, so it would be good to integrate the proposed changes (if you agree) so that we can merge it.

@showuon
Copy link
Copy Markdown
Member Author

showuon commented Jan 20, 2021

@ijuma , sorry I was busy yesterday. I'll work on it today. Thanks

Copy link
Copy Markdown
Member

@ijuma ijuma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks. Will merge if the tests pass.

@showuon
Copy link
Copy Markdown
Member Author

showuon commented Jan 20, 2021

@ijuma , Thanks for helping refactor the codes. I've fix the conflict and make sure every tests extends BaseConsumerTest all passed. Let's wait the jenkins test results. Thank you.

Copy link
Copy Markdown
Member

@chia7712 chia7712 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@showuon Nice improvement! +1

@showuon
Copy link
Copy Markdown
Member Author

showuon commented Jan 20, 2021

failed tests:

    Build / JDK 8 / org.apache.kafka.streams.state.internals.metrics.RocksDBMetricsTest.shouldAddCurSizeAllMemTablesMetric
    Build / JDK 8 / org.apache.kafka.streams.state.internals.metrics.RocksDBMetricsTest.shouldAddNumLiveVersionMetric
    Build / JDK 8 / org.apache.kafka.streams.state.internals.metrics.RocksDBMetricsTest.shouldAddEstimateNumKeysMetric
    Build / JDK 8 / org.apache.kafka.streams.state.internals.metrics.RocksDBMetricsTest.shouldAddCurSizeAllMemTablesMetric
    Build / JDK 8 / org.apache.kafka.streams.state.internals.metrics.RocksDBMetricsTest.shouldAddNumLiveVersionMetric
    Build / JDK 8 / org.apache.kafka.streams.state.internals.metrics.RocksDBMetricsTest.shouldAddEstimateNumKeysMetric
    Build / JDK 15 / org.apache.kafka.clients.consumer.KafkaConsumerTest.testCloseWithTimeUnit()
    Build / JDK 15 / org.apache.kafka.clients.consumer.internals.FetcherTest.testEarlierOffsetResetArrivesLate()
    Build / JDK 15 / org.apache.kafka.streams.state.internals.metrics.RocksDBMetricsTest.shouldAddNumLiveVersionMetric
    Build / JDK 15 / org.apache.kafka.streams.state.internals.metrics.RocksDBMetricsTest.shouldAddEstimateNumKeysMetric
    Build / JDK 15 / org.apache.kafka.streams.state.internals.metrics.RocksDBMetricsTest.shouldAddCurSizeAllMemTablesMetric
    Build / JDK 15 / org.apache.kafka.streams.state.internals.metrics.RocksDBMetricsTest.shouldAddCurSizeAllMemTablesMetric
    Build / JDK 15 / org.apache.kafka.streams.state.internals.metrics.RocksDBMetricsTest.shouldAddNumLiveVersionMetric
    Build / JDK 15 / org.apache.kafka.streams.state.internals.metrics.RocksDBMetricsTest.shouldAddEstimateNumKeysMetric
    Build / JDK 11 / org.apache.kafka.streams.state.internals.metrics.RocksDBMetricsTest.shouldAddCurSizeAllMemTablesMetric
    Build / JDK 11 / org.apache.kafka.streams.state.internals.metrics.RocksDBMetricsTest.shouldAddNumLiveVersionMetric
    Build / JDK 11 / org.apache.kafka.streams.state.internals.metrics.RocksDBMetricsTest.shouldAddEstimateNumKeysMetric
    Build / JDK 11 / org.apache.kafka.streams.state.internals.metrics.RocksDBMetricsTest.shouldAddCurSizeAllMemTablesMetric
    Build / JDK 11 / org.apache.kafka.streams.state.internals.metrics.RocksDBMetricsTest.shouldAddNumLiveVersionMetric
    Build / JDK 11 / org.apache.kafka.streams.state.internals.metrics.RocksDBMetricsTest.shouldAddEstimateNumKeysMetric

Mostly they are failed in RocksDBMetricsTest, which is addressed in #9935 . The other tests are not related to my change. Thanks.

@chia7712 chia7712 merged commit 918c332 into apache:trunk Jan 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants