KAFKA-9928: Fix flaky GlobalKTableEOSIntegrationTest#8600
KAFKA-9928: Fix flaky GlobalKTableEOSIntegrationTest#8600mjsax merged 5 commits intoapache:trunkfrom
Conversation
| private void produceAbortedMessages() throws Exception { | ||
| final Properties properties = new Properties(); | ||
| properties.put(ProducerConfig.TRANSACTIONAL_ID_CONFIG, "someid"); | ||
| properties.put(ProducerConfig.RETRIES_CONFIG, 1); |
There was a problem hiding this comment.
This might be the actually fix. Not sure why we set retries to one, but if we would loose input data, we would never complete the result and the test would time out. (Maybe not relevant for aborted message, but same below)
guozhangwang
left a comment
There was a problem hiding this comment.
I run the tests locally, did not run into the failure but sometimes it hangs on
[2020-05-01 16:30:48,226] INFO stream-client [app-GlobalKTableEOSIntegrationTestshouldKStreamGlobalKTableLeftJoin_exactly_once_-2c965644-9f77-485f-a1bd-fa4eabb95b9c] State transition from CREATED to REBALANCING (org.apache.kafka.streams.KafkaStreams:280)
[2020-05-01 16:30:48,240] INFO Opening store globalStore in regular mode (org.apache.kafka.streams.state.internals.RocksDBTimestampedStore:100)
[2020-05-01 16:30:48,240] INFO global-stream-thread [app-GlobalKTableEOSIntegrationTestshouldKStreamGlobalKTableLeftJoin_exactly_once_-2c965644-9f77-485f-a1bd-fa4eabb95b9c-GlobalStreamThread] Restoring state for global store globalStore (org.apache.kafka.streams.processor.internals.GlobalStateManagerImpl:185)
[2020-05-01 16:30:48,242] INFO [Consumer clientId=app-GlobalKTableEOSIntegrationTestshouldKStreamGlobalKTableLeftJoin_exactly_once_-2c965644-9f77-485f-a1bd-fa4eabb95b9c-global-consumer, groupId=null] Cluster ID: jlS_Oi-0Rl6keBCVT10iwA (org.apache.kafka.clients.Metadata:280)
[2020-05-01 16:30:48,244] INFO [Consumer clientId=app-GlobalKTableEOSIntegrationTestshouldKStreamGlobalKTableLeftJoin_exactly_once_-2c965644-9f77-485f-a1bd-fa4eabb95b9c-global-consumer, groupId=null] Subscribed to partition(s): globalTable-GlobalKTableEOSIntegrationTestshouldKStreamGlobalKTableLeftJoin_exactly_once_-0 (org.apache.kafka.clients.consumer.KafkaConsumer:1115)
[2020-05-01 16:30:48,244] INFO [Consumer clientId=app-GlobalKTableEOSIntegrationTestshouldKStreamGlobalKTableLeftJoin_exactly_once_-2c965644-9f77-485f-a1bd-fa4eabb95b9c-global-consumer, groupId=null] Seeking to EARLIEST offset of partition globalTable-GlobalKTableEOSIntegrationTestshouldKStreamGlobalKTableLeftJoin_exactly_once_-0 (org.apache.kafka.clients.consumer.internals.SubscriptionState:566)
[2020-05-01 16:30:48,244] INFO [Consumer clientId=app-GlobalKTableEOSIntegrationTestshouldKStreamGlobalKTableLeftJoin_exactly_once_-2c965644-9f77-485f-a1bd-fa4eabb95b9c-global-consumer, groupId=null] Resetting offset for partition globalTable-GlobalKTableEOSIntegrationTestshouldKStreamGlobalKTableLeftJoin_exactly_once_-0 to offset 0. (org.apache.kafka.clients.consumer.internals.SubscriptionState:383)
Seems the rebalance never completes, which may be a related issue of KIP-441. I think we can merge it as-is and see if the failure happens again.
|
Java 8 and 11 passed. |
|
Retest this please. |
|
Java 8 passed. Java 14 EOS test failed with: It seem we are missing one update, but it's unclear why/how an input record could get dropped... Will investigate further. |
|
Retest this please. |
|
All three runs timed out. Retest this please. |
|
Java 14 passed. Java 11: |
|
I still see the following issue locally: In addition, sometimes the test will hang as well (i.e. the above verification would not fail, the test just runs forever); I tried using different assignor via |
|
LGTM! Let's merge to trunk now. |
* 'trunk' of github.com:apache/kafka: KAFKA-9290: Update IQ related JavaDocs (apache#8114) KAFKA-9928: Fix flaky GlobalKTableEOSIntegrationTest (apache#8600) KAFKA-6145: Set HighAvailabilityTaskAssignor as default in streams_upgrade_test.py (apache#8613) KAFKA-9667: Connect JSON serde strip trailing zeros (apache#8230) MINOR: Log4j Improvements on Fetcher (apache#8629)
Reviewer: Guozhang Wang <guozhang@confluent.io>
Most changes thus improve the error message output in case a test fails.
Potential fix: remove producer config
retries=1Call for review @guozhangwang @abbccdda