Skip to content

KAFKA-9928: Fix flaky GlobalKTableEOSIntegrationTest#8600

Merged
mjsax merged 5 commits intoapache:trunkfrom
mjsax:kafka-9928-flaky-global-ktable-eos
May 8, 2020
Merged

KAFKA-9928: Fix flaky GlobalKTableEOSIntegrationTest#8600
mjsax merged 5 commits intoapache:trunkfrom
mjsax:kafka-9928-flaky-global-ktable-eos

Conversation

@mjsax
Copy link
Copy Markdown
Member

@mjsax mjsax commented May 1, 2020

Most changes thus improve the error message output in case a test fails.

Potential fix: remove producer config retries=1

Call for review @guozhangwang @abbccdda

@mjsax mjsax added streams tests Test fixes (including flaky tests) labels May 1, 2020
private void produceAbortedMessages() throws Exception {
final Properties properties = new Properties();
properties.put(ProducerConfig.TRANSACTIONAL_ID_CONFIG, "someid");
properties.put(ProducerConfig.RETRIES_CONFIG, 1);
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be the actually fix. Not sure why we set retries to one, but if we would loose input data, we would never complete the result and the test would time out. (Maybe not relevant for aborted message, but same below)

Copy link
Copy Markdown
Contributor

@guozhangwang guozhangwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I run the tests locally, did not run into the failure but sometimes it hangs on

[2020-05-01 16:30:48,226] INFO stream-client [app-GlobalKTableEOSIntegrationTestshouldKStreamGlobalKTableLeftJoin_exactly_once_-2c965644-9f77-485f-a1bd-fa4eabb95b9c] State transition from CREATED to REBALANCING (org.apache.kafka.streams.KafkaStreams:280)
[2020-05-01 16:30:48,240] INFO Opening store globalStore in regular mode (org.apache.kafka.streams.state.internals.RocksDBTimestampedStore:100)
[2020-05-01 16:30:48,240] INFO global-stream-thread [app-GlobalKTableEOSIntegrationTestshouldKStreamGlobalKTableLeftJoin_exactly_once_-2c965644-9f77-485f-a1bd-fa4eabb95b9c-GlobalStreamThread] Restoring state for global store globalStore (org.apache.kafka.streams.processor.internals.GlobalStateManagerImpl:185)
[2020-05-01 16:30:48,242] INFO [Consumer clientId=app-GlobalKTableEOSIntegrationTestshouldKStreamGlobalKTableLeftJoin_exactly_once_-2c965644-9f77-485f-a1bd-fa4eabb95b9c-global-consumer, groupId=null] Cluster ID: jlS_Oi-0Rl6keBCVT10iwA (org.apache.kafka.clients.Metadata:280)
[2020-05-01 16:30:48,244] INFO [Consumer clientId=app-GlobalKTableEOSIntegrationTestshouldKStreamGlobalKTableLeftJoin_exactly_once_-2c965644-9f77-485f-a1bd-fa4eabb95b9c-global-consumer, groupId=null] Subscribed to partition(s): globalTable-GlobalKTableEOSIntegrationTestshouldKStreamGlobalKTableLeftJoin_exactly_once_-0 (org.apache.kafka.clients.consumer.KafkaConsumer:1115)
[2020-05-01 16:30:48,244] INFO [Consumer clientId=app-GlobalKTableEOSIntegrationTestshouldKStreamGlobalKTableLeftJoin_exactly_once_-2c965644-9f77-485f-a1bd-fa4eabb95b9c-global-consumer, groupId=null] Seeking to EARLIEST offset of partition globalTable-GlobalKTableEOSIntegrationTestshouldKStreamGlobalKTableLeftJoin_exactly_once_-0 (org.apache.kafka.clients.consumer.internals.SubscriptionState:566)
[2020-05-01 16:30:48,244] INFO [Consumer clientId=app-GlobalKTableEOSIntegrationTestshouldKStreamGlobalKTableLeftJoin_exactly_once_-2c965644-9f77-485f-a1bd-fa4eabb95b9c-global-consumer, groupId=null] Resetting offset for partition globalTable-GlobalKTableEOSIntegrationTestshouldKStreamGlobalKTableLeftJoin_exactly_once_-0 to offset 0. (org.apache.kafka.clients.consumer.internals.SubscriptionState:383)

Seems the rebalance never completes, which may be a related issue of KIP-441. I think we can merge it as-is and see if the failure happens again.

@mjsax
Copy link
Copy Markdown
Member Author

mjsax commented May 2, 2020

Java 8 and 11 passed.
Java 14: SmokeTestDriverIntegrationTest.shouldWorkWithRebalance

@mjsax
Copy link
Copy Markdown
Member Author

mjsax commented May 2, 2020

Retest this please.

@mjsax
Copy link
Copy Markdown
Member Author

mjsax commented May 4, 2020

Java 8 passed.
Java 11: org.apache.kafka.streams.integration.GlobalKTableIntegrationTest.shouldKStreamGlobalKTableLeftJoin (note it's not the EOS test)
Java 14:

org.apache.kafka.streams.integration.GlobalKTableEOSIntegrationTest.shouldKStreamGlobalKTableLeftJoin[exactly_once_beta]
org.apache.kafka.streams.integration.GlobalKTableIntegrationTest.shouldKStreamGlobalKTableLeftJoin

Java 14 EOS test failed with:

java.lang.AssertionError: Condition not met within timeout 30000. waiting for final values
  expected: {a=1+F, b=2+G, c=3+H, d=4+I, e=5+J}
  received: {a=1+F, b=2+G, c=3+C, d=4+I, e=5+J}

It seem we are missing one update, but it's unclear why/how an input record could get dropped... Will investigate further.

@mjsax
Copy link
Copy Markdown
Member Author

mjsax commented May 4, 2020

Retest this please.

@mjsax
Copy link
Copy Markdown
Member Author

mjsax commented May 4, 2020

All three runs timed out.

Retest this please.

@mjsax
Copy link
Copy Markdown
Member Author

mjsax commented May 5, 2020

Java 14 passed.
Java 8:

org.apache.kafka.streams.integration.EosIntegrationTest.shouldNotViolateEosIfOneTaskFailsWithState[exactly_once]
org.apache.kafka.streams.integration.QueryableStateIntegrationTest.shouldAllowConcurrentAccesses

Java 11: org.apache.kafka.streams.integration.GlobalKTableIntegrationTest.shouldKStreamGlobalKTableLeftJoin

@guozhangwang
Copy link
Copy Markdown
Contributor

I still see the following issue locally:

java.lang.AssertionError: Condition not met within timeout 30000. waiting for final values
  expected: {a=1+F, b=2+G, c=3+H, d=4+I, e=5+J}
  received: {a=1+A, b=2+G, c=3+H, d=4+I, e=5+J}

	at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:26)
	at org.apache.kafka.test.TestUtils.lambda$waitForCondition$17(TestUtils.java:381)
	at org.apache.kafka.test.TestUtils.retryOnExceptionWithTimeout(TestUtils.java:429)
	at org.apache.kafka.test.TestUtils.retryOnExceptionWithTimeout(TestUtils.java:397)
	at org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:378)
	at org.apache.kafka.streams.integration.GlobalKTableEOSIntegrationTest.shouldKStreamGlobalKTableLeftJoin(GlobalKTableEOSIntegrationTest.java:205)

In addition, sometimes the test will hang as well (i.e. the above verification would not fail, the test just runs forever); I tried using different assignor via INTERNAL_TASK_ASSIGNOR_CLASS but the same hanging issue still exists.

@guozhangwang
Copy link
Copy Markdown
Contributor

LGTM! Let's merge to trunk now.

@mjsax mjsax merged commit 611831b into apache:trunk May 8, 2020
@mjsax mjsax deleted the kafka-9928-flaky-global-ktable-eos branch May 8, 2020 06:01
Kvicii pushed a commit to Kvicii/kafka that referenced this pull request May 8, 2020
* 'trunk' of github.com:apache/kafka:
  KAFKA-9290: Update IQ related JavaDocs (apache#8114)
  KAFKA-9928: Fix flaky GlobalKTableEOSIntegrationTest (apache#8600)
  KAFKA-6145: Set HighAvailabilityTaskAssignor as default in streams_upgrade_test.py (apache#8613)
  KAFKA-9667: Connect JSON serde strip trailing zeros (apache#8230)
  MINOR: Log4j Improvements on Fetcher (apache#8629)
jwijgerd pushed a commit to buxapp/kafka that referenced this pull request May 14, 2020
Reviewer: Guozhang Wang <guozhang@confluent.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

streams tests Test fixes (including flaky tests)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants