KAFKA-9928: Fix flaky GlobalKTableEOSIntegrationTest by mjsax · Pull Request #8600 · apache/kafka

mjsax · 2020-05-01T21:04:28Z

Most changes thus improve the error message output in case a test fails.

Potential fix: remove producer config retries=1

mjsax · 2020-05-01T21:05:53Z

    private void produceAbortedMessages() throws Exception {
        final Properties properties = new Properties();
        properties.put(ProducerConfig.TRANSACTIONAL_ID_CONFIG, "someid");
-        properties.put(ProducerConfig.RETRIES_CONFIG, 1);


This might be the actually fix. Not sure why we set retries to one, but if we would loose input data, we would never complete the result and the test would time out. (Maybe not relevant for aborted message, but same below)

guozhangwang

I run the tests locally, did not run into the failure but sometimes it hangs on

[2020-05-01 16:30:48,226] INFO stream-client [app-GlobalKTableEOSIntegrationTestshouldKStreamGlobalKTableLeftJoin_exactly_once_-2c965644-9f77-485f-a1bd-fa4eabb95b9c] State transition from CREATED to REBALANCING (org.apache.kafka.streams.KafkaStreams:280)
[2020-05-01 16:30:48,240] INFO Opening store globalStore in regular mode (org.apache.kafka.streams.state.internals.RocksDBTimestampedStore:100)
[2020-05-01 16:30:48,240] INFO global-stream-thread [app-GlobalKTableEOSIntegrationTestshouldKStreamGlobalKTableLeftJoin_exactly_once_-2c965644-9f77-485f-a1bd-fa4eabb95b9c-GlobalStreamThread] Restoring state for global store globalStore (org.apache.kafka.streams.processor.internals.GlobalStateManagerImpl:185)
[2020-05-01 16:30:48,242] INFO [Consumer clientId=app-GlobalKTableEOSIntegrationTestshouldKStreamGlobalKTableLeftJoin_exactly_once_-2c965644-9f77-485f-a1bd-fa4eabb95b9c-global-consumer, groupId=null] Cluster ID: jlS_Oi-0Rl6keBCVT10iwA (org.apache.kafka.clients.Metadata:280)
[2020-05-01 16:30:48,244] INFO [Consumer clientId=app-GlobalKTableEOSIntegrationTestshouldKStreamGlobalKTableLeftJoin_exactly_once_-2c965644-9f77-485f-a1bd-fa4eabb95b9c-global-consumer, groupId=null] Subscribed to partition(s): globalTable-GlobalKTableEOSIntegrationTestshouldKStreamGlobalKTableLeftJoin_exactly_once_-0 (org.apache.kafka.clients.consumer.KafkaConsumer:1115)
[2020-05-01 16:30:48,244] INFO [Consumer clientId=app-GlobalKTableEOSIntegrationTestshouldKStreamGlobalKTableLeftJoin_exactly_once_-2c965644-9f77-485f-a1bd-fa4eabb95b9c-global-consumer, groupId=null] Seeking to EARLIEST offset of partition globalTable-GlobalKTableEOSIntegrationTestshouldKStreamGlobalKTableLeftJoin_exactly_once_-0 (org.apache.kafka.clients.consumer.internals.SubscriptionState:566)
[2020-05-01 16:30:48,244] INFO [Consumer clientId=app-GlobalKTableEOSIntegrationTestshouldKStreamGlobalKTableLeftJoin_exactly_once_-2c965644-9f77-485f-a1bd-fa4eabb95b9c-global-consumer, groupId=null] Resetting offset for partition globalTable-GlobalKTableEOSIntegrationTestshouldKStreamGlobalKTableLeftJoin_exactly_once_-0 to offset 0. (org.apache.kafka.clients.consumer.internals.SubscriptionState:383)

Seems the rebalance never completes, which may be a related issue of KIP-441. I think we can merge it as-is and see if the failure happens again.

mjsax · 2020-05-02T02:03:57Z

Java 8 and 11 passed.
Java 14: SmokeTestDriverIntegrationTest.shouldWorkWithRebalance

mjsax · 2020-05-02T02:04:06Z

Retest this please.

mjsax · 2020-05-04T03:13:25Z

Java 8 passed.
Java 11: org.apache.kafka.streams.integration.GlobalKTableIntegrationTest.shouldKStreamGlobalKTableLeftJoin (note it's not the EOS test)
Java 14:

org.apache.kafka.streams.integration.GlobalKTableEOSIntegrationTest.shouldKStreamGlobalKTableLeftJoin[exactly_once_beta]
org.apache.kafka.streams.integration.GlobalKTableIntegrationTest.shouldKStreamGlobalKTableLeftJoin

Java 14 EOS test failed with:

java.lang.AssertionError: Condition not met within timeout 30000. waiting for final values
  expected: {a=1+F, b=2+G, c=3+H, d=4+I, e=5+J}
  received: {a=1+F, b=2+G, c=3+C, d=4+I, e=5+J}

It seem we are missing one update, but it's unclear why/how an input record could get dropped... Will investigate further.

mjsax · 2020-05-04T03:15:53Z

Retest this please.

mjsax · 2020-05-04T17:55:14Z

All three runs timed out.

Retest this please.

mjsax · 2020-05-05T17:33:02Z

Java 14 passed.
Java 8:

org.apache.kafka.streams.integration.EosIntegrationTest.shouldNotViolateEosIfOneTaskFailsWithState[exactly_once]
org.apache.kafka.streams.integration.QueryableStateIntegrationTest.shouldAllowConcurrentAccesses

Java 11: org.apache.kafka.streams.integration.GlobalKTableIntegrationTest.shouldKStreamGlobalKTableLeftJoin

guozhangwang · 2020-05-05T20:20:44Z

I still see the following issue locally:

java.lang.AssertionError: Condition not met within timeout 30000. waiting for final values
  expected: {a=1+F, b=2+G, c=3+H, d=4+I, e=5+J}
  received: {a=1+A, b=2+G, c=3+H, d=4+I, e=5+J}

	at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:26)
	at org.apache.kafka.test.TestUtils.lambda$waitForCondition$17(TestUtils.java:381)
	at org.apache.kafka.test.TestUtils.retryOnExceptionWithTimeout(TestUtils.java:429)
	at org.apache.kafka.test.TestUtils.retryOnExceptionWithTimeout(TestUtils.java:397)
	at org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:378)
	at org.apache.kafka.streams.integration.GlobalKTableEOSIntegrationTest.shouldKStreamGlobalKTableLeftJoin(GlobalKTableEOSIntegrationTest.java:205)

In addition, sometimes the test will hang as well (i.e. the above verification would not fail, the test just runs forever); I tried using different assignor via INTERNAL_TASK_ASSIGNOR_CLASS but the same hanging issue still exists.

guozhangwang · 2020-05-07T15:54:17Z

LGTM! Let's merge to trunk now.

* 'trunk' of github.com:apache/kafka: KAFKA-9290: Update IQ related JavaDocs (apache#8114) KAFKA-9928: Fix flaky GlobalKTableEOSIntegrationTest (apache#8600) KAFKA-6145: Set HighAvailabilityTaskAssignor as default in streams_upgrade_test.py (apache#8613) KAFKA-9667: Connect JSON serde strip trailing zeros (apache#8230) MINOR: Log4j Improvements on Fetcher (apache#8629)

Reviewer: Guozhang Wang <guozhang@confluent.io>

KAFKA-9928: Fix flaky GlobalKTableEOSIntegrationTest

e0d5f56

mjsax added streams tests Test fixes (including flaky tests) labels May 1, 2020

mjsax commented May 1, 2020

View reviewed changes

mjsax added 2 commits May 1, 2020 14:07

fix formatting

1c1d325

Use supplier to provide error message

dbec61d

guozhangwang approved these changes May 1, 2020

View reviewed changes

Another try to fix flakiness (plus some cleanup)

df27f9e

Improve wait condition

ab2e6a4

guozhangwang mentioned this pull request May 7, 2020

MINOR: Log4j Improvements on Fetcher #8629

Merged

3 tasks

mjsax merged commit 611831b into apache:trunk May 8, 2020

mjsax deleted the kafka-9928-flaky-global-ktable-eos branch May 8, 2020 06:01

guozhangwang mentioned this pull request May 9, 2020

KAFKA-9949: Fix Flaky GlobalKTableIntegrationTest #8635

Merged

3 tasks

jwijgerd pushed a commit to buxapp/kafka that referenced this pull request May 14, 2020

KAFKA-9928: Fix flaky GlobalKTableEOSIntegrationTest (apache#8600)

840cfb0

Reviewer: Guozhang Wang <guozhang@confluent.io>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KAFKA-9928: Fix flaky GlobalKTableEOSIntegrationTest#8600

KAFKA-9928: Fix flaky GlobalKTableEOSIntegrationTest#8600
mjsax merged 5 commits intoapache:trunkfrom
mjsax:kafka-9928-flaky-global-ktable-eos

mjsax commented May 1, 2020 •

edited

Loading

Uh oh!

mjsax May 1, 2020

Uh oh!

guozhangwang left a comment

Uh oh!

mjsax commented May 2, 2020

Uh oh!

mjsax commented May 2, 2020

Uh oh!

mjsax commented May 4, 2020

Uh oh!

mjsax commented May 4, 2020

Uh oh!

mjsax commented May 4, 2020

Uh oh!

mjsax commented May 5, 2020

Uh oh!

guozhangwang commented May 5, 2020

Uh oh!

guozhangwang commented May 7, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mjsax commented May 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mjsax May 1, 2020

Choose a reason for hiding this comment

Uh oh!

guozhangwang left a comment

Choose a reason for hiding this comment

Uh oh!

mjsax commented May 2, 2020

Uh oh!

mjsax commented May 2, 2020

Uh oh!

mjsax commented May 4, 2020

Uh oh!

mjsax commented May 4, 2020

Uh oh!

mjsax commented May 4, 2020

Uh oh!

mjsax commented May 5, 2020

Uh oh!

guozhangwang commented May 5, 2020

Uh oh!

guozhangwang commented May 7, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mjsax commented May 1, 2020 •

edited

Loading