Skip to content

KAFKA-9538: Flaky test: testResetOffsetsExportImportPlan#6561

Merged
hachikuji merged 2 commits intoapache:trunkfrom
huxihx:KAFKA-8211
Feb 13, 2020
Merged

KAFKA-9538: Flaky test: testResetOffsetsExportImportPlan#6561
hachikuji merged 2 commits intoapache:trunkfrom
huxihx:KAFKA-8211

Conversation

@huxihx
Copy link
Copy Markdown
Contributor

@huxihx huxihx commented Apr 11, 2019

https://issues.apache.org/jira/browse/KAFKA-8211

Reduced offset-committing interval from 5 seconds to 1 second, hoping consumer#committed returns offset more quickly. Besides, enriched the output message for the exceptional case.

More detailed description of your change,
if necessary. The PR title and PR message become
the squashed commit message, so use a separate
comment to ping reviewers.

Summary of testing strategy (including rationale)
for the feature or bug fix. Unit and/or integration
tests are expected for any behaviour change and
system tests should be considered for larger changes.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

@huxihx
Copy link
Copy Markdown
Contributor Author

huxihx commented Apr 11, 2019

retest this please

@mjsax
Copy link
Copy Markdown
Member

mjsax commented Feb 9, 2020

@huxihx This PR shows conflict. Can you rebase the PR to resolve them?

@hachikuji @cmccabe This test fails very often -- can we bump up the priority to review this PR? Does the proposed fix make sense? I am not familiar with this test.

https://issues.apache.org/jira/browse/KAFKA-8211

Reduced offset-committing interval from 5 seconds to 1 seoncd, hoping consumer#committed returns offset more quickly. Besides enriched the output message for the exceptional case.
@huxihx
Copy link
Copy Markdown
Contributor Author

huxihx commented Feb 10, 2020

Occasionally, the group is not in inactive state when resetting offsets. That would return the empty offset to fail the assertion. We should ensure the group inactivity before running resetOffsets. Does it make sense? Btw, I ran the test locally 50 times consecutively and got no failure.

@@ -416,6 +416,9 @@ class ResetConsumerGroupOffsetTest extends ConsumerGroupCommandTest {
produceConsumeAndShutdown(topic = topic1, group = group1, totalMessages = 100, numConsumers = 2)
produceConsumeAndShutdown(topic = topic2, group = group2, totalMessages = 100, numConsumers = 5)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test case doesn't really need 5 consumers. Can we change both of these to use a single consumer?

@hachikuji hachikuji changed the title KAFKA-8211: Flaky test: testResetOffsetsExportImportPlan KAFKA-9538: Flaky test: testResetOffsetsExportImportPlan Feb 11, 2020
@hachikuji
Copy link
Copy Markdown
Contributor

ok to test

hachikuji added a commit that referenced this pull request Feb 12, 2020
The recent increase in the flakiness of one of the offset reset tests (KAFKA-9538) traces back to #7941. After investigation, we found that following this patch, the consumer was sending an additional metadata request prior to performing the group assignment. This slight timing difference was enough to trigger the test failures. The problem turned out to be due to a bug in `SubscriptionState.groupSubscribe`, which no longer counted the local subscription when determining if there were new topics to fetch metadata for. Hence the extra metadata update. This patch restores the old logic.

Without the fix, we saw 30-50% test failures locally. With it, I could no longer reproduce the failure. However, #6561 is probably still needed to improve the resilience of this test.

Reviewers: Rajini Sivaram <rajinisivaram@googlemail.com>
Copy link
Copy Markdown
Contributor

@hachikuji hachikuji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the patch!

@hachikuji
Copy link
Copy Markdown
Contributor

ok to test

hachikuji added a commit that referenced this pull request Feb 12, 2020
The recent increase in the flakiness of one of the offset reset tests (KAFKA-9538) traces back to #7941. After investigation, we found that following this patch, the consumer was sending an additional metadata request prior to performing the group assignment. This slight timing difference was enough to trigger the test failures. The problem turned out to be due to a bug in `SubscriptionState.groupSubscribe`, which no longer counted the local subscription when determining if there were new topics to fetch metadata for. Hence the extra metadata update. This patch restores the old logic.

Without the fix, we saw 30-50% test failures locally. With it, I could no longer reproduce the failure. However, #6561 is probably still needed to improve the resilience of this test.

Reviewers: Rajini Sivaram <rajinisivaram@googlemail.com>
rite2nikhil pushed a commit to confluentinc/kafka that referenced this pull request Feb 12, 2020
…#8095)

The recent increase in the flakiness of one of the offset reset tests (KAFKA-9538) traces back to apache#7941. After investigation, we found that following this patch, the consumer was sending an additional metadata request prior to performing the group assignment. This slight timing difference was enough to trigger the test failures. The problem turned out to be due to a bug in `SubscriptionState.groupSubscribe`, which no longer counted the local subscription when determining if there were new topics to fetch metadata for. Hence the extra metadata update. This patch restores the old logic.

Without the fix, we saw 30-50% test failures locally. With it, I could no longer reproduce the failure. However, apache#6561 is probably still needed to improve the resilience of this test.

Reviewers: Rajini Sivaram <rajinisivaram@googlemail.com>
hachikuji added a commit that referenced this pull request Feb 12, 2020
The recent increase in the flakiness of one of the offset reset tests (KAFKA-9538) traces back to #7941. After investigation, we found that following this patch, the consumer was sending an additional metadata request prior to performing the group assignment. This slight timing difference was enough to trigger the test failures. The problem turned out to be due to a bug in `SubscriptionState.groupSubscribe`, which no longer counted the local subscription when determining if there were new topics to fetch metadata for. Hence the extra metadata update. This patch restores the old logic.

Without the fix, we saw 30-50% test failures locally. With it, I could no longer reproduce the failure. However, #6561 is probably still needed to improve the resilience of this test.

Reviewers: Rajini Sivaram <rajinisivaram@googlemail.com>
hachikuji added a commit that referenced this pull request Feb 12, 2020
The recent increase in the flakiness of one of the offset reset tests (KAFKA-9538) traces back to #7941. After investigation, we found that following this patch, the consumer was sending an additional metadata request prior to performing the group assignment. This slight timing difference was enough to trigger the test failures. The problem turned out to be due to a bug in `SubscriptionState.groupSubscribe`, which no longer counted the local subscription when determining if there were new topics to fetch metadata for. Hence the extra metadata update. This patch restores the old logic.

Without the fix, we saw 30-50% test failures locally. With it, I could no longer reproduce the failure. However, #6561 is probably still needed to improve the resilience of this test.

Reviewers: Rajini Sivaram <rajinisivaram@googlemail.com>
@hachikuji
Copy link
Copy Markdown
Contributor

retest this please

@hachikuji
Copy link
Copy Markdown
Contributor

Having a hard time getting a build here. I verified the fix works locally, so I will go ahead and merge.

@hachikuji hachikuji merged commit 46e80db into apache:trunk Feb 13, 2020
stanislavkozlovski pushed a commit to stanislavkozlovski/kafka that referenced this pull request Feb 18, 2020
…#8095)

The recent increase in the flakiness of one of the offset reset tests (KAFKA-9538) traces back to apache#7941. After investigation, we found that following this patch, the consumer was sending an additional metadata request prior to performing the group assignment. This slight timing difference was enough to trigger the test failures. The problem turned out to be due to a bug in `SubscriptionState.groupSubscribe`, which no longer counted the local subscription when determining if there were new topics to fetch metadata for. Hence the extra metadata update. This patch restores the old logic.

Without the fix, we saw 30-50% test failures locally. With it, I could no longer reproduce the failure. However, apache#6561 is probably still needed to improve the resilience of this test.

Reviewers: Rajini Sivaram <rajinisivaram@googlemail.com>
qq619618919 pushed a commit to qq619618919/kafka that referenced this pull request May 12, 2020
…#8095)

The recent increase in the flakiness of one of the offset reset tests (KAFKA-9538) traces back to apache#7941. After investigation, we found that following this patch, the consumer was sending an additional metadata request prior to performing the group assignment. This slight timing difference was enough to trigger the test failures. The problem turned out to be due to a bug in `SubscriptionState.groupSubscribe`, which no longer counted the local subscription when determining if there were new topics to fetch metadata for. Hence the extra metadata update. This patch restores the old logic.

Without the fix, we saw 30-50% test failures locally. With it, I could no longer reproduce the failure. However, apache#6561 is probably still needed to improve the resilience of this test.

Reviewers: Rajini Sivaram <rajinisivaram@googlemail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants