Skip to content

KAFKA-16390 add group.coordinator.rebalance.protocols=classic,consumer to broker configs when system tests need the new coordinator#16715

Merged
jolshan merged 2 commits intoapache:trunkfrom
frankvicky:KAFKA-16390
Aug 2, 2024
Merged

Conversation

@frankvicky
Copy link
Copy Markdown
Contributor

@frankvicky frankvicky commented Jul 29, 2024

Jira Link
Fix an issue that cause system test failing when using AsyncKafkaConsumer.
A configuration option, group.coordinator.rebalance.protocols, was introduced to specify the rebalance protocols used by the group coordinator. By default, the rebalance protocol is set to classic. When the new group coordinator is enabled, the rebalance protocols are set to classic,consumer.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

@chia7712
Copy link
Copy Markdown
Member

@frankvicky could you please attach the test result? I will verify this patch on my local later

@@ -782,7 +782,8 @@ def prop_file(self, node):

if self.use_new_coordinator:
override_configs[config_property.NEW_GROUP_COORDINATOR_ENABLE] = 'true'
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can remove this config? group.coordinator.rebalance.protocols=classic,consumer has covered that.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would keep it for now. It is nice to have the ability to set it explicitly in tests.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would keep it for now. It is nice to have the ability to set it explicitly in tests.

that is fine to me. BTW, will we remove group.coordinator.new.enable in the future?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my point was: why we need the scenario that server uses the new coordinator but "consumer protocol" is disallowed?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. We will remove group.coordinator.new.enable in the near future. No, we don't have this case at the moment. I think that we have such cases, especially in integration environments where we want to test the new coordinator without the "consumer protocol".

@frankvicky
Copy link
Copy Markdown
Contributor Author

Hi @chia7712
Following is the test result on my local machine:

================================================================================
SESSION REPORT (ALL TESTS)
ducktape version: 0.11.4
session_id:       2024-07-29--006
run time:         55 minutes 55.809 seconds
tests run:        28
passed:           28
flaky:            0
failed:           0
ignored:          0
================================================================================
test_id:    kafkatest.tests.core.consume_bench_test.ConsumeBenchTest.test_consume_bench.topics=.consume_bench_topic.0-5.0-4.metadata_quorum=ISOLATED_KRAFT.use_new_coordinator=False
status:     PASS
run time:   2 minutes 5.924 seconds
--------------------------------------------------------------------------------
test_id:    kafkatest.tests.core.consume_bench_test.ConsumeBenchTest.test_consume_bench.topics=.consume_bench_topic.0-5.0-4.metadata_quorum=ISOLATED_KRAFT.use_new_coordinator=True.group_protocol=classic
status:     PASS
run time:   2 minutes 7.900 seconds
--------------------------------------------------------------------------------
test_id:    kafkatest.tests.core.consume_bench_test.ConsumeBenchTest.test_consume_bench.topics=.consume_bench_topic.0-5.0-4.metadata_quorum=ISOLATED_KRAFT.use_new_coordinator=True.group_protocol=consumer
status:     PASS
run time:   2 minutes 6.928 seconds
--------------------------------------------------------------------------------
test_id:    kafkatest.tests.core.consume_bench_test.ConsumeBenchTest.test_consume_bench.topics=.consume_bench_topic.0-5.0-4.metadata_quorum=ZK.use_new_coordinator=False
status:     PASS
run time:   2 minutes 4.668 seconds
--------------------------------------------------------------------------------
test_id:    kafkatest.tests.core.consume_bench_test.ConsumeBenchTest.test_consume_bench.topics=.consume_bench_topic.0-5.metadata_quorum=ISOLATED_KRAFT.use_new_coordinator=False
status:     PASS
run time:   2 minutes 9.259 seconds
--------------------------------------------------------------------------------
test_id:    kafkatest.tests.core.consume_bench_test.ConsumeBenchTest.test_consume_bench.topics=.consume_bench_topic.0-5.metadata_quorum=ISOLATED_KRAFT.use_new_coordinator=True.group_protocol=classic
status:     PASS
run time:   2 minutes 7.732 seconds
--------------------------------------------------------------------------------
test_id:    kafkatest.tests.core.consume_bench_test.ConsumeBenchTest.test_consume_bench.topics=.consume_bench_topic.0-5.metadata_quorum=ISOLATED_KRAFT.use_new_coordinator=True.group_protocol=consumer
status:     PASS
run time:   2 minutes 8.855 seconds
--------------------------------------------------------------------------------
test_id:    kafkatest.tests.core.consume_bench_test.ConsumeBenchTest.test_consume_bench.topics=.consume_bench_topic.0-5.metadata_quorum=ZK.use_new_coordinator=False
status:     PASS
run time:   2 minutes 6.395 seconds
--------------------------------------------------------------------------------
test_id:    kafkatest.tests.core.consume_bench_test.ConsumeBenchTest.test_multiple_consumers_random_group_partitions.metadata_quorum=ISOLATED_KRAFT.use_new_coordinator=False
status:     PASS
run time:   2 minutes 17.896 seconds
--------------------------------------------------------------------------------
test_id:    kafkatest.tests.core.consume_bench_test.ConsumeBenchTest.test_multiple_consumers_random_group_partitions.metadata_quorum=ISOLATED_KRAFT.use_new_coordinator=True.group_protocol=classic
status:     PASS
run time:   2 minutes 18.293 seconds
--------------------------------------------------------------------------------
test_id:    kafkatest.tests.core.consume_bench_test.ConsumeBenchTest.test_multiple_consumers_random_group_partitions.metadata_quorum=ISOLATED_KRAFT.use_new_coordinator=True.group_protocol=consumer
status:     PASS
run time:   2 minutes 18.186 seconds
--------------------------------------------------------------------------------
test_id:    kafkatest.tests.core.consume_bench_test.ConsumeBenchTest.test_multiple_consumers_random_group_partitions.metadata_quorum=ZK.use_new_coordinator=False
status:     PASS
run time:   2 minutes 18.477 seconds
--------------------------------------------------------------------------------
test_id:    kafkatest.tests.core.consume_bench_test.ConsumeBenchTest.test_multiple_consumers_random_group_topics.metadata_quorum=ISOLATED_KRAFT.use_new_coordinator=False
status:     PASS
run time:   1 minute 59.997 seconds
--------------------------------------------------------------------------------
test_id:    kafkatest.tests.core.consume_bench_test.ConsumeBenchTest.test_multiple_consumers_random_group_topics.metadata_quorum=ISOLATED_KRAFT.use_new_coordinator=True.group_protocol=classic
status:     PASS
run time:   2 minutes 4.979 seconds
--------------------------------------------------------------------------------
test_id:    kafkatest.tests.core.consume_bench_test.ConsumeBenchTest.test_multiple_consumers_random_group_topics.metadata_quorum=ISOLATED_KRAFT.use_new_coordinator=True.group_protocol=consumer
status:     PASS
run time:   2 minutes 1.873 seconds
--------------------------------------------------------------------------------
test_id:    kafkatest.tests.core.consume_bench_test.ConsumeBenchTest.test_multiple_consumers_random_group_topics.metadata_quorum=ZK.use_new_coordinator=False
status:     PASS
run time:   1 minute 59.512 seconds
--------------------------------------------------------------------------------
test_id:    kafkatest.tests.core.consume_bench_test.ConsumeBenchTest.test_multiple_consumers_specified_group_partitions_should_raise.metadata_quorum=ISOLATED_KRAFT.use_new_coordinator=False
status:     PASS
run time:   1 minute 19.657 seconds
--------------------------------------------------------------------------------
test_id:    kafkatest.tests.core.consume_bench_test.ConsumeBenchTest.test_multiple_consumers_specified_group_partitions_should_raise.metadata_quorum=ISOLATED_KRAFT.use_new_coordinator=True.group_protocol=classic
status:     PASS
run time:   1 minute 18.882 seconds
--------------------------------------------------------------------------------
test_id:    kafkatest.tests.core.consume_bench_test.ConsumeBenchTest.test_multiple_consumers_specified_group_partitions_should_raise.metadata_quorum=ISOLATED_KRAFT.use_new_coordinator=True.group_protocol=consumer
status:     PASS
run time:   1 minute 15.674 seconds
--------------------------------------------------------------------------------
test_id:    kafkatest.tests.core.consume_bench_test.ConsumeBenchTest.test_multiple_consumers_specified_group_partitions_should_raise.metadata_quorum=ZK.use_new_coordinator=False
status:     PASS
run time:   1 minute 13.861 seconds
--------------------------------------------------------------------------------
test_id:    kafkatest.tests.core.consume_bench_test.ConsumeBenchTest.test_single_partition.metadata_quorum=ISOLATED_KRAFT.use_new_coordinator=False
status:     PASS
run time:   2 minutes 1.614 seconds
--------------------------------------------------------------------------------
test_id:    kafkatest.tests.core.consume_bench_test.ConsumeBenchTest.test_single_partition.metadata_quorum=ISOLATED_KRAFT.use_new_coordinator=True.group_protocol=classic
status:     PASS
run time:   2 minutes 1.605 seconds
--------------------------------------------------------------------------------
test_id:    kafkatest.tests.core.consume_bench_test.ConsumeBenchTest.test_single_partition.metadata_quorum=ISOLATED_KRAFT.use_new_coordinator=True.group_protocol=consumer
status:     PASS
run time:   1 minute 57.558 seconds
--------------------------------------------------------------------------------
test_id:    kafkatest.tests.core.consume_bench_test.ConsumeBenchTest.test_single_partition.metadata_quorum=ZK.use_new_coordinator=False
status:     PASS
run time:   1 minute 59.035 seconds
--------------------------------------------------------------------------------
test_id:    kafkatest.tests.core.consume_bench_test.ConsumeBenchTest.test_two_consumers_specified_group_topics.metadata_quorum=ISOLATED_KRAFT.use_new_coordinator=False
status:     PASS
run time:   2 minutes 9.197 seconds
--------------------------------------------------------------------------------
test_id:    kafkatest.tests.core.consume_bench_test.ConsumeBenchTest.test_two_consumers_specified_group_topics.metadata_quorum=ISOLATED_KRAFT.use_new_coordinator=True.group_protocol=classic
status:     PASS
run time:   2 minutes 7.757 seconds
--------------------------------------------------------------------------------
test_id:    kafkatest.tests.core.consume_bench_test.ConsumeBenchTest.test_two_consumers_specified_group_topics.metadata_quorum=ISOLATED_KRAFT.use_new_coordinator=True.group_protocol=consumer
status:     PASS
run time:   2 minutes 8.797 seconds
--------------------------------------------------------------------------------
test_id:    kafkatest.tests.core.consume_bench_test.ConsumeBenchTest.test_two_consumers_specified_group_topics.metadata_quorum=ZK.use_new_coordinator=False
status:     PASS
run time:   2 minutes 3.356 seconds
--------------------------------------------------------------------------------

Comment thread tests/kafkatest/services/kafka/kafka.py Outdated
if self.use_new_coordinator:
override_configs[config_property.NEW_GROUP_COORDINATOR_ENABLE] = 'true'

override_configs[config_property.GROUP_COORDINATOR_REBALANCE_PROTOCOLS_CONFIG] = 'classic,consumer'
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to double check, here we're adding both as a default, but still respecting any specific value that a test may have right? (I guess the following logic on ln 788 ensures that?)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're right, the logic respects specific test values while adding necessary defaults.

Copy link
Copy Markdown
Contributor

@kirktrue kirktrue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, @frankvicky!

This seems like a partial revert of the changes from #16482. If @dajac is happy with it, then I am as well 😄

Thanks!

@kirktrue kirktrue added ctr Consumer Threading Refactor (KIP-848) consumer tests Test fixes (including flaky tests) KIP-848 The Next Generation of the Consumer Rebalance Protocol labels Jul 29, 2024
@chia7712
Copy link
Copy Markdown
Member

This seems like a partial revert of the changes from #16482. If @dajac is happy with it, then I am as well 😄

yep, this PR reverts a part of #16482. For another, it seems adding the new coordinator configs to kraft_broker_configs is more straightforward. Maybe we can follow the style of #16482? @dajac @frankvicky WDYT?

@frankvicky
Copy link
Copy Markdown
Contributor Author

This seems like a partial revert of the changes from #16482. If @dajac is happy with it, then I am as well 😄

yep, this PR reverts a part of #16482. For another, it seems adding the new coordinator configs to kraft_broker_configs is more straightforward. Maybe we can follow the style of #16482? @dajac @frankvicky WDYT?

I'm sold. 😺

@chia7712
Copy link
Copy Markdown
Member

chia7712 commented Aug 1, 2024

@frankvicky any updates? I feel this PR need to be backport to 3.9 to make e2e stable. @cmccabe FYI

@frankvicky
Copy link
Copy Markdown
Contributor Author

Hi @chia7712
I will update it ASAP.

@jolshan
Copy link
Copy Markdown
Member

jolshan commented Aug 2, 2024

Can we update the title and/or description. There are many more tests that are failing (and hopefully should be fixed) by this change. More specifically any test with use_new_coordinator=True Thanks for the fix!

Copy link
Copy Markdown
Member

@chia7712 chia7712 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. run consume_bench_test.py on my local. it works.

@chia7712 chia7712 changed the title KAFKA-16390: consume_bench_test.py failed using AsyncKafkaConsumer KAFKA-16390 add group.coordinator.rebalance.protocols=classic,consumer to broker configs when E2E needs the new coordinator Aug 2, 2024
@chia7712 chia7712 changed the title KAFKA-16390 add group.coordinator.rebalance.protocols=classic,consumer to broker configs when E2E needs the new coordinator KAFKA-16390 add group.coordinator.rebalance.protocols=classic,consumer to broker configs when system tests need the new coordinator Aug 2, 2024
Copy link
Copy Markdown
Member

@lianetm lianetm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes LGTM. Running on my local still but some already passed. Since it passes for @chia7712 too LGTM.

@jolshan jolshan merged commit da14b5a into apache:trunk Aug 2, 2024
@jolshan
Copy link
Copy Markdown
Member

jolshan commented Aug 2, 2024

I will also cherrypick to 3.9

jolshan pushed a commit that referenced this pull request Aug 2, 2024
…mer` to broker configs when system tests need the new coordinator (#16715)

Fix an issue that cause system test failing when using AsyncKafkaConsumer.
A configuration option, group.coordinator.rebalance.protocols, was introduced to specify the rebalance protocols used by the group coordinator. By default, the rebalance protocol is set to classic. When the new group coordinator is enabled, the rebalance protocols are set to classic,consumer.

Reviewers: Chia-Ping Tsai <chia7712@gmail.com>, David Jacot <djacot@confluent.io>, Lianet Magrans <lianetmr@gmail.com>, Kirk True <kirk@kirktrue.pro>, Justine Olshan <jolshan@confluent.io>
@chia7712
Copy link
Copy Markdown
Member

chia7712 commented Aug 3, 2024

Running on my local still but some already passed.

hi @lianetm, do you mean "consume_bench_test.py" still has some failed tests on your local? I run the test only once 😅 so maybe there are others flaky ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

consumer ctr Consumer Threading Refactor (KIP-848) KIP-848 The Next Generation of the Consumer Rebalance Protocol tests Test fixes (including flaky tests)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants