Skip to content

MINOR: Adds KRaft versions of most streams system tests#12458

Merged
vvcephei merged 6 commits intoapache:trunkfrom
AlanConfluent:updates_tests_kraft
Aug 26, 2022
Merged

MINOR: Adds KRaft versions of most streams system tests#12458
vvcephei merged 6 commits intoapache:trunkfrom
AlanConfluent:updates_tests_kraft

Conversation

@AlanConfluent
Copy link
Copy Markdown
Contributor

Adds the annotation @matrix(metadata_quorum=quorum.all_non_upgrade) to many existing tests, which runs them with each of zookeeper and remote_kraft nodes.

This skips tests which use various forms of Kafka versioning since those seem to have issues with KRaft at the moment. Running these tests with KRaft will require a followup PR.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

Copy link
Copy Markdown
Contributor

@vvcephei vvcephei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @AlanConfluent !

It's great to see us tightening the knobs on KRaft.

In addition to addressing the review comments, can you post a link to the system test results with this change? It would be good to verify the impact before merging.

Thanks!
-John

@matrix(failure_mode=["clean_shutdown", "hard_shutdown", "clean_bounce", "hard_bounce"],
broker_type=["leader", "controller"],
num_threads=[1, 3],
sleep_time_secs=[120])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about explicitly specifying quorum.zk instead of relying on the default? It's maybe mildly more verbose, but it avoids having to think through overrides when you're debugging a test failure out of the blue.

If you're in favor, I'd suggest just removing all the default parameter values that got added here to be sure everything is explicitly configured.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this extra matrix configuration to avoid broker_type=="controller" for metadata_quorum=="quorum.remote_kraft"-- not sure if you noticed that, otherwise I would have just added metadata_quorum=quorum.all_non_upgrade to the existing matrix to cover both modes as you said. This is because killing the controller when it's separate from the broker is maybe not what the test is attempting to test and also requires other changes to accommodate killing off the separate KRaft controller node, but that seems a bit overkill for this purpose.

Is that what

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @AlanConfluent . I was actually making a slightly different suggestion, namely to explicitly include metadata_quorum=[quorum.zk] in this second matrix config, and to remove the default parameter value metadata_quorum=quorum.zk from the method signature.

Github charmingly won't let me do a suggestion to make this clearer because the diff includes a deleted line.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Done!

Comment on lines +80 to +82
self.wait_for_verification(processor_1, "ACTIVE_TASKS:2 STANDBY_TASKS:2", processor_1.STDOUT_FILE)
self.wait_for_verification(processor_2, "ACTIVE_TASKS:2 STANDBY_TASKS:2", processor_2.STDOUT_FILE)
self.wait_for_verification(processor_3, "ACTIVE_TASKS:2 STANDBY_TASKS:2", processor_3.STDOUT_FILE)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's up with this verification change? It looks like you're increasing the specificity of the verification here, which is good. I'm wondering if it has some relationship to the quorum change, of if it's just on the side.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, I know how this change got made. I had made my changes in another repo locally and copied the file over to this one... Must have been made by another commit in the original repo. Will revert this and make sure there are no similar changes elsewhere.

@guozhangwang
Copy link
Copy Markdown
Contributor

In addition to addressing the review comments, can you post a link to the system test results with this change? It would be good to verify the impact before merging.

+1. I'd also love to learn how much system test time increase this one would incur.

Copy link
Copy Markdown
Contributor

@guozhangwang guozhangwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @AlanConfluent ! A few meta questions:

  1. Seems the streams_cooperative_rebalance_upgrade_test.py is not included in this PR?
  2. I think for streams_application_upgrade_test.py we should also consider enabling kraft on the servers, to make sure that kraft works when streams itself upgrade.
  3. In streams_broker_compatibility_test.py when we test for broker versions > 3.1 we should also allow it to be kraft.
  4. This is not related to this PR, but it seems the test coverage for streams_application_ugprade_test and streams_upgrade_test have much overlaps. @vvcephei could you chime in here since you have much experience with the former class file. Could we dedup their coverage hence reduce our test time?
  5. If the current e2e time is too high, I feel maybe we can skip adding the kraft model for the following suite:
    a. shutdown_deadlock
    b. relational_smoke, since its dependency on kraft is exactly covered by smoke -- i.e. if there's an issue with relational_smoke due to kraft, then smoke itself should fail as well.
    c. named_repartition

def test_many_brokers_bounce(self, failure_mode, num_failures):
num_failures=[2],
metadata_quorum=quorum.all_non_upgrade)
def test_many_brokers_bounce(self, failure_mode, num_failures, metadata_quorum=quorum.zk):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: since we already set the value range of metadata_quorum in the matrix, do we still need to set its default as quorum.zk? Seems the default value would never be used? Ditto elsewhere.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah just saw @vvcephei 's comment earlier, that makes sense. Please feel free to ignore.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think if it's not defined, it'll default to using zk, but I just wanted to be a bit more explicit about the default.

broker_type=["leader"],
num_threads=[1, 3],
sleep_time_secs=[120],
metadata_quorum=[quorum.remote_kraft])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we only want to test remote_kraft but not collocated kraft?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what was suggested for tests which are not testing core Kafka functionality and could be affected, which I don't think this is. For those, the predefined variable quorum.all_non_upgrade exists and is equal to [quorum.remote_kraft, quorum.zk]. Also, given how long these tests run, I didn't want to run more variants than is necessary.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, thanks!


@cluster(num_nodes=8)
def test_rolling_bounces_will_not_trigger_rebalance_under_static_membership(self):
self.zookeeper.start()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do this change in streams_cooperative_rebalance_upgrade_test also?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm doing that in a followup PR

@AlanConfluent
Copy link
Copy Markdown
Contributor Author

AlanConfluent commented Aug 4, 2022

In addition to addressing the review comments, can you post a link to the system test results with this change? It would be good to verify the impact before merging.

+1. I'd also love to learn how much system test time increase this one would incur.

My run with my changes is here: https://jenkins.confluent.io/job/system-test-kafka-branch-builder/5067/

I'm trying to run the same tests on trunk and see, but there are unrelated failures. Will update when I get a run.

@AlanConfluent
Copy link
Copy Markdown
Contributor Author

Thanks @AlanConfluent ! A few meta questions:

  1. Seems the streams_cooperative_rebalance_upgrade_test.py is not included in this PR?

I didn't update any of the upgrade tests just yet. I was having a hard time running against different versions of Kafka and using KRaft -- this seems like a reasonable followup, so I didn't do it yet.

  1. I think for streams_application_upgrade_test.py we should also consider enabling kraft on the servers, to make sure that kraft works when streams itself upgrade.

I agree. This will be a followup.

  1. In streams_broker_compatibility_test.py when we test for broker versions > 3.1 we should also allow it to be kraft.

Maybe this was the main issue I was running into. I thought KRaft was available in earlier versions, but saw odd failures. I'll talk to you about what it would take to get this working.

  1. This is not related to this PR, but it seems the test coverage for streams_application_ugprade_test and streams_upgrade_test have much overlaps. @vvcephei could you chime in here since you have much experience with the former class file. Could we dedup their coverage hence reduce our test time?
  1. If the current e2e time is too high, I feel maybe we can skip adding the kraft model for the following suite:
    a. shutdown_deadlock
    b. relational_smoke, since its dependency on kraft is exactly covered by smoke -- i.e. if there's an issue with relational_smoke due to kraft, then smoke itself should fail as well.
    c. named_repartition
    Good to consider. Let's evaluate after I get a successful run of trunk.

@guozhangwang
Copy link
Copy Markdown
Contributor

Re-triggered the jenkins build.

@AlanConfluent
Copy link
Copy Markdown
Contributor Author

I made the change to switch some of these tests to just run with remote_kraft to minimize test run time.

Copy link
Copy Markdown
Contributor

@vvcephei vvcephei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @AlanConfluent !

@vvcephei vvcephei merged commit 481fefb into apache:trunk Aug 26, 2022
@AlanConfluent
Copy link
Copy Markdown
Contributor Author

cmccabe pushed a commit that referenced this pull request Sep 14, 2022
Migrates Streams sustem tests to either use kraft brokers or to use both kraft and zk in a testing matrix.

This skips tests which use various forms of Kafka versioning since those seem to have issues with KRaft at the moment. Running these tests with KRaft will require a followup PR.

Reviewers: Guozhang Wang <guozhang@apache.org>, John Roesler <vvcephei@apache.org>
ijuma added a commit to confluentinc/kafka that referenced this pull request Sep 21, 2022
…eptember 2022)

`Jenkinsfile` was the only conflict and we ignore the changes since
they are not relevant to the Confluent build.

* apache-github/3.3: (61 commits)
  KAFKA-14214: Introduce read-write lock to StandardAuthorizer for consistent ACL reads. (apache#12628)
  KAFKA-14243: Temporarily disable unsafe downgrade (apache#12664)
  KAFKA-14240; Validate kraft snapshot state on startup (apache#12653)
  KAFKA-14233: disable testReloadUpdatedFilesWithoutConfigChange first to fix the build (apache#12658)
  KAFKA-14238;  KRaft metadata log should not delete segment past the latest snapshot (apache#12655)
  KAFKA-14156: Built-in partitioner may create suboptimal batches (apache#12570)
  MINOR: Adds KRaft versions of most streams system tests (apache#12458)
  MINOR; Add missing li end tag (apache#12640)
  MINOR: Mention that kraft is production ready in upgrade notes (apache#12635)
  MINOR: Add upgrade note regarding the Strictly Uniform Sticky Partitioner (KIP-794)  (apache#12630)
  KAFKA-14222; KRaft's memory pool should always allocate a buffer (apache#12625)
  KAFKA-14208; Do not raise wakeup in consumer during asynchronous offset commits (apache#12626)
  KAFKA-14196; Do not continue fetching partitions awaiting auto-commit prior to revocation (apache#12603)
  KAFKA-14215; Ensure forwarded requests are applied to broker request quota (apache#12624)
  MINOR; Remove end html tag from upgrade (apache#12605)
  Remove the html end tag from upgrade.html
  KAFKA-14205; Document how to replace the disk for the KRaft Controller (apache#12597)
  KAFKA-14203 Disable snapshot generation on broker after metadata errors (apache#12596)
  KAFKA-14216: Remove ZK reference from org.apache.kafka.server.quota.ClientQuotaCallback javadoc (apache#12617)
  KAFKA-14217: app-reset-tool.html should not show --zookeeper flag that no longer exists (apache#12618)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants