MINOR: Adds KRaft versions of most streams system tests#12458
MINOR: Adds KRaft versions of most streams system tests#12458vvcephei merged 6 commits intoapache:trunkfrom
Conversation
vvcephei
left a comment
There was a problem hiding this comment.
Thanks, @AlanConfluent !
It's great to see us tightening the knobs on KRaft.
In addition to addressing the review comments, can you post a link to the system test results with this change? It would be good to verify the impact before merging.
Thanks!
-John
| @matrix(failure_mode=["clean_shutdown", "hard_shutdown", "clean_bounce", "hard_bounce"], | ||
| broker_type=["leader", "controller"], | ||
| num_threads=[1, 3], | ||
| sleep_time_secs=[120]) |
There was a problem hiding this comment.
What do you think about explicitly specifying quorum.zk instead of relying on the default? It's maybe mildly more verbose, but it avoids having to think through overrides when you're debugging a test failure out of the blue.
If you're in favor, I'd suggest just removing all the default parameter values that got added here to be sure everything is explicitly configured.
There was a problem hiding this comment.
I added this extra matrix configuration to avoid broker_type=="controller" for metadata_quorum=="quorum.remote_kraft"-- not sure if you noticed that, otherwise I would have just added metadata_quorum=quorum.all_non_upgrade to the existing matrix to cover both modes as you said. This is because killing the controller when it's separate from the broker is maybe not what the test is attempting to test and also requires other changes to accommodate killing off the separate KRaft controller node, but that seems a bit overkill for this purpose.
Is that what
There was a problem hiding this comment.
Thanks, @AlanConfluent . I was actually making a slightly different suggestion, namely to explicitly include metadata_quorum=[quorum.zk] in this second matrix config, and to remove the default parameter value metadata_quorum=quorum.zk from the method signature.
Github charmingly won't let me do a suggestion to make this clearer because the diff includes a deleted line.
| self.wait_for_verification(processor_1, "ACTIVE_TASKS:2 STANDBY_TASKS:2", processor_1.STDOUT_FILE) | ||
| self.wait_for_verification(processor_2, "ACTIVE_TASKS:2 STANDBY_TASKS:2", processor_2.STDOUT_FILE) | ||
| self.wait_for_verification(processor_3, "ACTIVE_TASKS:2 STANDBY_TASKS:2", processor_3.STDOUT_FILE) |
There was a problem hiding this comment.
What's up with this verification change? It looks like you're increasing the specificity of the verification here, which is good. I'm wondering if it has some relationship to the quorum change, of if it's just on the side.
There was a problem hiding this comment.
Ahh, I know how this change got made. I had made my changes in another repo locally and copied the file over to this one... Must have been made by another commit in the original repo. Will revert this and make sure there are no similar changes elsewhere.
+1. I'd also love to learn how much system test time increase this one would incur. |
guozhangwang
left a comment
There was a problem hiding this comment.
Thanks @AlanConfluent ! A few meta questions:
- Seems the
streams_cooperative_rebalance_upgrade_test.pyis not included in this PR? - I think for
streams_application_upgrade_test.pywe should also consider enabling kraft on the servers, to make sure that kraft works when streams itself upgrade. - In
streams_broker_compatibility_test.pywhen we test for broker versions > 3.1 we should also allow it to be kraft. - This is not related to this PR, but it seems the test coverage for
streams_application_ugprade_testandstreams_upgrade_testhave much overlaps. @vvcephei could you chime in here since you have much experience with the former class file. Could we dedup their coverage hence reduce our test time? - If the current e2e time is too high, I feel maybe we can skip adding the kraft model for the following suite:
a. shutdown_deadlock
b. relational_smoke, since its dependency on kraft is exactly covered by smoke -- i.e. if there's an issue with relational_smoke due to kraft, then smoke itself should fail as well.
c. named_repartition
| def test_many_brokers_bounce(self, failure_mode, num_failures): | ||
| num_failures=[2], | ||
| metadata_quorum=quorum.all_non_upgrade) | ||
| def test_many_brokers_bounce(self, failure_mode, num_failures, metadata_quorum=quorum.zk): |
There was a problem hiding this comment.
nit: since we already set the value range of metadata_quorum in the matrix, do we still need to set its default as quorum.zk? Seems the default value would never be used? Ditto elsewhere.
There was a problem hiding this comment.
Ah just saw @vvcephei 's comment earlier, that makes sense. Please feel free to ignore.
There was a problem hiding this comment.
Yeah, I think if it's not defined, it'll default to using zk, but I just wanted to be a bit more explicit about the default.
| broker_type=["leader"], | ||
| num_threads=[1, 3], | ||
| sleep_time_secs=[120], | ||
| metadata_quorum=[quorum.remote_kraft]) |
There was a problem hiding this comment.
Why we only want to test remote_kraft but not collocated kraft?
There was a problem hiding this comment.
That's what was suggested for tests which are not testing core Kafka functionality and could be affected, which I don't think this is. For those, the predefined variable quorum.all_non_upgrade exists and is equal to [quorum.remote_kraft, quorum.zk]. Also, given how long these tests run, I didn't want to run more variants than is necessary.
|
|
||
| @cluster(num_nodes=8) | ||
| def test_rolling_bounces_will_not_trigger_rebalance_under_static_membership(self): | ||
| self.zookeeper.start() |
There was a problem hiding this comment.
Should we do this change in streams_cooperative_rebalance_upgrade_test also?
There was a problem hiding this comment.
I'm doing that in a followup PR
My run with my changes is here: https://jenkins.confluent.io/job/system-test-kafka-branch-builder/5067/ I'm trying to run the same tests on trunk and see, but there are unrelated failures. Will update when I get a run. |
I didn't update any of the upgrade tests just yet. I was having a hard time running against different versions of Kafka and using KRaft -- this seems like a reasonable followup, so I didn't do it yet.
I agree. This will be a followup.
Maybe this was the main issue I was running into. I thought KRaft was available in earlier versions, but saw odd failures. I'll talk to you about what it would take to get this working.
|
|
Re-triggered the jenkins build. |
|
I made the change to switch some of these tests to just run with remote_kraft to minimize test run time. |
vvcephei
left a comment
There was a problem hiding this comment.
Thanks, @AlanConfluent !
|
I think this was the ultimate run which covered everything under streams: |
Migrates Streams sustem tests to either use kraft brokers or to use both kraft and zk in a testing matrix. This skips tests which use various forms of Kafka versioning since those seem to have issues with KRaft at the moment. Running these tests with KRaft will require a followup PR. Reviewers: Guozhang Wang <guozhang@apache.org>, John Roesler <vvcephei@apache.org>
…eptember 2022) `Jenkinsfile` was the only conflict and we ignore the changes since they are not relevant to the Confluent build. * apache-github/3.3: (61 commits) KAFKA-14214: Introduce read-write lock to StandardAuthorizer for consistent ACL reads. (apache#12628) KAFKA-14243: Temporarily disable unsafe downgrade (apache#12664) KAFKA-14240; Validate kraft snapshot state on startup (apache#12653) KAFKA-14233: disable testReloadUpdatedFilesWithoutConfigChange first to fix the build (apache#12658) KAFKA-14238; KRaft metadata log should not delete segment past the latest snapshot (apache#12655) KAFKA-14156: Built-in partitioner may create suboptimal batches (apache#12570) MINOR: Adds KRaft versions of most streams system tests (apache#12458) MINOR; Add missing li end tag (apache#12640) MINOR: Mention that kraft is production ready in upgrade notes (apache#12635) MINOR: Add upgrade note regarding the Strictly Uniform Sticky Partitioner (KIP-794) (apache#12630) KAFKA-14222; KRaft's memory pool should always allocate a buffer (apache#12625) KAFKA-14208; Do not raise wakeup in consumer during asynchronous offset commits (apache#12626) KAFKA-14196; Do not continue fetching partitions awaiting auto-commit prior to revocation (apache#12603) KAFKA-14215; Ensure forwarded requests are applied to broker request quota (apache#12624) MINOR; Remove end html tag from upgrade (apache#12605) Remove the html end tag from upgrade.html KAFKA-14205; Document how to replace the disk for the KRaft Controller (apache#12597) KAFKA-14203 Disable snapshot generation on broker after metadata errors (apache#12596) KAFKA-14216: Remove ZK reference from org.apache.kafka.server.quota.ClientQuotaCallback javadoc (apache#12617) KAFKA-14217: app-reset-tool.html should not show --zookeeper flag that no longer exists (apache#12618) ...
Adds the annotation
@matrix(metadata_quorum=quorum.all_non_upgrade)to many existing tests, which runs them with each of zookeeper and remote_kraft nodes.This skips tests which use various forms of Kafka versioning since those seem to have issues with KRaft at the moment. Running these tests with KRaft will require a followup PR.
Committer Checklist (excluded from commit message)