Skip to content

KAFKA-15093: Add 3.4 and 3.5 Streams upgrade system tests#13860

Closed
mimaison wants to merge 2 commits intoapache:trunkfrom
mimaison:KAFKA-15093-streams
Closed

KAFKA-15093: Add 3.4 and 3.5 Streams upgrade system tests#13860
mimaison wants to merge 2 commits intoapache:trunkfrom
mimaison:KAFKA-15093-streams

Conversation

@mimaison
Copy link
Copy Markdown
Member

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

@clolov clolov self-requested a review June 21, 2023 16:06
@mimaison
Copy link
Copy Markdown
Member Author

I copied what was done in previous releases. I'm unsure if any APIs have changed or what to really look for.
@mjsax @vvcephei Can you take a look?

Copy link
Copy Markdown
Member

@mjsax mjsax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. It's really just a c&p (view minor updates required).

We should trigger a system test run before we merge to see it its working as expected.

@Override
public void init(final ProcessorContext<Void, Void> context) {
super.init(context);
System.out.println("[3.3] initializing processor: topic=" + topic + " taskId=" + context.taskId());
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
System.out.println("[3.3] initializing processor: topic=" + topic + " taskId=" + context.taskId());
System.out.println("[3.4] initializing processor: topic=" + topic + " taskId=" + context.taskId());


final Properties streamsProperties = Utils.loadProps(propFileName);

System.out.println("StreamsTest instance started (StreamsUpgradeTest v3.3)");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
System.out.println("StreamsTest instance started (StreamsUpgradeTest v3.3)");
System.out.println("StreamsTest instance started (StreamsUpgradeTest v3.4)");


@Override
public void init(final ProcessorContext<KOut, VOut> context) {
System.out.println("[3.3] initializing processor: topic=" + topic + "taskId=" + context.taskId());
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
System.out.println("[3.3] initializing processor: topic=" + topic + "taskId=" + context.taskId());
System.out.println("[3.4] initializing processor: topic=" + topic + "taskId=" + context.taskId());

@Override
public void init(final ProcessorContext<Void, Void> context) {
super.init(context);
System.out.println("[3.3] initializing processor: topic=" + topic + " taskId=" + context.taskId());
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
System.out.println("[3.3] initializing processor: topic=" + topic + " taskId=" + context.taskId());
System.out.println("[3.5] initializing processor: topic=" + topic + " taskId=" + context.taskId());


final Properties streamsProperties = Utils.loadProps(propFileName);

System.out.println("StreamsTest instance started (StreamsUpgradeTest v3.3)");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
System.out.println("StreamsTest instance started (StreamsUpgradeTest v3.3)");
System.out.println("StreamsTest instance started (StreamsUpgradeTest v3.5)");


@Override
public void init(final ProcessorContext<KOut, VOut> context) {
System.out.println("[3.3] initializing processor: topic=" + topic + "taskId=" + context.taskId());
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
System.out.println("[3.3] initializing processor: topic=" + topic + "taskId=" + context.taskId());
System.out.println("[3.5] initializing processor: topic=" + topic + "taskId=" + context.taskId());

@mimaison
Copy link
Copy Markdown
Member Author

@vvcephei
Copy link
Copy Markdown
Contributor

Good catch, @mimaison .

It looks like we could keep adding "upgrade from" versions on each release, but unless there's a compatibility issue, we don't need to. If you look at git blame, you can see that, in the past, we didn't add those cases one at a time, but in chunks when we had a compatibility problem to manage.

@mjsax
Copy link
Copy Markdown
Member

mjsax commented Jul 26, 2023

Thanks for confirming @vvcephei -- as a matter of fact, we released KP-904 with 3.5, so it seems to be an issue. Let me do a PR to address this and also add a test to get a guard in place.

@mimaison
Copy link
Copy Markdown
Member Author

Are there compatibility issues with 3.4 or 3.5?

@mjsax
Copy link
Copy Markdown
Member

mjsax commented Jul 26, 2023

With KIP-904 we changes some serialization format inside an internal repartition topic (cf https://cwiki.apache.org/confluence/display/KAFKA/KIP-904%3A+Kafka+Streams+-+Guarantee+subtractor+is+called+before+adder+if+key+has+not+changed)

To upgrade KS correctly, you need to do two rolling bounces (first one with upgrade.from = "3.4" -- or whatever version you are using), and second one with the config set back to null.

Thus, it seems, while the upgrade path for KP-904 itself works, the AssignorConfiguration you mentioned would not recognize "3.4" and thus crash Kafka Streams.

We did not get to finish the system test for this upgrade (it's a WIP PR: #13656), that I assume would have caught this issue. Guess I should have insistent to get the system test in place before the release (that is totally my fault)

So basically, if one uses KTable.groupBy().aggregate() and want to upgrade from 3.4 to 3.5, they cannot do it without resetting the application.

\cc @fqaiser94 @cadonna

@mjsax
Copy link
Copy Markdown
Member

mjsax commented Jul 26, 2023

Did a PR: #14103

@mjsax
Copy link
Copy Markdown
Member

mjsax commented Jul 26, 2023

I just uploaded the artifact for 3.5.1 to the S3 bucket we use for system tests -- can we add 3.5.1 in this PR right away (or even wait for 3.5.2 -- we might need a hotfix for this upgrade issue...?)

@divijvaidya
Copy link
Copy Markdown
Member

divijvaidya commented Jul 27, 2023

can we add 3.5.1 in this PR right away

I have added system tests to use 3.5.1 in #14069

@mimaison
Copy link
Copy Markdown
Member Author

System tests look better with #14103. In addition to a few flaky tests, I got 3 failures:
1.
Module: kafkatest.tests.streams.streams_broker_bounce_test Class: StreamsBrokerBounceTest Method: test_broker_type_bounce Arguments: { "broker_type": "controller", "failure_mode": "hard_shutdown", "metadata_quorum": "ZK", "num_threads": 1, "sleep_time_secs": 120 }
2.
Module: kafkatest.tests.streams.streams_broker_bounce_test Class: StreamsBrokerBounceTest Method: test_broker_type_bounce Arguments: { "broker_type": "leader", "failure_mode": "hard_shutdown", "metadata_quorum": "ZK", "num_threads": 1, "sleep_time_secs": 120 }
3.
Module: kafkatest.tests.streams.streams_upgrade_test Class: StreamsUpgradeTest Method: test_rolling_upgrade_with_2_bounces Arguments: { "from_version": "3.5.0", "to_version": "3.6.0-SNAPSHOT" }
I still need to look at the logs of all failures but at first sight 1. and 2. don't seem directly related.

@mimaison mimaison force-pushed the KAFKA-15093-streams branch from f07cf9d to 16c72d3 Compare July 28, 2023 13:36
Copy link
Copy Markdown
Member

@mjsax mjsax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@mimaison
Copy link
Copy Markdown
Member Author

mimaison commented Aug 8, 2023

This depends on #14103 so we need to get that merged first

@mjsax
Copy link
Copy Markdown
Member

mjsax commented Aug 18, 2023

@mimaison Merged the hotfix PR.

@mimaison mimaison force-pushed the KAFKA-15093-streams branch from 16c72d3 to a770085 Compare August 21, 2023 09:58
@mimaison
Copy link
Copy Markdown
Member Author

I rebased on top of trunk and reran the system tests in our CI. Many of the Streams upgrade system tests are flaky but I've seen all of them pass at least once apart from kafkatest.tests.streams.streams_upgrade_test.test_rolling_upgrade_with_2_bounces_from_version=3_5_1_to_version=3_6_0-SNAPSHOT. This one consistently failed across ~10 runs.

From my limited Streams understanding I don't see any obvious issues in the logs. @mjsax can you take a look and see if it's just really flaky or if there's a real issue. Is is passing in your CI? I've uploaded the logs from the last run to my Apache home directory: https://home.apache.org/~mimaison/streams-upgrade-failure.zip

@mjsax
Copy link
Copy Markdown
Member

mjsax commented Sep 13, 2023

We also observed failing tests -- it's on our TODO list but might take some time until we get to it :( -- I'll keep you posted.

@mimaison
Copy link
Copy Markdown
Member Author

mimaison commented Oct 3, 2023

@mjsax These missing tests are starting to pile up. We're already missing them for 3.4 and 3.5, and 3.6 will be out shortly. If we were to break compatibility in Streams how would we notice it if we don't have these tests? Actually do we know whether the failures are caused by the tests or by issues in Streams? Figuring this out seems relatively high priority for me.

@mjsax
Copy link
Copy Markdown
Member

mjsax commented Oct 5, 2023

I identified one fundamental problem in 3.3 release -- a bug was introduced that fundamentally breaks KS systems tests using that version. This PR (fa03244) changes ProcessorParameter and allows this.processorSupplier to be null what breaks toString() method on the same class.

It was fixed via 0de0374 but it's only included in 3.4.1 release -- while it was backported to 3.3 branch, too, we never did a bug fix release afterwards and thus the system test that pull release binaries only cannot pickup the cherry-pick.

@mjsax
Copy link
Copy Markdown
Member

mjsax commented Oct 5, 2023

Another issue just introduced in trunk -> #14490

@mimaison
Copy link
Copy Markdown
Member Author

Thanks @mjsax for following up. So what's the path forward? Are we able to merge this or do we need to release new bugfix versions for 3.3 and 3.4 (and 3.5?)?

@mjsax
Copy link
Copy Markdown
Member

mjsax commented Oct 12, 2023

I just opened a PR to fix trunk -- if this PR is merged, I can cherry-pick it to older branches, and you can rebase this PR so we can verify everything is in order and we can merge this one.

#14539

@mjsax
Copy link
Copy Markdown
Member

mjsax commented Oct 13, 2023

while it was backported to 3.3 branch, too, we never did a bug fix release afterwards and thus the system test that pull release binaries only cannot pickup the cherry-pick

I could not reproduce the issue any longer and did more digging -- it think I hit this issue previously because we did not bump the Dockerfile to use the latest bug fix versions. Looking into the fix for this, it's actually shipped with 3.3.2 release, so I think we are fine.

@mimaison mimaison force-pushed the KAFKA-15093-streams branch from a770085 to ff352ce Compare October 13, 2023 09:07
@mimaison
Copy link
Copy Markdown
Member Author

I rebased this PR on trunk and rerun the Streams system tests and still got a bunch of failures:

kafkatest.tests.streams.streams_upgrade_test.test_rolling_upgrade_with_2_bounces_from_version=0_10_0_1_to_version=3_7_0-SNAPSHOT
kafkatest.tests.streams.streams_cooperative_rebalance_upgrade_test.test_upgrade_to_cooperative_rebalance_upgrade_from_version=0_10_0_1
kafkatest.tests.streams.streams_cooperative_rebalance_upgrade_test.test_upgrade_to_cooperative_rebalance_upgrade_from_version=0_10_1_1
kafkatest.tests.streams.streams_cooperative_rebalance_upgrade_test.test_upgrade_to_cooperative_rebalance_upgrade_from_version=0_10_2_2
kafkatest.tests.streams.streams_cooperative_rebalance_upgrade_test.test_upgrade_to_cooperative_rebalance_upgrade_from_version=0_11_0_3
kafkatest.tests.streams.streams_cooperative_rebalance_upgrade_test.test_upgrade_to_cooperative_rebalance_upgrade_from_version=1_0_2
kafkatest.tests.streams.streams_cooperative_rebalance_upgrade_test.test_upgrade_to_cooperative_rebalance_upgrade_from_version=1_1_1
kafkatest.tests.streams.streams_cooperative_rebalance_upgrade_test.test_upgrade_to_cooperative_rebalance_upgrade_from_version=2_0_1
kafkatest.tests.streams.streams_cooperative_rebalance_upgrade_test.test_upgrade_to_cooperative_rebalance_upgrade_from_version=2_1_1
kafkatest.tests.streams.streams_cooperative_rebalance_upgrade_test.test_upgrade_to_cooperative_rebalance_upgrade_from_version=2_2_2
kafkatest.tests.streams.streams_cooperative_rebalance_upgrade_test.test_upgrade_to_cooperative_rebalance_upgrade_from_version=2_3_1
kafkatest.tests.streams.streams_broker_down_resilience_test.test_streams_runs_with_broker_down_initially_metadata_quorum=ISOLATED_KRAFT
kafkatest.tests.streams.streams_upgrade_test.test_rolling_upgrade_with_2_bounces_from_version=0_10_1_1_to_version=3_7_0-SNAPSHOT
kafkatest.tests.streams.streams_upgrade_test.test_rolling_upgrade_with_2_bounces_from_version=0_11_0_3_to_version=3_7_0-SNAPSHOT
kafkatest.tests.streams.streams_upgrade_test.test_rolling_upgrade_with_2_bounces_from_version=0_10_2_2_to_version=3_7_0-SNAPSHOT
kafkatest.tests.streams.streams_upgrade_test.test_rolling_upgrade_with_2_bounces_from_version=1_0_2_to_version=3_7_0-SNAPSHOT
kafkatest.tests.streams.streams_upgrade_test.test_rolling_upgrade_with_2_bounces_from_version=1_1_1_to_version=3_7_0-SNAPSHOT
kafkatest.tests.streams.streams_upgrade_test.test_rolling_upgrade_with_2_bounces_from_version=3_5_1_to_version=3_7_0-SNAPSHOT
kafkatest.tests.streams.streams_broker_compatibility_test.test_compatible_brokers_eos_alpha_enabled_broker_version=0_11_0_3
kafkatest.tests.streams.streams_broker_compatibility_test.test_compatible_brokers_eos_disabled_broker_version=0_11_0_3
kafkatest.tests.streams.streams_broker_compatibility_test.test_fail_fast_on_incompatible_brokers_broker_version=0_10_1_1
kafkatest.tests.streams.streams_broker_compatibility_test.test_fail_fast_on_incompatible_brokers_broker_version=0_10_2_2
kafkatest.tests.streams.streams_broker_compatibility_test.test_fail_fast_on_incompatible_brokers_if_eos_v2_enabled_broker_version=0_11_0_3

Tests can be a bit flaky in our CI but many of these seem to fail consistently. @mjsax did you get a clean run or do you see these tests failing in your CI as well?

@mjsax
Copy link
Copy Markdown
Member

mjsax commented Oct 17, 2023

I rebased this PR on trunk

#14539 is not merged yet... So that's expected I guess. -- We actually also just found another bug that was fixed in the meantime: #14555

Let's see if we can push 14539 over the finish line first... Will ping here again after that's done; sorry that it take so long, but we are on it. Promised.

@mjsax
Copy link
Copy Markdown
Member

mjsax commented Oct 18, 2023

Summary about your test results:

  • streams_upgrade_test expected to still fail because 14539 was not merged yet
  • streams_cooperative_rebalance_upgrade_test -- seems to be flaky -- after re-running them multiple times, I got each one to pass at least once locally (we should still try to stabilize this test, but I don't think it's a broker to move forward with this PR)
  • streams_broker_down_resilience_test -- this test was broken by a recent commit (fcac880) which changes a log message the test expects -- I'll add a fix to 14539 for it
  • streams_broker_compatibility_test -- all test passed locally on the first run; not sure -- if we indeed see issues on Jenkins we need to figure out why it's flaky

@mjsax
Copy link
Copy Markdown
Member

mjsax commented Oct 18, 2023

Updated #14539 and re-triggered a system test run. Let's hope for the best :)

@mjsax
Copy link
Copy Markdown
Member

mjsax commented Oct 20, 2023

Merged the other PR. Btw: should we split this PR into two, one to add 3.4 and one to add 3.5? So it's easier to cherry-pick? (Adding 3.5 should only go into 3.6 branch, but adding 3.4 should got into 3.6 and 3.5 branches.)

@mimaison
Copy link
Copy Markdown
Member Author

@mjsax Good call, I've split this PR in 2:

Closing this PR

@mimaison mimaison closed this Oct 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants