KAFKA-6054: Fix upgrade path from Kafka Streams v0.10.0 by mjsax · Pull Request #4746 · apache/kafka

mjsax · 2018-03-21T06:03:44Z

No description provided.

mjsax · 2018-03-21T06:08:26Z

This is a patch for 0.10.1 only. It sets up system testing with 0.10.0 code as discussed in #4636. Will port this PR to 0.10.2 and add corresponding system test setup.

Local build and tests passed. System test passed, too: https://jenkins.confluent.io/job/system-test-kafka-branch-builder/1533/

Any insight why Jenkins failed?

guozhangwang · 2018-03-21T21:20:07Z

retest this please

guozhangwang

THanks for the PR @mjsax !

Made a pass over it.

guozhangwang · 2018-03-21T21:50:26Z

+
+        self.driver.stop()
+
+    def start_all_nodes_with_0100(self):


This is a meta comment: in newer versions we should consider parameterize it instead of writing a new function for each from / to version pair.

Agreed. I'll upgrade the code accordingly when porting the PR to other branches.

guozhangwang · 2018-03-21T21:51:03Z

+                                              timeout_sec=60,
+                                              err_msg="Never saw output 'processed 100 records from topic' on" + str(node2.account))
+
+        # start second with 0.10.0


nit: third.

guozhangwang · 2018-03-21T21:54:48Z

+        "streams_stderr.1": {
+            "path": STDERR_FILE + ".1",
+            "collect_default": True},
+        "streams_log.0-1": {


Why we will have 6 log files per round? From the code below it seems we will only need three per round?

Yes. But we don't know which one we get because we shuffle the order or processors for rebalancing. The once that don't exists are just not collected, but we need to list them all :(

I am more than happy to change the code, but I don't know a better solution... Is there anything better we could do (I still don't know all ducktape magic)

Although we shuffle the processes, each of them will still be bounced exactly twice, once per round right? In this case could we just use streams_log.0 and streams_log.1 on each processor instance?

We could. I choose the current naming schema, because it encodes the rebalance order in the file names, what makes debugging easier. Otherwise, you have to extract this information from the ducktape log. Let me know if you think this simplification is worth it, or if you prefer removing the counter reducing the log files to 3.

I see your point now, that makes sense. Let's keep it as is.

guozhangwang · 2018-03-21T21:56:00Z

        "streams_stderr": {
            "path": STDERR_FILE,
            "collect_default": True},
+        "streams_log.0": {


Where can these log files (the ones without dashes) be generated? I can only find the ones with dashes created in rolling bounces.

guozhangwang · 2018-03-21T21:58:17Z

+
+    @SuppressWarnings("unchecked")
+    public static void main(final String[] args) {
+        String kafka = args.length > 0 ? args[0] : null;


If we expect all parameters except upgradeFrom to always be given, should be just simply check args.length > 2 and if failed error out than setting nulls?

guozhangwang · 2018-03-21T23:04:11Z

        assertEquals(oldVersion.standbyTasks, decoded.standbyTasks);
        assertEquals(0, decoded.partitionsByHostState.size()); // should be empty as wasn't in V1
-        assertEquals(2, decoded.version); // automatically upgraded to v2 on decode;
+        assertEquals(1, decoded.version);


Is this change intentional? If yes should we rename the test then?

I think the test name is still fine -- AssignmentInfo.decode should be able to decode a version 1 Assignment -- originally, it upgraded the AssignmentInfo to be version 2, but I think this is actually not a good idea -- if we receive a version 1 AssignmentInfo we should return version 1 and not set version 2.

guozhangwang · 2018-03-21T23:06:40Z


+        final String upgradeMode = (String) configs.get(StreamsConfig.UPGRADE_FROM_CONFIG);
+        if (StreamsConfig.UPGRADE_FROM_0100.equals(upgradeMode)) {
+            log.debug("Downgrading metadata version from 2 to 1 for upgrade from 0.10.0.x.");


nit: I'd suggest make it INFO, as it should be rare but important for trouble shooting.

guozhangwang · 2018-03-21T23:11:01Z


 def connectPkgs = ['connect:api', 'connect:runtime', 'connect:json', 'connect:file']
-def pkgs = ['clients', 'examples', 'log4j-appender', 'tools', 'streams', 'streams:examples'] + connectPkgs
+def pkgs = ['clients', 'examples', 'log4j-appender', 'tools', 'streams', 'streams:examples', 'streams:upgrade-system-tests-0100'] + connectPkgs


We should exclude this from installAll, releaseTarGzAll and uploadArchivesAll otherwise it would be uploaded to maven. I think we'd better always jar locally before test than trying to pull from the messed repo.

guozhangwang · 2018-03-21T23:16:10Z

    from(project(':streams').configurations.runtime) { into("libs/") }
    from(project(':streams:examples').jar) { into("libs/") }
    from(project(':streams:examples').configurations.runtime) { into("libs/") }
+    from(project(':streams:upgrade-system-tests-0100').jar) { into("libs/") }


Could we double check if build, install, releaseTarGz and uploadArchives would not include this module, i.e. it will only be built upon running the system test?

not sure why we should exclude from build or install ? Can you elaborate?

Well, I guess it may not be too bad for build and install, just these are not necessary. Plus install will put the jar into the mvn cache which could be risky if we are relying on always build the jar from the current branch. Again, it is for cleanness more than correctness.

guozhangwang · 2018-03-21T23:19:08Z

+    fi
+  done
+else
+  for file in "$base_dir"/streams/upgrade-system-tests-0100/build/libs/kafka-streams-upgrade-system-tests*.jar;


Hmm... if we hard-code the version here how it could be extended in newer versions with multiple possible values of UPGRADE_KAFKA_STREAMS_TEST_VERSION?

Yes. In newer version we extend this and replace hard-coded 0100 with $UPGRADE_KAFKA_STREAMS_TEST_VERSION -- did just omit it here and fix when porting this PR to 0.10.2 branch

mjsax · 2018-03-22T04:19:33Z

Updates this. Triggered System Tests: https://jenkins.confluent.io/job/system-test-kafka-branch-builder/1556/

System test passed.

guozhangwang · 2018-03-22T17:58:25Z

Hmm... not so sure why ./jenkins.sh cannot be found in jenkins for this branch.. is it consistent?

mjsax · 2018-03-22T18:07:22Z

Retest this please

mjsax · 2018-03-22T18:11:33Z

\cc @ijuma @ewencp Any insight why Jenkins cannot build this PR?

bbejeck

Nice patch @mjsax! Overall looks good, I have a question or two and some minor comments.

bbejeck · 2018-03-22T19:32:35Z

+	 (cf. <a href="https://cwiki.apache.org/confluence/display/KAFKA/KIP-268%3A+Simplify+Kafka+Streams+Rebalance+Metadata+Upgrade">KIP-268</a>).
+        <ul>
+            <li> note: rolling bounce upgrade requires to upgrade to 0.10.1.2 (rolling bounce upgrade is not supported for upgrading from 0.10.0.x to 0.10.1.0 or 0.10.1.1) </li>
+	    <li> prepare your application instances for a rolling bounce and make sure that config <code>upgrade.from="0.10.0"</code> is set for new version 0.10.1.2 </li>


does this mean the config setting needs to change?

Yes. The newly started application must have this setting.

bbejeck · 2018-03-22T19:34:52Z

  jackson: "2.6.3",
  jetty: "9.2.22.v20170606",
  jersey: "2.22.2",
+  kafka0100: "0.10.0.1",


super nit: can we change to kafka_0100

bbejeck · 2018-03-22T19:35:16Z

  junit: "junit:junit:$versions.junit",
  log4j: "log4j:log4j:$versions.log4j",
  joptSimple: "net.sf.jopt-simple:jopt-simple:$versions.jopt",
+  kafkaStreams0100: "org.apache.kafka:kafka-streams:$versions.kafka0100",


same as above

bbejeck · 2018-03-22T19:36:59Z

    private static final Map<String, Object> PRODUCER_DEFAULT_OVERRIDES;
-    static
-    {
+    static {


Thanks for this!

bbejeck · 2018-03-22T19:38:36Z

        public final TopicPartition partition;

-        public AssignedPartition(TaskId taskId, TopicPartition partition) {
+        AssignedPartition(TaskId taskId, TopicPartition partition) {


nit: final on parameters

bbejeck · 2018-03-22T19:57:12Z

+        String kafka = args[0];
+        String zookeeper = args[1];
+        String stateDir = args[2];
+        String upgradeFrom = args.length > 3 ? args[3] : null;


nit: make these final?

bbejeck · 2018-03-22T19:57:56Z

+        System.out.println("StreamsTest instance started (StreamsUpgradeTest v0.10.0)");
+        System.out.println("kafka=" + kafka);
+        System.out.println("zookeeper=" + zookeeper);
+        System.out.println("stateDir=" + stateDir);


nit: as above

bbejeck · 2018-03-22T20:06:15Z

+        }
+        String kafka = args[0];
+        String zookeeper = args[1];
+        String stateDir = args[2];


should we use the streams config instead of singlestateDir parameter introduced in #4714? Some others below as well.

I think for this older branch, it's easier to just stay with stateDir for avoid unnecessary refactoring -- for trunk it makes sense to change to Properties. I'll keep it in mind an fix for trunk PR. Ok?

mjsax

Updated this.

mjsax · 2018-03-22T21:39:41Z

+	 (cf. <a href="https://cwiki.apache.org/confluence/display/KAFKA/KIP-268%3A+Simplify+Kafka+Streams+Rebalance+Metadata+Upgrade">KIP-268</a>).
+        <ul>
+            <li> note: rolling bounce upgrade requires to upgrade to 0.10.1.2 (rolling bounce upgrade is not supported for upgrading from 0.10.0.x to 0.10.1.0 or 0.10.1.1) </li>
+	    <li> prepare your application instances for a rolling bounce and make sure that config <code>upgrade.from="0.10.0"</code> is set for new version 0.10.1.2 </li>


Yes. The newly started application must have this setting.

mjsax · 2018-03-22T21:45:08Z

+        }
+        String kafka = args[0];
+        String zookeeper = args[1];
+        String stateDir = args[2];


I think for this older branch, it's easier to just stay with stateDir for avoid unnecessary refactoring -- for trunk it makes sense to change to Properties. I'll keep it in mind an fix for trunk PR. Ok?

ijuma · 2018-03-22T22:17:29Z

@mjsax the reason the PR is failing is that the target branch (0.10.1) is missing jenkins.sh.

guozhangwang · 2018-03-22T23:15:17Z

LGTM assuming Jenkins passed.

mjsax · 2018-03-23T03:17:17Z

Thanks @ijuma !

mjsax · 2018-03-23T03:17:24Z

Retest this please.

mjsax · 2018-03-23T03:18:37Z

Retriggered system tests after latest updates: https://jenkins.confluent.io/job/system-test-kafka-branch-builder/1571/

Passed.

mjsax · 2018-03-23T04:36:06Z

@guozhangwang @bbejeck @vvcephei I updated the docs section, too. Can you have a look. I am not sure how docs are deployed -- this should not end up on the web page immediately because we talk about non-release version 0.10.1.2...

Retest this please.

guozhangwang · 2018-03-23T05:13:18Z

+    </li>
+    <li> Upgrading from 0.10.0.x to 0.10.1.0 or 0.10.1.1 requires an offline upgrade (rolling bounce upgrade is not supported)
+        <ul>
+            <li> note: rolling bounce upgrade is supported for upgrading from 0.10.0.x to 0.10.1.2 </li>


nit: this lines is a bit verbose to me.

guozhangwang · 2018-03-23T05:13:57Z

+	 (cf. <a href="https://cwiki.apache.org/confluence/display/KAFKA/KIP-268%3A+Simplify+Kafka+Streams+Rebalance+Metadata+Upgrade">KIP-268</a>).
+        <ul>
+            <li> note: rolling bounce upgrade requires to upgrade to 0.10.1.2 (rolling bounce upgrade is not supported for upgrading from 0.10.0.x to 0.10.1.0 or 0.10.1.1) </li>
+            <li> prepare your application instances for a rolling bounce and make sure that config <code>upgrade.from="0.10.0"</code> is set for new version 0.10.1.2 </li>


config upgrade.from is set to "0.10.0".

guozhangwang · 2018-03-23T05:14:21Z

+            <li> note: rolling bounce upgrade requires to upgrade to 0.10.1.2 (rolling bounce upgrade is not supported for upgrading from 0.10.0.x to 0.10.1.0 or 0.10.1.1) </li>
+            <li> prepare your application instances for a rolling bounce and make sure that config <code>upgrade.from="0.10.0"</code> is set for new version 0.10.1.2 </li>
+            <li> bounce each instance of your application once </li>
+            <li> prepare your newly deployed 0.10.1.2 application instances for a second round of rolling bounces; make sure to remove config <code>upgrade.mode</code> </li>


to remove the value for ...

mjsax · 2018-03-23T07:17:53Z

Note: Java 9 does not work on old 0.10.1 branch -- Java9 was set up later and we can ignore failing Java9 builds.

mjsax · 2018-03-23T07:28:31Z

Updated.

vvcephei

I asked a few questions and picked a couple of nits.

But overall, it's fine by me.

vvcephei · 2018-03-23T14:13:41Z

 <ul>
+    <li> Upgrading from 0.10.0.x to 0.10.1.2 requires two rolling bounces with config <code>upgrade.from="0.10.0"</code> set for first upgrade phase
+	 (cf. <a href="https://cwiki.apache.org/confluence/display/KAFKA/KIP-268%3A+Simplify+Kafka+Streams+Rebalance+Metadata+Upgrade">KIP-268</a>).
+        <ul>


Seems like this should be an <ol>, with the exception of the 'note:' bullet.

I did bullet points on purpose instead or numbers. Let me know if you want to have numbers.

vvcephei · 2018-03-23T14:18:48Z


 <h5><a id="upgrade_1010_streams" href="#upgrade_1010_streams">Streams API changes in 0.10.1.0</a></h5>
 <ul>
+    <li> Upgrading from 0.10.0.x to 0.10.1.2 requires two rolling bounces with config <code>upgrade.from="0.10.0"</code> set for first upgrade phase


An offline upgrade is also an option here, right?

vvcephei · 2018-03-23T15:17:06Z

+                index = rand.nextInt(remaining);
+            } else {
+                index = rand.nextInt(numKeys);
+            }


nit: using a ternary operator would let you make index final:
final int index = autoTerminate ? rand.nextInt(remaining) : rand.nextInt(numKeys);

vvcephei · 2018-03-23T15:23:53Z

+            Runtime.getRuntime().addShutdownHook(new Thread() {
+                @Override
+                public void run() {
+                    running = false;


Sorry if this is a noob question... what's the value of this hook?

I couldn't find a usage with autoTerminate=false in which the code wasn't running in the main thread.

With an infinite loop in the main thread, the only way your hook is going to run is from a SIGTERM, which will terminate the main thread anyway, unless we trap it somewhere. In this case, the hook will set running := false after the while loop has already exited (killed via SIGTERM).

... I think

Yes. Good point.

vvcephei · 2018-03-23T15:40:15Z

+        Runtime.getRuntime().addShutdownHook(new Thread() {
+            @Override
+            public void run() {
+                streams.close();


might be nice to print a line before this, in case streams.close() hangs, preventing the JVM from exiting.

vvcephei · 2018-03-23T15:41:21Z

+        Runtime.getRuntime().addShutdownHook(new Thread() {
+            @Override
+            public void run() {
+                streams.close();


vvcephei · 2018-03-23T15:43:22Z

+            monitor.wait_until('StreamsTest instance started', timeout_sec=15, err_msg="Never saw message indicating StreamsTest finished startup on " + str(node.account))
+
+        if len(self.pids(node)) == 0:
+            raise RuntimeError("No process ids recorded")


duplicate of the immediately following block?

Nice catch! Guess this happened during cherry-picking resolving conflicts.

vvcephei · 2018-03-23T15:56:14Z

+                                   timeout_sec=60,
+                                   err_msg="Never saw output 'UPGRADE-TEST-CLIENT-CLOSED' on" + str(node.account))
+
+        self.driver.stop()


I might have missed it... did we verify that everything is using the new metadata version after the second bounce?

Hmmm... Not really... Also not sure how to check this? Ideas? We could put additional DEBUG logs that print the version of the received AssignmentInfo. \cc @bbejeck @guozhangwang WDYT?

We only make sure that the instances did not crash and process data after second rolling bounce.

I think this check should better be covered in unit test or integration test, not system test.

It's covered in StreamPartitionAssignorTest -- seems we are good than.

vvcephei · 2018-03-23T16:07:41Z

+        first_other_node = first_other_processor.node
+        second_other_node = second_other_processor.node
+
+        # stop processor and wait for rebalance of others


"wait on rebalance" was not part of the instructions in the doc. Is this a necessary step?

I suppose it is the distinction between an offline upgrade and a rolling one...

Supposing that is the intent, maybe a better phrasing would be "stop processor and ensure the others continue making progress".

For users, it's not required. It's a system test thing, that allows us to "track" the progress -- if we bounce instances without waiting, we introduce a race condition in the test because we don't know how many rebalances might actually be triggered: it could be a single rebalance or two, depending how quickly the instance comes back online.

mjsax · 2018-03-23T20:32:38Z

Updated this. Triggered system test: https://jenkins.confluent.io/job/system-test-kafka-branch-builder/1586/

mjsax · 2018-03-23T23:57:04Z

Retest this please.

mjsax · 2018-03-25T06:42:10Z

Retest this please

mjsax · 2018-03-25T18:44:59Z

Retest this please

mjsax · 2018-03-26T18:38:28Z

Retest this please.

guozhangwang · 2018-03-26T20:18:35Z

LGTM. Please feel free to merge after jenkins passed.

bbejeck

LGTM

bbejeck · 2018-03-26T22:06:32Z

retest this please

mjsax · 2018-03-26T23:27:47Z

Java 7 passed -- Java 8 timed out.

Retest this please.

vvcephei · 2018-03-26T23:50:47Z

Just finished the review. LGTM, mod the failing tests ;)

mjsax · 2018-03-27T01:03:10Z

Both Java7 and Java8 passed. Pushing one more doc typo and merge afterwards.

mjsax · 2018-03-27T04:45:45Z

Merged to 0.10.1.

KAFKA-6054: Fix upgrade path from Kafka Streams v0.10.0

96da162

mjsax added the streams label Mar 21, 2018

mjsax requested review from dguy and guozhangwang March 21, 2018 06:03

Update docs

eda3ee2

guozhangwang reviewed Mar 21, 2018

View reviewed changes

mjsax added 3 commits March 21, 2018 17:31

Reviews

2b58d53

Review2

1ec8e4a

Fix

6e44bb6

mjsax mentioned this pull request Mar 22, 2018

KAFKA-6054: Fix upgrade path from Kafka Streams v0.10.0 #4758

Merged

bbejeck reviewed Mar 22, 2018

View reviewed changes

Github comments

40723c4

mjsax commented Mar 22, 2018

View reviewed changes

mjsax mentioned this pull request Mar 22, 2018

MINOR: Rolling bounce upgrade fixed broker system test #4690

Merged

3 tasks

Add missing jenkins.sh file and update rat.gradle

d061ff0

mjsax force-pushed the kafka-6054-fix-upgrade-0101 branch from 25d3685 to d061ff0 Compare March 23, 2018 01:38

mjsax mentioned this pull request Mar 23, 2018

KAFKA-6054: Fix upgrade path from Kafka Streams v0.10.0 #4761

Merged

guozhangwang approved these changes Mar 23, 2018

View reviewed changes

Github comments

02ae6a7

vvcephei approved these changes Mar 23, 2018

View reviewed changes

mjsax force-pushed the kafka-6054-fix-upgrade-0101 branch 3 times, most recently from fd6ba48 to adbe6fa Compare March 23, 2018 20:28

John's review

9db2a8f

mjsax force-pushed the kafka-6054-fix-upgrade-0101 branch from adbe6fa to 9db2a8f Compare March 23, 2018 20:31

mjsax mentioned this pull request Mar 23, 2018

KAFKA-6054: Fix upgrade path from Kafka Streams v0.10.0 #4768

Merged

Fix flaky tests

e655b89

mjsax mentioned this pull request Mar 26, 2018

KAFKA-6054: Fix upgrade path from Kafka Streams v0.10.0 #4773

Merged

bbejeck approved these changes Mar 26, 2018

View reviewed changes

fix typo

fe469ed

mjsax mentioned this pull request Mar 27, 2018

KAFKA-6054: Fix upgrade path from Kafka Streams v0.10.0 #4779

Merged

mjsax merged commit faac933 into apache:0.10.1 Mar 27, 2018

mjsax deleted the kafka-6054-fix-upgrade-0101 branch June 5, 2018 23:48

Conversation

mjsax commented Mar 21, 2018

Uh oh!

mjsax commented Mar 21, 2018

Uh oh!

guozhangwang commented Mar 21, 2018

Uh oh!

guozhangwang left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mjsax commented Mar 22, 2018

Uh oh!

guozhangwang commented Mar 22, 2018

Uh oh!

mjsax commented Mar 22, 2018

Uh oh!

mjsax commented Mar 22, 2018

Uh oh!

bbejeck left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mjsax left a comment

mjsax commented Mar 23, 2018 •

edited

Loading