KAFKA-6054: Add 'version probing' to Kafka Streams rebalance by mjsax · Pull Request #4636 · apache/kafka

mjsax · 2018-03-02T07:39:23Z

No description provided.

mjsax · 2018-03-02T07:47:59Z

This is a WIP PR that also includes all changes from #4630 -- thus, must eventually be rebased. I added a new config just for now to test the approach -- this might change after the KIP is done, but I wanted to get a initial passing system test setup (the system test also contains some code that is put in comment, that allows to fail the test -- just FYI -- this part needs cleanup as well).

I also needed to update the Docker setup to get the older jar files we need to run the test locally using docker.

We could backport this fix to 0.10.1, 0.10.2, 0.11.0, 1.0, and 1.1, too. But cherry-picking won't work, as too much code change in-between. Basically, the fix contains, setting the correct min-version in the assignment and a config that tells bouncing instances to stay on protocol version 1. So it's quite a change -- not sure if it's worth it, considering that not many people will use 0.10.0 anyway anymore (Having said this, it might not even be required to fix the JIRA at all -- nevertheless, this PR gives an idea for the upcoming version change and how to do a system test for it).

Please share your thoughts.

mjsax · 2018-03-06T01:47:20Z

Rebased this and cleanup the code. This is a proper fix for the issue including system test. Still depends on a KIP that is WIP.

Triggered 10 runs of the new system test: https://jenkins.confluent.io/job/system-test-kafka-branch-builder/1387/

bbejeck · 2018-03-09T15:49:01Z

Since upgrading won't be a common occurrence should this be log.info? Just a thought I don't have a strong opinion either way.

EDIT: Thinking some more it's fine as debug, ignore my previous comment.

I am fine either way -- put as debug because it's usually not helpful information for the user.

bbejeck · 2018-03-09T16:23:36Z

Why all the log configs? I'm guessing separate ones for each version upgrade?

EDIT: Should have looked further down, each one is for each roll then?

Yes. In line https://github.com/apache/kafka/pull/4636/files#diff-21e906e43e313a665d07a8e1cd61a9c3R330 we move files around to roll them over -- it's super helpful for debugging to know which file belongs to what phase in the test.

bbejeck · 2018-03-09T16:33:10Z

Just playing devil's advocate here, do we want to create the internal topics ahead of time? As if we change the SmokeTest streams application at we'll need to update these again.

We need to. We use the StreamsEosTestDriverService that expects a replication factor of 3, however, when we start up the 0.10.0.x version, it would by default create those topic with replication factor 1 and thus the test crashes later on.

Note, that we pull in the 0.10.0.1 code as-is and cannot update the Streams app there, to change the replication factor config to 3.

bbejeck · 2018-03-09T16:43:06Z

@mjsax thanks for the work on this and just one minor comment.

Like you said above, we can't merge the changes to the pre-KIP-182 versions, but maybe it's worthwhile to cherry-pick the code changes to the SmokeTest streams app to the 0.11.0 branches forward, as we'll be removing the old API code soon, so a rolling upgrade system test simulates a user going from old API to new.

mjsax · 2018-03-09T23:43:18Z

@bbejeck -- for old versions, we only pull in officially release artifacts. Thus, as long as there is not bug fix release for older versions, back porting does not help :(

guozhangwang

The code change generally makes sense to me. Per the config names there are some comment on the KIP itself which we can continue discussing there.

guozhangwang · 2018-03-11T20:01:44Z

nit: better log the latest supported version as well.

mjsax · 2018-03-12T00:21:35Z

Updated this.

Also had a look into the system tests. One run failed. The issue is, that it can happen that one instance only gets repartition topics assigned. For for this case, we never get expected output "Processes 100 record from topic=" and the test fails as it thinks the application did not start up. This issue is part of 0.10.0.1 code and thus we cannot easily change this.

I think, there are the following options:

weaken the test setup and put a sleep() instead
change the test setup and try to get committed offsets to see if an instance started processing
change the test setup to grep the log instead of stdout
change the test setup to use "new client code" that links to old jar file (atm, the main jar files and test jar files are "linked" together -- we could put code in-place that allows us to write a 0.10.0 application that has it's code in trunk but still link to 0.10.0 lib in the system test.

The latest options would give us most control. Might be the most difficult to implement though. However, if we want to get more test like this, it might be worth doing it. Atm, being stuck with old client code if we want to run a older Streams version that we cannot change seems to be quite some limitation. Decoupling the test code from the library jars would be valuable.

WDYT?

bbejeck · 2018-03-12T19:42:55Z

@mjsax I'm very much in favor of the last proposal and decoupling the library jars from the test code. I recently ran into the same difficulty with writing a rolling upgrade test against all versions and had to make some trade-offs because we can't change any test code.

guozhangwang · 2018-03-12T22:19:48Z

I'm also in favor of the last approach as well. But for this specific failure case, after we have implemented that approach, we still need to figure out how to modify the client code used in system test suite, if we want to go further in this direction such as (re-)compiling from non-trunk code for some modules before the test, we likely need to modify ducktape as well. These should theoretically all be doable but I'd let us to figure out all the details before dig into it.

guozhangwang · 2018-03-17T04:01:30Z

Are these more log files intentional or for debugging only?

Both. Originally, I added this for debugging. But I think it's super useful to keep them. Otherwise, it's hard to inspect log/stdout/stderr files if there are multiple processors that get bounced (it's hard to tell from a single file, which statements belong to which "group generation" -- rolling over the files make debugging much easier.) Thus, I would suggest to keep this change and also apply it to other system test with similar pattern.

vvcephei · 2018-03-18T14:42:05Z

Sorry for the radio silence.

For the record, I also think it's a great idea to keep a copy of SmokeTest for each version that we want to test. It will free us up to alter the test scenario, amd simplify the test orchestration. I think we also wouldn't need to download the test artifacts anymore also, which saves a little testing time.

guozhangwang · 2018-03-19T16:47:37Z

Let's be specific on the scope of this config. For exmaple:

Allows versions equal or older than 1.1.0 to upgrade to newer versions including 1.2.0 in a backward compatible way. When upgrading from other versions this config would never need to be specified. Default is null ...

vvcephei

Are you ready to start collecting "final" reviews? If so, it looks good to me.

mjsax · 2018-03-26T18:42:29Z

This PR is a little outdated -- I am going to do a new PR for trunk that will add the new config upgrade.mode including system tests etc similar to the other once. Afterwards, I plan to clean this PR up, to implement "version probing" etc.

mjsax · 2018-04-07T19:56:59Z

Updated this to add system test for 1.1 release, increase metadata to version 3, add version probing, and version probing system test.

System test passed: https://jenkins.confluent.io/job/system-test-kafka-branch-builder/1671/

This should be the last PR to complete KIP-268.

Call for review @guozhangwang @bbejeck @vvcephei @dguy

- don't reassign tasks of new instance but wait for second rebalance

mjsax · 2018-05-25T17:48:42Z

Update this.

Couple of notes what is changed:

in version probing case, leader does not send empty assignment back to everybody (because this would trigger closing of tasks that we try to avoid) but send empty assignment only to the instance that issue the probe request
in version probing case, leader does not assign "missing tasks" -- the instance that get's upgraded, does not have any tasks assigned and sends and empty "prev task" map -- this would lead to task migration from upgrading instance to other instances. Instead, on version probing, the leader send the exact old assignment back to non-upgrading instances, and might not assign some task (empty assignment for upgrading instance). This prevents task migration that we want to prevent, because the upgrading instance will trigger a second rebalance and in this second rebalance we can assign all tasks avoid and task migration.
in version probing case, if the leader is already upgraded, it does encode a downgraded "supported version" instead of it's actually supported version in the general case -- this prevents already upgraded instances to upgrade their metatdata to the new version and thus, as long as at least one old instance is there, all instances use the old metadata format. Only if all instances support the new version, the leader encode it's latest supported version. This triggers metadata upgrade. This happen when the last instance is bounced -- additionally, when the bounced instance trigger a second rebalance (because version probing was detected), all instanced automatically upgrade to the new protocol. Hence, we don't need the leader to trigger an additional rebalance but the the upgrade "for free"
updated system test accordingly. note, that we always need to know who the leader is and that a "forceRebalance()" can trigger a leader change

Also did some additional code cleanup.

mjsax · 2018-05-25T21:49:32Z

Triggered system test (50 runs): https://jenkins.confluent.io/job/system-test-kafka-branch-builder/1758/

bbejeck

I took a pass, overall looks good, just have a couple of questions. The system test looks good, but I might need another look later.

bbejeck · 2018-05-25T19:10:33Z

     */
    // TODO: currently we cannot get the full topic configurations and hence cannot allow topic configs without the prefix,
    //       this can be lifted once kafka.log.LogConfig is completely deprecated by org.apache.kafka.common.config.TopicConfig
+    @SuppressWarnings("WeakerAccess")


a question more for my information, why do we need this SuppressWarnings here?

Not strictly. However, Intellij shows a warning that those could be private because it's never used. It's just to get rid of those warnings. Can also revert if you insist.

bbejeck · 2018-05-25T19:38:11Z

+                    if (streamThread.versionProbingFlag.get()) {
+                        streamThread.versionProbingFlag.set(false);
+                    } else {
+                        taskManager.suspendTasksAndState();


I'm probably missing something and brought this up before, but above in onPartitionsAssigned we create tasks with the assignment when not version probing. But in onPartitionsRevoked if we are version probing we flip the version probing flag, hence on assignment we create tasks. Why don't we flip the version probing flag in onPartitionedAssigned as an else statement on line 270 so we are only every suspending and creating tasks during non-version probing rebalances?

onPartitionsAssigned: if version probing flag is set, it means assignment is empty and we want to trigger a new rebalance. If we call taskManager.createTasks(assignment);, we would close suspended task and that is what we do not want to do at this point, because we hope to get those task assigned after the second rebalance.

onPartitionsRevoked: if version probing flag is set, we don't want to suspend tasks either. Tasks are already suspended but if we call taskManager.suspendTasksAndState(); again, we loose the information about currently suspended tasks (but we need to keep this information; ie, we avoid an incorrect internal metadata update here).

The flow is the following:

trigger first rebalance

onPartitionsRevoke -> version probing flag not set: suspend tasks regularly

onPartitionAssigned -> version probing flag set by StreamsPartitionsAssignor: we skip task creation as we will rebalance again (we cannot reset the flag here, because we need it in the next step)

trigger second rebalance

onPartitionsRevoke -> version probing flag is still set; we can reset the flag and skip suspending tasks to preserve metadata

onPartitionAssigned -> version probing flag not set: we do regular assignment and start processing

Does this make sense?

yep - thanks for the clarification

bbejeck · 2018-05-25T20:16:38Z

                case StreamsConfig.UPGRADE_FROM_0110:
                case StreamsConfig.UPGRADE_FROM_10:
                case StreamsConfig.UPGRADE_FROM_11:
                    log.info("Downgrading metadata version from {} to 2 for upgrade from " + upgradeFrom + ".x.", SubscriptionInfo.LATEST_SUPPORTED_VERSION);


nit: use "{}.x." vs. string concatenation

bbejeck · 2018-05-25T21:30:29Z

            "path": STDERR_FILE,
            "collect_default": True},
+        "streams_log.1": {
+            "path": LOG_FILE + ".1",


will the system test results still get errors when these files aren't found?

Yes, we need to specify all used files here -- otherwise they won't be collected after the test finished.

bbejeck · 2018-05-25T21:44:33Z

+                found = list(p.node.account.ssh_capture("grep \"Finished assignment for group\" %s" % p.LOG_FILE, allow_fail=True))
+                if len(found) == self.leader_counter[p] + 1:
+                    self.leader = p
+                    self.leader_counter[p] = self.leader_counter[p] + 1


Why the + 1? Is that for the leader to kick off version probing with a future version?

If a processor is the leader, it will print Finished assignment for group in the log. A processor can be the leader multiple times, and thus we count how many of those lines we have seen in leader_counter[p]. To identify the new leader, it's count must increase by exactly one. We increase by one as we have found one occurrence as expected.

bbejeck

LGTM

guozhangwang

Made another pass on non-testing code as I've reviewed the testing code before. Left some comments.

guozhangwang · 2018-05-30T20:17:33Z

+    private final static int VERSION_THREE = 3;
+    private final static int EARLIEST_PROBEABLE_VERSION = VERSION_THREE;
+    protected int minUserMetadataVersion = UNKNOWN;
+    protected Set<Integer> supportedVersions = new HashSet<>();


This field is only used for testing purposes?

For now, yes. However, after we bump metadata version to 4, parts of the logic of FutureStreamsPartitionAssignor will move into StreamsPartitionAssignor and than also StreamsPartitionAssignor need this.

guozhangwang · 2018-05-30T20:18:04Z


        return assignment;
    }
+    private Map<String, Assignment> versionProbingAssignment(final Map<UUID, ClientMetadata> clientsMetadata,


nit: empty line space.

guozhangwang · 2018-05-30T20:19:39Z

+        for (final ClientMetadata clientMetadata : clientsMetadata.values()) {
+            for (final String consumerId : clientMetadata.consumers) {
+
+                final List<TaskId> activeTasks = new ArrayList<>(clientMetadata.state.prevActiveTasks());


We could skip if futureConsumers.contains(consumerId)?

guozhangwang · 2018-05-30T20:21:00Z

+        if (usedSubscriptionMetadataVersion > receivedAssignmentMetadataVersion
+            && receivedAssignmentMetadataVersion >= EARLIEST_PROBEABLE_VERSION) {
+
+            if (info.version() == supportedVersion) {


info.version() could be replaced with receivedAssignmentMetadataVersion

Here's my reasoning of the cases:

receivedAssignmentMetadataVersion > supportedVersion: this should never happen.

receivedAssignmentMetadataVersion == supportedVersion: normal case, the leader only knows up to supportedVersion, and hence sends this version back.

receivedAssignmentMetadataVersion < supportedVersion: if some other consumer used an even older-than-supportedVersion, in this case this consumer will again send the subscription with supportedVersion.

So it seems we do not need to distinguish 2) and 3) since for either case, line 763 and line 770 will actually assign usedSubscriptionMetadataVersion = supportedVersion right?

Or do you just want to distinguish the log entry? If that's the case I think simplifying this to:

if (receivedAssignmentMetadataVersion > supportedVersion) { // throw runtime exception } else { if (receivedAssignmentMetadataVersion == supportedVersion) // log normally else // log differently usedSubscriptionMetadataVersion = supportedVersion; }

I think unifying the assignment to usedSubscriptionMetadataVersion into a single line is harder to read for humans.
Using

usedSubscriptionMetadataVersion = receivedAssignmentMetadataVersion

make clear that it's a downgrade while

usedSubscriptionMetadataVersion = supportedVersion

makes clear it is an upgrade.

I apply the other suggestions, thought.

guozhangwang · 2018-05-30T20:24:11Z

+        final int supportedVersion = info.latestSupportedVersion();
+
+        if (usedSubscriptionMetadataVersion > receivedAssignmentMetadataVersion
+            && receivedAssignmentMetadataVersion >= EARLIEST_PROBEABLE_VERSION) {


receivedAssignmentMetadataVersion >= EARLIEST_PROBEABLE_VERSION should be guaranteed at the server side as always right? If that is true, I'd suggest we refactor it as:

if (usedSubscriptionMetadataVersion > receivedAssignmentMetadataVersion) { if (receivedAssignmentMetadataVersion < EARLIEST_PROBEABLE_VERSION) { // throw illegal state exception. } // .. below logic }

So that we can detect potential bugs.

This code is actually correct. If I change a suggested, it break for the manual upgrade path from version 1/2 to version 3. For this case, a received version can be smaller than the used-subscription-version, and smaller than EARLIEST_PROBEABLE_VERSION.

Ah you're right. My understanding is that to manual upgrade from version 1/2 to version 3, we set upgrade.from config accordingly, so first rebalance everyone use version 1/2 in subscriptionInfo and AssignmentInfo; then in second rebalance someone send subscriptionInfo with version 3, and someone send with version 1/2 (they have not bounced yet), so assignmentInfo with version 1/2 are sent back again.

guozhangwang · 2018-05-30T20:39:13Z

                break;
-            case 3:
+            case VERSION_THREE:
+                final int latestSupportedVersionGroupLeader = info.latestSupportedVersion();


We can reuse supportedVersion in line 753 above.

Actually how about renaming that field to latestLeaderSupportedVersion?

guozhangwang · 2018-05-30T20:41:46Z

    }

-    private ClientState(Set<TaskId> activeTasks, Set<TaskId> standbyTasks, Set<TaskId> assignedTasks, Set<TaskId> prevActiveTasks, Set<TaskId> prevAssignedTasks, int capacity) {
+    private ClientState(final Set<TaskId> activeTasks,


The added prevStandbyTasks seems not set anywhere? I.e. it will always be empty hashset?

It is set in

public void addPreviousStandbyTasks(final Set<TaskId> standbyTasks) { prevStandbyTasks.addAll(standbyTasks); prevAssignedTasks.addAll(standbyTasks); }

guozhangwang · 2018-05-30T21:34:24Z

@mjsax Please feel free to merge after those comments are addressed. I have no more feedbacks.

mjsax · 2018-05-31T02:43:50Z

Retest this please.

mjsax · 2018-05-31T04:44:30Z

Upgrade system test passed: https://jenkins.confluent.io/job/system-test-kafka-branch-builder/1762/

This PR fixes some regressions introduced into streams system tests and sets the upgrade tests to ignore until PR apache#4636 is merged as it has the fixes for the upgrade tests. Reviewers: Guozhang Wang <wangguoz@gmail.com>

…4636) implements KIP-268 Reviewers: Bill Bejeck <bill@confluent.io>, John Roesler <john@confluent.io>, Guozhang Wang <guozhang@confluent.io>

mjsax added the streams label Mar 2, 2018

mjsax requested review from dguy and guozhangwang March 2, 2018 07:39

ijuma force-pushed the trunk branch from 5d87b92 to 9868976 Compare March 2, 2018 19:31

mjsax force-pushed the kafka-6054-fix-upgrade-system-test branch 2 times, most recently from becd580 to af39ef9 Compare March 6, 2018 01:42

bbejeck reviewed Mar 9, 2018

View reviewed changes

guozhangwang reviewed Mar 11, 2018

View reviewed changes

mjsax force-pushed the kafka-6054-fix-upgrade-system-test branch from af39ef9 to 8f00806 Compare March 12, 2018 00:11

mjsax mentioned this pull request Mar 16, 2018

MINOR: Rolling bounce upgrade fixed broker system test #4690

Merged

3 tasks

guozhangwang reviewed Mar 17, 2018

View reviewed changes

guozhangwang reviewed Mar 19, 2018

View reviewed changes

mjsax force-pushed the kafka-6054-fix-upgrade-system-test branch from 8f00806 to f665293 Compare March 20, 2018 21:14

mjsax mentioned this pull request Mar 21, 2018

KAFKA-6054: Fix upgrade path from Kafka Streams v0.10.0 #4746

Merged

vvcephei approved these changes Mar 26, 2018

View reviewed changes

mjsax mentioned this pull request Mar 27, 2018

KAFKA-6054: Fix upgrade path from Kafka Streams v0.10.0 #4779

Merged

mjsax force-pushed the kafka-6054-fix-upgrade-system-test branch from 69a851d to c7bedb3 Compare April 7, 2018 19:51

mjsax changed the title ~~[WIP] KAFKA-6054: Fix Kafka Streams upgrade path for v0.10.0~~ KAFKA-6054: Fix Kafka Streams upgrade path for v0.10.0 Apr 7, 2018

mjsax added 7 commits May 22, 2018 16:33

Some code cleanup

622aa0a

Updates assignment strategy for version-probing phase

a51b12d

- don't reassign tasks of new instance but wait for second rebalance

System tests

6793c2f

Do real rolling bounce test

1671fe6

Updated version probing protocol and adjusted system test

3a7c513

fix unit tests

9ef1dfb

Fix checkstyle

6a077ec

mjsax force-pushed the kafka-6054-fix-upgrade-system-test branch from d44014d to 6a077ec Compare May 25, 2018 17:33

Fixed system test

7b01cae

bbejeck reviewed May 25, 2018

View reviewed changes

Github comment

e3035e8

vvcephei approved these changes May 29, 2018

View reviewed changes

bbejeck approved these changes May 29, 2018

View reviewed changes

guozhangwang reviewed May 30, 2018

View reviewed changes

guozhangwang approved these changes May 30, 2018

View reviewed changes

mjsax added 2 commits May 30, 2018 16:19

Github comments

e6d576f

Fix checkstyle

5c0853e

fix system test

0a87bd5

guozhangwang approved these changes May 31, 2018

View reviewed changes

mjsax merged commit d166485 into apache:trunk May 31, 2018

mjsax deleted the kafka-6054-fix-upgrade-system-test branch June 5, 2018 23:48

Conversation

mjsax commented Mar 2, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mjsax commented Mar 2, 2018

Uh oh!

mjsax commented Mar 6, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bbejeck Mar 9, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bbejeck Mar 9, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bbejeck commented Mar 9, 2018

Uh oh!

mjsax commented Mar 9, 2018

Uh oh!

guozhangwang left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mjsax commented Mar 12, 2018

Uh oh!

bbejeck commented Mar 12, 2018

Uh oh!

guozhangwang commented Mar 12, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vvcephei commented Mar 18, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vvcephei left a comment

Choose a reason for hiding this comment

Uh oh!

mjsax commented Mar 26, 2018

Uh oh!

mjsax commented Apr 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mjsax commented May 25, 2018

Uh oh!

mjsax commented May 25, 2018

Uh oh!

bbejeck left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

mjsax commented Mar 2, 2018 •

edited

Loading

mjsax commented Mar 6, 2018 •

edited

Loading

bbejeck Mar 9, 2018 •

edited

Loading

bbejeck Mar 9, 2018 •

edited

Loading

mjsax commented Apr 7, 2018 •

edited

Loading