Skip to content

MINOR: onControllerResignation should be invoked if triggerControllerMove is called#2935

Closed
ijuma wants to merge 6 commits intoapache:trunkfrom
ijuma:on-controller-resignation-if-trigger-controller-move
Closed

MINOR: onControllerResignation should be invoked if triggerControllerMove is called#2935
ijuma wants to merge 6 commits intoapache:trunkfrom
ijuma:on-controller-resignation-if-trigger-controller-move

Conversation

@ijuma
Copy link
Copy Markdown
Member

@ijuma ijuma commented Apr 28, 2017

Also update the test to be simpler since we can use a mock event to simulate the issue
more easily (thanks Jun for the suggestion). This should fix two issues:

  1. A transient test failure due to a NPE in ControllerFailoverTest.testMetadataUpdate:
Caused by: java.lang.NullPointerException
	at kafka.controller.ControllerBrokerRequestBatch.addUpdateMetadataRequestForBrokers(ControllerChannelManager.scala:338)
	at kafka.controller.KafkaController.sendUpdateMetadataRequest(KafkaController.scala:975)
	at kafka.controller.ControllerFailoverTest.testMetadataUpdate(ControllerFailoverTest.scala:141)

The test was creating an additional thread and it does not seem like it was doing the
appropriate synchronization (perhaps this became more of an issue after we changed
the Controller to be single-threaded and changed the locking)

  1. Setting activeControllerId.set(-1) in triggerControllerMove causes Reelect not to invoke onControllerResignation. Among other things, this causes an IllegalStateException to be thrown when KafkaScheduler.startup is invoked for the second time without the corresponding shutdown. We now simply call onControllerResignation as part of triggerControllerMove.

Finally, I included a few clean-ups:

  1. No longer update the broker state in onControllerFailover. This is no longer needed
    since we removed the RunningAsController state (KAFKA-3761).
  2. Trivial clean-ups in KafkaController
  3. Removed unused parameter in ZkUtils.getPartitionLeaderAndIsrForTopics

@ijuma
Copy link
Copy Markdown
Member Author

ijuma commented Apr 28, 2017

@asfbot
Copy link
Copy Markdown

asfbot commented Apr 28, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.10/3277/
Test FAILed (JDK 7 and Scala 2.10).

@ijuma
Copy link
Copy Markdown
Member Author

ijuma commented Apr 28, 2017

Review by @onurkaraman and @junrao.

@asfbot
Copy link
Copy Markdown

asfbot commented Apr 28, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/3282/
Test FAILed (JDK 8 and Scala 2.11).

@asfbot
Copy link
Copy Markdown

asfbot commented Apr 28, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/3273/
Test PASSed (JDK 8 and Scala 2.12).

@onurkaraman
Copy link
Copy Markdown
Contributor

I think it would be worth figuring out whether the transiently failing tests were actually coming from bb663d0 and exactly why (as in, what in that patch caused the change).

The triggerControllerMove method introduced in bb663d0 is literally the former ZookeeperLeaderElector.resign but with a different method name. Here's the method in 0.10.2: https://github.com/apache/kafka/blob/0.10.2/core/src/main/scala/kafka/server/ZookeeperLeaderElector.scala#L109 for comparison. I just replaced all 4 calls of ZookeeperLeaderElector.resign with triggerControllerMove.

So there's no change stemming from triggerControllerMove. Maybe, as you said, a change happened in Reelect or ControllerChange, but this would be worth figuring out.

@ijuma
Copy link
Copy Markdown
Member Author

ijuma commented Apr 28, 2017

@onurkaraman, thanks, I was hoping you'd be able to track down if it was indeed bb663d0. It looks like the underlying issue was there too:

[2017-04-28 16:14:51,760] ERROR Error while electing or becoming leader on broker 0 (kafka.server.ZookeeperLeaderElector:105)
java.lang.IllegalStateException: This scheduler has already been started!
	at kafka.utils.KafkaScheduler.startup(KafkaScheduler.scala:78)
	at kafka.controller.KafkaController.onControllerFailover(KafkaController.scala:351)
	at kafka.controller.KafkaController.$anonfun$controllerElector$1(KafkaController.scala:160)
	at kafka.server.ZookeeperLeaderElector.elect(ZookeeperLeaderElector.scala:85)
	at kafka.server.ZookeeperLeaderElector$LeaderChangeListener.$anonfun$handleDataDeleted$3(ZookeeperLeaderElector.scala:154)

The difference is that the NPE didn't seem to happen. Since the NPE requires a particular event ordering, it seems plausible that making it single threaded made this more likely than before.

@onurkaraman
Copy link
Copy Markdown
Contributor

Interesting. I ran the test 60 times and they all passed.

> for i in {1..60}; do ./gradlew cleanTest core:test -Dtest.single=ControllerFailoverTest; done

I ran it against my local trunk which just went up to commit bb663d0 (KAFKA-5028).

@ijuma
Copy link
Copy Markdown
Member Author

ijuma commented May 1, 2017

retest this please

@asfbot
Copy link
Copy Markdown

asfbot commented May 1, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.10/3321/
Test FAILed (JDK 7 and Scala 2.10).

@asfbot
Copy link
Copy Markdown

asfbot commented May 1, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/3326/
Test PASSed (JDK 8 and Scala 2.11).

@asfbot
Copy link
Copy Markdown

asfbot commented May 1, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/3317/
Test FAILed (JDK 8 and Scala 2.12).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if removing this line is the right move, as it will cause events to be processed by the controller even after triggering the controller move.

triggerControllerMove now just deletes the /controller znode. The Reelect corresponding to the /controller znode deletion will come behind events already in the queue. Since the PR removes the activeControllerId.set(-1) line, the controller will still think it's active and will process the queued events in between the current event and the Reelect event.

When I first saw the PR title, I thought you'd simply add a line calling onControllerResignation within triggerControllerMove. I think this makes more sense.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried adding onControllerResignation in that method first, but that causes another exception to be thrown. It may be that this test needs be rethought.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree that perhaps we should improve the test in ControllerFailoverTest. My understanding is that the test tries to simulate a case that the controller gets into an illegal state and tries to see if the controller can recover from it. But the way that it simulates the illegal state is quite involved. Perhaps we can do the following. (1) Consolidate all try/catch illegalStateException to ControllerEventThread.doWork(). (2) Change the test by inserting a MockEvent type that throws illegalStateException.

@onurkaraman
Copy link
Copy Markdown
Contributor

By the way, your latest jenkins run still shows an NPE in the exact same spot in ControllerFailoverTest.testMetadataUpdate:
https://builds.apache.org/job/kafka-pr-jdk7-scala2.10/3321/testReport/junit/kafka.controller/ControllerFailoverTest/testMetadataUpdate/

I took a peek at the test itself. It's quite complicated. Do we know what it's actually trying to test?

@ijuma ijuma force-pushed the on-controller-resignation-if-trigger-controller-move branch from 241b989 to 919901e Compare May 2, 2017 00:12
@ijuma
Copy link
Copy Markdown
Member Author

ijuma commented May 2, 2017

@onurkaraman, yes, I noticed that the latest Jenkins run triggered that NPE again. I looked a bit more and it seems that the test is racy as it calls non thread safe from a different thread. That may be the source of the NPE. So, more than one issue in play.

@asfbot
Copy link
Copy Markdown

asfbot commented May 2, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/3337/
Test FAILed (JDK 8 and Scala 2.12).

@asfbot
Copy link
Copy Markdown

asfbot commented May 2, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/3346/
Test PASSed (JDK 8 and Scala 2.11).

@asfbot
Copy link
Copy Markdown

asfbot commented May 2, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.10/3341/
Test PASSed (JDK 7 and Scala 2.10).

@ijuma ijuma force-pushed the on-controller-resignation-if-trigger-controller-move branch from 919901e to c188e61 Compare May 25, 2017 14:12
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this should be in the try/catch since it can throw an IllegalStateException and it seems like updateLeaderEpoch doesn't do anything with this field. But please double-check.

@ijuma ijuma force-pushed the on-controller-resignation-if-trigger-controller-move branch 2 times, most recently from 6e922ca to 8e5d282 Compare May 25, 2017 14:18
@ijuma
Copy link
Copy Markdown
Member Author

ijuma commented May 25, 2017

@junrao @onurkaraman, here's another attempt that uses a mock event as suggested by Jun. I didn't move the IllegalStateException handling into ControllerEventManager because it also clears some state that the event manager doesn't have access to. I did extract a method, which seems like a reasonable compromise. Let me know what you think.

@asfbot
Copy link
Copy Markdown

asfbot commented May 25, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.11/4391/
Test FAILed (JDK 7 and Scala 2.11).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both of these should be waitUntilTrue as they happen after onControllerResigned

@asfbot
Copy link
Copy Markdown

asfbot commented May 25, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/4377/
Test FAILed (JDK 8 and Scala 2.12).

@asfbot
Copy link
Copy Markdown

asfbot commented May 25, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.11/4401/
Test FAILed (JDK 7 and Scala 2.11).

@asfbot
Copy link
Copy Markdown

asfbot commented May 25, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/4387/
Test FAILed (JDK 8 and Scala 2.12).

@asfbot
Copy link
Copy Markdown

asfbot commented May 26, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.11/4413/
Test PASSed (JDK 7 and Scala 2.11).

@asfbot
Copy link
Copy Markdown

asfbot commented May 26, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/4399/
Test PASSed (JDK 8 and Scala 2.12).

@ijuma
Copy link
Copy Markdown
Member Author

ijuma commented May 26, 2017

retest this please

Tests passed, but running them again, just in case.

@asfbot
Copy link
Copy Markdown

asfbot commented May 26, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/4418/
Test PASSed (JDK 8 and Scala 2.12).

@asfbot
Copy link
Copy Markdown

asfbot commented May 26, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.11/4432/
Test PASSed (JDK 7 and Scala 2.11).

Copy link
Copy Markdown
Contributor

@junrao junrao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ijuma : Thanks for the patch. LGTM. Just a few minor comments.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing license header

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assertion probably doesn't need to be in waitUntilTrue()?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that we don't really need epochMap. previousEpoch could be obtained from the previous controller. If a broker is not a controller, the controller epoch is not valid.

@ijuma ijuma force-pushed the on-controller-resignation-if-trigger-controller-move branch from fa7405f to e603ee9 Compare May 27, 2017 11:27
@ijuma
Copy link
Copy Markdown
Member Author

ijuma commented May 27, 2017

@junrao, I addressed your feedback and fixed a trivial merge conflict.

@asfbot
Copy link
Copy Markdown

asfbot commented May 27, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.11/4487/
Test PASSed (JDK 7 and Scala 2.11).

@asfbot
Copy link
Copy Markdown

asfbot commented May 27, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/4473/
Test FAILed (JDK 8 and Scala 2.12).

@junrao
Copy link
Copy Markdown
Contributor

junrao commented May 30, 2017

@ijuma : Thanks for the latest patch. LGTM

asfgit pushed a commit that referenced this pull request May 30, 2017
…Move is called

Also update the test to be simpler since we can use a mock event to simulate the issue
more easily (thanks Jun for the suggestion). This should fix two issues:

1. A transient test failure due to a NPE in ControllerFailoverTest.testMetadataUpdate:

```text
Caused by: java.lang.NullPointerException
	at kafka.controller.ControllerBrokerRequestBatch.addUpdateMetadataRequestForBrokers(ControllerChannelManager.scala:338)
	at kafka.controller.KafkaController.sendUpdateMetadataRequest(KafkaController.scala:975)
	at kafka.controller.ControllerFailoverTest.testMetadataUpdate(ControllerFailoverTest.scala:141)
```

The test was creating an additional thread and it does not seem like it was doing the
appropriate synchronization (perhaps this became more of an issue after we changed
the Controller to be single-threaded and changed the locking)

2. Setting `activeControllerId.set(-1)` in `triggerControllerMove` causes `Reelect` not to invoke `onControllerResignation`. Among other things, this causes an `IllegalStateException` to be thrown when `KafkaScheduler.startup` is invoked for the second time without the corresponding `shutdown`. We now simply call `onControllerResignation` as part of `triggerControllerMove`.

Finally, I included a few clean-ups:

1. No longer update the broker state in `onControllerFailover`. This is no longer needed
since we removed the `RunningAsController` state (KAFKA-3761).
2. Trivial clean-ups in KafkaController
3. Removed unused parameter in `ZkUtils.getPartitionLeaderAndIsrForTopics`

Author: Ismael Juma <ismael@juma.me.uk>

Reviewers: Jun Rao <junrao@gmail.com>

Closes #2935 from ijuma/on-controller-resignation-if-trigger-controller-move

(cherry picked from commit 6021618)
Signed-off-by: Jun Rao <junrao@gmail.com>
@asfgit asfgit closed this in 6021618 May 31, 2017
@ijuma ijuma deleted the on-controller-resignation-if-trigger-controller-move branch June 18, 2017 09:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants