KAFKA-8058: Fix ConnectClusterStateImpl.connectors() method by C0urante · Pull Request #6384 · apache/kafka

C0urante · 2019-03-07T00:22:50Z

This makes the ConnectClusterStateImpl.connectors() method synchronous, whereas before it was implicitly asynchronous with no way to tell whether it had completed or not.

More detail can be found in the Jira.

~~Tested manually. Seems like a light enough change that even unit tests would be overkill, but if reviewers feel differently tests can be added.~~
A unit test has also been added for the ConnectClusterStateImpl class.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

C0urante · 2019-03-07T00:23:49Z

@mageshn @wicknicks @rayokota could you guys take a look when you have a chance? 😃

wicknicks

good catch, @C0urante. one comment.

mageshn

Nice catch @C0urante. LGTM.

…ception on timeout

rayokota

LGTM

wicknicks

Thanks, @C0urante. LGTM.

C0urante · 2019-03-11T20:13:59Z

@rhauch @ewencp Hi guys! Got a minor bug fix for you; been reviewed and approved by three contributors and seems ready to be looked at by a committer or two. Would appreciate your review when you get the chance.

kkonstantine

Nice find @C0urante !
I think using FutureCallback already allows you to do what you implement with the countdown latch.
Also I have a question on whether we need to forward this request to the leader as we do elsewhere.

kkonstantine · 2019-03-12T17:55:22Z

Forgot to mention that a unit test would be nice (for the whole class even?), since this seems it never worked because it returned immediately.

…StateImpl.connectors()

C0urante · 2019-03-12T22:22:51Z

@kkonstantine I think I've addressed your two inline comments, and I've added a unit test for the ConnectClusterStateImpl.connectors() method. The other method (connectorHealth(...)) is synchronous and doesn't really look like it warrants an additional test.

As far as the ten-second timeout goes, I'm open to suggestions but it doesn't appear that rebalances or forwarding to master should be a factor.

kkonstantine

@C0urante this looks even better now IMO and on par with the rest of the code.
One nit and one more suggested improvement on the tests. Thanks for including those btw. Definitely on the right track.

C0urante · 2019-03-14T20:20:13Z

@kkonstantine thanks for the review! I've taken your comments into account and pushed the next iteration, which consists of changes to how the ConnectClusterStateImpl is tested. Hope this looks good and addresses your review remarks, but will be happy to keep iterating if not or if there's anything else that needs work.

kkonstantine

LGTM!

rhauch

Nice find, @C0urante! One really minor suggestion for slight improvement in readability, but not a bit deal.

This has never worked, right? This PR should be relatively easy to backport.

C0urante · 2019-03-15T18:21:48Z

@rhauch yep, this bug has been present since REST extensions were added to Connect. I've added your suggested change to the PR, let me know if there are any other alterations you'd like.

rhauch

Thanks, @C0urante! LGTM.

ewencp

Mostly looks good, one comment. This is the sort of thing that should come with list of backports -- the bug appears to already be tagged with at least some info about previous versions that are also affected so we probably don't want to close without backporting.

Regarding timeout, it should really be as conservative as possible without being longer than the minimum of (rebalance timeout, default REST API request timeout).

C0urante · 2019-03-18T15:56:15Z

@ewencp can you shed some light on why we'd want to take rebalances into account for the timeout? I could be wrong, but I don't think a rebalance would affect the latency of a call to Herder.connectors(...).

…annotation

ewencp · 2019-03-21T18:11:21Z

Calling connectors() for DistributedHerder adds a request to the request queue, which are processed serially in the thread that manages cluster membership so they get a consistent view of the state of the Herder. Pretty much all the requests follow that pattern, though it is true some could potentially be optimized to not require it if some other locking were put in place for certain parts of the state.

The thread that processes the requests is also the one managing the member and calling poll() on it, which can block on ensuring membership in the group, which means request processing latency can be affected by rebalances.

… conservative

C0urante · 2019-03-24T01:09:45Z

@ewencp thanks for the clarification; I did not realize that membership and request handling were taken care of with the same thread. I've taken your suggestion literally and just made the timeout the minimum of the timeout for the /connectors resource and the rebalance timeout (if running in distributed mode). LMK how this looks!

C0urante · 2019-03-25T18:25:03Z

BTW, I think we want to backport this to 2.0, which according to the KIP is when Connect REST extensions were introduced and AFAICT the code for the ConnectClusterStateImpl class hasn't been modified since.

ewencp

one comment that we could follow up on, but i'm fine merging as is

ewencp · 2019-03-28T04:03:51Z

            config, ConnectRestExtension.class);

+        long herderRequestTimeoutMs = ConnectorsResource.REQUEST_TIMEOUT_MS;
+        if (config instanceof DistributedConfig) {


nit: i don't love instanceof solutions. might be better to generalize thee rebalance timeout so support a small/zero value for StandaloneConfig.

I wouldn't block merging on this, but this is an anti-pattern, so if there is a cleaner solution, it might be worth exploring.

@C0urante, what do you think about doing something like the following instead of the instanceof block?

Integer rebalanceTimeoutMs = config.getInt(DistributedConfig.REBALANCE_TIMEOUT_MS_CONFIG); if (rebalanceTimeoutMs != null) { herderRequestTimeoutMs = Math.min(herderRequestTimeoutMs, rebalanceTimeoutMs.longValue()); }

This still uses the DistributedConfig.REBALANCE_TIMEOUT_MS_CONFIG, but otherwise is simply checking for the existence of that property to determine if herderRequestTimeoutMs should be set to a new value.

@rhauch sgtm! I've added your code snippet verbatim to the PR

Fixed the ConnectClusterStateImpl.connectors() method and throw an exception on timeout. Added unit test. Author: Chris Egerton <chrise@confluent.io> Reviewers: Magesh Nandakumar <magesh.n.kumar@gmail.com>, Robert Yokota <rayokota@gmail.com>, Arjun Satish <wicknicks@users.noreply.github.com>, Konstantine Karantasis <konstantine@confluent.io>, Randall Hauch <rhauch@gmail.com>, Ewen Cheslack-Postava <ewen@confluent.io> Closes #6384 from C0urante:kafka-8058

* apache/trunk: MINOR: Add security considerations for remote JMX in Kafka docs (apache#6544) MINOR: Remove redundant access specifiers from metrics interfaces (apache#6527) MINOR: Correct KStream documentation (apache#6552) KAFKA-8013; Avoid underflow when reading a Struct from a partially correct buffer (apache#6340) KAFKA-8058: Fix ConnectClusterStateImpl.connectors() method (apache#6384) MINOR: Move common consumer tests out of abstract consumer class (apache#6548) KAFKA-8168; Add a generated ApiMessageType class KAFKA-7893; Refactor ConsumerBounceTest to reuse functionality from BaseConsumerTest (apache#6238) MINOR: Tighten up metadata upgrade test (apache#6531) KAFKA-8190; Don't update keystore modification time during validation (apache#6539) MINOR: Fixed a few warning in core and connects (apache#6545) KAFKA-7904; Add AtMinIsr partition metric and TopicCommand option (KIP-427) MINOR: fix throttling and status in ConnectionStressWorker KAFKA-8090: Use automatic RPC generation in ControlledShutdown KAFKA-6399: Remove Streams max.poll.interval override (apache#6509) KAFKA-8126: Flaky Test org.apache.kafka.connect.runtime.WorkerTest.testAddRemoveTask (apache#6475) HOTFIX: Update unit test for KIP-443 KAFKA-7190: KIP-443; Remove streams overrides on repartition topics (apache#6511) KAFKA-8183: Add retries to WorkerUtils#verifyTopics (apache#6532) KAFKA-8181: Removed Avro topic from TOC on kafka (apache#6529)

) Fixed the ConnectClusterStateImpl.connectors() method and throw an exception on timeout. Added unit test. Author: Chris Egerton <chrise@confluent.io> Reviewers: Magesh Nandakumar <magesh.n.kumar@gmail.com>, Robert Yokota <rayokota@gmail.com>, Arjun Satish <wicknicks@users.noreply.github.com>, Konstantine Karantasis <konstantine@confluent.io>, Randall Hauch <rhauch@gmail.com>, Ewen Cheslack-Postava <ewen@confluent.io> Closes apache#6384 from C0urante:kafka-8058

KAFKA-8058: Fix ConnectClusterStateImpl.connectors() method

6da5db0

wicknicks reviewed Mar 11, 2019

View reviewed changes

Comment thread ...t/runtime/src/main/java/org/apache/kafka/connect/runtime/health/ConnectClusterStateImpl.java Outdated

mageshn approved these changes Mar 11, 2019

View reviewed changes

KAFKA-8058: Change ConnectorClusterStateImpl.connectors() to throw ex…

5d131d8

…ception on timeout

rayokota approved these changes Mar 11, 2019

View reviewed changes

wicknicks approved these changes Mar 11, 2019

View reviewed changes

kkonstantine reviewed Mar 12, 2019

View reviewed changes

Comment thread ...t/runtime/src/main/java/org/apache/kafka/connect/runtime/health/ConnectClusterStateImpl.java Outdated

Comment thread ...t/runtime/src/main/java/org/apache/kafka/connect/runtime/health/ConnectClusterStateImpl.java Outdated

KAFKA-8058: Address review comments, add unit test for ConnectCluster…

68155b1

…StateImpl.connectors()

KAFKA-8058: Simplify unit tests for ConnectClusterStateImpl.connectors()

d1a97f0

kkonstantine reviewed Mar 14, 2019

View reviewed changes

Comment thread ...ntime/src/test/java/org/apache/kafka/connect/runtime/health/ConnectClusterStateImplTest.java Outdated

Comment thread ...ntime/src/test/java/org/apache/kafka/connect/runtime/health/ConnectClusterStateImplTest.java

KAFKA-8058: Address review comments on ConnectClusterStateImplTest

65e7c3d

kkonstantine approved these changes Mar 14, 2019

View reviewed changes

Comment thread ...ntime/src/test/java/org/apache/kafka/connect/runtime/health/ConnectClusterStateImplTest.java

rhauch reviewed Mar 15, 2019

View reviewed changes

Comment thread ...t/runtime/src/main/java/org/apache/kafka/connect/runtime/health/ConnectClusterStateImpl.java Outdated

KAFKA-8058: Address review comments

cd99fc4

rhauch approved these changes Mar 15, 2019

View reviewed changes

ewencp reviewed Mar 17, 2019

View reviewed changes

Comment thread ...ntime/src/test/java/org/apache/kafka/connect/runtime/health/ConnectClusterStateImplTest.java Outdated

KAFKA-8058: Use assertThrows(...) instead of expected field in @test …

f6c0d7b

…annotation

KAFKA-8058: Make timeout for ConnectClusterStateImpl.connetors() more…

002d3b2

… conservative

ewencp approved these changes Mar 28, 2019

View reviewed changes

Address review suggestions

b574294

rhauch merged commit 71e721f into apache:trunk Apr 7, 2019

C0urante deleted the kafka-8058 branch April 7, 2019 19:13

kkonstantine added the connect label Oct 16, 2020

Conversation

C0urante commented Mar 7, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Committer Checklist (excluded from commit message)

Uh oh!

C0urante commented Mar 7, 2019

Uh oh!

wicknicks left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mageshn left a comment

Choose a reason for hiding this comment

Uh oh!

rayokota left a comment

Choose a reason for hiding this comment

Uh oh!

wicknicks left a comment

Choose a reason for hiding this comment

Uh oh!

C0urante commented Mar 11, 2019

Uh oh!

kkonstantine left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

kkonstantine commented Mar 12, 2019

Uh oh!

C0urante commented Mar 12, 2019

Uh oh!

kkonstantine left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

C0urante commented Mar 14, 2019

Uh oh!

kkonstantine left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rhauch left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

C0urante commented Mar 15, 2019

Uh oh!

rhauch left a comment

Choose a reason for hiding this comment

Uh oh!

ewencp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

C0urante commented Mar 18, 2019

Uh oh!

ewencp commented Mar 21, 2019

Uh oh!

C0urante commented Mar 24, 2019

Uh oh!

C0urante commented Mar 25, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ewencp left a comment

Choose a reason for hiding this comment

Uh oh!

ewencp Mar 28, 2019

Choose a reason for hiding this comment

Uh oh!

rhauch Apr 3, 2019

Choose a reason for hiding this comment

Uh oh!

C0urante Apr 6, 2019

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

C0urante commented Mar 7, 2019 •

edited

Loading

C0urante commented Mar 25, 2019 •

edited

Loading