Skip to content

KAFKA-8058: Fix ConnectClusterStateImpl.connectors() method#6384

Merged
rhauch merged 9 commits intoapache:trunkfrom
C0urante:kafka-8058
Apr 7, 2019
Merged

KAFKA-8058: Fix ConnectClusterStateImpl.connectors() method#6384
rhauch merged 9 commits intoapache:trunkfrom
C0urante:kafka-8058

Conversation

@C0urante
Copy link
Copy Markdown
Contributor

@C0urante C0urante commented Mar 7, 2019

This makes the ConnectClusterStateImpl.connectors() method synchronous, whereas before it was implicitly asynchronous with no way to tell whether it had completed or not.

More detail can be found in the Jira.

Tested manually. Seems like a light enough change that even unit tests would be overkill, but if reviewers feel differently tests can be added.
A unit test has also been added for the ConnectClusterStateImpl class.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

@C0urante
Copy link
Copy Markdown
Contributor Author

C0urante commented Mar 7, 2019

@mageshn @wicknicks @rayokota could you guys take a look when you have a chance? 😃

Copy link
Copy Markdown
Contributor

@wicknicks wicknicks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, @C0urante. one comment.

Copy link
Copy Markdown
Contributor

@mageshn mageshn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch @C0urante. LGTM.

Copy link
Copy Markdown
Contributor

@rayokota rayokota left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Contributor

@wicknicks wicknicks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @C0urante. LGTM.

@C0urante
Copy link
Copy Markdown
Contributor Author

@rhauch @ewencp Hi guys! Got a minor bug fix for you; been reviewed and approved by three contributors and seems ready to be looked at by a committer or two. Would appreciate your review when you get the chance.

Copy link
Copy Markdown
Contributor

@kkonstantine kkonstantine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice find @C0urante !
I think using FutureCallback already allows you to do what you implement with the countdown latch.
Also I have a question on whether we need to forward this request to the leader as we do elsewhere.

@kkonstantine
Copy link
Copy Markdown
Contributor

Forgot to mention that a unit test would be nice (for the whole class even?), since this seems it never worked because it returned immediately.

@C0urante
Copy link
Copy Markdown
Contributor Author

@kkonstantine I think I've addressed your two inline comments, and I've added a unit test for the ConnectClusterStateImpl.connectors() method. The other method (connectorHealth(...)) is synchronous and doesn't really look like it warrants an additional test.

As far as the ten-second timeout goes, I'm open to suggestions but it doesn't appear that rebalances or forwarding to master should be a factor.

Copy link
Copy Markdown
Contributor

@kkonstantine kkonstantine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@C0urante this looks even better now IMO and on par with the rest of the code.
One nit and one more suggested improvement on the tests. Thanks for including those btw. Definitely on the right track.

@C0urante
Copy link
Copy Markdown
Contributor Author

@kkonstantine thanks for the review! I've taken your comments into account and pushed the next iteration, which consists of changes to how the ConnectClusterStateImpl is tested. Hope this looks good and addresses your review remarks, but will be happy to keep iterating if not or if there's anything else that needs work.

Copy link
Copy Markdown
Contributor

@kkonstantine kkonstantine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Copy Markdown
Contributor

@rhauch rhauch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice find, @C0urante! One really minor suggestion for slight improvement in readability, but not a bit deal.

This has never worked, right? This PR should be relatively easy to backport.

@C0urante
Copy link
Copy Markdown
Contributor Author

@rhauch yep, this bug has been present since REST extensions were added to Connect. I've added your suggested change to the PR, let me know if there are any other alterations you'd like.

Copy link
Copy Markdown
Contributor

@rhauch rhauch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @C0urante! LGTM.

Copy link
Copy Markdown
Contributor

@ewencp ewencp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good, one comment. This is the sort of thing that should come with list of backports -- the bug appears to already be tagged with at least some info about previous versions that are also affected so we probably don't want to close without backporting.

Regarding timeout, it should really be as conservative as possible without being longer than the minimum of (rebalance timeout, default REST API request timeout).

@C0urante
Copy link
Copy Markdown
Contributor Author

@ewencp can you shed some light on why we'd want to take rebalances into account for the timeout? I could be wrong, but I don't think a rebalance would affect the latency of a call to Herder.connectors(...).

@ewencp
Copy link
Copy Markdown
Contributor

ewencp commented Mar 21, 2019

Calling connectors() for DistributedHerder adds a request to the request queue, which are processed serially in the thread that manages cluster membership so they get a consistent view of the state of the Herder. Pretty much all the requests follow that pattern, though it is true some could potentially be optimized to not require it if some other locking were put in place for certain parts of the state.

The thread that processes the requests is also the one managing the member and calling poll() on it, which can block on ensuring membership in the group, which means request processing latency can be affected by rebalances.

@C0urante
Copy link
Copy Markdown
Contributor Author

@ewencp thanks for the clarification; I did not realize that membership and request handling were taken care of with the same thread. I've taken your suggestion literally and just made the timeout the minimum of the timeout for the /connectors resource and the rebalance timeout (if running in distributed mode). LMK how this looks!

@C0urante
Copy link
Copy Markdown
Contributor Author

C0urante commented Mar 25, 2019

BTW, I think we want to backport this to 2.0, which according to the KIP is when Connect REST extensions were introduced and AFAICT the code for the ConnectClusterStateImpl class hasn't been modified since.

Copy link
Copy Markdown
Contributor

@ewencp ewencp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one comment that we could follow up on, but i'm fine merging as is

config, ConnectRestExtension.class);

long herderRequestTimeoutMs = ConnectorsResource.REQUEST_TIMEOUT_MS;
if (config instanceof DistributedConfig) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: i don't love instanceof solutions. might be better to generalize thee rebalance timeout so support a small/zero value for StandaloneConfig.

I wouldn't block merging on this, but this is an anti-pattern, so if there is a cleaner solution, it might be worth exploring.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@C0urante, what do you think about doing something like the following instead of the instanceof block?

        Integer rebalanceTimeoutMs = config.getInt(DistributedConfig.REBALANCE_TIMEOUT_MS_CONFIG);
        if (rebalanceTimeoutMs != null) {
            herderRequestTimeoutMs = Math.min(herderRequestTimeoutMs, rebalanceTimeoutMs.longValue());
        }

This still uses the DistributedConfig.REBALANCE_TIMEOUT_MS_CONFIG, but otherwise is simply checking for the existence of that property to determine if herderRequestTimeoutMs should be set to a new value.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rhauch sgtm! I've added your code snippet verbatim to the PR

@rhauch rhauch merged commit 71e721f into apache:trunk Apr 7, 2019
rhauch pushed a commit that referenced this pull request Apr 7, 2019
Fixed the ConnectClusterStateImpl.connectors() method and throw an exception on timeout. Added unit test.

Author: Chris Egerton <chrise@confluent.io>

Reviewers: Magesh Nandakumar <magesh.n.kumar@gmail.com>, Robert Yokota <rayokota@gmail.com>, Arjun Satish <wicknicks@users.noreply.github.com>, Konstantine Karantasis <konstantine@confluent.io>, Randall Hauch <rhauch@gmail.com>, Ewen Cheslack-Postava <ewen@confluent.io>

Closes #6384 from C0urante:kafka-8058
rhauch pushed a commit that referenced this pull request Apr 7, 2019
Fixed the ConnectClusterStateImpl.connectors() method and throw an exception on timeout. Added unit test.

Author: Chris Egerton <chrise@confluent.io>

Reviewers: Magesh Nandakumar <magesh.n.kumar@gmail.com>, Robert Yokota <rayokota@gmail.com>, Arjun Satish <wicknicks@users.noreply.github.com>, Konstantine Karantasis <konstantine@confluent.io>, Randall Hauch <rhauch@gmail.com>, Ewen Cheslack-Postava <ewen@confluent.io>

Closes #6384 from C0urante:kafka-8058
rhauch pushed a commit that referenced this pull request Apr 7, 2019
Fixed the ConnectClusterStateImpl.connectors() method and throw an exception on timeout. Added unit test.

Author: Chris Egerton <chrise@confluent.io>

Reviewers: Magesh Nandakumar <magesh.n.kumar@gmail.com>, Robert Yokota <rayokota@gmail.com>, Arjun Satish <wicknicks@users.noreply.github.com>, Konstantine Karantasis <konstantine@confluent.io>, Randall Hauch <rhauch@gmail.com>, Ewen Cheslack-Postava <ewen@confluent.io>

Closes #6384 from C0urante:kafka-8058
@C0urante C0urante deleted the kafka-8058 branch April 7, 2019 19:13
jarekr pushed a commit to confluentinc/kafka that referenced this pull request Apr 18, 2019
* apache/trunk:
  MINOR: Add security considerations for remote JMX in Kafka docs (apache#6544)
  MINOR: Remove redundant access specifiers from metrics interfaces (apache#6527)
  MINOR: Correct KStream documentation (apache#6552)
  KAFKA-8013; Avoid underflow when reading a Struct from a partially correct buffer (apache#6340)
  KAFKA-8058: Fix ConnectClusterStateImpl.connectors() method (apache#6384)
  MINOR: Move common consumer tests out of abstract consumer class (apache#6548)
  KAFKA-8168; Add a generated ApiMessageType class
  KAFKA-7893; Refactor ConsumerBounceTest to reuse functionality from BaseConsumerTest (apache#6238)
  MINOR: Tighten up metadata upgrade test (apache#6531)
  KAFKA-8190; Don't update keystore modification time during validation (apache#6539)
  MINOR: Fixed a few warning in core and connects (apache#6545)
  KAFKA-7904; Add AtMinIsr partition metric and TopicCommand option (KIP-427)
  MINOR: fix throttling and status in ConnectionStressWorker
  KAFKA-8090: Use automatic RPC generation in ControlledShutdown
  KAFKA-6399: Remove Streams max.poll.interval override (apache#6509)
  KAFKA-8126: Flaky Test org.apache.kafka.connect.runtime.WorkerTest.testAddRemoveTask (apache#6475)
  HOTFIX: Update unit test for KIP-443
  KAFKA-7190: KIP-443; Remove streams overrides on repartition topics (apache#6511)
  KAFKA-8183: Add retries to WorkerUtils#verifyTopics (apache#6532)
  KAFKA-8181: Removed Avro topic from TOC on kafka (apache#6529)
pengxiaolong pushed a commit to pengxiaolong/kafka that referenced this pull request Jun 14, 2019
)

Fixed the ConnectClusterStateImpl.connectors() method and throw an exception on timeout. Added unit test.

Author: Chris Egerton <chrise@confluent.io>

Reviewers: Magesh Nandakumar <magesh.n.kumar@gmail.com>, Robert Yokota <rayokota@gmail.com>, Arjun Satish <wicknicks@users.noreply.github.com>, Konstantine Karantasis <konstantine@confluent.io>, Randall Hauch <rhauch@gmail.com>, Ewen Cheslack-Postava <ewen@confluent.io>

Closes apache#6384 from C0urante:kafka-8058
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants