KAFKA-5608: Add --wait option for JmxTool and use in system tests to avoid race between JmxTool and monitored services by ewencp · Pull Request #3547 · apache/kafka

ewencp · 2017-07-19T05:03:11Z

No description provided.

…avoid race between JmxTool and monitored services

ewencp · 2017-07-19T05:13:49Z

This is still wip and needs testing, but I've kicked off a round of tests here. It turns out there are still multiple problems with JmxTool / JmxMixin. @hachikuji discovered the issue that is likely a source of most of the issues from the previous patch -- JmxTool may start up fast enough that the requested metrics are not available yet. In that case, it will never discover them, it'll just run with an empty list and hang. This matches the code, and I've verified with valid & invalid metrics just to be sure. This adds some logic to wait for requested metrics to be available before it starts reporting. Since you can't do this safely for empty lists or object name patterns, it tries to protect against misuse. However, within system tests, afaict, we don't use anything but an explicit list of names.

The other case that I found in these test failures is when the application being observed is too fast. JmxMixin may detect that the JMX port is listening, but by the time it actually runs, connects, and tries to query, the application has exited. This one will be harder to fix.

I think the long term solution is to ditch the separate JmxTool entirely and use MetricsReporters instead. This patch is actually a second attempt to fix this issue -- the initial retries for connections were added in KAFKA-4558 and this is probably our second/third/etc iteration on about 3 or 4 areas related to JmxTool that have been challenging to get reliable.

/cc @hachikuji @apurvam @ijuma

asfgit · 2017-07-19T06:03:40Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.11/6158/
Test PASSed (JDK 7 and Scala 2.11).

asfgit · 2017-07-19T06:26:50Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/6142/
Test PASSed (JDK 8 and Scala 2.12).

ijuma

Thanks for the PR. Left a couple of comments.

ijuma · 2017-07-19T12:29:37Z

+        if (names != null) {
+          System.err.println("Could not find all object names, retrying")
+          retries += 1
+          Thread.sleep(500)


Is there a reason not to make this lower? Maybe 100?

In general we've needed to increase these values. The 500 in the connection timeout was previously also an issue -- we had some tests that would exceed that. We sometimes see 5s of just JVM startup time before we ever even get a log message...

The only reason it's relevant is because we're limited by the number of retries. Perhaps we should use a timeout instead?

A fixed one like we effectively have via # of retries now or do you mean via a user-provided config?

It could be hard-coded. I was just thinking it is a more intuitive way to achieve the same result. A bit weird to have to increase the backoff to make sure we're giving enough time to the tool.

ijuma · 2017-07-19T12:34:55Z

+          Thread.sleep(500)
+        }
+        names = queries.flatMap((name: ObjectName) => mbsc.queryNames(name, null).asScala)
+      } while (wait && retries < maxNumRetries && !queries.equals(names))


The queries.equals(names) check is a bit fragile as the collection type is Iterable. If one of them is a Set and the other is a Seq, they will always be false, for example. Maybe we should make sure both instances are a Set before comparing. It also seems to make more sense since we don't care about ordering.

Yeah, I had thought about that too, will update with that change.

ijuma · 2017-07-19T12:35:55Z

I agree that it would be good to get rid of JmxTool in the future.

ewencp · 2017-07-19T17:31:07Z

There seems to be a lot of variability currently which makes it more difficult to evaluate the impact of this patch. For example, the last few nightly runs:

18th 1 - 118 failures
18th 2 - 25 failures
19th - 12 failures

In any case, on to evaluating the test failures. One class of failures show the following in jmx_tool.err.log:

Exception in thread "main" joptsimple.UnrecognizedOptionException: wait is not a recognized option
        at joptsimple.OptionException.unrecognizedOption(OptionException.java:108)
        at joptsimple.OptionParser.handleLongOptionToken(OptionParser.java:510)
        at joptsimple.OptionParserState$2.handleArgument(OptionParserState.java:56)
        at joptsimple.OptionParser.parse(OptionParser.java:396)
        at kafka.tools.JmxTool$.main(JmxTool.scala:77)
        at kafka.tools.JmxTool.main(JmxTool.scala)

2 failures of kafkatest.tests.client.message_format_change_test.MessageFormatChangeTest.test_compatibility
1 failure of kafkatest.tests.core.upgrade_test.TestUpgrade.test_upgrade
6 failures of kafkatest.tests.core.compatibility_test_new_broker_test.ClientCompatibilityTestNewBroker.test_compatibility
11 failures of kafkatest.tests.core.upgrade_test.TestUpgrade.test_upgrade

At first I thought these were due to an rsyncing issue, but then realized they are all related to version upgrades. Unfortunately, I'm not sure there will be a good fix for this. One option is to try to force those old versions to use the newer version of JmxTool. This should be pretty reliable since JmxTool changes infrequently, but I think requires some gymnastics with the path resolution/versioning stuff. The other would be to half revert the removal of waiting for the first line of output and make it conditional on the version, but at that point it seems easier to revert & add a note about why things are the way they are (despite the fact that it seems there is still potentially a race condition in the test).

The remaining 2 failures from kafkatest.tests.core.throttling_test.ThrottlingTest.test_throttled_reassignment fail with this:

[DEBUG - 2017-07-19 06:03:56,451 - remoteaccount - _log - lineno:158]: ubuntu@worker10: Running ssh command: cat /mnt/jmx_tool.log
[INFO  - 2017-07-19 06:03:56,522 - background_thread - _protected_worker - lineno:38]: BackgroundThreadService threw exception:
[INFO  - 2017-07-19 06:03:56,522 - background_thread - _protected_worker - lineno:39]: Traceback (most recent call last):
  File "/home/jenkins/workspace/system-test-kafka-branch-builder/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/services/background_thread.py", line 35, in _protected_worker
    self._worker(idx, node)
  File "/home/jenkins/workspace/system-test-kafka-branch-builder/kafka/tests/kafkatest/services/console_consumer.py", line 276, in _worker
    self.read_jmx_output(idx, node)
  File "/home/jenkins/workspace/system-test-kafka-branch-builder/kafka/tests/kafkatest/services/monitor/jmx.py", line 101, in read_jmx_output
    stats = [float(field) for field in line.split(',')]
ValueError: could not convert string to float:

From the jmx_tool.log, it looks like there are somehow two instances of JmxTool running:

"time","kafka.consumer:type=consumer-coordinator-metrics,client-id=console-consumer:assigned-partitions"
"time","kafka.consumer:type=consumer-coordinator-metrics,client-id=console-consumer:assigned-partitions"
1500444002113,0.0
1500444002115,0.0
...snip...
1500444019120,3.0
1500444020113,3.0
1500444020118,3.0
1500444021115,3.01500444021123,3.0

And synchronization issues cause some lines to merge before the newline is written, then adding another newline. Not sure yet how this could happen.

…JmxTool so --wait is available.

asfgit · 2017-07-19T18:52:56Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.11/6173/
Test FAILed (JDK 7 and Scala 2.11).

asfgit · 2017-07-19T18:56:30Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/6157/
Test FAILed (JDK 8 and Scala 2.12).

ewencp · 2017-07-19T19:04:52Z

Turns out the version stuff doesn't look so bad, I've updated the patch and am running the subset of tests that failed here.

ewencp · 2017-07-19T20:32:39Z

The system tests passed and I've updated code to address comments, so I think only remaining issue is that one set of tests that weirdly seems to have 2 instances of JmxTool running concurrently.

Additionally, it looks like this only backports cleanly to 0.11.0 and 0.10.2 and the version check relies on 0.11.0.0 version numbers being available, so I don't think this will work for 0.10.2 either. So I'd probably revert the changes on those branches and we'd only rely on this for 0.11.0 and newer.

asfgit · 2017-07-19T21:28:40Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.11/6174/
Test PASSed (JDK 7 and Scala 2.11).

asfgit · 2017-07-19T21:47:19Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/6158/
Test PASSed (JDK 8 and Scala 2.12).

apurvam · 2017-07-19T23:09:53Z

Thanks for the patch. If I understand correctly, the issue being addressed here was revealed by #3447. In that PR, we no longer wait for the first line of output before starting the tool. So it could start before the metrics are ready, causing the tool to hang.

And we had PR #3447 because to fix errors like JMXTool took too long to start correct?

ewencp · 2017-07-19T23:27:00Z

@apurvam That's correct, #3447 now allows JmxTool to start so fast that the metrics haven't fully registered and JmxTool does not currently handle that well resulting in a hang.

#3447 is because I'm trying to get the conditions here to finally test the conditions we really want. This is the third iteration. The check that the first line had be written was the first attempt, and was a heuristic when JmxMixin was first added and we found that was incorrect, which led to your change to JmxTool to add some retries to handle potential delay. However, just being able to connect isn't sufficient (which is why your change to JmxTool and my change to JmxMixin to check connectivity with nc before starting JmxTool did not fully work).

This isn't really a great solution either since it relies on some fixed timeouts in JmxTool (which I've just increased), but it's the most accurate we've gotten so far. That's why I mentioned possibly trying an alternative approach using MetricsReporters in the future since it wouldn't rely on this multi-process coordination.

hachikuji · 2017-07-19T23:29:57Z

Using a metric reporter seems like a good option to explore. Maybe create a JIRA?

ewencp · 2017-07-19T23:44:58Z

Yeah, I'll follow up. Am just trying to get this done quickly so we can get the failures resolved.

Since above tests pass, last issue is that one test that seems to run JmxTool multiple times. Still not sure what's going on and having trouble reproducing anywhere but on Jenkins, but from one of the logs, it does indeed appear it is starting multiple times:

[DEBUG - 2017-07-19 21:55:18,964 - console_consumer - _worker - lineno:260]: collecting following jmx objects: ['kafka.consumer:type=consumer-coordinator-metrics,client-id=console-consumer']
[DEBUG - 2017-07-19 21:55:18,965 - remoteaccount - _log - lineno:158]: ubuntu@worker1: Running ssh command: nc -z 127.0.0.1 9192
[DEBUG - 2017-07-19 21:55:18,971 - remoteaccount - _log - lineno:158]: ubuntu@worker1: Running ssh command: ps ax | grep -i console_consumer | grep java | grep -v grep | awk '{print $1}'
[DEBUG - 2017-07-19 21:55:19,045 - remoteaccount - _log - lineno:158]: ubuntu@worker1: Running ssh command: nc -z 127.0.0.1 9192
[DEBUG - 2017-07-19 21:55:19,071 - remoteaccount - _log - lineno:158]: ubuntu@worker1: Running ssh command: nc -z 127.0.0.1 9192
[DEBUG - 2017-07-19 21:55:19,190 - remoteaccount - _log - lineno:158]: ubuntu@worker1: Running ssh command: nc -z 127.0.0.1 9192
[DEBUG - 2017-07-19 21:55:19,234 - remoteaccount - _log - lineno:158]: ubuntu@worker1: Running ssh command: nc -z 127.0.0.1 9192
[DEBUG - 2017-07-19 21:55:19,239 - jmx - start_jmx_tool - lineno:78]: ubuntu@worker1: Start JmxTool 1 command: /opt/kafka-dev/bin/kafka-run-class.sh kafka.tools.JmxTool --reporting-interval 1000 --jmx-url service:jmx:rmi:///jndi/rmi://127.0.0.1:9192/jmxrmi --wait --object-name kafka.consumer:type=consumer-coordinator-metrics,client-id=console-consumer --attributes assigned-partitions 1>> /mnt/jmx_tool.log 2>> /mnt/jmx_tool.err.log &
[DEBUG - 2017-07-19 21:55:19,239 - remoteaccount - _log - lineno:158]: ubuntu@worker1: Running ssh command: /opt/kafka-dev/bin/kafka-run-class.sh kafka.tools.JmxTool --reporting-interval 1000 --jmx-url service:jmx:rmi:///jndi/rmi://127.0.0.1:9192/jmxrmi --wait --object-name kafka.consumer:type=consumer-coordinator-metrics,client-id=console-consumer --attributes assigned-partitions 1>> /mnt/jmx_tool.log 2>> /mnt/jmx_tool.err.log &
[DEBUG - 2017-07-19 21:55:19,295 - remoteaccount - _log - lineno:158]: ubuntu@worker1: Running ssh command: nc -z 127.0.0.1 9192
[DEBUG - 2017-07-19 21:55:19,301 - jmx - start_jmx_tool - lineno:78]: ubuntu@worker1: Start JmxTool 1 command: /opt/kafka-dev/bin/kafka-run-class.sh kafka.tools.JmxTool --reporting-interval 1000 --jmx-url service:jmx:rmi:///jndi/rmi://127.0.0.1:9192/jmxrmi --wait --object-name kafka.consumer:type=consumer-coordinator-metrics,client-id=console-consumer --attributes assigned-partitions 1>> /mnt/jmx_tool.log 2>> /mnt/jmx_tool.err.log &
[DEBUG - 2017-07-19 21:55:19,301 - remoteaccount - _log - lineno:158]: ubuntu@worker1: Running ssh command: /opt/kafka-dev/bin/kafka-run-class.sh kafka.tools.JmxTool --reporting-interval 1000 --jmx-url service:jmx:rmi:///jndi/rmi://127.0.0.1:9192/jmxrmi --wait --object-name kafka.consumer:type=consumer-coordinator-metrics,client-id=console-consumer --attributes assigned-partitions 1>> /mnt/jmx_tool.log 2>> /mnt/jmx_tool.err.log &
[DEBUG - 2017-07-19 21:55:19,345 - remoteaccount - _log - lineno:158]: ubuntu@worker1: Running ssh command: test -z "$(cat /mnt/jmx_tool.log)"
[DEBUG - 2017-07-19 21:55:19,346 - remoteaccount - _log - lineno:158]: ubuntu@worker1: Running ssh command: test -z "$(cat /mnt/jmx_tool.log)"
[DEBUG - 2017-07-19 21:55:19,855 - remoteaccount - _log - lineno:158]: ubuntu@worker1: Running ssh command: test -z "$(cat /mnt/jmx_tool.log)"
[DEBUG - 2017-07-19 21:55:19,856 - remoteaccount - _log - lineno:158]: ubuntu@worker1: Running ssh command: test -z "$(cat /mnt/jmx_tool.log)"
[DEBUG - 2017-07-19 21:55:20,363 - remoteaccount - _log - lineno:158]: ubuntu@worker1: Running ssh command: test -z "$(cat /mnt/jmx_tool.log)"
[DEBUG - 2017-07-19 21:55:20,364 - remoteaccount - _log - lineno:158]: ubuntu@worker1: Running ssh command: test -z "$(cat /mnt/jmx_tool.log)"

I'll continue trying to track down how that could happen (seems like it could possibly have something to do with console consumer running in the background and calling start_jmx_tool but produce_consume_validate calling has_partitions_assigned in the test, which in turn calls start_jmx_tool), and hopefully reproduce. But I think we might want to also merge as is (and revert the other patch on 0.10.1 and 0.10.2) if people are happy with the current state so that we can get the vast majority of tests back to passing. I can also @ignore that test for the moment if it's inconvenient until we can sort out the issue.

hachikuji · 2017-07-19T23:44:59Z

        List(null)

-    val names = queries.flatMap((name: ObjectName) => mbsc.queryNames(name, null).asScala)
+    val hasPatternQueries = queries.exists((name: ObjectName) => name.isPattern)


Do you know if we are using patterns anywhere? If the goal is to ultimately remove the JmxTool and we don't use this capability, maybe we could drop support now. If that makes sense, we could do it in a follow-up to avoid holding up this patch any further.

JmxTool is a public tool that users use. We'd need to deprecate before removing. Technically the addition of the option should probably be a KIP

I see. In that case, do we need the --wait option at all? Could this not be the default behavior?

Using --wait breaks the tool when used with patterns. Since we can't say whether users are using patterns or not, I don't think we can turn it on by default. And I wouldn't be surprised if they are since it allows you to gather metrics for all client IDs, all brokers, etc.

Oh, I think I see what you were saying re: default, just if no patterns were used. It would be a change in behavior. Currently even if you have an extra or typo metric, you still get output. If we made this the default, you would not.

Perhaps it would be more palatable if, instead of exiting, we just log a warning when the wait time expires and not all queries have been matched?

Hmm, maybe. The tradeoff there is that users running it directly now suffer a 10s timeout (was only 5s, but while we were at it, since we hit timeout issues with JmxTool in Confluent's own tests I figured it made sense to bump it up). 10s isn't too bad, but since there's also output about retrying (which is useful for debugging in system tests) it'll just spam their console for awhile (the tradeoff of 100 vs 500ms).

Most of this could be resolved by using real logging, but since the tool doesn't do that we have fewer options here.

Ok. I'll leave it up to you since the tradeoffs seem about even. I slightly prefer changing the default behavior since a little extra latency in cases when the query doesn't match seems like not a big deal for a long-running service like this (and I think it's unlikely we have many users of this tool anyway or we wouldn't be considering deprecation). It also avoids the K-word (you brought it up, not me).

asfgit · 2017-07-20T00:08:09Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.11/6178/
Test PASSed (JDK 7 and Scala 2.11).

asfgit · 2017-07-20T00:22:16Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/6162/
Test FAILed (JDK 8 and Scala 2.12).

apurvam · 2017-07-20T00:29:47Z

I'll continue trying to track down how that could happen (seems like it could possibly have something to do with console consumer running in the background and calling start_jmx_tool but produce_consume_validate calling has_partitions_assigned in the test, which in turn calls start_jmx_tool), and hopefully reproduce. But I think we might want to also merge as is (and revert the other patch on 0.10.1 and 0.10.2) if people are happy with the current state so that we can get the vast majority of tests back to passing. I can also @ignore that test for the moment if it's inconvenient until we can sort out the issue.

This seems to be a valid hypothesis, since the throttling test is the only one which actually calls has_partitions_assigned, which results in multiple calls to start_jmx_tool. Still not sure why this patch would expose a problem with this behavior though.

apurvam · 2017-07-20T00:42:49Z

Oh I see, there may be a race condition in the self.started check in start_jmx_tool which is being exposed by these changes somehow so that it returns false both times. When we waited for the first line of output from the consumer, we would always start the jmx tool from has_partitions_assigned, and the second call would be no-op. But now we probably start from both places more often than not.

Perhaps the fix would be to not start the jmx tool from has_partition_assigned since with the changes in #3447 it should already have been started as soon as the console consumer was started.

hachikuji

Left a couple minor comments, but LGTM overall.

hachikuji · 2017-07-20T00:40:09Z

-    val names = queries.flatMap((name: ObjectName) => mbsc.queryNames(name, null).asScala)
+    val hasPatternQueries = queries.exists((name: ObjectName) => name.isPattern)
+
+    var names : Iterable[ObjectName] = null


nit: remove space after names?

hachikuji · 2017-07-20T00:40:42Z

+    }
+
+    if (wait && !queries.toSet.equals(if (names == null) null else names.toSet)) {
+      System.err.println(s"Could not find all requested object names after $waitTimeoutMs ms.")


Maybe we can mention the ones we didn't find?

hachikuji · 2017-07-20T00:42:36Z

+          Thread.sleep(100)
+        }
+        names = queries.flatMap((name: ObjectName) => mbsc.queryNames(name, null).asScala)
+      } while (wait && System.currentTimeMillis - start < waitTimeoutMs && !queries.toSet.equals(if (names == null) null else names.toSet))


nit: perhaps turn queries.toSet.equals(if (names == null) null else names.toSet) into an inner def since it is also used below?

ewencp · 2017-07-20T01:01:54Z

@apurvam That's actually not definitely true because all starting the console consumer service does is start the background thread. You're not really guaranteed start_jmx_tool has been called unless you do some additional synchronization.

I actually just synchronized the relevant calls in ConsoleConsumer and ran those tests here and they are now passing. Given that they failed multiple times in a row previously, I think we should be good. I've kicked off another round of those tests to repeatedly validate.

…foreground and background threads may invoke them

asfgit · 2017-07-20T02:29:26Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.11/6182/
Test PASSed (JDK 7 and Scala 2.11).

hachikuji · 2017-07-20T02:37:55Z

-        self.logger.debug("collecting following jmx objects: %s", self.jmx_object_names)
-        self.start_jmx_tool(idx, node)
+        with self.lock:
+            self._init_jmx_attributes()


Maybe this line and the one below could be moved into start_jmx_tool? If we go that far, we could do the locking in there as well.

start_jmx_tool is completely generic for anything this is mixed into. _init_jmx_metrics is specific to ConsoleConsumer. That's actually why I renamed it -- it clarifies that this method is internal to this specific class.

I think we might be able to move _init_jmx_metrics into the constructor if we are sure that all uses guarantee that the JmxMixin constructor is called with the appropriate parameters and that the choice of metrics is not modified dynamically after the service is created.

hachikuji · 2017-07-20T02:39:57Z

+
+    def _init_jmx_attributes(self):
+        # Must hold lock
        if self.new_consumer is True:


nit: since we're here, maybe we could simplify this if and the one below?

Done. There's a bunch of other cleanup that we'd probably catch via a simple pep8/flake8 checker against the entire code base (with the drawback that this may make any future backports quite painful).

asfgit · 2017-07-20T02:52:02Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/6166/
Test PASSed (JDK 8 and Scala 2.12).

ewencp · 2017-07-20T03:52:51Z

@hachikuji Addressed those last few comments, making changes where I think it's safe. Last round of tests against the threading race condition also passed.

Hopefully this is good to go and we can just do some follow ups after this:

At a minimum, file a JIRA to consider replacing JmxTool with a MetricsReporter in our system tests since this multi-process stuff has been the source of many weeks worth of debugging and fixes with at least 3 different attempts to get things right (and yet may still not be quite right, e.g. I think the assignment metric might get updated before offsets are reloaded and fetch requests make it to the broker, which I think is the real requirement for some of these tests).
Determine what the usage of JmxTool is and whether it's used much beyond very simple local use cases + system tests. If it doesn't have much other usage, tweak the defaults to work better for system tests, and possibly figure out a path to deprecation/move to system test-specific jar.
There are a number of PEP8/flake8 checks which would help clean up the code and make it more idiomatic. These will probably only make sense to actually change once we have automated checkers in CI to help enforce the rules for those that aren't very familiar with idiomatic python.

hachikuji · 2017-07-20T04:08:29Z

Latest changes LGTM. Feel free to merge when you think it's ready.

asfgit · 2017-07-20T04:31:40Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/6168/
Test PASSed (JDK 8 and Scala 2.12).

asfgit · 2017-07-20T04:36:51Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.11/6184/
Test PASSed (JDK 7 and Scala 2.11).

…avoid race between JmxTool and monitored services Author: Ewen Cheslack-Postava <me@ewencp.org> Author: Ewen Cheslack-Postava <ewen@confluent.io> Reviewers: Ismael Juma <ismael@juma.me.uk>, Jason Gustafson <jason@confluent.io> Closes #3547 from ewencp/wait-jmx-metrics (cherry picked from commit f50af9c) Signed-off-by: Ewen Cheslack-Postava <me@ewencp.org>

…since we now have a more reliable test using JMX" This reverts commit 40e36fc. See https://issues.apache.org/jira/browse/KAFKA-5608 and #3547 for more details.

…since we now have a more reliable test using JMX" This reverts commit 7f96aa3. See https://issues.apache.org/jira/browse/KAFKA-5608 and #3547 for more details.

ewencp · 2017-07-20T05:04:33Z

Ok, so this is now merged to trunk and 0.11.0. I've reverted the commit that caused the problems on 0.10.2 and 0.10.1. I've also filed:

https://issues.apache.org/jira/browse/KAFKA-5612 - replace JmxTool with MetricsReporter
https://issues.apache.org/jira/browse/KAFKA-5613 - deprecate JmxTool
https://issues.apache.org/jira/browse/KAFKA-5614 - python style checks

ijuma

Thanks for the quick follow-up. I agree that it's best to revert the change for older branches. I left a couple of comments with a PR to address them (if you agree):

#3553

ijuma · 2017-07-20T11:15:52Z

+
+    var names: Iterable[ObjectName] = null
+    def namesSet = if (names == null) null else names.toSet
+    def foundAllObjects() = !queries.toSet.equals(namesSet)


Seems like this is actually doing the reverse of what it's claiming to do. But the calling code uses it in a way that's OK in terms of behaviour, but a bit confusing.

ijuma · 2017-07-20T11:16:57Z

+    val hasPatternQueries = queries.exists((name: ObjectName) => name.isPattern)
+
+    var names: Iterable[ObjectName] = null
+    def namesSet = if (names == null) null else names.toSet


Isn't there a potential NPE here if wait == true && !hasPatternQueries?

Reviewers: Ron Dagostino <rdagostino@confluent.io>, Yeva Byzek <yeva@confluent.io>

KAFKA-5608: Add --wait option for JmxTool and use in system tests to …

b333060

…avoid race between JmxTool and monitored services

ijuma reviewed Jul 19, 2017

View reviewed changes

ewencp added 2 commits July 19, 2017 11:24

When using older versions of Kafka, make them use a newer version of …

4ea3bd4

…JmxTool so --wait is available.

Validate queries and names match as a Set rather than Iterables

f4bfd33

Fix handling of nulls

eb62a6a

Switch to using fixed timeouts instead of a fixed number of retries.

c136a4d

hachikuji reviewed Jul 19, 2017

View reviewed changes

hachikuji approved these changes Jul 20, 2017

View reviewed changes

ewencp and others added 2 commits July 19, 2017 18:12

Add locking to the console consumer's JMX related methods since both …

ac4ee6d

…foreground and background threads may invoke them

Address review comments

053709e

hachikuji reviewed Jul 20, 2017

View reviewed changes

Cleanup a few if statements

206ab03

asfgit closed this in f50af9c Jul 20, 2017

ewencp mentioned this pull request Jul 20, 2017

MINOR: Apply extra serialized rsync step to both parallel and serial paths #3546

Closed

ijuma reviewed Jul 20, 2017

View reviewed changes

cmccabe added a commit to confluentinc/kafka that referenced this pull request May 7, 2021

KC-1618: Adapt the KRaft README and example configs for CP (apache#3547)

1b44bfb

Reviewers: Ron Dagostino <rdagostino@confluent.io>, Yeva Byzek <yeva@confluent.io>

cmccabe added a commit to confluentinc/kafka that referenced this pull request May 7, 2021

KC-1618: Adapt the KRaft README and example configs for CP (apache#3547)

cfc63e5

Reviewers: Ron Dagostino <rdagostino@confluent.io>, Yeva Byzek <yeva@confluent.io>

Conversation

ewencp commented Jul 19, 2017

Uh oh!

ewencp commented Jul 19, 2017

Uh oh!

asfgit commented Jul 19, 2017

Uh oh!

asfgit commented Jul 19, 2017

Uh oh!

ijuma left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ijuma commented Jul 19, 2017

Uh oh!

ewencp commented Jul 19, 2017

Uh oh!

asfgit commented Jul 19, 2017

Uh oh!

asfgit commented Jul 19, 2017

Uh oh!

ewencp commented Jul 19, 2017

Uh oh!

ewencp commented Jul 19, 2017

Uh oh!

asfgit commented Jul 19, 2017

Uh oh!

asfgit commented Jul 19, 2017

Uh oh!

apurvam commented Jul 19, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ewencp commented Jul 19, 2017

Uh oh!

hachikuji commented Jul 19, 2017

Uh oh!

ewencp commented Jul 19, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hachikuji Jul 19, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

asfgit commented Jul 20, 2017

Uh oh!

asfgit commented Jul 20, 2017

Uh oh!

apurvam commented Jul 20, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

apurvam commented Jul 20, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hachikuji left a comment

apurvam commented Jul 19, 2017 •

edited

Loading

hachikuji Jul 19, 2017 •

edited

Loading

apurvam commented Jul 20, 2017 •

edited

Loading

apurvam commented Jul 20, 2017 •

edited

Loading

hachikuji Jul 20, 2017 •

edited

Loading