Skip to content

KAFKA-5608: Add --wait option for JmxTool and use in system tests to avoid race between JmxTool and monitored services#3547

Closed
ewencp wants to merge 8 commits intoapache:trunkfrom
ewencp:wait-jmx-metrics
Closed

KAFKA-5608: Add --wait option for JmxTool and use in system tests to avoid race between JmxTool and monitored services#3547
ewencp wants to merge 8 commits intoapache:trunkfrom
ewencp:wait-jmx-metrics

Conversation

@ewencp
Copy link
Copy Markdown
Contributor

@ewencp ewencp commented Jul 19, 2017

No description provided.

…avoid race between JmxTool and monitored services
@ewencp
Copy link
Copy Markdown
Contributor Author

ewencp commented Jul 19, 2017

This is still wip and needs testing, but I've kicked off a round of tests here. It turns out there are still multiple problems with JmxTool / JmxMixin. @hachikuji discovered the issue that is likely a source of most of the issues from the previous patch -- JmxTool may start up fast enough that the requested metrics are not available yet. In that case, it will never discover them, it'll just run with an empty list and hang. This matches the code, and I've verified with valid & invalid metrics just to be sure. This adds some logic to wait for requested metrics to be available before it starts reporting. Since you can't do this safely for empty lists or object name patterns, it tries to protect against misuse. However, within system tests, afaict, we don't use anything but an explicit list of names.

The other case that I found in these test failures is when the application being observed is too fast. JmxMixin may detect that the JMX port is listening, but by the time it actually runs, connects, and tries to query, the application has exited. This one will be harder to fix.

I think the long term solution is to ditch the separate JmxTool entirely and use MetricsReporters instead. This patch is actually a second attempt to fix this issue -- the initial retries for connections were added in KAFKA-4558 and this is probably our second/third/etc iteration on about 3 or 4 areas related to JmxTool that have been challenging to get reliable.

/cc @hachikuji @apurvam @ijuma

@asfgit
Copy link
Copy Markdown

asfgit commented Jul 19, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.11/6158/
Test PASSed (JDK 7 and Scala 2.11).

@asfgit
Copy link
Copy Markdown

asfgit commented Jul 19, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/6142/
Test PASSed (JDK 8 and Scala 2.12).

Copy link
Copy Markdown
Member

@ijuma ijuma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. Left a couple of comments.

if (names != null) {
System.err.println("Could not find all object names, retrying")
retries += 1
Thread.sleep(500)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason not to make this lower? Maybe 100?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general we've needed to increase these values. The 500 in the connection timeout was previously also an issue -- we had some tests that would exceed that. We sometimes see 5s of just JVM startup time before we ever even get a log message...

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only reason it's relevant is because we're limited by the number of retries. Perhaps we should use a timeout instead?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A fixed one like we effectively have via # of retries now or do you mean via a user-provided config?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be hard-coded. I was just thinking it is a more intuitive way to achieve the same result. A bit weird to have to increase the backoff to make sure we're giving enough time to the tool.

Thread.sleep(500)
}
names = queries.flatMap((name: ObjectName) => mbsc.queryNames(name, null).asScala)
} while (wait && retries < maxNumRetries && !queries.equals(names))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The queries.equals(names) check is a bit fragile as the collection type is Iterable. If one of them is a Set and the other is a Seq, they will always be false, for example. Maybe we should make sure both instances are a Set before comparing. It also seems to make more sense since we don't care about ordering.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I had thought about that too, will update with that change.

@ijuma
Copy link
Copy Markdown
Member

ijuma commented Jul 19, 2017

I agree that it would be good to get rid of JmxTool in the future.

@ewencp
Copy link
Copy Markdown
Contributor Author

ewencp commented Jul 19, 2017

There seems to be a lot of variability currently which makes it more difficult to evaluate the impact of this patch. For example, the last few nightly runs:

In any case, on to evaluating the test failures. One class of failures show the following in jmx_tool.err.log:

Exception in thread "main" joptsimple.UnrecognizedOptionException: wait is not a recognized option
        at joptsimple.OptionException.unrecognizedOption(OptionException.java:108)
        at joptsimple.OptionParser.handleLongOptionToken(OptionParser.java:510)
        at joptsimple.OptionParserState$2.handleArgument(OptionParserState.java:56)
        at joptsimple.OptionParser.parse(OptionParser.java:396)
        at kafka.tools.JmxTool$.main(JmxTool.scala:77)
        at kafka.tools.JmxTool.main(JmxTool.scala)
  • 2 failures of kafkatest.tests.client.message_format_change_test.MessageFormatChangeTest.test_compatibility
  • 1 failure of kafkatest.tests.core.upgrade_test.TestUpgrade.test_upgrade
  • 6 failures of kafkatest.tests.core.compatibility_test_new_broker_test.ClientCompatibilityTestNewBroker.test_compatibility
  • 11 failures of kafkatest.tests.core.upgrade_test.TestUpgrade.test_upgrade

At first I thought these were due to an rsyncing issue, but then realized they are all related to version upgrades. Unfortunately, I'm not sure there will be a good fix for this. One option is to try to force those old versions to use the newer version of JmxTool. This should be pretty reliable since JmxTool changes infrequently, but I think requires some gymnastics with the path resolution/versioning stuff. The other would be to half revert the removal of waiting for the first line of output and make it conditional on the version, but at that point it seems easier to revert & add a note about why things are the way they are (despite the fact that it seems there is still potentially a race condition in the test).

The remaining 2 failures from kafkatest.tests.core.throttling_test.ThrottlingTest.test_throttled_reassignment fail with this:

[DEBUG - 2017-07-19 06:03:56,451 - remoteaccount - _log - lineno:158]: ubuntu@worker10: Running ssh command: cat /mnt/jmx_tool.log
[INFO  - 2017-07-19 06:03:56,522 - background_thread - _protected_worker - lineno:38]: BackgroundThreadService threw exception:
[INFO  - 2017-07-19 06:03:56,522 - background_thread - _protected_worker - lineno:39]: Traceback (most recent call last):
  File "/home/jenkins/workspace/system-test-kafka-branch-builder/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/services/background_thread.py", line 35, in _protected_worker
    self._worker(idx, node)
  File "/home/jenkins/workspace/system-test-kafka-branch-builder/kafka/tests/kafkatest/services/console_consumer.py", line 276, in _worker
    self.read_jmx_output(idx, node)
  File "/home/jenkins/workspace/system-test-kafka-branch-builder/kafka/tests/kafkatest/services/monitor/jmx.py", line 101, in read_jmx_output
    stats = [float(field) for field in line.split(',')]
ValueError: could not convert string to float:

From the jmx_tool.log, it looks like there are somehow two instances of JmxTool running:

"time","kafka.consumer:type=consumer-coordinator-metrics,client-id=console-consumer:assigned-partitions"
"time","kafka.consumer:type=consumer-coordinator-metrics,client-id=console-consumer:assigned-partitions"
1500444002113,0.0
1500444002115,0.0
...snip...
1500444019120,3.0
1500444020113,3.0
1500444020118,3.0
1500444021115,3.01500444021123,3.0

And synchronization issues cause some lines to merge before the newline is written, then adding another newline. Not sure yet how this could happen.

@asfgit
Copy link
Copy Markdown

asfgit commented Jul 19, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.11/6173/
Test FAILed (JDK 7 and Scala 2.11).

@asfgit
Copy link
Copy Markdown

asfgit commented Jul 19, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/6157/
Test FAILed (JDK 8 and Scala 2.12).

@ewencp
Copy link
Copy Markdown
Contributor Author

ewencp commented Jul 19, 2017

Turns out the version stuff doesn't look so bad, I've updated the patch and am running the subset of tests that failed here.

@ewencp
Copy link
Copy Markdown
Contributor Author

ewencp commented Jul 19, 2017

The system tests passed and I've updated code to address comments, so I think only remaining issue is that one set of tests that weirdly seems to have 2 instances of JmxTool running concurrently.

Additionally, it looks like this only backports cleanly to 0.11.0 and 0.10.2 and the version check relies on 0.11.0.0 version numbers being available, so I don't think this will work for 0.10.2 either. So I'd probably revert the changes on those branches and we'd only rely on this for 0.11.0 and newer.

@asfgit
Copy link
Copy Markdown

asfgit commented Jul 19, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.11/6174/
Test PASSed (JDK 7 and Scala 2.11).

@asfgit
Copy link
Copy Markdown

asfgit commented Jul 19, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/6158/
Test PASSed (JDK 8 and Scala 2.12).

@apurvam
Copy link
Copy Markdown
Contributor

apurvam commented Jul 19, 2017

Thanks for the patch. If I understand correctly, the issue being addressed here was revealed by #3447. In that PR, we no longer wait for the first line of output before starting the tool. So it could start before the metrics are ready, causing the tool to hang.

And we had PR #3447 because to fix errors like JMXTool took too long to start correct?

@ewencp
Copy link
Copy Markdown
Contributor Author

ewencp commented Jul 19, 2017

@apurvam That's correct, #3447 now allows JmxTool to start so fast that the metrics haven't fully registered and JmxTool does not currently handle that well resulting in a hang.

#3447 is because I'm trying to get the conditions here to finally test the conditions we really want. This is the third iteration. The check that the first line had be written was the first attempt, and was a heuristic when JmxMixin was first added and we found that was incorrect, which led to your change to JmxTool to add some retries to handle potential delay. However, just being able to connect isn't sufficient (which is why your change to JmxTool and my change to JmxMixin to check connectivity with nc before starting JmxTool did not fully work).

This isn't really a great solution either since it relies on some fixed timeouts in JmxTool (which I've just increased), but it's the most accurate we've gotten so far. That's why I mentioned possibly trying an alternative approach using MetricsReporters in the future since it wouldn't rely on this multi-process coordination.

@hachikuji
Copy link
Copy Markdown
Contributor

Using a metric reporter seems like a good option to explore. Maybe create a JIRA?

@ewencp
Copy link
Copy Markdown
Contributor Author

ewencp commented Jul 19, 2017

Yeah, I'll follow up. Am just trying to get this done quickly so we can get the failures resolved.

Since above tests pass, last issue is that one test that seems to run JmxTool multiple times. Still not sure what's going on and having trouble reproducing anywhere but on Jenkins, but from one of the logs, it does indeed appear it is starting multiple times:

[DEBUG - 2017-07-19 21:55:18,964 - console_consumer - _worker - lineno:260]: collecting following jmx objects: ['kafka.consumer:type=consumer-coordinator-metrics,client-id=console-consumer']
[DEBUG - 2017-07-19 21:55:18,965 - remoteaccount - _log - lineno:158]: ubuntu@worker1: Running ssh command: nc -z 127.0.0.1 9192
[DEBUG - 2017-07-19 21:55:18,971 - remoteaccount - _log - lineno:158]: ubuntu@worker1: Running ssh command: ps ax | grep -i console_consumer | grep java | grep -v grep | awk '{print $1}'
[DEBUG - 2017-07-19 21:55:19,045 - remoteaccount - _log - lineno:158]: ubuntu@worker1: Running ssh command: nc -z 127.0.0.1 9192
[DEBUG - 2017-07-19 21:55:19,071 - remoteaccount - _log - lineno:158]: ubuntu@worker1: Running ssh command: nc -z 127.0.0.1 9192
[DEBUG - 2017-07-19 21:55:19,190 - remoteaccount - _log - lineno:158]: ubuntu@worker1: Running ssh command: nc -z 127.0.0.1 9192
[DEBUG - 2017-07-19 21:55:19,234 - remoteaccount - _log - lineno:158]: ubuntu@worker1: Running ssh command: nc -z 127.0.0.1 9192
[DEBUG - 2017-07-19 21:55:19,239 - jmx - start_jmx_tool - lineno:78]: ubuntu@worker1: Start JmxTool 1 command: /opt/kafka-dev/bin/kafka-run-class.sh kafka.tools.JmxTool --reporting-interval 1000 --jmx-url service:jmx:rmi:///jndi/rmi://127.0.0.1:9192/jmxrmi --wait --object-name kafka.consumer:type=consumer-coordinator-metrics,client-id=console-consumer --attributes assigned-partitions 1>> /mnt/jmx_tool.log 2>> /mnt/jmx_tool.err.log &
[DEBUG - 2017-07-19 21:55:19,239 - remoteaccount - _log - lineno:158]: ubuntu@worker1: Running ssh command: /opt/kafka-dev/bin/kafka-run-class.sh kafka.tools.JmxTool --reporting-interval 1000 --jmx-url service:jmx:rmi:///jndi/rmi://127.0.0.1:9192/jmxrmi --wait --object-name kafka.consumer:type=consumer-coordinator-metrics,client-id=console-consumer --attributes assigned-partitions 1>> /mnt/jmx_tool.log 2>> /mnt/jmx_tool.err.log &
[DEBUG - 2017-07-19 21:55:19,295 - remoteaccount - _log - lineno:158]: ubuntu@worker1: Running ssh command: nc -z 127.0.0.1 9192
[DEBUG - 2017-07-19 21:55:19,301 - jmx - start_jmx_tool - lineno:78]: ubuntu@worker1: Start JmxTool 1 command: /opt/kafka-dev/bin/kafka-run-class.sh kafka.tools.JmxTool --reporting-interval 1000 --jmx-url service:jmx:rmi:///jndi/rmi://127.0.0.1:9192/jmxrmi --wait --object-name kafka.consumer:type=consumer-coordinator-metrics,client-id=console-consumer --attributes assigned-partitions 1>> /mnt/jmx_tool.log 2>> /mnt/jmx_tool.err.log &
[DEBUG - 2017-07-19 21:55:19,301 - remoteaccount - _log - lineno:158]: ubuntu@worker1: Running ssh command: /opt/kafka-dev/bin/kafka-run-class.sh kafka.tools.JmxTool --reporting-interval 1000 --jmx-url service:jmx:rmi:///jndi/rmi://127.0.0.1:9192/jmxrmi --wait --object-name kafka.consumer:type=consumer-coordinator-metrics,client-id=console-consumer --attributes assigned-partitions 1>> /mnt/jmx_tool.log 2>> /mnt/jmx_tool.err.log &
[DEBUG - 2017-07-19 21:55:19,345 - remoteaccount - _log - lineno:158]: ubuntu@worker1: Running ssh command: test -z "$(cat /mnt/jmx_tool.log)"
[DEBUG - 2017-07-19 21:55:19,346 - remoteaccount - _log - lineno:158]: ubuntu@worker1: Running ssh command: test -z "$(cat /mnt/jmx_tool.log)"
[DEBUG - 2017-07-19 21:55:19,855 - remoteaccount - _log - lineno:158]: ubuntu@worker1: Running ssh command: test -z "$(cat /mnt/jmx_tool.log)"
[DEBUG - 2017-07-19 21:55:19,856 - remoteaccount - _log - lineno:158]: ubuntu@worker1: Running ssh command: test -z "$(cat /mnt/jmx_tool.log)"
[DEBUG - 2017-07-19 21:55:20,363 - remoteaccount - _log - lineno:158]: ubuntu@worker1: Running ssh command: test -z "$(cat /mnt/jmx_tool.log)"
[DEBUG - 2017-07-19 21:55:20,364 - remoteaccount - _log - lineno:158]: ubuntu@worker1: Running ssh command: test -z "$(cat /mnt/jmx_tool.log)"

I'll continue trying to track down how that could happen (seems like it could possibly have something to do with console consumer running in the background and calling start_jmx_tool but produce_consume_validate calling has_partitions_assigned in the test, which in turn calls start_jmx_tool), and hopefully reproduce. But I think we might want to also merge as is (and revert the other patch on 0.10.1 and 0.10.2) if people are happy with the current state so that we can get the vast majority of tests back to passing. I can also @ignore that test for the moment if it's inconvenient until we can sort out the issue.

List(null)

val names = queries.flatMap((name: ObjectName) => mbsc.queryNames(name, null).asScala)
val hasPatternQueries = queries.exists((name: ObjectName) => name.isPattern)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know if we are using patterns anywhere? If the goal is to ultimately remove the JmxTool and we don't use this capability, maybe we could drop support now. If that makes sense, we could do it in a follow-up to avoid holding up this patch any further.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JmxTool is a public tool that users use. We'd need to deprecate before removing. Technically the addition of the option should probably be a KIP

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. In that case, do we need the --wait option at all? Could this not be the default behavior?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using --wait breaks the tool when used with patterns. Since we can't say whether users are using patterns or not, I don't think we can turn it on by default. And I wouldn't be surprised if they are since it allows you to gather metrics for all client IDs, all brokers, etc.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I think I see what you were saying re: default, just if no patterns were used. It would be a change in behavior. Currently even if you have an extra or typo metric, you still get output. If we made this the default, you would not.

Copy link
Copy Markdown
Contributor

@hachikuji hachikuji Jul 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it would be more palatable if, instead of exiting, we just log a warning when the wait time expires and not all queries have been matched?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, maybe. The tradeoff there is that users running it directly now suffer a 10s timeout (was only 5s, but while we were at it, since we hit timeout issues with JmxTool in Confluent's own tests I figured it made sense to bump it up). 10s isn't too bad, but since there's also output about retrying (which is useful for debugging in system tests) it'll just spam their console for awhile (the tradeoff of 100 vs 500ms).

Most of this could be resolved by using real logging, but since the tool doesn't do that we have fewer options here.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I'll leave it up to you since the tradeoffs seem about even. I slightly prefer changing the default behavior since a little extra latency in cases when the query doesn't match seems like not a big deal for a long-running service like this (and I think it's unlikely we have many users of this tool anyway or we wouldn't be considering deprecation). It also avoids the K-word (you brought it up, not me).

@asfgit
Copy link
Copy Markdown

asfgit commented Jul 20, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.11/6178/
Test PASSed (JDK 7 and Scala 2.11).

@asfgit
Copy link
Copy Markdown

asfgit commented Jul 20, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/6162/
Test FAILed (JDK 8 and Scala 2.12).

@apurvam
Copy link
Copy Markdown
Contributor

apurvam commented Jul 20, 2017

I'll continue trying to track down how that could happen (seems like it could possibly have something to do with console consumer running in the background and calling start_jmx_tool but produce_consume_validate calling has_partitions_assigned in the test, which in turn calls start_jmx_tool), and hopefully reproduce. But I think we might want to also merge as is (and revert the other patch on 0.10.1 and 0.10.2) if people are happy with the current state so that we can get the vast majority of tests back to passing. I can also @ignore that test for the moment if it's inconvenient until we can sort out the issue.

This seems to be a valid hypothesis, since the throttling test is the only one which actually calls has_partitions_assigned, which results in multiple calls to start_jmx_tool. Still not sure why this patch would expose a problem with this behavior though.

@apurvam
Copy link
Copy Markdown
Contributor

apurvam commented Jul 20, 2017

Oh I see, there may be a race condition in the self.started check in start_jmx_tool which is being exposed by these changes somehow so that it returns false both times. When we waited for the first line of output from the consumer, we would always start the jmx tool from has_partitions_assigned, and the second call would be no-op. But now we probably start from both places more often than not.

Perhaps the fix would be to not start the jmx tool from has_partition_assigned since with the changes in #3447 it should already have been started as soon as the console consumer was started.

Copy link
Copy Markdown
Contributor

@hachikuji hachikuji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a couple minor comments, but LGTM overall.

val names = queries.flatMap((name: ObjectName) => mbsc.queryNames(name, null).asScala)
val hasPatternQueries = queries.exists((name: ObjectName) => name.isPattern)

var names : Iterable[ObjectName] = null
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove space after names?

}

if (wait && !queries.toSet.equals(if (names == null) null else names.toSet)) {
System.err.println(s"Could not find all requested object names after $waitTimeoutMs ms.")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can mention the ones we didn't find?

Thread.sleep(100)
}
names = queries.flatMap((name: ObjectName) => mbsc.queryNames(name, null).asScala)
} while (wait && System.currentTimeMillis - start < waitTimeoutMs && !queries.toSet.equals(if (names == null) null else names.toSet))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: perhaps turn queries.toSet.equals(if (names == null) null else names.toSet) into an inner def since it is also used below?

@ewencp
Copy link
Copy Markdown
Contributor Author

ewencp commented Jul 20, 2017

@apurvam That's actually not definitely true because all starting the console consumer service does is start the background thread. You're not really guaranteed start_jmx_tool has been called unless you do some additional synchronization.

I actually just synchronized the relevant calls in ConsoleConsumer and ran those tests here and they are now passing. Given that they failed multiple times in a row previously, I think we should be good. I've kicked off another round of those tests to repeatedly validate.

@asfgit
Copy link
Copy Markdown

asfgit commented Jul 20, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.11/6182/
Test PASSed (JDK 7 and Scala 2.11).

self.logger.debug("collecting following jmx objects: %s", self.jmx_object_names)
self.start_jmx_tool(idx, node)
with self.lock:
self._init_jmx_attributes()
Copy link
Copy Markdown
Contributor

@hachikuji hachikuji Jul 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this line and the one below could be moved into start_jmx_tool? If we go that far, we could do the locking in there as well.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

start_jmx_tool is completely generic for anything this is mixed into. _init_jmx_metrics is specific to ConsoleConsumer. That's actually why I renamed it -- it clarifies that this method is internal to this specific class.

I think we might be able to move _init_jmx_metrics into the constructor if we are sure that all uses guarantee that the JmxMixin constructor is called with the appropriate parameters and that the choice of metrics is not modified dynamically after the service is created.


def _init_jmx_attributes(self):
# Must hold lock
if self.new_consumer is True:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: since we're here, maybe we could simplify this if and the one below?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. There's a bunch of other cleanup that we'd probably catch via a simple pep8/flake8 checker against the entire code base (with the drawback that this may make any future backports quite painful).

@asfgit
Copy link
Copy Markdown

asfgit commented Jul 20, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/6166/
Test PASSed (JDK 8 and Scala 2.12).

@ewencp
Copy link
Copy Markdown
Contributor Author

ewencp commented Jul 20, 2017

@hachikuji Addressed those last few comments, making changes where I think it's safe. Last round of tests against the threading race condition also passed.

Hopefully this is good to go and we can just do some follow ups after this:

  • At a minimum, file a JIRA to consider replacing JmxTool with a MetricsReporter in our system tests since this multi-process stuff has been the source of many weeks worth of debugging and fixes with at least 3 different attempts to get things right (and yet may still not be quite right, e.g. I think the assignment metric might get updated before offsets are reloaded and fetch requests make it to the broker, which I think is the real requirement for some of these tests).
  • Determine what the usage of JmxTool is and whether it's used much beyond very simple local use cases + system tests. If it doesn't have much other usage, tweak the defaults to work better for system tests, and possibly figure out a path to deprecation/move to system test-specific jar.
  • There are a number of PEP8/flake8 checks which would help clean up the code and make it more idiomatic. These will probably only make sense to actually change once we have automated checkers in CI to help enforce the rules for those that aren't very familiar with idiomatic python.

@hachikuji
Copy link
Copy Markdown
Contributor

Latest changes LGTM. Feel free to merge when you think it's ready.

@asfgit
Copy link
Copy Markdown

asfgit commented Jul 20, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/6168/
Test PASSed (JDK 8 and Scala 2.12).

@asfgit
Copy link
Copy Markdown

asfgit commented Jul 20, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.11/6184/
Test PASSed (JDK 7 and Scala 2.11).

asfgit pushed a commit that referenced this pull request Jul 20, 2017
…avoid race between JmxTool and monitored services

Author: Ewen Cheslack-Postava <me@ewencp.org>
Author: Ewen Cheslack-Postava <ewen@confluent.io>

Reviewers: Ismael Juma <ismael@juma.me.uk>, Jason Gustafson <jason@confluent.io>

Closes #3547 from ewencp/wait-jmx-metrics

(cherry picked from commit f50af9c)
Signed-off-by: Ewen Cheslack-Postava <me@ewencp.org>
@asfgit asfgit closed this in f50af9c Jul 20, 2017
asfgit pushed a commit that referenced this pull request Jul 20, 2017
…since we now have a more reliable test using JMX"

This reverts commit 40e36fc.

See https://issues.apache.org/jira/browse/KAFKA-5608 and #3547 for more details.
asfgit pushed a commit that referenced this pull request Jul 20, 2017
…since we now have a more reliable test using JMX"

This reverts commit 7f96aa3.

See https://issues.apache.org/jira/browse/KAFKA-5608 and #3547 for more details.
@ewencp
Copy link
Copy Markdown
Contributor Author

ewencp commented Jul 20, 2017

Ok, so this is now merged to trunk and 0.11.0. I've reverted the commit that caused the problems on 0.10.2 and 0.10.1. I've also filed:

Copy link
Copy Markdown
Member

@ijuma ijuma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick follow-up. I agree that it's best to revert the change for older branches. I left a couple of comments with a PR to address them (if you agree):

#3553


var names: Iterable[ObjectName] = null
def namesSet = if (names == null) null else names.toSet
def foundAllObjects() = !queries.toSet.equals(namesSet)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this is actually doing the reverse of what it's claiming to do. But the calling code uses it in a way that's OK in terms of behaviour, but a bit confusing.

val hasPatternQueries = queries.exists((name: ObjectName) => name.isPattern)

var names: Iterable[ObjectName] = null
def namesSet = if (names == null) null else names.toSet
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't there a potential NPE here if wait == true && !hasPatternQueries?

cmccabe added a commit to confluentinc/kafka that referenced this pull request May 7, 2021
Reviewers: Ron Dagostino <rdagostino@confluent.io>, Yeva Byzek <yeva@confluent.io>
cmccabe added a commit to confluentinc/kafka that referenced this pull request May 7, 2021
Reviewers: Ron Dagostino <rdagostino@confluent.io>, Yeva Byzek <yeva@confluent.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants