KAFKA-5608: Add --wait option for JmxTool and use in system tests to avoid race between JmxTool and monitored services#3547
KAFKA-5608: Add --wait option for JmxTool and use in system tests to avoid race between JmxTool and monitored services#3547ewencp wants to merge 8 commits intoapache:trunkfrom
Conversation
…avoid race between JmxTool and monitored services
|
This is still wip and needs testing, but I've kicked off a round of tests here. It turns out there are still multiple problems with JmxTool / JmxMixin. @hachikuji discovered the issue that is likely a source of most of the issues from the previous patch -- JmxTool may start up fast enough that the requested metrics are not available yet. In that case, it will never discover them, it'll just run with an empty list and hang. This matches the code, and I've verified with valid & invalid metrics just to be sure. This adds some logic to wait for requested metrics to be available before it starts reporting. Since you can't do this safely for empty lists or object name patterns, it tries to protect against misuse. However, within system tests, afaict, we don't use anything but an explicit list of names. The other case that I found in these test failures is when the application being observed is too fast. JmxMixin may detect that the JMX port is listening, but by the time it actually runs, connects, and tries to query, the application has exited. This one will be harder to fix. I think the long term solution is to ditch the separate JmxTool entirely and use MetricsReporters instead. This patch is actually a second attempt to fix this issue -- the initial retries for connections were added in KAFKA-4558 and this is probably our second/third/etc iteration on about 3 or 4 areas related to JmxTool that have been challenging to get reliable. /cc @hachikuji @apurvam @ijuma |
|
Refer to this link for build results (access rights to CI server needed): |
|
Refer to this link for build results (access rights to CI server needed): |
ijuma
left a comment
There was a problem hiding this comment.
Thanks for the PR. Left a couple of comments.
| if (names != null) { | ||
| System.err.println("Could not find all object names, retrying") | ||
| retries += 1 | ||
| Thread.sleep(500) |
There was a problem hiding this comment.
Is there a reason not to make this lower? Maybe 100?
There was a problem hiding this comment.
In general we've needed to increase these values. The 500 in the connection timeout was previously also an issue -- we had some tests that would exceed that. We sometimes see 5s of just JVM startup time before we ever even get a log message...
There was a problem hiding this comment.
The only reason it's relevant is because we're limited by the number of retries. Perhaps we should use a timeout instead?
There was a problem hiding this comment.
A fixed one like we effectively have via # of retries now or do you mean via a user-provided config?
There was a problem hiding this comment.
It could be hard-coded. I was just thinking it is a more intuitive way to achieve the same result. A bit weird to have to increase the backoff to make sure we're giving enough time to the tool.
| Thread.sleep(500) | ||
| } | ||
| names = queries.flatMap((name: ObjectName) => mbsc.queryNames(name, null).asScala) | ||
| } while (wait && retries < maxNumRetries && !queries.equals(names)) |
There was a problem hiding this comment.
The queries.equals(names) check is a bit fragile as the collection type is Iterable. If one of them is a Set and the other is a Seq, they will always be false, for example. Maybe we should make sure both instances are a Set before comparing. It also seems to make more sense since we don't care about ordering.
There was a problem hiding this comment.
Yeah, I had thought about that too, will update with that change.
|
I agree that it would be good to get rid of |
|
There seems to be a lot of variability currently which makes it more difficult to evaluate the impact of this patch. For example, the last few nightly runs: In any case, on to evaluating the test failures. One class of failures show the following in
At first I thought these were due to an rsyncing issue, but then realized they are all related to version upgrades. Unfortunately, I'm not sure there will be a good fix for this. One option is to try to force those old versions to use the newer version of JmxTool. This should be pretty reliable since JmxTool changes infrequently, but I think requires some gymnastics with the path resolution/versioning stuff. The other would be to half revert the removal of waiting for the first line of output and make it conditional on the version, but at that point it seems easier to revert & add a note about why things are the way they are (despite the fact that it seems there is still potentially a race condition in the test). The remaining 2 failures from kafkatest.tests.core.throttling_test.ThrottlingTest.test_throttled_reassignment fail with this: From the And synchronization issues cause some lines to merge before the newline is written, then adding another newline. Not sure yet how this could happen. |
|
Refer to this link for build results (access rights to CI server needed): |
|
Refer to this link for build results (access rights to CI server needed): |
|
Turns out the version stuff doesn't look so bad, I've updated the patch and am running the subset of tests that failed here. |
|
The system tests passed and I've updated code to address comments, so I think only remaining issue is that one set of tests that weirdly seems to have 2 instances of JmxTool running concurrently. Additionally, it looks like this only backports cleanly to 0.11.0 and 0.10.2 and the version check relies on 0.11.0.0 version numbers being available, so I don't think this will work for 0.10.2 either. So I'd probably revert the changes on those branches and we'd only rely on this for 0.11.0 and newer. |
|
Refer to this link for build results (access rights to CI server needed): |
|
Refer to this link for build results (access rights to CI server needed): |
|
Thanks for the patch. If I understand correctly, the issue being addressed here was revealed by #3447. In that PR, we no longer wait for the first line of output before starting the tool. So it could start before the metrics are ready, causing the tool to hang. And we had PR #3447 because to fix errors like |
|
@apurvam That's correct, #3447 now allows #3447 is because I'm trying to get the conditions here to finally test the conditions we really want. This is the third iteration. The check that the first line had be written was the first attempt, and was a heuristic when This isn't really a great solution either since it relies on some fixed timeouts in |
|
Using a metric reporter seems like a good option to explore. Maybe create a JIRA? |
|
Yeah, I'll follow up. Am just trying to get this done quickly so we can get the failures resolved. Since above tests pass, last issue is that one test that seems to run JmxTool multiple times. Still not sure what's going on and having trouble reproducing anywhere but on Jenkins, but from one of the logs, it does indeed appear it is starting multiple times: I'll continue trying to track down how that could happen (seems like it could possibly have something to do with console consumer running in the background and calling |
| List(null) | ||
|
|
||
| val names = queries.flatMap((name: ObjectName) => mbsc.queryNames(name, null).asScala) | ||
| val hasPatternQueries = queries.exists((name: ObjectName) => name.isPattern) |
There was a problem hiding this comment.
Do you know if we are using patterns anywhere? If the goal is to ultimately remove the JmxTool and we don't use this capability, maybe we could drop support now. If that makes sense, we could do it in a follow-up to avoid holding up this patch any further.
There was a problem hiding this comment.
JmxTool is a public tool that users use. We'd need to deprecate before removing. Technically the addition of the option should probably be a KIP
There was a problem hiding this comment.
I see. In that case, do we need the --wait option at all? Could this not be the default behavior?
There was a problem hiding this comment.
Using --wait breaks the tool when used with patterns. Since we can't say whether users are using patterns or not, I don't think we can turn it on by default. And I wouldn't be surprised if they are since it allows you to gather metrics for all client IDs, all brokers, etc.
There was a problem hiding this comment.
Oh, I think I see what you were saying re: default, just if no patterns were used. It would be a change in behavior. Currently even if you have an extra or typo metric, you still get output. If we made this the default, you would not.
There was a problem hiding this comment.
Perhaps it would be more palatable if, instead of exiting, we just log a warning when the wait time expires and not all queries have been matched?
There was a problem hiding this comment.
Hmm, maybe. The tradeoff there is that users running it directly now suffer a 10s timeout (was only 5s, but while we were at it, since we hit timeout issues with JmxTool in Confluent's own tests I figured it made sense to bump it up). 10s isn't too bad, but since there's also output about retrying (which is useful for debugging in system tests) it'll just spam their console for awhile (the tradeoff of 100 vs 500ms).
Most of this could be resolved by using real logging, but since the tool doesn't do that we have fewer options here.
There was a problem hiding this comment.
Ok. I'll leave it up to you since the tradeoffs seem about even. I slightly prefer changing the default behavior since a little extra latency in cases when the query doesn't match seems like not a big deal for a long-running service like this (and I think it's unlikely we have many users of this tool anyway or we wouldn't be considering deprecation). It also avoids the K-word (you brought it up, not me).
|
Refer to this link for build results (access rights to CI server needed): |
|
Refer to this link for build results (access rights to CI server needed): |
This seems to be a valid hypothesis, since the throttling test is the only one which actually calls |
|
Oh I see, there may be a race condition in the Perhaps the fix would be to not start the jmx tool from |
hachikuji
left a comment
There was a problem hiding this comment.
Left a couple minor comments, but LGTM overall.
| val names = queries.flatMap((name: ObjectName) => mbsc.queryNames(name, null).asScala) | ||
| val hasPatternQueries = queries.exists((name: ObjectName) => name.isPattern) | ||
|
|
||
| var names : Iterable[ObjectName] = null |
There was a problem hiding this comment.
nit: remove space after names?
| } | ||
|
|
||
| if (wait && !queries.toSet.equals(if (names == null) null else names.toSet)) { | ||
| System.err.println(s"Could not find all requested object names after $waitTimeoutMs ms.") |
There was a problem hiding this comment.
Maybe we can mention the ones we didn't find?
| Thread.sleep(100) | ||
| } | ||
| names = queries.flatMap((name: ObjectName) => mbsc.queryNames(name, null).asScala) | ||
| } while (wait && System.currentTimeMillis - start < waitTimeoutMs && !queries.toSet.equals(if (names == null) null else names.toSet)) |
There was a problem hiding this comment.
nit: perhaps turn queries.toSet.equals(if (names == null) null else names.toSet) into an inner def since it is also used below?
|
@apurvam That's actually not definitely true because all starting the console consumer service does is start the background thread. You're not really guaranteed I actually just synchronized the relevant calls in ConsoleConsumer and ran those tests here and they are now passing. Given that they failed multiple times in a row previously, I think we should be good. I've kicked off another round of those tests to repeatedly validate. |
…foreground and background threads may invoke them
|
Refer to this link for build results (access rights to CI server needed): |
| self.logger.debug("collecting following jmx objects: %s", self.jmx_object_names) | ||
| self.start_jmx_tool(idx, node) | ||
| with self.lock: | ||
| self._init_jmx_attributes() |
There was a problem hiding this comment.
Maybe this line and the one below could be moved into start_jmx_tool? If we go that far, we could do the locking in there as well.
There was a problem hiding this comment.
start_jmx_tool is completely generic for anything this is mixed into. _init_jmx_metrics is specific to ConsoleConsumer. That's actually why I renamed it -- it clarifies that this method is internal to this specific class.
I think we might be able to move _init_jmx_metrics into the constructor if we are sure that all uses guarantee that the JmxMixin constructor is called with the appropriate parameters and that the choice of metrics is not modified dynamically after the service is created.
|
|
||
| def _init_jmx_attributes(self): | ||
| # Must hold lock | ||
| if self.new_consumer is True: |
There was a problem hiding this comment.
nit: since we're here, maybe we could simplify this if and the one below?
There was a problem hiding this comment.
Done. There's a bunch of other cleanup that we'd probably catch via a simple pep8/flake8 checker against the entire code base (with the drawback that this may make any future backports quite painful).
|
Refer to this link for build results (access rights to CI server needed): |
|
@hachikuji Addressed those last few comments, making changes where I think it's safe. Last round of tests against the threading race condition also passed. Hopefully this is good to go and we can just do some follow ups after this:
|
|
Latest changes LGTM. Feel free to merge when you think it's ready. |
|
Refer to this link for build results (access rights to CI server needed): |
|
Refer to this link for build results (access rights to CI server needed): |
…avoid race between JmxTool and monitored services Author: Ewen Cheslack-Postava <me@ewencp.org> Author: Ewen Cheslack-Postava <ewen@confluent.io> Reviewers: Ismael Juma <ismael@juma.me.uk>, Jason Gustafson <jason@confluent.io> Closes #3547 from ewencp/wait-jmx-metrics (cherry picked from commit f50af9c) Signed-off-by: Ewen Cheslack-Postava <me@ewencp.org>
…since we now have a more reliable test using JMX" This reverts commit 40e36fc. See https://issues.apache.org/jira/browse/KAFKA-5608 and #3547 for more details.
…since we now have a more reliable test using JMX" This reverts commit 7f96aa3. See https://issues.apache.org/jira/browse/KAFKA-5608 and #3547 for more details.
|
Ok, so this is now merged to trunk and 0.11.0. I've reverted the commit that caused the problems on 0.10.2 and 0.10.1. I've also filed:
|
|
|
||
| var names: Iterable[ObjectName] = null | ||
| def namesSet = if (names == null) null else names.toSet | ||
| def foundAllObjects() = !queries.toSet.equals(namesSet) |
There was a problem hiding this comment.
Seems like this is actually doing the reverse of what it's claiming to do. But the calling code uses it in a way that's OK in terms of behaviour, but a bit confusing.
| val hasPatternQueries = queries.exists((name: ObjectName) => name.isPattern) | ||
|
|
||
| var names: Iterable[ObjectName] = null | ||
| def namesSet = if (names == null) null else names.toSet |
There was a problem hiding this comment.
Isn't there a potential NPE here if wait == true && !hasPatternQueries?
Reviewers: Ron Dagostino <rdagostino@confluent.io>, Yeva Byzek <yeva@confluent.io>
Reviewers: Ron Dagostino <rdagostino@confluent.io>, Yeva Byzek <yeva@confluent.io>
No description provided.