perf: eliminate double string concat in remote-task-runner shutdown logging#12097
Conversation
|
This pull request introduces 1 alert when merging 6a5bb4b into 476d0bf - view on LGTM.com new alerts:
|
|
Since the Something ike this: Another thing that confuses me is that the flamechart shows that there's some bottlenecks at the logging, not the In another word, at INFO logging level, the gain seems not significant. |
|
I like your suggestion to move it up to the interface level. WRT warn / error / etc. You're right; there's two perspectives on this:
|
|
Some extra data, .. With reference to the TaskQueue-Manager thread, a capture with info enabled: With respect to this thread only, the ie: If you take the image just above ^, and eliminate the This becomes more obvious when coupled to #12096 and the yet-to-be-raised PR to reduce contention on the TaskQueue loop. |
|
Capturing some test case results from #12099 we have more concrete example: Alternating between these two lines: The existing approach to always format the message completes the TaskQueueScaleTest in 15.x seconds: Using the new way to only format if necessary completes the test in 6.x seconds: |
| default void shutdown(String taskid, String reasonFormat, Object... reasonArgs) | ||
| { | ||
| shutdown(taskid, StringUtils.format(reasonFormat, args)); | ||
| String reason = (getLogger().isInfoEnabled()) ? StringUtils.format(reasonFormat, reasonArgs) : null; |
There was a problem hiding this comment.
better to add some comments here to elaborate why we do this here
|
Hi @FrankChen021 I moved the detailed log output to debug instead of info; I think this is a better balance. Let me know your thoughts. In 5685eaf. |
| /** | ||
| * Get the logger. Not expected to be called by consumers. | ||
| */ | ||
| Logger getLogger(); |
There was a problem hiding this comment.
I would probably change the default implementation to throw an UnsupportedOperationException. This will also have the benefit of not having to change classes implementing TaskRunner interface where they don't have a logger available (like test classes).
| { | ||
| AtomicBoolean wasCalled = new AtomicBoolean(false); | ||
|
|
||
| public boolean getWasCalled() |
There was a problem hiding this comment.
nit: maybe a better method name would be wasCalled() ?
| * EasyMock does not support mocking of toString, so this provides a custom | ||
| * object implementation to track whether toString was called. | ||
| */ | ||
| public static class ToStringMock |
There was a problem hiding this comment.
nit: We should probably have these inner classes ToStringMock and MockTaskRunner as private static.
| shutdown(taskid, StringUtils.format(reasonFormat, args)); | ||
| // only calculate the 'reason' string for debug level logging | ||
| // in large clusters the 'reasonArgs' may be very large / expensive if it includes a list of tasks | ||
| String reason = (getLogger().isDebugEnabled()) ? StringUtils.format(reasonFormat, reasonArgs) : "debug log disabled"; |
There was a problem hiding this comment.
If debug logging is not enabled, then the following log line won't be reported.
https://github.com/apache/druid/blob/master/indexing-service/src/main/java/org/apache/druid/indexing/overlord/TaskQueue.java#L582
There was a problem hiding this comment.
There was a problem hiding this comment.
Looking at the callers, I see only TaskQueue#shutdown() that is passing in the task ids. So, I propose we modify code in TaskQueue#manageInternal() to something like this:
// Kill tasks that shouldn't be running
final Set<String> knownTaskIds = tasks
.stream()
.map(Task::getId)
.collect(Collectors.toSet());
final Set<String> tasksToKill = Sets.difference(runnerTaskFutures.keySet(), knownTaskIds);
if (!tasksToKill.isEmpty()) {
log.info("Asking taskRunner to clean up %,d tasks.", tasksToKill.size());
// On large installations running several thousands of tasks,
// concatenating the list of known task ids can be compupationally expensive.
boolean logKnownTaskIds = log.isDebugEnabled();
String reason = logKnownTaskIds
? String.format("Task is not in knownTaskIds[%s]", knownTaskIds)
: "Task is not in knownTaskIds";
for (final String taskId : tasksToKill) {
try {
taskRunner.shutdown(
taskId, reason
);
}
catch (Exception e) {
log.warn(e, "TaskRunner failed to clean up task: %s", taskId);
}
}
}
If we do this, then you won't have to add a new getLogger() method in TaskRunner interface.
There was a problem hiding this comment.
I like this solution, I'll swap over to this implementation.
There was a problem hiding this comment.
thanks @samarthjain , this solution is much simpler - fixes applied
let me know if all good and i can squash it
0402f70 to
d4aaef7
Compare
|
+1, looks good. I will merge after CI completes. |
d4aaef7 to
0b6cf82
Compare
0b6cf82 to
5df557d
Compare
|
Looking at the CI failures, it looks like the build timed out because |
|
this PR has broken the ci, forbidden api check is failing, |
|
Looks like the last rebase you pushed reverted your String.format changes, @jasonk000 . I will submit a PR to fix CI, @clintropolis |
|
Submitted #12304 to fix CI. |
|
Oops my bad, we had that fixed last night but I broke it again. Sorry! Thx @samarthjain for addressing it. |


Description
Improve the performance of
TaskQueue::manageon large installations by removing unnecessary String concat / construction.Once the changes in #12096 are applied, the performance of the
TaskQueueloop becomes much tighter, and is dominated by the logging calls.Screenshot of profiler showing

TaskQueue-Mana...thread:Screenshot of profile flamegraph for this thread, highlighting the

formatcalls in the stack:During the task clean up loop, the
shutdown()call is issued multiple times to theRemoteTaskRunner- note especially it uses the three-argument invocation ofshutdown()method:druid/indexing-service/src/main/java/org/apache/druid/indexing/overlord/TaskQueue.java
Lines 328 to 335 in 476d0bf
This hits the default implementation, which constructs a "reason" argument and passes it on to the two-argument
shutdown():druid/indexing-service/src/main/java/org/apache/druid/indexing/overlord/TaskRunner.java
Lines 91 to 94 in 476d0bf
In particular on large clusters with thousands of tasks, this
default void shutdowncall performs aStringUtils.formaton a task set with thousands of task IDs, and it serialises all of them to a String.Even in the event this log line is turned off, the log is still constructed, and then discarded.
Key changed/added classes in this PR
This introduces an
@Override public void shutdown()with arguments, and has it perform the log construction and issue only if the info level is enabled. This results in significantly lower CPU consumption in this loop.This follows up the changes in #12096, and also follows the mailing list discussion here:
https://lists.apache.org/thread/9jgdwrodwsfcg98so6kzfhdmn95gzyrj
This PR has: