Skip to content

perf: eliminate double string concat in remote-task-runner shutdown logging#12097

Merged
samarthjain merged 1 commit intoapache:masterfrom
jasonk000:rtr-logging-shortcut
Mar 3, 2022
Merged

perf: eliminate double string concat in remote-task-runner shutdown logging#12097
samarthjain merged 1 commit intoapache:masterfrom
jasonk000:rtr-logging-shortcut

Conversation

@jasonk000
Copy link
Copy Markdown
Contributor

Description

Improve the performance of TaskQueue::manage on large installations by removing unnecessary String concat / construction.

Once the changes in #12096 are applied, the performance of the TaskQueue loop becomes much tighter, and is dominated by the logging calls.

Screenshot of profiler showing TaskQueue-Mana... thread:
image

Screenshot of profile flamegraph for this thread, highlighting the format calls in the stack:
image

During the task clean up loop, the shutdown() call is issued multiple times to the RemoteTaskRunner - note especially it uses the three-argument invocation of shutdown() method:

for (final String taskId : tasksToKill) {
try {
taskRunner.shutdown(
taskId,
"task is not in knownTaskIds[%s]",
knownTaskIds
);
}

This hits the default implementation, which constructs a "reason" argument and passes it on to the two-argument shutdown():

default void shutdown(String taskid, String reasonFormat, Object... args)
{
shutdown(taskid, StringUtils.format(reasonFormat, args));
}

In particular on large clusters with thousands of tasks, this default void shutdown call performs a StringUtils.format on a task set with thousands of task IDs, and it serialises all of them to a String.

Even in the event this log line is turned off, the log is still constructed, and then discarded.

Key changed/added classes in this PR

This introduces an @Override public void shutdown() with arguments, and has it perform the log construction and issue only if the info level is enabled. This results in significantly lower CPU consumption in this loop.

This follows up the changes in #12096, and also follows the mailing list discussion here:
https://lists.apache.org/thread/9jgdwrodwsfcg98so6kzfhdmn95gzyrj

This PR has:

  • been self-reviewed.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • been tested in a test Druid cluster (as part of a larger block of changes).

@lgtm-com
Copy link
Copy Markdown

lgtm-com Bot commented Dec 23, 2021

This pull request introduces 1 alert when merging 6a5bb4b into 476d0bf - view on LGTM.com

new alerts:

  • 1 for Implicit conversion from array to string

@FrankChen021
Copy link
Copy Markdown
Member

Since the reason is passed to shutdown only for logging, why do we do what you have changed in the interface so that all subclasses gain this improvement?

Something ike this:

Logger getLogger();

default void shutdown(String taskid, String reasonFormat, Object... args) 
 { 
   getLogger().info("Shutdown [%s] because: [" + reasonFormat + "]" + reasonFormat, taskid, args);
   shutdown(taskid); 
 } 

Another thing that confuses me is that the flamechart shows that there's some bottlenecks at the logging, not the StringUtils.format, so to gain the performance improvement, the right thing here to do is to change the logging level to WARN or ERROR right? What you solved here is just to eliminate some string construction under WARN or ERROR logging level, right?

In another word, at INFO logging level, the gain seems not significant.

@jasonk000
Copy link
Copy Markdown
Contributor Author

I like your suggestion to move it up to the interface level.

WRT warn / error / etc. You're right; there's two perspectives on this:

  1. Yes, the biggest benefit is if logging is set to WARN/ERROR. In my stack, with thousands of tasks running, I specifically set warn level for this class in log4j xml as it is noisy, expensive and provides little benefit for regular operations, but this benefit is mitigated with the current codebase.

  2. Even in the case of info logging being enabled, the triple string concat is unnecessary.

@jasonk000
Copy link
Copy Markdown
Contributor Author

jasonk000 commented Dec 24, 2021

Some extra data, ..

With reference to the TaskQueue-Manager thread, a capture with info enabled:
image

With respect to this thread only, the TaskRunner::shutdown call is 82% of CPU in the manage() loop, and, the useful part of the loop, findWorkerRunningTask consumes ~18%, the rest is logging related. By (1) turning off INFO logging on this path, and (2) avoiding the StringUtils.format() call, we get approx 4x more capacity through the TaskQueue-Manager thread.
image


ie: If you take the image just above ^, and eliminate the Logger::info and StringUtils::format paths, it opens a lot of capacity through this loop.

This becomes more obvious when coupled to #12096 and the yet-to-be-raised PR to reduce contention on the TaskQueue loop.

@jasonk000
Copy link
Copy Markdown
Contributor Author

Capturing some test case results from #12099 we have more concrete example:

Alternating between these two lines:

  default void shutdown(String taskid, String reasonFormat, Object... reasonArgs)
  {
    // String reason = (getLogger().isInfoEnabled()) ? StringUtils.format(reasonFormat, reasonArgs) : null;
    String reason = StringUtils.format(reasonFormat, reasonArgs);
    shutdown(taskid, reason);
  }

The existing approach to always format the message completes the TaskQueueScaleTest in 15.x seconds:

[INFO] Running org.apache.druid.indexing.overlord.TaskQueueScaleTest
[INFO] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 15.553 s - in org.apache.druid.indexing.overlord.TaskQueueScaleTest

Using the new way to only format if necessary completes the test in 6.x seconds:

[INFO] Running org.apache.druid.indexing.overlord.TaskQueueScaleTest
[INFO] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 6.529 s - in org.apache.druid.indexing.overlord.TaskQueueScaleTest

default void shutdown(String taskid, String reasonFormat, Object... reasonArgs)
{
shutdown(taskid, StringUtils.format(reasonFormat, args));
String reason = (getLogger().isInfoEnabled()) ? StringUtils.format(reasonFormat, reasonArgs) : null;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to add some comments here to elaborate why we do this here

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in 7c563ea and fb3cf82

@jasonk000
Copy link
Copy Markdown
Contributor Author

Hi @FrankChen021 I moved the detailed log output to debug instead of info; I think this is a better balance. Let me know your thoughts.

In 5685eaf.

/**
* Get the logger. Not expected to be called by consumers.
*/
Logger getLogger();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably change the default implementation to throw an UnsupportedOperationException. This will also have the benefit of not having to change classes implementing TaskRunner interface where they don't have a logger available (like test classes).

{
AtomicBoolean wasCalled = new AtomicBoolean(false);

public boolean getWasCalled()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe a better method name would be wasCalled() ?

* EasyMock does not support mocking of toString, so this provides a custom
* object implementation to track whether toString was called.
*/
public static class ToStringMock
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: We should probably have these inner classes ToStringMock and MockTaskRunner as private static.

shutdown(taskid, StringUtils.format(reasonFormat, args));
// only calculate the 'reason' string for debug level logging
// in large clusters the 'reasonArgs' may be very large / expensive if it includes a list of tasks
String reason = (getLogger().isDebugEnabled()) ? StringUtils.format(reasonFormat, reasonArgs) : "debug log disabled";
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor

@samarthjain samarthjain Mar 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the callers, I see only TaskQueue#shutdown() that is passing in the task ids. So, I propose we modify code in TaskQueue#manageInternal() to something like this:

// Kill tasks that shouldn't be running
    final Set<String> knownTaskIds = tasks
        .stream()
        .map(Task::getId)
        .collect(Collectors.toSet());
    final Set<String> tasksToKill = Sets.difference(runnerTaskFutures.keySet(), knownTaskIds);
    if (!tasksToKill.isEmpty()) {
      log.info("Asking taskRunner to clean up %,d tasks.", tasksToKill.size());
      // On large installations running several thousands of tasks,
      // concatenating the list of known task ids can be compupationally expensive.
      boolean logKnownTaskIds = log.isDebugEnabled();
      String reason = logKnownTaskIds
                      ? String.format("Task is not in knownTaskIds[%s]", knownTaskIds)
                      : "Task is not in knownTaskIds";
      for (final String taskId : tasksToKill) {
        try {
          taskRunner.shutdown(
              taskId, reason
          );
        }
        catch (Exception e) {
          log.warn(e, "TaskRunner failed to clean up task: %s", taskId);
        }
      }
    }

If we do this, then you won't have to add a new getLogger() method in TaskRunner interface.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this solution, I'll swap over to this implementation.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @samarthjain , this solution is much simpler - fixes applied

let me know if all good and i can squash it

@jasonk000 jasonk000 force-pushed the rtr-logging-shortcut branch 2 times, most recently from 0402f70 to d4aaef7 Compare March 3, 2022 00:01
@samarthjain
Copy link
Copy Markdown
Contributor

+1, looks good. I will merge after CI completes.

@jasonk000 jasonk000 force-pushed the rtr-logging-shortcut branch from d4aaef7 to 0b6cf82 Compare March 3, 2022 01:16
@jasonk000 jasonk000 force-pushed the rtr-logging-shortcut branch from 0b6cf82 to 5df557d Compare March 3, 2022 21:37
@samarthjain
Copy link
Copy Markdown
Contributor

Looking at the CI failures, it looks like the build timed out because HashPartitionMultiPhaseParallelIndexingTest never completed. To be double sure, I ran the test locally and it completed fine. I am going to go ahead and merge this PR since it looks like tests have been flaky in general off late and I don't think logging changes introduced by this PR can contribute to the failures.

@samarthjain samarthjain merged commit 3619395 into apache:master Mar 3, 2022
@clintropolis
Copy link
Copy Markdown
Member

this PR has broken the ci, forbidden api check is failing, String.format shouldn't be called directly, use StringUtils.format instead

@samarthjain
Copy link
Copy Markdown
Contributor

Looks like the last rebase you pushed reverted your String.format changes, @jasonk000 . I will submit a PR to fix CI, @clintropolis

@samarthjain
Copy link
Copy Markdown
Contributor

Submitted #12304 to fix CI.

@jasonk000
Copy link
Copy Markdown
Contributor Author

jasonk000 commented Mar 4, 2022

Oops my bad, we had that fixed last night but I broke it again. Sorry! Thx @samarthjain for addressing it.

@abhishekagarwal87 abhishekagarwal87 added this to the 0.23.0 milestone May 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants